PCA is a technique for transforming a set of correlated variables into a set of linearly uncorrelated variables, known as principal components. The first principal component is the direction in the data with the highest variance, and each subsequent principal component has the highest variance orthogonal to the previous principal components.
PCA can be used for a variety of purposes, such as data compression, noise reduction, and visualization. In this blog post, we will focus on how PCA can be used for data visualization and dimensionality reduction.
PCA with Python
Let’s start by importing the necessary libraries and loading the dataset that we will use for our examples.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
digits = load_digits()
X = digits.data
y = digits.target
The load_digits()
function from sklearn.datasets
loads the handwritten digits dataset. This dataset contains images of handwritten digits, with each image represented as a 8×8 matrix of pixel values. The data
attribute of the dataset contains a flattened version of these matrices, with each row representing an image. The target
attribute contains the corresponding digit labels.
Now that we have loaded the dataset, we can perform PCA on it. We will start by scaling the data using StandardScaler
from sklearn.preprocessing
.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
We scale the data because PCA is sensitive to the scale of the variables, and we want to ensure that all variables have the same scale.
Next, we will perform PCA using PCA
from sklearn.decomposition
.
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
We specify n_components=2
to reduce the dimensionality of the data to 2 principal components, which will allow us to visualize the data. We fit the PCA model to the scaled data using fit_transform()
.
Now that we have transformed the data, we can plot it using Matplotlib.
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('First principal component')
plt.ylabel('Second principal component')
plt.colorbar()
plt.show()
This code produces a scatter plot of the data, with the x-axis representing the first principal component and the y-axis representing the second principal component. The color of each point represents the corresponding digit label. The colorbar()
function adds a color legend to the plot.
We have demonstrated how PCA can be used for data visualization and dimensionality reduction, using the handwritten digits dataset as an example. PCA is a valuable tool for any data scientist or machine learning practitioner, and we encourage you to explore its many applications.