Principal Component Analysis (PCA) is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation.
Dimensions are nothing but features that represent the data. For example, a 28 X 28 image has 784 picture elements (pixels) that are the dimensions or features which together represent that image.
One important thing to note about PCA is that it is an Unsupervised dimensionality reduction technique, you can cluster the similar data points based on the feature correlation between them without any supervision (or labels).
PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities that each consist of various numerical values) into a set of values of linearly uncorrelated variables called principal components.
Note: Features, Dimensions, and Variables are all referring to the same thing. You will find them often being used interchangeably.
Where Can You Apply PCA?
- Data Visualization: When working on any data related problem, the challenge in today’s world is the sheer volume of data, and the variables/features that define that data. To solve a problem where data is the key, you need extensive data exploration like finding out how the variables are correlated or understanding the distribution of a few variables. Considering that there are a large number of variables or dimensions along which the data is distributed, visualization can be a challenge & almost impossible. PCA can do that for you since it projects the data into a lower dimension; allowing you to visualize the data in a 2D or 3D space with a naked eye.
- Speeding Machine Learning (ML) Algorithm: Since PCA’s main idea is dimensionality reduction, you can leverage that to speed up your machine learning algorithm’s training and testing time considering your data has a lot of features, and the ML algorithm’s learning is too slow. At an abstract level, you take a dataset having many features, and you simplify that dataset by selecting a few Principal Components from original features.
What is a Principal Component?
Principal components are the key to PCA; they represent what’s underneath the hood of your data. When the data is projected into a lower dimension (assume three dimensions) from a higher space, the three dimensions are nothing but the three Principal Components that captures (or holds) most of the variance (information) of your data.
Principal components have both direction & magnitude. The direction represents across which principal axes the data is mostly spread out or has most variance and the magnitude signifies the amount of variance that Principal Component captures of the data when projected onto that axis. The principal components are a straight line, and the first principal component holds the most variance in the data. Each subsequent principal component is orthogonal to the last and has a lesser variance. In this way, given a set of x correlated variables over y samples you achieve a set of u uncorrelated principal components over the same y samples.
The reason you achieve uncorrelated principal components from the original features is that the correlated features contribute to the same principal component, thereby reducing the original data features into uncorrelated principal components, each representing a different set of correlated features with different amounts of variation.
Each principal component represents a percentage of total variation captured from the data.
(Following examples derived from famous Breast Cancer dataset)
The Breast Cancer data set is a real-valued multivariate dataset that consists of two classes, where each class signifies whether a patient has breast cancer or not. The two categories are: malignant and benign.
The malignant class has 212 samples, whereas the benign class has 357 samples.
It has 30 features shared across all classes: radius, texture, perimeter, area, smoothness, fractal dimension, etc.
Data Visualization Using PCA
- You start by
Standardizingthe data since PCA's output is influenced based on the scale of the features of the data. It is a common practice to normalize your data before feeding it to any machine learning algorithm. To apply normalization, you will import
StandardScalermodule from the sklearn library and select only the features from the
breast_dataset.Once you have the features, you will then apply scaling by doing
fit_transformon the feature data.
- While applying StandardScaler, each feature of your data should be normally distributed such that it will scale the distribution to a mean of zero and a standard deviation of one.
Let’s check whether the normalized data has a mean of zero and a standard deviation of one.
Let’s convert the normalized features into a tabular format with the help of DataFrame.
- Now comes the critical part, the next few lines of code will be projecting the thirty-dimensional Breast Cancer data to two-dimensional
- You will use the sklearn library to import the
PCAmodule, and in the PCA method, you will pass the number of components (n_components=2) and finally call
fit_transformon the aggregate data. Here, several components represent the lower dimension in which you will project your higher dimension data.
Next, let’s create a DataFrame that will have the principal component values for all 569 samples.
- Once you have the principal components, you can find the
explained_variance_ratio. It will provide you with the amount of information or variance each principal component holds after projecting the data to a lower dimensional subspace.
From the above output, you can observe that the
principal component 1 holds 44.2% of the information while the
principal component 2 holds only 19% of the information. Also, the other point to note is that while projecting thirty-dimensional data to a two-dimensional data, 36.8% information was lost.
Let’s plot the visualization of the 569 samples along the
principal component - 1 and
principal component - 2 axis. It should give you good insight into how your samples are distributed among the two classes.
From the above graph, you can observe that the two classes
malignant, when projected to a two-dimensional space, can be linearly separable up to some extent. Other observations can be that the
benign class is spread out as compared to the