Principal Component Analysis (PCA)

Concept

Principal Component Analysis uses the following ideas:

  • extracts variance structure from high dimensional datasets
  • orthogonal projection or transformation of the data into a (possibly lower dimensional) subspace to maximize the variance (thought experiment: how would a 2D being capture a human)
  • retains as much information as possible with as little loss as possible
  • helps to deduce whether or not features are independent
  • can make features less interpretable
  • each dimension in the dataset results in a principal component (eigenvector)

The PCA Process

  1. Nomralize to remove the influence of scales.
  2. Compute the covariance matrix.
  3. Compute the Eigenvectors and Eigenvalues from the covariance matrix.
  4. Sort the features by percentage of explained variance.
  5. Drop or Keep the features based on thresholds (how many features you want, how much variance you need, etc.)

PCA vs. Linear Regression

Note that we are not trying to find a straight line that best fits our data.

Implementation (sklearn)

Import Libraries and Data

Code
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
# from sklearn.preprocessing import StandardScaler (shown how to use in cell below)
Code
# Load the homework data set

# Rename the columns based on their features.

columns = ['class','alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium',
    'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
    'proanthocyanins', 'color_intensity', 'hue',
    'dilution_of_wines', 'proline']

df = pd.read_csv('data/wine.csv', names=columns, header=0)
df.head()
class alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue dilution_of_wines proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

Scale Data

With scale

Code
# Utilizing the standard scaler method to get the values converted into integers.
X = df.iloc[:, 1:].values
X_normal = scale(X)

With StandardScaler

scaler = StandardScaler() X_normal = scaler.fit_transform(X)

Implement PCA

Question 1- Using Principal Component Analysis or PCA in short to reduce the dimensionality of the data in order to optimize the result of the clustering (5 Points)

Code
# implement pca with the normalized features
pca = PCA()
principalComponents = pca.fit_transform(X_normal)

Question 2 - Create a dataframe featuring the Principal components that you acquired through PCA and show the output (5 Points)

Code
# create the dataframe from the principal components
principalDf = pd.DataFrame(data = principalComponents)
principalDf.columns = [f'principal component {col + 1}' for col in range(principalDf.shape[1])]

# display the head of data
principalDf.head()
principal component 1 principal component 2 principal component 3 principal component 4 principal component 5 principal component 6 principal component 7 principal component 8 principal component 9 principal component 10 principal component 11 principal component 12 principal component 13
0 3.316751 -1.443463 -0.165739 -0.215631 0.693043 -0.223880 0.596427 0.065139 0.641443 1.020956 -0.451563 0.540810 -0.066239
1 2.209465 0.333393 -2.026457 -0.291358 -0.257655 -0.927120 0.053776 1.024416 -0.308847 0.159701 -0.142657 0.388238 0.003637
2 2.516740 -1.031151 0.982819 0.724902 -0.251033 0.549276 0.424205 -0.344216 -1.177834 0.113361 -0.286673 0.000584 0.021717
3 3.757066 -2.756372 -0.176192 0.567983 -0.311842 0.114431 -0.383337 0.643593 0.052544 0.239413 0.759584 -0.242020 -0.369484
4 1.008908 -0.869831 2.026688 -0.409766 0.298458 -0.406520 0.444074 0.416700 0.326819 -0.078366 -0.525945 -0.216664 -0.079364

Question 3 - Compute the amount of variance that each PCA explains. Dispay the output (10 Points)

Code
# compute explained variance each PCA explains
explained_variance = pca.explained_variance_ratio_

# numerical array of explained variance
explained_variance

# dataframe of explained variance
explained_variance_df = pd.DataFrame({
    'Principal Component': [f'PC{i+1}' for i in range(len(explained_variance))],
    'Explained Variance': explained_variance
})

explained_variance_df.round(4)
Principal Component Explained Variance
0 PC1 0.3620
1 PC2 0.1921
2 PC3 0.1112
3 PC4 0.0707
4 PC5 0.0656
5 PC6 0.0494
6 PC7 0.0424
7 PC8 0.0268
8 PC9 0.0222
9 PC10 0.0193
10 PC11 0.0174
11 PC12 0.0130
12 PC13 0.0080

Question 4 - Calculate the cummulative variances to 4 decimals places. Display the output (10 points)

Code
# compute cumulative variance through each PCA (to 4 decimal places)
cumulative_variance = np.cumsum(explained_variance.round(4))

# numerical array of cumulative variance
cumulative_variance

# dataframe of cumulative variance
cumulative_variance_df = pd.DataFrame({
    'Principal Component': [f'PC{i+1}' for i in range(len(cumulative_variance))],
    'Cumulative Variance': cumulative_variance
})

cumulative_variance_df
Principal Component Cumulative Variance
0 PC1 0.3620
1 PC2 0.5541
2 PC3 0.6653
3 PC4 0.7360
4 PC5 0.8016
5 PC6 0.8510
6 PC7 0.8934
7 PC8 0.9202
8 PC9 0.9424
9 PC10 0.9617
10 PC11 0.9791
11 PC12 0.9921
12 PC13 1.0001

Question 5 - Compute the Variance plot for PCA components obtained and comment on the plot (10 points)

Code
# plot for individual explained variance
plt.figure()
plt.bar(range(1, len(explained_variance) + 1), explained_variance, alpha=0.5, align='center', color = 'blue', label = 'Explained Variance')
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, color = 'red', label = 'Cumulative Variance')
plt.legend()
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.xticks(range(1, len(explained_variance) + 1))
plt.title('Individual Variance Explained by Each Principal Component')
plt.grid(True)
plt.show()

In the chart above, we have displayed the individual explained variance and the cumulative variance of each principal component. We can see that Principal Component 1 individually explains the most of the variance in the model right below 40%, and then the amount of variance the other components explain start to taper off. The cumulative explained variance reaches about 80% at Principal Component 5.

Depending on our threshold (either explained variance or number of components), we have some information to choose which components we would like to include.

If we wanted to keep our model to 3 principal components, we would get just about 60% explained variance.

However, by including up to 5 or 6 principal components, we could increase our explained variance into the 80%s.

Question 6 - As our results are suggesting to use first 3 principal components for further computation, extract the three features from the PCA_dataset into PCA1, PCA2, PCA3 (10 points)

Code
# extract the first three principal components into a final dataframe
final_df = principalDf.iloc[:, :3]

final_df.head()

final_df.shape
(178, 3)

Question 7 - Create a dataframe for further clusering algorithms with PCA1, PCA2 and PCA3 as column headings. Display your results (10 points)

Code
# join class column into final dtaframe
final_df = pd.concat([final_df, df['class']], axis=1)

# change column headings
final_df.columns = ['PCA1', 'PCA2', 'PCA3', 'Class']

# double check shape
final_df.shape

# double check nulls
final_df.isnull().sum()

# display the head of the results
final_df.head()
PCA1 PCA2 PCA3 Class
0 3.316751 -1.443463 -0.165739 1
1 2.209465 0.333393 -2.026457 1
2 2.516740 -1.031151 0.982819 1
3 3.757066 -2.756372 -0.176192 1
4 1.008908 -0.869831 2.026688 1

Question 8 - As done in class, go ahead and visualize the results of the 3D PCA. Properly label the x, y, and z- axis. Comment and summarize your results after the plot. (20 points)

Code
# visualize the results
fig = plt.figure(figsize = (12,10))
ax = fig.add_subplot(1,1,1,  projection='3d')
ax.set_xlabel('PCA1', fontsize = 15)
ax.set_ylabel('PCA2', fontsize = 15)
ax.set_zlabel('PCA3', fontsize = 15)
ax.set_title('3 Component PCA', fontsize = 20)

targetsName = ['1', '2', '3']

targets = [1, 2, 3]
colors = ['r', 'g', 'b']
for target, color in zip(targets, colors):
    indicesToKeep = final_df['Class'] == target
    ax.scatter(final_df.loc[indicesToKeep, 'PCA1'],
               final_df.loc[indicesToKeep, 'PCA2'],
               final_df.loc[indicesToKeep, 'PCA3'],
               c = color,
               s = 50)
ax.legend(targetsName)
ax.grid()

In the plot above, we created a 3-dimensional plot to display the result of our principal component analysis. Our three target classes of wine were given different colors:

  • 1: Red
  • 2: Green
  • 3: Blue

As seen by the coloring, the PCA does a decent job of creating clusters between the 3 classes of wine. For the most part, each class of wine is within their own cluster. However, there is some minor overlap. There is overlap between Class 1 and Class 2, and then again between Class 2 and Class 3. However, it appears there is no overlap between the clusters of Class 1 and Class 3.