StatQuest (Josh Starmer) has published a great video explaining principal component analysis (PCA). In the video he mentions that he is using Singular Value Decomposition (SVD), but he doesn’t actually do/present that part of the maths – he does not explicitly calculate a singular value decomposition.
The reason he states that he’s using SVD is because the example works directly with the data matrix – a matrix/table with measurements where each column represents a variable, and the rows represent a sample. (In his table of mice and genes the data matrix is transposed.) The alternative would be to first form a covariance matrix and then use Eigenvalue Decomposition (EvD). The two methods are equivalent.
This post aims to fill in some of the gaps by presenting a program that does the maths. Further it will show the connection between EvD and SVD. In addition it demonstrates how to do PCA using ojAlgo.
Unfortunately the program doesn’t get exactly the same numbers as in the video. If you can spot what’s causing the differences then please let me and/or Josh Starmer know (depending on who made a mistake). In the video the loading scores for PC1 (2 mice version) are 0.97 and 0.242, but in our calculations they are 0.94 and 0.34. The corresponding eigenvalues are also different. This causes derived numbers like the relative importance (of the principal components) to be very different.
Remember!
PCA will not (directly) tell you which variables are the most important. It removes noise and reduces the dimensionality of the data. This makes it easier to work with. Most likely there needs to be further analysis. Possibly that further analysis is simply to plot a chart.
Example Code
Console Output
2 Variables
class StatQuestPCA ojAlgo 2019-05-05 Data 10 6 11 4 8 5 3 3 2 2.8 1 1 There are 2 variables and 6 samples. Covariances 18.966667 6.486667 6.486667 3.126667 EvD Eigenvalues (on the diagonal) 21.284012 0 0 0.809321 Eigenvectors (in the columns) 0.941711 0.336424 0.336424 -0.941711 Relative (Variance) Importance: { 0.963368085822159, 0.03663191417784095 } Data (centered) 4.166667 2.366667 5.166667 0.366667 2.166667 1.366667 -2.833333 -0.633333 -3.833333 -0.833333 -4.833333 -2.633333 SVD Left-singular Vectors (in the columns) 0.457541 -0.411087 0.483604 0.692426 0.242357 -0.277432 -0.279299 -0.177362 -0.377107 -0.250975 -0.527095 0.424429 Singular values (on the diagonal) 10.31601 0 0 2.011618 Right-singular Vectors (in the columns) - compare these to the eigenvectors above 0.941711 0.336424 0.336424 -0.941711 Sum of eigenvalues/variance: 22.093333333333334 == 22.093333333333344 PC1: Variance=21.284012242764245 (96.34%%) Loadings={ 0.9417106889306233, 0.3364238076501293 } PC2: Variance=0.8093210905690993 (3.66%%) Loadings={ 0.3364238076501293, -0.9417106889306233 } Data (transformed) 4.719998 -0.826949 4.988861 1.392896 2.500152 -0.558086 -2.881249 -0.356784 -3.890244 -0.504866 -5.437518 0.85379 Transformed data (derived another way) - compare the 2 first columns with what we just calculated above 4.719998 -0.826949 4.988861 1.392896 2.500152 -0.558086 -2.881249 -0.356784 -3.890244 -0.504866 -5.437518 0.85379 Covariances (from SVD) – compare this what we originally calculated 18.966667 6.486667 6.486667 3.126667 Covariances (from SVD using only 2 components) 18.966667 6.486667 6.486667 3.126667
3 Variables
class StatQuestPCA ojAlgo 2019-05-05 Data 10 6 12 11 4 9 8 5 10 3 3 2.5 2 2.8 1.3 1 1 2 There are 3 variables and 6 samples. Covariances 18.966667 6.486667 19.286667 6.486667 3.126667 7.486667 19.286667 7.486667 22.246667 EvD Eigenvalues (on the diagonal) 42.45597 0 0 0 1.357664 0 0 0 0.526366 Eigenvectors (in the columns) 0.654759 -0.747649 0.110959 0.244157 0.348147 0.905086 0.715316 0.565522 -0.410496 Relative (Variance) Importance: { 0.9575094809015773, 0.030619393636449194, 0.011871125461973569 } Data (centered) 4.166667 2.366667 5.866667 5.166667 0.366667 2.866667 2.166667 1.366667 3.866667 -2.833333 -0.633333 -3.633333 -3.833333 -0.833333 -4.833333 -4.833333 -2.633333 -4.133333 SVD Left-singular Vectors (in the columns) 0.514936 0.393973 -0.120891 0.379072 -0.811392 0.16742 0.310108 0.400155 0.067738 -0.316323 -0.060214 -0.37223 -0.423529 -0.060447 -0.495894 -0.464265 0.137926 0.753857 Singular values (on the diagonal) 14.569827 0 0 0 2.60544 0 0 0 1.622291 Right-singular Vectors (in the columns) - compare these to the eigenvectors above 0.654759 -0.747649 -0.110959 0.244157 0.348147 -0.905086 0.715316 0.565522 0.410496 Sum of eigenvalues/variance: 44.34 == 44.34000000000004 PC1: Variance=42.455970383175966 (95.75%%) Loadings={ 0.6547591745515717, 0.24415739058194894, 0.715316427858859 } PC2: Variance=1.3576639138401558 (3.06%%) Loadings={ -0.7476486995434165, 0.34814683061538987, 0.5655220653550999 } Data (transformed) 7.502525 1.026474 5.523021 -2.114035 4.518217 1.04258 -4.608767 -0.156885 -6.170737 -0.157492 -6.764258 0.359358 Transformed data (derived another way) - compare the 2 first columns with what we just calculated above 7.502525 1.026474 -0.196121 5.523021 -2.114035 0.271604 4.518217 1.04258 0.109891 -4.608767 -0.156885 -0.603865 -6.170737 -0.157492 -0.804485 -6.764258 0.359358 1.222976 Covariances (from SVD) – compare this what we originally calculated 18.966667 6.486667 19.286667 6.486667 3.126667 7.486667 19.286667 7.486667 22.246667 Covariances (from SVD using only 2 components) 18.960186 6.433805 19.310642 6.433805 2.695478 7.68223 19.310642 7.68223 22.15797
4 Variables
class StatQuestPCA ojAlgo 2019-05-05 Data 10 6 12 5 11 4 9 7 8 5 10 6 3 3 2.5 2 2 2.8 1.3 4 1 1 2 7 There are 4 variables and 6 samples. Covariances 18.966667 6.486667 19.286667 3.033333 6.486667 3.126667 7.486667 -0.086667 19.286667 7.486667 22.246667 3.413333 3.033333 -0.086667 3.413333 3.766667 EvD Eigenvalues (on the diagonal) 42.951938 0 0 0 0 3.730063 0 0 0 0 1.320309 0 0 0 0 0.104357 Eigenvectors (in the columns) 0.651053 0.001663 0.759018 0.004493 0.239553 -0.375033 -0.20981 0.8706 0.711502 -0.020939 -0.60817 -0.351361 0.111845 0.926773 -0.100005 0.344355 Relative (Variance) Importance: { 0.8928479327252333, 0.07753734201523488, 0.027445446117285287, 0.0021692791422464803 } Data (centered) 4.166667 2.366667 5.866667 -0.166667 5.166667 0.366667 2.866667 1.833333 2.166667 1.366667 3.866667 0.833333 -2.833333 -0.633333 -3.633333 -3.166667 -3.833333 -0.833333 -4.833333 -1.166667 -4.833333 -2.633333 -4.133333 1.833333 SVD Left-singular Vectors (in the columns) 0.507358 0.268131 0.34454 0.05478 0.388702 -0.349683 -0.746453 0.046354 0.312688 -0.042237 0.419224 -0.177092 -0.336798 0.608043 -0.197987 0.523235 -0.427491 0.156041 -0.125105 -0.766633 -0.444459 -0.640295 0.305781 0.319356 Singular values (on the diagonal) 14.654681 0 0 0 0 4.318601 0 0 0 0 2.569347 0 0 0 0 0.722346 Right-singular Vectors (in the columns) - compare these to the eigenvectors above 0.651053 -0.001663 -0.759018 -0.004493 0.239553 0.375033 0.20981 -0.8706 0.711502 0.020939 0.60817 0.351361 0.111845 -0.926773 0.100005 -0.344355 Sum of eigenvalues/variance: 48.10666666666667 == 48.10666666666664 PC1: Variance=42.95193788363521 (89.28%%) Loadings={ 0.6510525699470171, 0.23955264672389256, 0.711502412186202, 0.11184542040771327 } PC2: Variance=3.7300630665462307 (7.75%%) Loadings={ -0.001662974141421872, 0.3750331043127808, 0.020938613599998712, -0.9267734241156432 } Data (transformed) 7.435167 1.157951 5.696298 -1.51014 4.58235 -0.182406 -4.935668 2.625896 -6.264743 0.67388 -6.513403 -2.76518 Transformed data (derived another way) - compare the 2 first columns with what we just calculated above 7.435167 1.157951 0.885242 0.03957 5.696298 -1.51014 -1.917896 0.033484 4.58235 -0.182406 1.077132 -0.127922 -4.935668 2.625896 -0.508698 0.377957 -6.264743 0.67388 -0.321437 -0.553774 -6.513403 -2.76518 0.785657 0.230685 Covariances (from SVD) – compare this what we originally calculated 18.966667 6.486667 19.286667 3.033333 6.486667 3.126667 7.486667 -0.086667 19.286667 7.486667 22.246667 3.413333 3.033333 -0.086667 3.413333 3.766667 Covariances (from SVD using only 2 components) 18.206025 6.696517 19.896302 3.133391 6.696517 2.98945 7.350117 -0.145655 19.896302 7.350117 21.745439 3.345658 3.133391 -0.145655 3.345658 3.741088