What’s the point?

Both Principal Components Analysis (PCA) and Factor Analysis (FA) are methods of “dimension reduction”. The name should give a clue about what their purpose is: To reduce the complexity in your data while sacrificing as little information as possible. Since PCA is a little simpler, we will focus on it throughout this explainer. Also, since it’s likely that you’ve taken a personality test in the past, let’s take it as an example. Assuming it was an actual scientific measure and not something off of Bookface1, once you scored it, you probably got some kind of description of your personality on a few traits (e.g., extraversion or agreeableness). This is exactly what dimension reduction is. You started with many questions and ended up with five or so traits.

Have you ever wondered how the creators of the questionnaire know how each of the questions contributes to each of the traits? No? Well, take a moment to ponder before you read on and find out the answer…

Dimensions

When we say ‘dimension’, what we mean is the number of variables you’re working with. If you only have one variable, you only need one axis (the number line) to plot all the points. For two variables, say, a person’s age and height, you need a 2-D plot, with age on one of the axes and height on the other. Each individual’s value of age and height gives you the coordinates of the point representing that individual on your plot. If you add another variable, you need to expand the plotting region into a higher dimension: a 3-D space with height, width, and depth. Beyond 3 variables, plotting becomes difficult as our imagination is mostly restricted to three spatial dimensions but the idea is the same2: if someone answers 50 personality questions, you can locate them as a single point in a 50-dimensional space, where each one of the 50 axes is somehow perpendicular to all the others. Mind = blown.

Since this is supposed to be a visual guide, let’s restrict ourselves to two dimensions:

Okay, we have some data. Hopefully, you can see the two variables depicted are positively correlated but let’s see just how much:

cor(df)
##           x         y
## x 1.0000000 0.6296835
## y 0.6296835 1.0000000

There’s a reason I’m not showing you just the single number you get from cor(df$x, df$y): I would like you to familiarise yourself with the correlation matrix. As you can see, this is a 2 \(\times\) 23 symmetrical matrix (a rectangular arrangement of numbers) with 1s on the diagonal – because a correlation of a variable with itself is always 1 – and cor(x, y) in place of the two off-diagonal elements. The reason why I’m showing you the full matrix is that we can visualise it on the plot.

Think of the columns of the matrix as coordinates of two points on our plot, one with x-value of 1 and y-value of 0.63, and the other one with the values swapped:

Better still, we can imagine these as arrows from the origin point (0, 0) to these two separate points. These arrows represent vectors that have magnitude (length) and direction:

How do we reduce the number of dimensions?

Right, so the question now is: how can we reduce the dimensionality of our data so that we lose as little information as possible? To reduce dimensionality means to flatten out your data, in our case to put all the points on a single number line. To preserve information in this context means to put the points on the line in such a way that points that started close together in the 2D plane end up relatively close together on the line and point that start out distant end up relatively far apart. So, in other words, the question is…

Which way should the line go?

Here is where the idea of eigenvectors comes into the picture. A typical correlation matrix4 has as many eigenvectors as it has columns (and indeed rows) and each eigenvector has a length, called the eigenvalue. Let’s see:

eigen(cor(df))
## eigen() decomposition
## $values
## [1] 1.6296835 0.3703165
## 
## $vectors
##           [,1]       [,2]
## [1,] 0.7071068 -0.7071068
## [2,] 0.7071068  0.7071068

Again, in the $vector section, we have a matrix whose columns represent the direction of the two eigenvectors. They are both of length 1. We will get to the $values section in due course. For the time being, let’s replace the correlation vectors in the plot with lines going through them and visualise these two eigenvectors:

There are several things to notice about eigenvectors of correlation matrices:

  1. they are always orthogonal with respect to each other,

  2. they bisect the angles between the dashed lines,

  3. importantly, the fisrt (here, purple) eigenvector points in the direction of the greatest variance in the data.

 

And there, in point 3 lies the answer to our question: the line we want all the data points to fall on should be in the direction of the greatest variance. In order to better understand what this means, we need to talk briefly about linear transformation.

Linear transformation

This is just a fancy name for addition, subtraction, multiplication, and division. The reason these operations are called linear transformations is that they change the values provided but preserve the proportional distances between the values. So the relative distances between 1, 3, and 5 are the same as the relative distances between 1 + 4, 3 + 4, and 5 + 4, or between 1 \(\times\) 7, 3 \(\times\) 7, and 5 \(\times\) 7. In visual terms, you can think of linear transformations as ways of shifting the points along any of the axes or stretching, squishing, skewing, or rotating space itself. All of these ways of warping space keep three things unchanged:

  1. the origin point (0, 0) stays where it is,

  2. any lines (horizontal, vertical, diagonal…) that start out as parallel remain parallel, and

  3. the distances between points (or lines) remain proportional.

Here’s an example of a linear transformation: a 90° anticlockwise rotation with shear. Notice that all three above-listed conditions apply. If you are interested in the nitty-gritty, the final effect is achieved by multiplying the data matrix (D) by the 90° rotation matrix (R) and then by the shear matrix (S)5.

And just in case you’re wondering what a non-linear transformation might look like, here’s an example of the same plot in polar coordinates:

Pretty cool, huh? But, you might be wondering, what does any of this have to do with PCA? Well, you’re invited to read on…

Eigenvalues

Let’s go back to the plot of our data:

print(p)

To recap, we have our individual rows of df plotted as points in a 2-dimensional orthogonal space defined by our horizontal and vertical axes. The dashed lines represent the correlation matrix returned by cor(df). The purple and green arrows represent the two eigenvectors of this matrix. We discussed how the first (purple) eigenvector points in the direction of the greatest variance in the data. It should feel intuitively true that were you to draw an oval shape around the datapoints, a line drawn through the first eigenvector would cut the oval lengthwise in half.

We also talked about how linear transformations of data can be pictured as warping of the plotting space. So what happens if we transform df using the correlation matrix? The animation below gives the answer:

Again, there are several things to notice here:

  1. After the transformation, the axes aligned with the dashed lines (after all, that is what the transformation really is!) and are no longer orthogonal.

  2. The plotting space got stretched, just like if you took a square sheet of rubber and pulled at the opposite corners.

  3. The individual data points still have the same coordinates. It is just the coordinate system (the axes) that has changed.

  4. The data points got spread out in the direction of the greates variance.

  5. The eigenvectors did not change orientation. Any other vector (arrow) that is not parallel to these two would get rotated as a result of the transformation (see here for a nice visualisation).

  6. The first eigenvector got stretched and the second one got squished.

The factor by which each eigenvector got scaled is the eigenvalue:

eigen(cor(df))$values
## [1] 1.6296835 0.3703165

This means, that the first eigenvector get stretched to 1.63 times its previous length (which was one6),

\[1^{st}\ eigenvector:\ 1.63\cdot\begin{bmatrix}0.71\\0.71\end{bmatrix} = \begin{bmatrix}1.15\\1.15\end{bmatrix}\] while the second eigenvector gets compressed to about 37% of its original length (also one).

\[2^{nd}\ eigenvector:\ 0.37\cdot\begin{bmatrix}-0.71\\0.71\end{bmatrix} = \begin{bmatrix}-0.26\\0.26\end{bmatrix}\]

Let’s recap what we know about eigenvectors and eigenvalues. Eigenvectors are those vectors in space that maintain their orientation after a given transformation; they can only get stretched or squished as a result of it. The factor by which they get stretched or squished is their eigenvalue. A linear transformation can always be represented with a matrix and the eigenvalues and eigenvectors are properties of that matrix. Not all transformations (and thus not all matrices) have eigenvectors. For instance, the rotation and shear transformation pictured above doesn’t7. But, at least generally speaking, a correlation or a variance-covariance matrix of your data will always have eigenvectors. The eigenvectors will all be perpendicular to each other and there will be as many of them as the original number of variables in your data.

OK, that was a lie. There will only be as many eigenvectors as there are variables if there is no multicollinearity in your data, that is, if none of the variables can be expressed as some linear combination of the other variables. If that is the case, it means that at least one variable in your data is redundant and the correlation matrix will have fewer eigenvectors. R will let you know by complaining about a “non-positive definite” matrix.

The nice thing about eigenvectors and eigenvalues is that you can use them to recreate the original correlation matrix you started with with a little mathemagic (AKA linear algebra):

(cmat <- cor(df))
##           x         y
## x 1.0000000 0.6296835
## y 0.6296835 1.0000000
(eig <- eigen(cmat))
## eigen() decomposition
## $values
## [1] 1.6296835 0.3703165
## 
## $vectors
##           [,1]       [,2]
## [1,] 0.7071068 -0.7071068
## [2,] 0.7071068  0.7071068
# %*% is an operator used in R for matrix multiplication
# t() gives the transpose of a matrix: swaps rows and columns
eig$vectors %*% diag(eig$values) %*% t(eig$vectors)
##           [,1]      [,2]
## [1,] 1.0000000 0.6296835
## [2,] 0.6296835 1.0000000

The diag(eig$values) turns the vector of eigenvalues into a matrix with eigenvalues on the diagonal and zeros elsewhere. This allows us to use the matrix to multiply the matrix of eigenvectors.

Principal Components

Now that we have an understanding what those weird German-sounding words mean, we can talk about principal components. Recall that the whole point of PCA is to reduce the dimensionality (number of variables) in your data while losing as little information as possible. We do this by extracting a number of principal components. But what are these and what does it mean to extract them?

Well, principal components are just a combination of eigenvectors and eigenvalues. You can visualise them as axes going through the eigenvector arrows. In our case, you can draw two axes through the two vectors and then make those be your new horizontal and vertical axes. Like this:

Notice several things about this new plot:

  1. There is now no correlation in the data.

  2. The data are mostly spread along the purple axis representing the 1st principal component. That is why we said that the first eigenvector points in the direction of the greatest variance.

  3. There is not much variance along the green axis. In other words, components with large eigenvalues account for a lot of the variance in the data, while ones with small eigenvalues account for little variance.

 

Now, just like with eigenvectors and their eigenvalues, we can use the principal components to reproduce the original correlation matrix. Each principal component is just an eigenvector multiplied by the square root of its eigenvalue:

# get principal components
(pc1 <- eig$vectors[ , 1] * sqrt(eig$values[1]))
## [1] 0.9026859 0.9026859
(pc2 <- eig$vectors[ , 2] * sqrt(eig$values[2]))
## [1] -0.4303002  0.4303002
# put them together in a matrix
# the same thing could be done with
# pc <- eig$vectors %*% diag(sqrt(eig$values))
(pc <- cbind(pc1, pc2))
##            pc1        pc2
## [1,] 0.9026859 -0.4303002
## [2,] 0.9026859  0.4303002
# reproduce the correlation matrix
(rep_cmat <- pc %*% t(pc))
##           [,1]      [,2]
## [1,] 1.0000000 0.6296835
## [2,] 0.6296835 1.0000000

So there we have it – the original correlation matrix. Since we can reproduce the matrix perfectly, we have lost no information whatsoever. On the flip-side, however, we still have a matrix with the same number of rows and columns as the original correlation matrix. That means we haven’t reduced the dimensionality in the data. In order to do that we extract only a subset of the available principal components. In our case, there’s not much to think about: we have 2 PCs, so we can only extract one of them. Which one? The one that accounts for the most variance: the 1st principal component (purple).

This means that we disregard the green axis and consider only the position of the points on the purple axis. This, ladies and gentlemen, is what it means to “extract the first principal component” from the variables!

The entire process can be pictured like such:

OK, now that we extracted only one of the PCs, it should be apparent that we have lost some of the information in the original data. This is because the 1st PC doesn’t account for all the variance. We can still try to reproduce the original correlation matrix though. Let’s give it a go:

pc1
## [1] 0.9026859 0.9026859
(rep_cmat <- pc1 %*% t(pc1))
##           [,1]      [,2]
## [1,] 0.8148418 0.8148418
## [2,] 0.8148418 0.8148418

This is clearly not the same matrix as cmat. Let’s look at the difference:

(rep_cmat - cmat)
##            x          y
## x -0.1851582  0.1851582
## y  0.1851582 -0.1851582

This is the residual matrix. And based on this matrix, you can assess how well your model fits the data. The smaller the residuals in the matrix, the more closely you can reproduce the original correlation or variance-covariance matrix and the better the fit.

On the flip-side, though, notice that the datapoints that started out far apart, ended up relatively far apart and those that started near each other also ended up near each other on the purple axis. This is the neat thing about dimension reduction – we preserved as much information about the original data as possible.

 

OK, so now that we’ve extracted our 1st PC, what can we do with it. Well, we can see how our participants scored on this component. To convert our two variables df$x and df$y, we multiply each column of df by the corresponding component loading and then add the columns together. The loadings are simply the elements of the extracted components, in our case:

pc1
## [1] 0.9026859 0.9026859

This means that we multiply column x by 0.9026859 and column y by 0.9026859 (the values will not always be the same!) and then add those two up. These scores are often standardised so let’s also do that:

pca_scores <- scale(df$x * pc1[1] + df$y * pc1[2])
hist(pca_scores, main = NULL)

And there we are! We successfully reduced the dimensionality in our data from 2 dimensions to only one, while losing as little complexity in the data as possible. Sweet…

 

As a sanity check – for it is always useful to perform those – let’s compare what we calculated by hand to what R gives us when we do PCA on our data using the pca() function from the psych package, extracting one principal component:

library(psych)

m1 <- pca(df, nfactors = 1)

m1
## Principal Components Analysis
## Call: principal(r = r, nfactors = nfactors, residuals = residuals, 
##     rotate = rotate, n.obs = n.obs, covar = covar, scores = scores, 
##     missing = missing, impute = impute, oblique.scores = oblique.scores, 
##     method = method)
## Standardized loadings (pattern matrix) based upon correlation matrix
##   PC1   h2   u2 com
## x 0.9 0.81 0.19   1
## y 0.9 0.81 0.19   1
## 
##                 PC1
## SS loadings    1.63
## Proportion Var 0.81
## 
## Mean item complexity =  1
## Test of the hypothesis that 1 component is sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.19 
##  with the empirical chi square  3.43  with prob <  NA 
## 
## Fit based upon off diagonal values = 0.91

These numbers should look familiar. In the Standardized loadings section, we have the loadings of the 1st principal component. You can see that they are identical to the two elements of our pc1 (0.9026859, 0.9026859). Squaring these gives us the h2 (communality) column the numbers in which represent the proportion of variance in each of the original variables explained by the extracted principal component(s). In our case, we can account for 81% of each variable, leaving 19% unexplained (the u2uniqueness – column). The com column indicates item complexity – how many components does each variable load on (or in case of factor analysis, how many factors load on the given item). As we only extracted 1 PC, the variables can only load on 1 principal component.

Next, you have the Sum of Squared loadings, which is exactly what it says on the tin: sum(pc1^2) = 1.6296835. It is also the eigenvalue of the first eigenvector. Right below it is the proportion of variance in the data explained by each extracted PC, which is just SS loadings divided by the number of variables.

Finally, there are the various statistics assessing the fit of out 1-component model. They are all derived from the residual matrix we saw above.

Since we saved the pca() model into an object m1, we can get the component scores out of it. Let’s look at their distribution:

hist(m1$scores, main = NULL)

As you can see, the plot looks exactly the same as the one of the scores we calculated by hand. If that doesn’t make you happy, I don’t know what will! :)

 

And with that, I hope you can PCA happily ever after…


  1. or the Myers-Briggs FLNS (four-letter non-sense).

  2. This is where matrices come in handy. Your data frames are basically matrices, where – provided that you have one row per participant – each row gives the coordinates of the given participant in the n-dimensional space.

  3. 3 \(\times\) 3 for a data set of three variables, 4 \(\times\) 4… You get the idea.

  4. or variance-covariance matrix, which is just the unstandardised version of the correlation matrix (you can look at it using var(data frame) or cov(data frame))…

  5. S(R D) = \(\begin{bmatrix}1\ \ 1\\0\ \ 1\end{bmatrix}\begin{bmatrix}0\ \ 1\\-1\ \ 0\end{bmatrix}\begin{bmatrix}x_1\cdots x_n\\y_1\cdots y_n\end{bmatrix}\)

  6. Applying the Pythagorean theorem, we get length before = \(\sqrt{0.71^2 + 0.71^2}\) = 1 and length after = \(\sqrt{1.15^2 + 1.15^2}\) = 1.63

  7. At least not in the domain of real numbers; it has two but they include complex numbers. In case you’re wondering, they’re \(\begin{bmatrix}0.35-0.61i\ \ \ 0.35+0.61i\\0.71\ \ \ 0.71\end{bmatrix}\)