## Varimax: Sure, it’s always worked but now there’s maths!

Some day you will be lovedDeath Cab for Cutie

Here is a paper that I just read that I really liked: Karl Rohe and Muzhe Zeng’s Vintage Factor Analysis with Varimax Performs Statistical Inference.

(Does anyone else get a Simon Smith and His Amazing Dancing Bear vibe off that title? No? Just me? Fine.)

This paper is in the fine tradition of papers that examine a method that’s been around for ages and everyone (maybe in a particular discipline) knows to work well, but no one has ever cracked why. Almost universally, it turns out that these methods do actually work1. But it usually takes someone to hold them up to the light in just the right way to see why.

And that’s what this paper does.

The basic idea is that you can often rotate principal components in order to make the loading vectors sparse. This typically leads to more interpretable principal components and, therefore, better exploratory data analysis.

The black fly in this particular glass of Chardonay2 is that in the standard working model for PCA (namely that that everything is independent and Gaussian) we end up with an awkward rotational invariance that doesn’t exactly scream “Hey! Rotate me! I’ll get better! I promise!”.

But nonetheless the Varimax rotations do actually seem to help most of the time and so they’ve been comfortably used for the last 50 or so years. It’s even in the base stats package in R!

So what’s the deal, daddy-o? It turns out that the whole method works as long as the entries in the principal component matrix are sufficiently non-Gaussian3. Figure 1 from their paper (shown below) shows this clearly. It turns out that when the principal components have heavy tails you see “radial streaks” and then there is a good rotation—the one that lines the axes up with these streaks!

Checking for these streaks was part of the workflow for using variamax rotations proposed by Thurstone back in 1947 and this paper does a good job of using it as a jumping off point for building a theory of why these rotations work. (Side note: I really love how rooted in history this paper is while being extremely modern. It’s always nice to see a paper big-up the contributions of people who came before them while also doing amazing work.)

Anyway. You should all read the paper. It’s super sweet. Basically it shows that if your PCA is done on an appropriately centred and scaled matrix, then under some assumptions PCA + Varimax (which they call Vintage Factor Analysis) fits the model

$latex \mathbb{E}(X\mid Y,Z)=ZBY^T$,

where $latex X$ is the $latex n\times p$ matrix of covariates and the PCA+Varimax returns the matrix $latex Z$. (The B and Y are just in there for fun. Don’t stress about them. I promise the Z is the one we want.).

This basically means that while PCA will give you some $latex n\times k$ matrix $latex U$ such that $latex X^TX=U^TU$, Vintage Factor Analysis can actually recover (under conditions!) the original matrices Z and Y used to generate the data.

The key condition is that the entries in the true $latex Z$ come from a heavy-tailed distribution (each column is iid from the same distribution, different columns can come from different distributions as long as they’re independent). This causes the radial streaks seen in Figure 1 and gives a preferred orientation that the Varimax rotation can find.

Now just because this method corresponds to a model doesn’t mean we believe that model. Actually we probably don’t. Unlike some other great homes of working models (like linear regression), this type of factor analysis is typically exploratory. So we don’t actually need to believe the assumptions for it to be useful.

This paper justifies the diagnostic checks of Thurstone (1947) and gives a plausible reason why this PCA+Varimax method can actually produce good, interpretable results.

On a personal note, my very first stats paper involved fitting a Bayesian factor model (Chris did all of the hard stats stuff because I was terrified and knew nothing. I provided an algorithm.). And I’ve got to say I’d much rather just do a PCA.

===========
1 Yes. I know. Just because a method is persistent doesn’t make it valid. This blog is full of examples. But this is a slightly different thing. This is not a method that is mathematically well understood (like using p-values to do inference) where either the mathematical framework doesn’t match with reality (like using p-values to do inference) or they aren’t being used correctly (like using p-values to do inference). This is instead a method that works but does not have a well-developed mathematical theory. Those types of things typically do end up working provably, even though it might take a while for theory to catch up to the method. Fun fact: one of my very early papers did this for a much more obscure algorithm.

2 OMG OMG OMG THERE’S A NEW ALANIS ALBUM AND IT’S AMAZING!!!!!

3 This is also the thinking behind Independent Component Analysis. The paper discusses the links.

1. OliP says:

Has some of the post been swallowed here:”Basically it shows that if your PCA is done on an appropriately centred and scaled matrix,”, or was that bit just one of the less important components?

• Dan Simpson says:

No. That’s what the paper shows. The scaling and centring is a classic (and key) part of any PCA so I didn’t feel the burning urge to type it up. Also it’s in the paper in lurid detail.

• gec says:

I think OliP is referring to an odd bit of formatting, where there’s a line break after a comma but the next thing is a capital letter, so it looks like the end of a sentence is missing.

At least, I got tripped up there for a bit, though it still makes sense. Three cheers for PCA!

• Dan Simpson says:

Ah! Sorry OliP – I just realized what you meant. Apparently I just restarted the sentence immediately after. Eeeek

• OliP says:

Thanks for clarifying, and of course for the great post.

Great except for footnote 1. I could be wrong but I’m sort of picking up a subtext that maybe Dan somehow doesn’t think that P-values are useful for inference. Possibly. Just my impression.

2. ian says:

Reminds me of Chapter 34 (Independent Component Analysis and Latent Variable Modelling)
in the late great David MacKay’s “Information Theory, Inference, and Learning Algorithms” where
we can recover latent variables if they are not distributed as a gaussian but have heavier tails.

He has maths also! (Note the ‘s’ barbarians! ;).

• Dan Simpson says:

From the paper:

“To enable the regime d = k, the results for ICA typically presume that A = ZM is observed with little or no noise. In contrast, Theorem 4.1 covers situations where (i) d grows at the same rate as n, (ii) there is an abundance of noise in A, and (iii) A is mostly zeros (i.e., sparse). This allows the theorem to cover the contemporary factor models in Section 5.”

• We borrowed the “s” from “maths” and lent it to “sport”. I’d say we were barbarianz, but we can’t even be consistent on when to replace an ess with a zed.

Also, +1 for McKay’s Oxford comma. I was ruined by grad school in Edinburgh, though my undoing was the brilliant copy editor of my first book, which was with Cambridge University Press. I’ve been at odds with my American colleagues over punctuation and slightly confused on spelling words like “colour” ever since.

Plus, hurray for Dan being back!

• Martha (Smith) says:

Growing up as a border brat, I encountered “zee”/ “zed”, “aluminum”/”aluminium”, “company”/”limited”, different pronunciations of “o”, different pronunciations of “ou”, which hand you hold your fork in, etc., from the get-go — and often just used whatever custom the person I was with used.

• Anonymous says:

Same thought occurred to me. However note that MacKay was also dismissive of principle component analysis as well. Which is why there is no chapter on that in his book.

3. jlm says:

totally remember simon smith & the ADB. Varimax, though, doesn’t ring a bell

4. Roy says:

Dan, speaking of your early paper, I always wished there had been more follow-up on that work. Putting these models into state-space format makes a lot of sense because you can then use both the Jungbacker-Koopman transformation and the Durbin-Koopman univariate processing. These have pretty much been ignored in the spatial literature, and in fact are just being re-discovered in the Gaussian process literature (see “Scalable Exact Inference in Multi-Output Gaussian Processes” by Bruinsma et al., their transformation T is that same as the Jungbacker-Koopman transformation and the orthogonal model is essentially univariate processing). I also always suspected the likelihood in the Strickland et. al paper was missing a Jacobean due to the Jungbacker-Koopman transformation, but my math isn’t good enough.

Ricardo Lemos has done a lot of work related to my points above, unfortunately much of it tied up in NDAs. He did present some of it at the Bayesian meeting in Sardinia a few years ago.

5. Frenetic Skeptic says:

Thanks for this post – a polite request: in future blogs, please point out this contribution to ML is a win for psychology, maybe point this out as many times as himmicanes have been mentioned :)

6. Brian J. Reilly says:

“Such Pretty Forks in the Road”: Haven’t listened to the album yet, but is Alanis reminding us to take all the forking paths, not just one?

7. Erikson says:

Thanks for sharing, Dan! I have to read the paper more carefully, but I really like your take on it. I always thought that Thurstone ‘simple structure’ was only an ad hoc way to justify human interpretable factor loadings, but it might have a stronger foundation after all!

I wonder if this paper would have any significant impact in the area Thurstone worked on, i.e., the development of psychometric scales. Although Varimax is still popular, it has been losing ground to various oblique transformations of the factor loadings, e.g., Oblimin (the psych package for R does Oblimin transformation in EFA by default for a long time now). My math isn’t enough to work out if the theorems holds for oblique transformations, but the paper states that the data should come from independent latent variables, so I believe it doesn’t.

I also wonder what this would mean to normal-theory factor analysis, certainly the most used set of assumptions to estimate and evaluate model fit in exploratory, confirmatory factor analysis and Structural Equations Modeling in Psychology. The leptokurtic condition for Varimax to work seems to preclude any multivariate normal assumptions about the distribution of observed and latent variables, if we take the model seriously.

Anyway, the nagging problem of deciding the number of latent factor remains. The author use the classic scree plot to decide the number of factors to retain. If we are using FA solely for the purpose of dimensionality reduction, well, this method is good as any, but if want to build scientific theory on top of those unbeliveable models, it’s hard to trust methods that create new entities solely due to sample variability.

• Karl Rohe says:

Thanks @Dan for sharing the article and the thoughtful write up!

@Erikson, it explicitly does not work for Gaussian latent variables. There is a fancy 1860 theorem that shows something even stronger (Gaussian is the *only* latent variables for which the rotation does not work). As for oblique rotations, I don’t think they do what we think. I think it sometimes looks like we want an oblique rotation because we didn’t do the centering properly (sometimes you should and sometimes you shouldn’t and sometimes both choices are wrong!). This is not clearly discussed in the paper, but is something that I am currently pondering. Remark 3.2 in the paper is related.

If you, or anyone you know, has a simulation (not a data analysis) that demonstrates the need for oblique rotations, I would be so so interested. Everything that I can imagine a simulation for… I can also prove that oblique rotations don’t help.