Principal Component Analysis in Theory and Practice

Prerequisites for this post are linear algebra, including tensors, and basic probability theory. Already knowing how PCA works will also be helpful. In section 1, I'll summarize the technique of principal component analysis (PCA), stubbornly doing so in a coordinate-free manner, partly because I am an asshole but mostly because it is rhetorically useful for emphasizing my point in section 2. In section 2, I'll gripe about how PCA is often used in ways that shouldn't be expected to work, but works just fine anyway. In section 3, I'll discuss some useless but potentially amusing ways that PCA could be modified. Thanks to Laurens Gunnarsen for inspiring this post by talking to me about the problem that I discuss in section 2.

A brief introduction to Principal Component Analysis

You start with a finite-dimensional real inner product space $V$ and a probability distribution $\mu$ on $V$. Actually, you probably just started with a large finite number of elements of $V$, and you've inferred a probability distribution that you're supposing they came from, but that difference is not important here. The goal is to find the $n$-dimensional (for some $n\leq\dim V$) affine subspace $W_{n}\subseteq V$ minimizing the expected squared distance between a vector (distributed according to $\mu$) and its orthogonal projection onto $W_{n}$. We can assume without loss of generality that the mean of $\mu$ is $0$, because we can just shift any probability distribution by its mean and get a probability distribution with mean $0$. This is useful because then $W_{n}$ will be a linear subspace of $V$. In fact, we will solve this problem for all $n\leq\dim V$ simultaneously by finding an ordered orthonormal basis such that $W_{n}$ is the span of the first $n$ basis elements.

First you take $\text{Cov}_{\mu}\in V\otimes V$, called the covariance of $\mu$, defined as the bilinear form on $V^{*}$ given by $\text{Cov}_{\mu}\left(\varphi,\psi\right)=\int_{V}\varphi\left(x\right)\psi\left(x\right)d\mu\left(x\right)$. From this, we get the covariance operator $C_{\mu}\in V^{*}\otimes V$ by raising the first index, which means starting with $\left\langle \cdot,\cdot\right\rangle \otimes\text{Cov}_{\mu}\in V^{*}\otimes V^{*}\otimes V\otimes V$ and performing a tensor contraction (in other words, $C_{\mu}$ is obtained from $\text{Cov}_{\mu}$ by applying the map $V\rightarrow V^{*}$ given by the inner product to the first index). $\text{Cov}_{\mu}$ is symmetric and positive semi-definite, so $C_{\mu}$ is self-adjoint and positive semi-definite, and hence $V$ has an orthonormal basis of eigenvectors of $C_{\mu}$, with non-negative real eigenvalues. This gives an orthonormal basis in which $\text{Cov}_{\mu}$ is diagonal, where the diagonal entries are the eigenvalues. Ordering the eigenvectors in decreasing order of the corresponding eigenvalues gives us the desired ordered orthonormal basis.

The problem

There's no problem with principal component analysis as I described it above. It works just fine, and in fact is quite beautiful. But often people apply principal component analysis to probability distributions on finite-dimensional real vector spaces that don't have a natural inner product structure. There are two closely related problems with this: First, the goal is underdefined. We want to find a projection onto an $n$-dimensional subspace that minimizes the expected squared distance from a vector to its projection, but we don't have a measure of distance. Second, the procedure is underdefined. $\text{Cov}_{\mu}$ is a bilinear form, not a linear operator, so it doesn't have eigenvectors or eigenvalues, and we don't have a way of raising an index to produce something that does. It should come as no surprise that these two problems arise together. After all, you shouldn't be able to find a fully specified solution to an underspecified problem.

People will apply principal component analysis in such cases by picking an inner product. This solves the second problem, since it allows you to carry out the procedure. But it does not solve the first problem. If you wanted to find a projection onto an $n$-dimensional subspace such that the distance from a vector to its projection tends to be small, then you must have already had some notion of distance in mind by which to judge success. Haphazardly picking an inner product gives you a new notion of distance, and then allows you to find an optimal solution with respect to your new notion of distance, and it is not clear to me why you should expect this solution to be reasonable with respect to the notion of distance that you actually care about.

In fact, it's worse than that. Of course, principal component analysis can't given you literally any ordered basis at all, but it is almost as bad. The thing that you use PCA for is the projection onto the span of the first $n$ basis elements along the span of the rest. These projections only depend on the sequence of $1$-dimensional subspaces spanned by the basis elements, and not the basis elements themselves. That is, we might as well only pay attention to the principal components up to scale, rather than making sure that are all unit length. Let a "coordinate system" refer to an ordered basis up to two ordered bases being equivalent if they differ only by scaling the basis vectors, so that we're paying attention to the coordinate systems given to us by PCA. If the covariance of $\mu$ is nondegenerate, then the set of coordinate systems that can be obtained from principal component analysis by a suitable choice of inner product is dense in the space of coordinate systems. More generally, where $U$ is the smallest subspace of $V$ such that $\mu\left(U\right)=1$, then the space of coordinate systems that you can get from principal component analysis is dense in the space of all coordinate systems whose first $\dim U$ coordinates span $U$ ($\dim U$ will be the rank of the covariance of $\mu$). So in a sense, for suitably poor choices of inner product, principal component analysis can give you arbitrarily terrible results, subject only to the weak constraint that it will always notice if all of the vectors in your sample belong to a common subspace.

It is thus somewhat mysterious that machine learning people seem to be able to often get good results from principal component analysis apparently without being very careful about the inner product they choose. Vector spaces that arise in machine learning seem to almost always come with a set of preferred coordinate axes, so these axes are taken to be orthogonal, leaving only the question of how to scale them relative to each other. If these axes are all labeled with the same units, then this also gives you a way of scaling them relative to each other, and hence an inner product. If they are aren't, then I'm under the impression that the most popular method is to normalize them such that the pushforward of $\mu$ along each coordinate axis has the same variance. This is unsatisfying, since figuring out which axes $\mu$ has enough variance along to be worth paying attention to seems like the sort of thing that you would want principal component analysis to be able to tell you. Normalizing the axes in this way seems to me like an admission that you don't know exactly what question you're hoping to use principal component analysis to answer, so you just tell it not to answer that part of the question to minimize the risk of asking it to answer the wrong question, and let it focus on telling you how the axes, which you're pretty sure should be considered orthogonal, correlate with each other.

That conservatism is actually pretty understandable, because figuring out how to ask the right question seems hard. You implicitly have some metric $d$ on $V$ such that you want to find a projection $\pi$ onto an $n$-dimensional subspace such that $d\left(x,\pi\left(x\right)\right)$ is usually small when $x$ is distributed according to $\mu$. This metric is probably very difficult to describe explicitly, and might not be the metric induced by any inner product (for that matter, it might not even be a metric; $d\left(x,y\right)$ could be any way of quantifying how bad it is to be told the value $y$ when the correct value you wanted to know is $x$). Even if you somehow manage to explicitly describe your metric, coming up with a version of PCA with the inner product replaced with an arbitrary metric also sounds hard, so the next thing you would want to do is fit an inner product to the metric.
The usual approach is essentially to skip the step of attempting to explicitly describe the metric, and just find an inner product that roughly approximates your implicit metric based on some rough heuristics about what the implicit metric probably looks like. The fact that these heuristics usually work so well seems to indicate that the implicit metric tends to be fairly tame with respect to ways of describing the data that we find most natural. Perhaps this shouldn't be too surprising, but I still feel like this explanation does not make it obvious a priori that this should work so well in practice. It might be interesting to look into why these heuristics work as well as they do with more precision, and how to go about fitting a better inner product to implicit metrics. Perhaps this has been done, and I just haven't found it.

To take a concrete example, consider eigenfaces, the principal components that you get from a set of images of people's faces. Here, you start with the coordinates in which each coordinate axis represents a pixel in the image, and the value of that coordinate is the brightness of the corresponding pixel. By declaring that the coordinate axes are orthogonal, and measuring the brightness of each pixel on the same scale, we get our inner product, which is arguably a fairly natural one.

Presumably, the implicit metric we're using here is visual distance, by which I mean a measure of how similar two images look. It seems clear to me that visual distance is not very well approximated by our inner product, and in fact, there is no norm such that the visual distance between two images is approximately the norm of their difference. To see this, if you take an image and make it brighter, you haven't changed how it looks very much, so the visual distance between the image and its brighter version is small. But their difference is just a dimmer version of the same image, and if you add that difference to a completely different image, you will get the two images superimposed on top of each other, a fairly radical change. Thus the visual distance traversed by adding a vector depends on where you start from.

Despite this, producing eigenfaces by using PCA on images of faces, using the inner product described above, performs well with respect to visual distance, in the sense that you can project the images onto a relatively small number of principal components and leave them still recognizable. I think this can be explained on an intuitive level. In a human eye, each photoreceptor has a narrow receptive field that it detects light in, much like a pixel, so the representation of an image in the eye as patterns of photoreceptor activity is very similar to the representation of an image in a computer as a vector of pixel brightnesses, and the inner product metric is a reasonable measure of distance in this representation. When the visual cortex processes this information from the eye, it is difficult (and perhaps also not useful) for it to make vast distinctions between images that are close to each other according to the inner product metric, and thus result in similar patterns of photoreceptor activity in the eye. Thus the visual distance between two images cannot be too much greater than their inner product distance, and hence changing an image by a small amount according to the inner product metric can only change it by a small amount according to visual distance, even though the reverse is not true.

Generalizations

The serious part of this post is now over. Let's have some fun. Some of the following ways of modifying principal component analysis could be combined, but I'll consider them one at a time for simplicity.

As hinted at above, you could start with an arbitrary metric on $V$ rather than an inner product, and try to find the rank-$n$ projection (for some $n\leq\dim V$) that minimizes the expected squared distance from a vector to its projection. This would probably be difficult, messy, and not that much like principal component analysis. If it can be done, it would be useful in practice if we were much better at fitting explicit metrics to our implicit metrics than at fitting inner products to our implicit metrics, but I'm under the impression that this is not currently the case. This also differs from the other proposals in this section in that it is a modification of the problem looking for a solution, rather than a modification of the solution looking for a problem.

$V$ could be a real Hilbert space that is not necessarily finite-dimensional. Here we can run into the problem that $C_{\mu}$ might not even have any eigenvectors. However, if $\mu$ (which hopefully was not inferred from a finite sample) is Gaussian (and possibly also under weaker conditions), then $C_{\mu}$ is a compact operator, so $V$ does have an orthonormal basis of eigenvectors of $C_{\mu}$, which still have non-negative eigenvalues. There probably aren't any guarantees you can get about the order-type of this orthonormal basis when you order the eigenvectors in decreasing order of their eigenvalues, and there probably isn't a sense in which the orthogonal projection onto the closure of the span of an initial segment of the basis accounts for the most variance of any closed subspace of the same "size" ("size" would have to refer to a refinement of the notion of dimension for this to be the case). However, a weaker statement is probably still true: namely that each orthonormal basis element maximizes the variance that it accounts for conditioned on values along the previous orthonormal basis elements. I guess considering infinite-dimensional vector spaces goes against the spirit of machine learning though.

$V$ could be a finite-dimensional complex inner product space. $\text{Cov}_{\mu}\in\overline{V}\otimes V$ would be the sesquilinear form on $V^{*}$ given by $\text{Cov}_{\mu}\left(\varphi,\psi\right)=\int_{V}\overline{\varphi\left(x\right)}\psi\left(x\right)d\mu\left(x\right)$. $\left\langle \cdot,\cdot\right\rangle \in\overline{V}^{*}\otimes V^{*}$, so $\left\langle \cdot,\cdot\right\rangle \otimes\text{Cov}_{\mu}\in\overline{V}^{*}\otimes V^{*}\otimes\overline{V}\otimes V$, and applying a tensor contraction to the conjugated indices gives us our covariance operator $C_{\mu}\in V^{*}\otimes V$ (in other words, the inner product gives us an isomorphism $\overline{V}\rightarrow V^{*}$, and applying this to the first index of $\text{Cov}_{\mu}$ gives us $C_{\mu}\in V^{*}\otimes V$). $C_{\mu}$ is still self-adjoint and positive semi-definite, so $V$ still has an orthonormal basis of eigenvectors with non-negative real eigenvalues, and we can order the basis in decreasing order of the eigenvalues. Analogously to the real case, projecting onto the span of the first $n$ basis vectors along the span of the rest is the complex rank-$n$ projection that minimizes the expected squared distance from a vector to its projection. As far as I know, machine learning tends to deal with real data, but if you have complex data and for some reason you want to project onto a lower-dimensional complex subspace without losing too much information, now you know what to do.

Suppose your sample consists of events, where you've labeled them with both their spatial location and the time at which they occurred. In this case, events are represented as points in Minkowski space, a four-dimensional vector space representing flat spacetime, which is equipped with a nondegenerate symmetric bilinear form called the Minkowski inner product, even though it is not an inner product because it is not positive-definite. Instead, the Minkowski inner product is such that $\left\langle x,x\right\rangle$ is positive if $x$ is a space-like vector, negative if $x$ is time-like, and zero if $x$ is light-like. We can still get $C_{\mu}\in V^{*}\otimes V$ out of $\text{Cov}_{\mu}\in V\otimes V$ and the Minkowski inner product in $V^{*}\otimes V^{*}$ in the same way, and $V$ has a basis of eigenvectors of $C_{\mu}$, and we can still order the basis in decreasing order of their eigenvalues. The first 3 eigenvectors will be space-like, with non-negative eigenvalues, and the last eigenvector will be time-like, with a non-positive eigenvalue. The eigenvectors are still orthogonal. Thus principal component analysis provides us with a reference frame in which the span of the first 3 eigenvectors is simultaneous, and the span of the last eigenvector is motionless. If $\mu$ is Gaussian, then this will be the reference frame in which the spatial position of an event and the time at which it occurs are mean independent of each other, meaning that conditioning on one of them doesn't change the expected value of the other one. For general $\mu$, there might not be a reference frame in which the space and time of an event are mean independent, but the reference frame given to you by by principal component analysis is still the unique reference frame with the property that the time coordinate is uncorrelated with any spatial coordinate.

More generally, we could consider $V$ equipped with any symmetric bilinear form $\left\langle \cdot,\cdot\right\rangle$ taking the role of the inner product. Without loss of generality, we can consider only nondegenerate symmetric bilinear forms, because in the general case, where $D:=\left\{ x\mid\forall y\,\left\langle x,y\right\rangle =0\right\}$, applying principal component analysis with $\left\langle \cdot,\cdot\right\rangle$ is equivalent to projecting the data onto $V/D$, applying principal component analysis there with the nondegenerate symmetric bilinear form on $V/D$ induced by $\left\langle \cdot,\cdot\right\rangle$, and then lifting back to $V$ and throwing in a basis for $D$ with eigenvalues $0$ at the end, essentially treating $D$ as the space of completely irrelevant distinctions between data points that we intend to immediately forget about. Anyway, nondegenerate symmetric bilinear forms are classified up to isomorphism by their signature $\left(n,m\right)$, which is such that any orthogonal basis contains exactly $n+m$ basis elements, $n$ of which are space-like and $m$ of which are time-like, using the convention that $x$ is space-like if $\left\langle x,x\right\rangle >0$, time-like if $\left\langle x,x\right\rangle <0$, and light-like if $\left\langle x,x\right\rangle =0$, as above. Using principal component analysis on probability distributions over points in spacetime (or rather, points in the tangent space to spacetime at a point, so that it is a vector space) in a universe with $n$ spatial dimensions and $m$ temporal dimensions still gives you a reference frame in which the span of the first $n$ basis vectors is simultaneous and the span of the last $m$ basis vectors is motionless, and this is again the unique reference frame in which each time coordinate is uncorrelated with each spatial coordinate. Incidentally, I've heard that much of physics still works with multiple temporal dimensions. I don't know what that would mean, except that I think it means there's something wrong with my intuitive understanding of time. But that's another story. Anyway, the spaces spanned by the first $n$ and by the last $m$ basis vectors could be used to establish a reference frame, and then the data might be projected onto the first few (at most $n$) and last few (at most $m$) coordinates to approximate the positions of the events in space and in time, respectively, in that reference frame.