Exact 2-cycles are degenerate isomorphisms

The situation in which you have vector spaces V and W, and linear maps f:V\rightarrow W and g:W\rightarrow V such that \ker\left(f\right)=\text{im}\left(g\right) and \ker\left(g\right)=\text{im}\left(f\right) often arises in the situation in which you would have an isomorphism between V and W if you knew how to divide by 0. Specifically, this happens when you'd need to divide by 0 exactly once; in similar situations in which you'd need to know how to divide by 0 multiple times in order to get an isomorphism, you get f:V\rightarrow W and g:W\rightarrow V such that f\circ g=0 and g\circ f=0 but whose kernels and images are not necessarily equal.

I'll call such a pair \left(f,g\right) with \ker\left(f\right)=\text{im}\left(g\right) and \ker\left(g\right)=\text{im}\left(f\right) an exact 2-cycle of vector spaces. Note that the two vector spaces V and W in an exact 2-cycle are in fact isomorphic, as \dim\left(V\right)=\dim\left(\ker\left(f\right)\right)+\dim\left(\text{im}\left(f\right)\right)=\dim\left(\text{im}\left(g\right)\right)+\dim\left(\ker\left(g\right)\right)=\dim\left(W\right).


Given a finite-dimensional vector space V and an invertible linear map f:V\rightarrow V, its adjugate is almost its inverse; you just have to divide by \det\left(f\right). If f:V\rightarrow V is not invertible, then of course, \det\left(f\right)=0, so dividing by \det\left(f\right) doesn't work. But if f has nullity 1, then \ker\left(f\right)=\text{im}\left(\text{adj}\left(f\right)\right) and \ker\left(\text{adj}\left(f\right)\right)=\text{im}\left(f\right). That is, \left(f,\text{adj}\left(f\right)\right) is an exact 2-cycle. If f has nullity N\left(f\right)\geq2, then \det\left(f\right)=0^{N\left(f\right)}, and hence inverting f requires dividing by 0 more than once, and \text{adj}\left(f\right)=0.

Homogeneous polynomials and multilinear forms

Given a vector space V over a field k, let k\left[V\right]_{2} denote the space of quadratic forms on V (that is, homogeneous quadratic polynomial maps V\rightarrow k), and let Sym^{2}V^{*} denote the space of symmetric bilinear forms on V.

Given a symmetric bilinear form B on V, we can construct a quadratic form Q on V by Q\left(\vec{x}\right):=B\left(\vec{x},\vec{x}\right). This gives us a map f:Sym^{2}V^{*}\rightarrow k\left[V\right]_{2} by f\left(B\right)\left(\vec{x}\right)=B\left(\vec{x},\vec{x}\right).

2B\left(\vec{x},\vec{y}\right)=B\left(\vec{x}+\vec{y},\vec{x}+\vec{y}\right)-B\left(\vec{x},\vec{x}\right)-B\left(\vec{y},\vec{y}\right), so we can recover B from f\left(B\right) by B\left(\vec{x},\vec{y}\right)=\frac{1}{2}\left(f\left(B\right)\left(\vec{x}+\vec{y}\right)-f\left(B\right)\left(\vec{x}\right)-f\left(B\right)\left(\vec{y}\right)\right). That is, the map g:k\left[V\right]_{2}\rightarrow Sym^{2}V^{*} given by g\left(Q\right)\left(\vec{x},\vec{y}\right)=Q\left(\vec{x}+\vec{y}\right)-Q\left(\vec{x}\right)-Q\left(\vec{y}\right) is twice the inverse of f.

This doesn't quite work if \text{char}\left(k\right)=2, since we can't do the part where we divide by 2. In fact, f is not invertible in this case. But g is still a well-defined map k\left[V\right]_{2}\rightarrow Sym^{2}V^{*}, and it's still true that g\circ f=2 id_{Sym^{2}V^{*}} and f\circ g=2 id_{k\left[V\right]_{2}}; it's just that now that means g\circ f=0 and f\circ g=0. In fact, \ker\left(f\right)=\text{im}\left(g\right) and \ker\left(g\right)=\text{im}\left(f\right). \ker\left(g\right) and \text{im}\left(f\right) are the \dim\left(V\right)-dimensional space of diagonal quadratic forms (polynomials that are linear combinations of squares of linear functions V\rightarrow k), and \ker\left(f\right) and \text{im}\left(g\right) are the {\dim\left(V\right) \choose 2}-dimensional space of alternating symmetric bilinear forms. Thus Sym^{d}V^{*} and k\left[V\right]_{2} are both \dim\left(V\right)+{\dim\left(V\right) \choose 2}-dimensional.

Similar things happen with higher degree homogeneous polynomials and symmetric multilinear forms. Let k\left[V\right]_{d} be the space of homogeneous degree-d polynomials on V and Sym^{d}V^{*} the space of symmetric d-linear forms on V. We have functions f:Sym^{d}V^{*}\rightarrow k\left[V\right]_{d} given by f\left(\Phi\right)\left(\vec{x}\right):=\Phi\left(\vec{x},...,\vec{x}\right) and g:k\left[V\right]_{d}\rightarrow Sym^{d}V^{*} given by g\left(\phi\right)\left(\vec{x}^{1},...,\vec{x}^{d}\right)=\sum_{I\subseteq\left[d\right]}\left(-1\right)^{d-\left|I\right|}\phi\left(\sum_{i\in I}\vec{x}^{i}\right). g\left(f\left(\Phi\right)\right)=d!\Phi and f\left(g\left(\phi\right)\right)=d!\phi, so if \text{char}\left(k\right)=0 or \text{char}\left(k\right)>d, then f and g are bijections, and d! times each others' inverse. Otherwise, g\circ f=0 and f\circ g=0. If \frac{d}{2}<\text{char}\left(k\right)\leq d, then \text{char}\left(k\right) divides d! with multiplicity 1, and \ker\left(f\right)=\text{im}\left(g\right) and \ker\left(g\right)=\text{im}\left(f\right). If 1<\text{char}\left(k\right)\leq\frac{d}{2}, then \text{char}\left(k\right) divides d! with multiplicity \geq2, and all bets are off. Though \dim\left(k\left[V\right]_{d}\right)=\dim\left(Sym^{d}V^{*}\right)={\dim\left(V\right)+d-1 \choose d}, no matter what \text{char}\left(k\right) is.

Newtonian spacetime

In special relativity, we work with a 4-dimensional (3 for space and 1 for time) real vector space T, with a symmetric bilinear form \left\langle \cdot,\cdot\right\rangle , called the Minkowski inner product, of signature \left(-+++\right); that is, the associated quadratic form can be given, in coordinates, by -t^{2}+x^{2}+y^{2}+z^{2} (t is the time coordinate and x,y,z are spatial coordinates for some reference frame). If \left\langle \vec{v},\vec{v}\right\rangle >0, then \vec{v} is spacelike, and \sqrt{\left\langle \vec{v},\vec{v}\right\rangle } measures its distance (in the reference frame in which its temporal coordinate is 0). If \left\langle \vec{v},\vec{v}\right\rangle <0, then \vec{v} is timelike, and \sqrt{-\left\langle \vec{v},\vec{v}\right\rangle } measures its duration (in the reference frame in which it is at rest). By currying, the Minkowski inner product can be seen as a linear map T\rightarrow T^{*}, where T^{*} is the vector space of linear maps T\rightarrow\mathbb{R}. Since the Minkowski inner product is nondegenerate, this linear map T\rightarrow T^{*} is an isomorphism.

In Newtonian physics, things are a little different. We can still work in 4-dimensional spacetime, but we don't have a single Minkowski inner product measuring both distance and duration. We do have a global notion of time; that is, there's a linear map t:T\rightarrow\mathbb{R} that tells you what time it is at each point in spacetime. \ker\left(t\right) is space in the present moment, so it should be Euclidean space; that is, it should be equipped with an ordinary inner product.

The time function t induces a degenerate inner product on T by \left\langle \vec{v},\vec{w}\right\rangle :=t\left(\vec{v}\right)t\left(\vec{w}\right). As before, this can be seen as a linear map T\rightarrow T^{*} (it sends \vec{v} to t\left(\vec{v}\right)t), with 1-dimensional image \text{span}\left(t\right) and 3-dimensional kernel \ker\left(t\right).

The ordinary inner product on \ker\left(t\right) gives us a degenerate inner product on T^{*}: since our inner product on \ker\left(t\right) is non-degenerate, it induces an isomorphism between \ker\left(t\right) and its dual, and hence induces an inner product on \ker\left(t\right)^{*}. There's a canonical map T^{*}\rightarrow\ker\left(t\right)^{*} given by restriction: \varphi\mapsto\varphi\restriction_{\ker\left(t\right)}. So given \varphi,\psi\in T^{*}, we can define their inner product to be the spatial inner product of their restrictions to \ker\left(t\right). This can be seen as a linear map T^{*}\rightarrow T (given \varphi:T\rightarrow\mathbb{R}, restrict it to \ker\left(t\right), and then find the element of \ker\left(t\right)\subseteq T that corresponds to it via the spatial inner product) with image \ker\left(t\right) and kernel \text{span}\left(t\right). We have thus found canonical maps T\rightarrow T^{*} and T^{*}\rightarrow T such that the kernel of each is the image of the other.


In the spacetime example, it is conventional in special relativity to normalize the speed of light to 1. But another thing we can do is let the speed of light be the variable c. So \left\langle \left[\begin{array}{c} t_{1}\\ x_{1}\\ y_{1}\\ z_{1} \end{array}\right],\left[\begin{array}{c} t_{2}\\ x_{2}\\ y_{2}\\ z_{2} \end{array}\right]\right\rangle =-c^{2}t_{1}t_{2}+x_{1}x_{2}+y_{1}y_{2}+z_{1}z_{1}. As a map T\rightarrow T^{*}, this is \left[\begin{array}{c} t\\ x\\ y\\ z \end{array}\right]\mapsto\left[\begin{array}{cccc} -c^{2}t & x & y & z\end{array}\right]. The inverse map T^{*}\rightarrow T is \left[\begin{array}{cccc} \tau & \alpha & \beta & \gamma\end{array}\right]\mapsto\left[\begin{array}{c} -c^{-2}\tau\\ \alpha\\ \beta\\ \gamma \end{array}\right], or, as an inner product on T^{*}, \left\langle \left[\begin{array}{cccc} \tau_{1} & \alpha_{1} & \beta_{1} & \gamma_{1}\end{array}\right],\left[\begin{array}{cccc} \tau_{2} & \alpha_{2} & \beta_{2} & \gamma_{2}\end{array}\right]\right\rangle =-c^{-2}\tau_{1}\tau_{2}+\alpha_{1}\alpha_{2}+\beta_{1}\beta_{2}+\gamma_{1}\gamma_{2}. We're going to want to take a limit as c\rightarrow\infty and get something finite, so we'll have to scale our inner product on T down by a factor of c^{2}, giving us \left\langle \left[\begin{array}{c} t_{1}\\ x_{1}\\ y_{1}\\ z_{1} \end{array}\right],\left[\begin{array}{c} t_{2}\\ x_{2}\\ y_{2}\\ z_{2} \end{array}\right]\right\rangle =-t_{1}t_{2}+c^{-2}x_{1}x_{2}+c^{-2}y_{1}y_{2}+c^{-2}z_{1}z_{1}, or, as a map T\rightarrow T^{*}, \left[\begin{array}{c} t\\ x\\ y\\ z \end{array}\right]\mapsto\left[\begin{array}{cccc} -t & c^{-2}x & c^{-2}y & c^{-2}z\end{array}\right]. The limit c\rightarrow\infty gives us our temporal inner product on Newtonian spacetime, \left\langle \left[\begin{array}{c} t_{1}\\ x_{1}\\ y_{1}\\ z_{1} \end{array}\right],\left[\begin{array}{c} t_{2}\\ x_{2}\\ y_{2}\\ z_{2} \end{array}\right]\right\rangle =-t_{1}t_{2}, and our spatial inner product on the dual space \left\langle \left[\begin{array}{cccc} \tau_{1} & \alpha_{1} & \beta_{1} & \gamma_{1}\end{array}\right],\left[\begin{array}{cccc} \tau_{2} & \alpha_{2} & \beta_{2} & \gamma_{2}\end{array}\right]\right\rangle =\alpha_{1}\alpha_{2}+\beta_{1}\beta_{2}+\gamma_{1}\gamma_{2}, giving us our exact 2-cycle of maps between T and T^{*}, \left[\begin{array}{c} t\\ x\\ y\\ z \end{array}\right]\mapsto\left[\begin{array}{cccc} -t & 0 & 0 & 0\end{array}\right] and \left[\begin{array}{cccc} \tau & \alpha & \beta & \gamma\end{array}\right]\mapsto\left[\begin{array}{c} 0\\ \alpha\\ \beta\\ \gamma \end{array}\right]. (I did say that this should only work if we have to divide by 0 once, not if we must do so twice, and this involved c^{2}, but we never used c on its own anywhere, so we can just say c=\sqrt{\infty}, and it's fine).

Let's go back to the first example. Given f:V\rightarrow V of nullity 1, perturb f slightly to make it invertible by adding an infinitesimal \varepsilon times some map g:V\rightarrow V. The only condition we need g to satisfy is g\left(\ker\left(f\right)\right)\nsubseteq\text{im}\left(f\right). That way \det\left(f+\varepsilon g\right), which must be a multiple of \varepsilon, is not a multiple of \varepsilon^{2}. \left(f+\varepsilon g\right)\circ\text{adj}\left(f+\varepsilon g\right)=\text{adj}\left(f+\varepsilon g\right)\circ\left(f+\varepsilon g\right)=\det\left(f+\varepsilon g\right)id_{V}. Clearly f\circ\text{adj}\left(f\right)=\text{adj}\left(f\right)\circ f=\det\left(f\right)id_{V}=0. Given \vec{x}\in\ker\left(f\right), \left(f+\varepsilon g\right)\left(\vec{x}\right)=\varepsilon g\left(\vec{x}\right), so \text{adj}\left(f+\varepsilon g\right)\left(\varepsilon g\left(\vec{x}\right)\right)=\det\left(f+\varepsilon g\right)\vec{x}. Hence \text{adj}\left(f+\varepsilon g\right)\left(\frac{\varepsilon}{\det\left(f+\varepsilon g\right)}g\left(\vec{x}\right)\right)=\vec{x}. Since \det\left(f+\varepsilon g\right) has 0 constant term but nonzero coefficient of \varepsilon, \frac{\varepsilon}{\det\left(f+\varepsilon g\right)} can be evaluated at \varepsilon=0, and has a nonzero, finite value. Then \text{adj}\left(f\right)\left(\frac{\varepsilon}{\det\left(f+\varepsilon g\right)}\vert_{\varepsilon=0}g\left(\vec{x}\right)\right)=\vec{x}. So \left(f,\text{adj}\left(f\right)\right) forms an exact 2-cycle for reasons closely relating to the fact that perturbing each of them infinitesimally can make them inverses up to an infinitesimal scalar multiple.

Now, in the second example, where V is a vector space over a field k of positive characteristic, \frac{d}{2}<\text{char}\left(k\right)\leq d, and we have an exact 2-cycle f:Sym^{d}V^{*}\rightarrow k\left[V\right]_{d}, g:k\left[V\right]_{d}\rightarrow Sym^{d}V^{*}, let {\cal O} be an integral domain of characteristic 0 with a unique maximal ideal \mathfrak{m}, such that {\cal O}/\mathfrak{m}=k and \text{char}\left(k\right)\notin\mathfrak{m}^2 (for instance, if k=\mathbb{F}_{p}, we can use {\cal O}=\mathbb{Z}_{p} and \mathfrak{m}=p\mathbb{Z}_{p}). Lift V to a free {\cal O}-module \tilde{V} with \tilde{V}\otimes k=V (in coordinates, this means, instead of V=k^{n}, work with \tilde{V}={\cal O}^{n}, which carries a natural map to k^{n} by reducing each coordinate mod \mathfrak{m}). Then there are natural maps \tilde{f}:Sym^{d}\tilde{V}^{*}\rightarrow{\cal O}\left[\tilde{V}\right]_{d} and \tilde{g}:{\cal O}\left[\tilde{V}\right]_{d}\rightarrow Sym^{d}\tilde{V}^{*} such that \tilde{g}\circ\tilde{f}=d!id_{Sym^{d}\tilde{V}^{*}} and \tilde{f}\circ\tilde{g}=d!id_{{\cal O}\left[\tilde{V}\right]_{d}}, and \tilde{f} and \tilde{g} reduce mod \mathfrak{m} to f and g, respectively. Where K is the field of fractions of {\cal O} (so K=\mathbb{Q}_p in our example with k=\mathbb{F}_p and \cal{O}=\mathbb{Z}_p), \tilde{f}\otimes K:Sym^{d}\left(\tilde{V}\otimes K\right)^{*}\rightarrow K\left[\tilde{V}\otimes K\right]_{d} and \tilde{g}\otimes K:K\left[\tilde{V}\otimes K\right]_{d}\rightarrow Sym^{d}\left(\tilde{V}\otimes K\right)^{*} are bijections (in coordinates, \tilde{V}\otimes K=K^{n}, and tensoring a map with K just means the same map extended over the field of fractions), as they are inverses of each other up to a multiple of d! , which is invertible in K. Since \text{char}\left(k\right)\leq d, d!\in\mathfrak{m}, and g\circ f=0 and f\circ g=0. Given \phi\in\ker\left(g\right), if we lift \phi to \tilde{\phi}\in{\cal O}\left[\tilde{V}\right]_{d}, \tilde{g}\left(\tilde{\phi}\right)\in\mathfrak{m}Sym^{d}\tilde{V}^{*}. Since \text{char}\left(k\right)>\frac{d}{2}, d!\notin\mathfrak{m}^{2}, and hence \frac{\tilde{g}\left(\tilde{\phi}\right)}{d!}\in Sym^{d}\tilde{V}^{*}, and of course, \tilde{f}\left(\frac{\tilde{g}\left(\tilde{\phi}\right)}{d!}\right)=\tilde{\phi}. Reducing mod \mathfrak{m}, we get f\left(\frac{\tilde{g}\left(\tilde{\phi}\right)}{d!}\mod\mathfrak{m}\right)=\phi. Thus \ker\left(g\right)=\text{im}\left(f\right). Similarly, given \Phi\in\ker\left(f\right), lift \Phi to \tilde{\Phi}\in Sym^{d}\tilde{V}^{*}. \tilde{f}\left(\tilde{\Phi}\right)\in\mathfrak{m}{\cal O}\left[\tilde{V}\right]_{d}. \frac{\tilde{f}\left(\tilde{\Phi}\right)}{d!}\in{\cal O}\left[\tilde{V}\right]_{d}, and \tilde{g}\left(\frac{\tilde{f}\left(\tilde{\Phi}\right)}{d!}\right)=\tilde{\Phi}. Reducing mod \mathfrak{m}, we get g\left(\frac{\tilde{f}\left(\tilde{\Phi}\right)}{d!}\mod\mathfrak{m}\right)=\Phi. Thus \ker\left(f\right)=\text{im}\left(g\right). So \left(f,g\right) forms an exact cycle because, in K, they are inverses up to a factor of d! , which we can divide by, and which is 0 with multiplicity 1 in k, since d!\in\mathfrak{m}\setminus\mathfrak{m}^{2}.

The general story

All three arguments from the previous section took the following form: Let {\cal O} be a discrete valuation ring with residue field k, field of fractions K, and valuation \nu:K^{\times}\rightarrow\mathbb{Z}. Let V and W be free {\cal O}-modules, and let f:V\rightarrow W and g:W\rightarrow V be such that f\circ g=\varepsilon id_{W} and g\circ f=\varepsilon id_{V}, for some \varepsilon\in{\cal O} with \nu\left(\varepsilon\right)=1. Then f\otimes K:V\otimes K\rightarrow W\otimes K and g\otimes K:W\otimes K\rightarrow V\otimes K are isomorphisms, and each is \varepsilon times the inverse of the other. f\otimes k:V\otimes k\rightarrow W\otimes k and g\otimes k:W\otimes k\rightarrow V\otimes k form an exact 2-cycle: they compose to 0 because f and g compose to \varepsilon id, which goes to 0 in k, and given \vec{x}\in V\otimes k such that \left(f\otimes k\right)\left(\vec{x}\right)=0, we can lift \vec{x} to \tilde{x}\in V. f\left(\tilde{x}\right)\in\varepsilon W, so \varepsilon^{-1}f\left(\tilde{x}\right)\in W, and g\left(\varepsilon^{-1}f\left(\tilde{x}\right)\right)=\tilde{x}, so tensoring with k sends \varepsilon^{-1}f\left(\tilde{x}\right) to some \vec{y}\in W such that \left(g\otimes k\right)\left(\vec{y}\right)=\vec{x}. Thus \ker\left(f\right)=\text{im}\left(g\right). The same argument with f and g switched shows \ker\left(g\right)=\text{im}\left(f\right). The exact 2-cycle \left(f\otimes k,g\otimes k\right) is a sort of shadow of the isomorphisms f\otimes K,g\otimes K.

In the spacetime example, K=\mathbb{R}\left(\left(c^{-2}\right)\right), {\cal O}=\mathbb{R}\left[\left[c^{-2}\right]\right], k=\mathbb{R}, and \varepsilon=c^{-2}. In the adjugates example, K=k\left(\left(\varepsilon\right)\right), {\cal O}=k\left[\left[\varepsilon\right]\right], and the \varepsilon in the general story is \det\left(f+\varepsilon g\right). In the homogeneous polynomials and symmetric multilinear forms example, K is a discretely valued field of characteristic 0 with residue field k, {\cal O} is its valuation ring, and \varepsilon=d! .

All exact 2-cycles of vector spaces can be fit into this general story. Given any exact 2-cycle f:V\rightarrow W, g:W\rightarrow V (V, W vector spaces over k), we can take a discretely valued field K with residue field k, and then lift f,g to \tilde{f},\tilde{g} with \tilde{g}\circ\tilde{f}=\varepsilon id for some \varepsilon\in K with \nu\left(\varepsilon\right)=1, exactly the conditions in the above argument.

What more?

What about exact 2-cycles in abelian categories other than vector spaces? In general, the two objects in an exact 2-cycle need not be isomorphic. For instance, with abelian groups, there's an exact 2-cycle between the 4-element cyclic group and the Klein four-group. Though two objects in an exact 2-cycle must be isomorphic in any category in which every short exact sequence splits (this is the gist of the dimension-counting argument from the beginning showing that two vector spaces in an exact 2-cycle must be isomorphic). Is there still some way of seeing exact 2-cycles as degenerate isomorphisms even in contexts in which there need not be actual isomorphisms?

Also, what about exact n-cycles? That is, a cycle of n functions such that the image of each is the kernel of the next. If an exact 2-cycle is a degenerate form of an isomorphism, and an isomorphism is an exact sequence of length 2, then perhaps an exact 3-cycle should be a degenerate form of an exact sequence of length 3 (i.e. a short exact sequence). This is hard to picture, as a short exact sequence is not symmetric between its objects. However, for reasons not understood by me, algebraic topologists care about exact 3-cycles in which two of the three objects involved are the same (these are called exact couples), and this apparently has something to do with short exact sequences in which the first two objects are isomorphic, which provides some support for the idea that exact 3-cycles should have something to do with short exact sequences. An exact sequence of length 1 just consists of the 0 object, so this suggests an exact 1-cycle (i.e. an endomorphism of an object whose kernel and image are the same) should be considered a degenerate form of the 0 object, which is also hard to picture.

Metamathematics and probability

Content warning: mathematical logic.

Note: This write-up consists mainly of open questions rather than results, but may contain errors anyway.


I'd like to describe a logic for talking about probabilities of logical sentences. Fix some first-order language {\cal L}. This logic deals with pairs \left(\varphi,p\right), which I'm calling assertions, where \varphi\in{\cal L} is a formula and p\in\left[0,1\right]. Such a pair is to be interpreted as a claim that \varphi has probability at least p.

A theory consists of a set of assertions. A model of a theory T consists of a probability space \left(X,P\right) whose points are {\cal L}-structures, such that for every assertion \left(\varphi,p\right)\in T, P_{*}\left(\left\{ {\cal M}\in X\mid{\cal M}\models\varphi\right\} \right)\geq p, where P_{*} is inner probability. I'll write T\vdash_{p}\varphi for \left(\varphi,p\right) can be proved from T, and T\models_{p}\varphi for all models of T are also models of \left\{ \left(\varphi,p\right)\right\} .

The rules of inference are all rules \Gamma\vdash_{p}\varphi where \Gamma is a finite set of assertions, and \left(\varphi,p\right) is an assertion such that P_{*}\left(\left\{ {\cal M}\in X\mid{\cal M}\models\varphi\right\} \right)\geq p in all models of \Gamma. Can we make an explicit finite list of inference rules that generate this logic? If not, is the set of inference rules at least recursively enumerable? (For recursive enumerability to make sense here, we need to restrict attention to probabilities in some countable dense subset of \left[0,1\right] that has a natural explicit bijection with \mathbb{N}, such as \mathbb{Q}\cap\left[0,1\right].) I'm going to assume later that the set of inference rules is recursively enumerable; if it isn't, everything should still work if we use some recursively enumerable subset of the inference rules that includes all of the ones that I use.

Note that the compactness theorem fails for this logic; for example, \left\{ \left(\varphi,p\right)\mid p<1\right\} \models_{1}\varphi, but no finite subset of \left\{ \left(\varphi,p\right)\mid p<1\right\} implies \left(\varphi,1\right), and hence \left\{ \left(\varphi,p\right)\mid p<1\right\} \nvdash_{1}\varphi.

Any classical first-order theory T can be converted into a theory in this logic as \left\{ \left(\varphi,1\right)\mid T\vdash\varphi\right\} .

Löb's Theorem

Let T be a consistent, recursively axiomatizable extension of Peano Arithmetic. By the usual sort of construction, there is a \Sigma_{1}^{0} binary predicate \square_{y}\left(x\right) such that T\vdash_{p}\varphi\iff\mathbb{N}\models\square_{p}\left(\ulcorner\varphi\urcorner\right) for any sentence \varphi and p\in\left[0,1\right]\cap\mathbb{Q}, where \ulcorner\urcorner is a coding of sentences with natural numbers. We have a probabilistic analog of Löb's theorem: if T\vdash_{p}\square_{p}\left(\ulcorner\varphi\urcorner\right)\rightarrow\varphi, then T\vdash_{p}\varphi. Peano arithmetic can prove this theorem, in the sense that PA\vdash_{1}\square_{p}\left(\ulcorner\square_{p}\left(\ulcorner\varphi\urcorner\right)\rightarrow\varphi\urcorner\right)\rightarrow\square_{p}\left(\ulcorner\varphi\urcorner\right).

Proof: Assume T\vdash_{p}\square_{p}\left(\ulcorner\varphi\urcorner\right)\rightarrow\varphi. By the diagonal lemma, there is a sentence \psi such that T\vdash_{1}\psi\leftrightarrow\left(\square_{p}\left(\ulcorner\psi\urcorner\right)\rightarrow\varphi\right). If \square_{p}\left(\ulcorner\psi\urcorner\right), then \square_{1}\left(\ulcorner\square_{p}\left(\ulcorner\psi\urcorner\right)\urcorner\right) and \square_{p}\left(\ulcorner\square_{p}\left(\ulcorner\psi\urcorner\right)\rightarrow\varphi\urcorner\right), so \square_{p}\left(\ulcorner\varphi\urcorner\right). This shows that T\cup\left\{ \left(\square_{p}\left(\ulcorner\psi\urcorner\right),1\right)\right\} \vdash_{1}\square_{p}\left(\ulcorner\varphi\urcorner\right). By the assumption that T\vdash_{p}\square_{p}\left(\ulcorner\varphi\urcorner\right)\rightarrow\varphi, this implies that T\cup\left\{ \left(\square_{p}\left(\ulcorner\psi\urcorner\right),1\right)\right\} \vdash_{p}\varphi. By a probabilistic version of the deduction theorem, T\vdash_{p}\square_{p}\left(\ulcorner\psi\urcorner\right)\rightarrow\varphi. That is, T\vdash_{p}\psi. Going back around through all that again, we get T\vdash_{p}\varphi.

If we change the assumption to be that T\vdash_{q}\square_{p}\left(\ulcorner\varphi\urcorner\right)\rightarrow\varphi for some q<p, then the above proof does not go through (if q>p, then it does, because \left(\theta,q\right)\vdash_{p}\theta). Is there a consistent theory extending Peano Arithmetic that proves a soundness schema about itself, \left\{ \left(\square_{p}\left(\ulcorner\varphi\urcorner\right)\rightarrow\varphi,q\right)\mid q<p\right\} , or can this be used to derive a contradiction some other way? If there is no such consistent theory, then can the soundness schema be modified so that it is consistent, while still being nontrivial? If there is such a consistent theory with a soundness schema, can the theory also be sound? That is actually several questions, because there are multiple things I could mean by "sound". The possible syntactic things "sound" could mean, in decreasing order of strictness, are: 1) The theory does not assert a positive probability to any sentence that is false in \mathbb{N}. 2) There is an upper bound below 1 for all probabilities asserted of sentences that are false in \mathbb{N}. 3) The theory does not assert probability 1 to any sentence that is false in \mathbb{N}.

There are also semantic versions of the above questions, which are at least as strict as their syntactic analogs, but probably aren't equivalent to them, since the compactness theorem does not hold. The semantic version of asking if the soundness schema is consistent is asking if it has a model. The first two soundness notions also have semantic analogs. 1') \left\{ \mathbb{N}\right\} is a model of the theory. 2') There is a model of the theory that assigns positive probability to \mathbb{N}. I don't have a semantic version of 3, but metaphorically speaking, a semantic version of 3 should mean that there is a model that assigns nonzero probability density at \mathbb{N}, even though it might not have a point mass at \mathbb{N}.


This is somewhat similar to Definability of Truth in Probabilistic Logic. But in place of adding a probability predicate to the language, I'm only changing the metalanguage to refer to probabilities, and using this to express statements about probability in the language through conventional metamathematics. An advantage of this approach is that it's constructive. Theories with the properties described by the Christiano et al paper are unsound, so if some reasonably strong notion of soundness applies to an extension of Peano Arithmetic with the soundness schema I described, that would be another advantage of my approach.

A type of situation that this might be useful for is that when an agent is reasoning about what actions it will take in the future, it should be able to trust its future self's reasoning. An agent with the soundness schema can assume that its future self's beliefs are accurate, up to arbitrarily small loss in precision. A related type of situation is if an agent reaches some conclusion, and then writes it to external storage instead of its own memory, and later reads the claim it had written to external storage. With the soundness schema, if the agent has reason to believe that the external storage hasn't been tampered with, it can reason that since its past self had derived the claim, the claim is to be trusted arbitrarily close to as much as it would have been if the agent had remembered it internally.

First Incompleteness Theorem

For a consistent theory T, say that a sentence \varphi is T-measurable if there is some p\in\left[0,1\right] such that T\vdash_{q}\varphi for every q<p and T\vdash_{q}\neg\varphi for every q<1-p. So T-measurability essentially means that T pins down the probability of the sentence. If \varphi is not T-measurable, then you could say that T has Knightian uncertainty about \varphi. Say that T is complete if every sentence is T-measurable. Essentially, complete theories assign a probability to every sentence, while incomplete theories have Knightian uncertainty.

The first incompleteness theorem (that no recursively axiomatizable extension of PA is consistent and complete) holds in this setting. In fact, for every consistent recursively axiomatizable extension of PA, there must be sentences that are given neither a nontrivial upper bound nor a nontrivial lower bound on their probability. Otherwise, we would be able to recursively separate the theorems of PA from the negations of theorems of PA, by picking some recursive enumeration of assertions of the theory, and sorting sentences by whether they are first given a nontrivial lower bound or first given a nontrivial upper bound; theorems of PA will only be given a nontrivial lower bound, and their negations will only be given a nontrivial upper bound. [Thanks to Sam Eisenstat for pointing this out; I had somehow managed not to notice this on my own.]

For an explicit example of a sentence for which no nontrivial bounds on its probability can be established, use the diagonal lemma to construct a sentence \varphi which is provably equivalent to "for every proof of \left(\varphi,p\right) for any p>0, there is a proof of \left(\neg\varphi,q\right) for some q>0 with smaller Gödel number."

Thus a considerable amount of Knightian uncertainty is inevitable in this framework. Dogmatic Bayesians such as myself might find this unsatisfying, but I suspect that any attempt to unify probability and first-order arithmetic will suffer similar problems.

A side note on model theory and compactness

I'm a bit unnerved about the compactness theorem failing. It occurred to me that it might be possible to fix this by letting models use hyperreal probabilities. Problem is, the hyperreals aren't complete, so the countable additivity axiom for probability measures doesn't mean anything, and it's unclear what a hyperreal-valued probability measure is. One possible solution is to drop countable additivity, and allow finitely-additive hyperreal-valued probability measures, but I'm worried that the logic might not even be sound for such models.

A different possible solution to this is to take a countably complete ultrafilter U on a set \kappa, and use probabilities valued in the ultrapower \mathbb{R}^{\kappa}/U. Despite \mathbb{R}^{\kappa}/U not being Cauchy complete, it inherits a notion of convergence of sequences from \mathbb{R}, since a sequence \left\{ \left[x_{i,j}\mid i\in\kappa\right]\mid j\in\omega\right\} can be said to converge to \left[\lim_{j\rightarrow\infty}x_{i,j}\mid i\in\kappa\right], and this is well-defined (if \lim_{j\rightarrow\infty}x_{i,j} is for a U-large set of indices i) by countable completeness. Thus the countable additivity axiom makes sense for \mathbb{R}^{\kappa}/U-valued probability measures. Allowing models to use \mathbb{R}^{\kappa}/U-valued probability measures might make the compactness theorem work. [Edit: This doesn't work, because \mathbb{R}^{\kappa}/U\cong\mathbb{R}. To see this, it is enough to show that \mathbb{R}^{\kappa}/U is Archimedean, since \mathbb{R} has no proper Archimedean extensions. Given \left[x_i\mid i\in\kappa\right]\in\mathbb{R}^{\kappa}/U, let A_n:=\{i\in\kappa\mid| x_i|<n\} for n\in\mathbb{N}. \bigcup_{n\in\mathbb{N}}A_n = \kappa, so by countable completeness of U, there is some n\in\mathbb{N} such that A_n\in U, and thus \left[x_i\mid i\in\kappa\right]<n.]

Complexity classes of natural numbers (googology for ultrafinitists)

Ultrafinitists think common ways of defining extremely large numbers don't actually refer to numbers that exist. For example, most ultrafinitists would maintain that a googolplex isn't a number. But to a classical mathematician, while numbers like a googolplex are far larger than the numbers we deal with on a day-to-day basis like 10, both numbers have the same ontological status. In this post, I want to consider a compromise position, that any number we can define can be meaningfully reasoned about, but that a special status is afforded to the sorts of numbers that ultrafinitists can accept.

Specifically, define an “ultrafinite number” to be a natural number that it is physically possible to express in unary. This isn't very precise, since there are all sorts of things that “physically possible to express in unary” could mean, but let's just not worry about that too much. Also, many ultrafinitists would not insist that numbers must be expressible in such an austere language as unary, but I'm about to get to that.

Examples: 20 is an ultrafinite number, because 20 = SSSSSSSSSSSSSSSSSSSS0, where S is the successor function. 80,000 is also an ultrafinite number, but it is a large one, and it isn't worth demonstrating its ultrafiniteness. A googol is not ultrafinite. The observable universe isn't even big enough to contain a googol written in unary.

Now, define a “polynomially finite number” to be a natural number that it is physically possible to express using addition and multiplication. Binary and decimal are basically just concise ways of expressing certain sequences of addition and multiplication operations. For instance, “18,526” means (((1*10 + 8)*10 + 5)*10 + 2)*10 + 6. Conversely, if you multiply an n-digit number with an m-digit number, you get an at most n+m+1-digit number, which is the same number of symbols it took write down “[the n-digit number] times [the m-digit number]” in the first place, so any number that can be written using addition and multiplication can be written in decimal. Thus, another way to define polynomially finite numbers is as the numbers that it is physically possible to express in binary or in decimal. I've been ignoring some small constant factors that might make these definitions not quite equivalent, but any plausible candidate for a counterexample would be an ambiguous edge case according to each definition anyway, so I'm not worried about that. Many ultrafinitists may see something more like polynomially finite number, rather than ultrafinite number, as a good description of what numbers exist.

Examples: A googol is polynomially finite, because a googol is 10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000. A googolplex is not polynomially finite, because it would require a googol digits to express in decimal, which is physically impossible.

Define an “elementarily finite number” to be a number that it is physically possible to express using addition, multiplication, subtraction, exponentiation, and the integer division function \lfloor x/y\rfloor. Elementarily finite is much broader than polynomially finite, so it might make sense to look at intermediate classes. Say a number is “exponentially finite” if it is physically possible to express using the above operations but without any nested exponentiation (e.g. a^b c^d is okay, but a^{(b^c)} is not). More generally, we can say that a number is “k-exponentially finite” if it can be expressed with exponentiation nested to depth at most k, so a polynomially finite number is a 0-exponentially finite number, an exponentially finite number is a 1-exponentially finite number, and an elementarily finite number is a number that is k-exponentially finite for some k (or equivalently, for some ultrafinite k).

Examples: a googolplex is exponentially finite, because it is 10^{10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000}. Thus a googolduplex, meaning 10^\text{googolplex}, is 2-exponentially finite, but it is not exponentially finite. For examples of non-elementarily finite numbers, and numbers that are only k-exponentially finite for fairly large k, I'll use up-arrow notation. a\uparrow b just means a^b, a\uparrow^{n+1} b means a\uparrow^n a\uparrow^n a\uparrow^n ... a, where b is the number of copies of a, and using order of operations that starts on the right. So 3\uparrow\uparrow3 = 3^{(3^3)} = 3^{27} = 7,625,597,484,987, which is certainly polynomially finite, and could also be ultrafinite depending on what is meant by “physically possible” (a human cannot possibly count that high, but a computer with a large enough hard drive can store 3\uparrow\uparrow3 in unary). 3\uparrow\uparrow\uparrow3 = 3\uparrow\uparrow(3\uparrow\uparrow3) = 3\text{^}3\text{^}3\text{^}...\text{^}3, where there are 3^{27} threes in that tower. Under the assumptions that imply 3^{27} is ultrafinite, 3\uparrow\uparrow\uparrow3 is elementarily finite. Specifically, it is 3^{27}-exponentially finite, but I'm pretty sure it's not 3^{26}-exponentially finite, or even 7,625,597,484,000-exponentially finite. 3\uparrow\uparrow\uparrow\uparrow3 = 3\uparrow\uparrow\uparrow(3\uparrow\uparrow\uparrow3), and is certainly not elementarily finite.

Interestingly, even though a googolplex is exponentially finite, there are numbers less than a googolplex that are not. There's an easy nonconstructive proof of this: in order to be able to represent every number less than a googolplex in any encoding scheme at all, there has to be some number less than a googolplex that requires at least a googol decimal digits of information to express. But it is physically impossible to store a googol decimal digits of information. Therefore for any encoding scheme for numbers, there is some number less than a googolplex that cannot physically be expressed in it. This is why the definition of elementarily finite is significantly more complicated than the definition of polynomially finite; in the polynomial case, if n can be expressed using addition and multiplication and m<n, then m can also be expressed using addition and multiplication, so there's no need for additional operations to construct smaller numbers, but in the elementary case, the operations of subtraction and integer division are useful for expressing more numbers, and are simpler than exponentiation. For example, these let us express the number that you get from reading off the last googol digits, or the first googol digits, of 3\uparrow\uparrow100, so these numbers are elementarily finite. However, it is exceptionally unlikely that the number you get from reading off the first googol decimal digits of 3\uparrow\uparrow\uparrow\uparrow3 is elementarily finite. But for a difficult exercise, show that the number you get from reading off the last googol decimal digits of 3\uparrow\uparrow\uparrow\uparrow3 is elementarily finite.

Why stop there instead of including more operations for getting smaller numbers, like \lfloor\log\rfloor, which I implicitly used when I told you that the number formed by the first googol digits of 3\uparrow\uparrow100 is elementarily finite? We don't have to. The functions that you can get by composition from addition, multiplication, exponentiation, \max(x-y,0), and \lfloor x/y\rfloor coincide with the functions that can be computed in iterated exponential time (meaning O(2^{2^{...^{2^n}}}) time, for some height of that tower). So if you have any remotely close to efficient way to compute an operation, it can be expressed in terms of the operations I already specified.

We can go farther. Consider a programming language that has the basic arithmetic operations, if/else clauses, and loops, where the number of iterations of each loop must be fixed in advance. The programs that can be written in such a language are the primitive recursive functions. Say that a number is primitive recursively finite if it is physically possible to write a program (that does not take any input) in this language that outputs it. For each fixed n, the binary function (x,y)\mapsto x\uparrow^n y is primitive recursive, so 3\uparrow\uparrow\uparrow\uparrow3 is primitive recursively finite. But the ternary function (x,y,z)\mapsto x\uparrow^y z is not primitive recursive, so 3\uparrow^\text{googol}3 is not primitive recursively finite.

The primitive recursively finite numbers can be put in a hierarchy of subclasses based on the depth of nested loops that are needed to express them. If the only arithmetic operation available is the successor function (from which other operations can be defined using loops), then the elementarily finite numbers are those that can be expressed with loops nested to depth at most 2. The k-exponentially finite numbers should roughly correspond to the numbers that can be expressed with at most k loops at depth 2.

Next comes the provably computably finite numbers. Say that a number is provably computably finite if it is physically possible to write a program in a Turing-complete language that outputs the number (taking no input), together with a proof in Peano Arithmetic that the program halts. The famous Graham's number is provably computably finite. Graham's number is defined in terms of a function g, defined recursively as g(0):=4 and g(n+1):=3\uparrow^{g(n)}3. Graham's number is g(64). You could write a computer program to compute g, and prove that g is total using Peano arithmetic. By replacing Peano arithmetic with other formal systems, you can get other variations on the notion of provably computably finite.

For an example of a number that is not provably computably finite, I'll use the hydra game, which is described here. There is no proof in Peano arithmetic (that can physically be written down) that it is possible to win the hydra game starting from the complete binary tree of depth a googol. So the number of turns it takes to win the hydra game on the complete binary tree of depth a googol is not provably computably finite. If you start with a reasonably small hydra (say, with 100 nodes), you could write a program to search for the shortest winning strategy, and prove in Peano arithmetic that it succeeds in finding one, if you're sufficiently clever and determined, and you use a computer to help you search for proofs. The proof you'd get out of this endeavor would be profoundly unenlightening, but the point is, the number of turns it takes to win the hydra game for a small hydra is provably computably finite (but not primitive recursively finite, except in certain trivial special cases).

Next we'll drop the provability requirement, and say that a number is computably finite if it is physically possible to write a computer program that computes it from no input. Of course, in order to describe a computably finite number, you need the program you use to actually halt, so you'd need some argument that it does halt in order to establish that you're describing a computably finite number. Thus this is arguably just a variation on provably computably finite, where Peano arithmetic is replaced by some unspecified strong theory encompassing the sort of reasoning that classical mathematicians tend to endorse. This is probably the point where even the most patient of ultrafinitists would roll their eyes in disgust, but oh well. Anyway, the number of steps that it takes to win the hydra game starting from the complete binary tree of depth a googol is a computably finite number, because there exists a shortest winning strategy, and you can write a computer program to exhaustively search for it.

The busy-beaver function BB is defined so that BB(n) is the longest any Turing machine with n states runs before halting (among those that do halt). BB(10^{100}) is not computably finite, because Turing machines with a googol states cannot be explicitly described, and since the busy-beaver function is very fast-growing, no smaller Turing machine has comparable behavior. What about BB(10,000)? Turing machines with 10,000 states are not too big to describe explicitly, so it may be tempting to say that BB(10,000) is computably finite. But on the other hand, it is not possible to search through all Turing machines with 10,000 states and find the one that runs the longest before halting. No matter how hard you search and no matter how clever your heuristics for finding Turing machines that run for exceptionally long and then halt, it is vanishingly unlikely that you will find the 10,000-state Turing machine that runs longest before halting, let alone realize that you have found it. And the idea is to use classical reasoning for large numbers themselves, but constructive reasoning for descriptions of large numbers. So since it is pretty much impossible to actually write a program that outputs BB(10,000), it is not computably finite.

For a class that can handle busy-beaver numbers too, let's turn to the arithmetically finite numbers. These are the numbers that are defined by arithmetical formulas. These form a natural hierarchy, where the \Sigma^0_n-finite numbers are the numbers defined by arithmetical formulas with at most n unbounded quantifiers starting with \exists, the \Pi^0_n-finite numbers are the numbers defined by arithmetical formulas with at most n unbounded quantifiers starting with \forall, and the \Delta^0_n-finite numbers are those that are both \Sigma^0_n-finite and \Pi^0_n-finite. The \Delta^0_1-finite numbers are the same as the computably finite numbers. BB(10^{100}) is \Pi^0_2-finite, because it is defined by “\forall n every Turing machine with 10^{100} states that halts in at most n steps halts in at most BB(10^{100}) steps, and there is a Turing machine with 10^{100} states that halts in exactly BB(10^{100}) steps.” Everything after the first quantifier in that formula is computable. BB(BB(10^{100})) is \Delta^0_2-finite, but no lower than that. To get a number that is not arithmetically finite, consider the function f given by f(n) is the largest number defined by an arithmetical formula with n symbols. f(10,000) is \Delta^0_{5,000}-finite, but f(10^{100}) is not arithmetically finite. I'll stop there.

Principal Component Analysis in Theory and Practice

Prerequisites for this post are linear algebra, including tensors, and basic probability theory. Already knowing how PCA works will also be helpful. In section 1, I'll summarize the technique of principal component analysis (PCA), stubbornly doing so in a coordinate-free manner, partly because I am an asshole but mostly because it is rhetorically useful for emphasizing my point in section 2. In section 2, I'll gripe about how PCA is often used in ways that shouldn't be expected to work, but works just fine anyway. In section 3, I'll discuss some useless but potentially amusing ways that PCA could be modified. Thanks to Laurens Gunnarsen for inspiring this post by talking to me about the problem that I discuss in section 2.

A brief introduction to Principal Component Analysis

You start with a finite-dimensional real inner product space V and a probability distribution \mu on V. Actually, you probably just started with a large finite number of elements of V, and you've inferred a probability distribution that you're supposing they came from, but that difference is not important here. The goal is to find the n-dimensional (for some n\leq\dim V) affine subspace W_{n}\subseteq V minimizing the expected squared distance between a vector (distributed according to \mu) and its orthogonal projection onto W_{n}. We can assume without loss of generality that the mean of \mu is 0, because we can just shift any probability distribution by its mean and get a probability distribution with mean 0. This is useful because then W_{n} will be a linear subspace of V. In fact, we will solve this problem for all n\leq\dim V simultaneously by finding an ordered orthonormal basis such that W_{n} is the span of the first n basis elements.

First you take \text{Cov}_{\mu}\in V\otimes V, called the covariance of \mu, defined as the bilinear form on V^{*} given by \text{Cov}_{\mu}\left(\varphi,\psi\right)=\int_{V}\varphi\left(x\right)\psi\left(x\right)d\mu\left(x\right). From this, we get the covariance operator C_{\mu}\in V^{*}\otimes V by raising the first index, which means starting with \left\langle \cdot,\cdot\right\rangle \otimes\text{Cov}_{\mu}\in V^{*}\otimes V^{*}\otimes V\otimes V and performing a tensor contraction (in other words, C_{\mu} is obtained from \text{Cov}_{\mu} by applying the map V\rightarrow V^{*} given by the inner product to the first index). \text{Cov}_{\mu} is symmetric and positive semi-definite, so C_{\mu} is self-adjoint and positive semi-definite, and hence V has an orthonormal basis of eigenvectors of C_{\mu}, with non-negative real eigenvalues. This gives an orthonormal basis in which \text{Cov}_{\mu} is diagonal, where the diagonal entries are the eigenvalues. Ordering the eigenvectors in decreasing order of the corresponding eigenvalues gives us the desired ordered orthonormal basis.

The problem

There's no problem with principal component analysis as I described it above. It works just fine, and in fact is quite beautiful. But often people apply principal component analysis to probability distributions on finite-dimensional real vector spaces that don't have a natural inner product structure. There are two closely related problems with this: First, the goal is underdefined. We want to find a projection onto an n-dimensional subspace that minimizes the expected squared distance from a vector to its projection, but we don't have a measure of distance. Second, the procedure is underdefined. \text{Cov}_{\mu} is a bilinear form, not a linear operator, so it doesn't have eigenvectors or eigenvalues, and we don't have a way of raising an index to produce something that does. It should come as no surprise that these two problems arise together. After all, you shouldn't be able to find a fully specified solution to an underspecified problem.

People will apply principal component analysis in such cases by picking an inner product. This solves the second problem, since it allows you to carry out the procedure. But it does not solve the first problem. If you wanted to find a projection onto an n-dimensional subspace such that the distance from a vector to its projection tends to be small, then you must have already had some notion of distance in mind by which to judge success. Haphazardly picking an inner product gives you a new notion of distance, and then allows you to find an optimal solution with respect to your new notion of distance, and it is not clear to me why you should expect this solution to be reasonable with respect to the notion of distance that you actually care about.

In fact, it's worse than that. Of course, principal component analysis can't given you literally any ordered basis at all, but it is almost as bad. The thing that you use PCA for is the projection onto the span of the first n basis elements along the span of the rest. These projections only depend on the sequence of 1-dimensional subspaces spanned by the basis elements, and not the basis elements themselves. That is, we might as well only pay attention to the principal components up to scale, rather than making sure that are all unit length. Let a "coordinate system" refer to an ordered basis up to two ordered bases being equivalent if they differ only by scaling the basis vectors, so that we're paying attention to the coordinate systems given to us by PCA. If the covariance of \mu is nondegenerate, then the set of coordinate systems that can be obtained from principal component analysis by a suitable choice of inner product is dense in the space of coordinate systems. More generally, where U is the smallest subspace of V such that \mu\left(U\right)=1, then the space of coordinate systems that you can get from principal component analysis is dense in the space of all coordinate systems whose first \dim U coordinates span U (\dim U will be the rank of the covariance of \mu). So in a sense, for suitably poor choices of inner product, principal component analysis can give you arbitrarily terrible results, subject only to the weak constraint that it will always notice if all of the vectors in your sample belong to a common subspace.

It is thus somewhat mysterious that machine learning people seem to be able to often get good results from principal component analysis apparently without being very careful about the inner product they choose. Vector spaces that arise in machine learning seem to almost always come with a set of preferred coordinate axes, so these axes are taken to be orthogonal, leaving only the question of how to scale them relative to each other. If these axes are all labeled with the same units, then this also gives you a way of scaling them relative to each other, and hence an inner product. If they are aren't, then I'm under the impression that the most popular method is to normalize them such that the pushforward of \mu along each coordinate axis has the same variance. This is unsatisfying, since figuring out which axes \mu has enough variance along to be worth paying attention to seems like the sort of thing that you would want principal component analysis to be able to tell you. Normalizing the axes in this way seems to me like an admission that you don't know exactly what question you're hoping to use principal component analysis to answer, so you just tell it not to answer that part of the question to minimize the risk of asking it to answer the wrong question, and let it focus on telling you how the axes, which you're pretty sure should be considered orthogonal, correlate with each other.

That conservatism is actually pretty understandable, because figuring out how to ask the right question seems hard. You implicitly have some metric d on V such that you want to find a projection \pi onto an n-dimensional subspace such that d\left(x,\pi\left(x\right)\right) is usually small when x is distributed according to \mu. This metric is probably very difficult to describe explicitly, and might not be the metric induced by any inner product (for that matter, it might not even be a metric; d\left(x,y\right) could be any way of quantifying how bad it is to be told the value y when the correct value you wanted to know is x). Even if you somehow manage to explicitly describe your metric, coming up with a version of PCA with the inner product replaced with an arbitrary metric also sounds hard, so the next thing you would want to do is fit an inner product to the metric.
The usual approach is essentially to skip the step of attempting to explicitly describe the metric, and just find an inner product that roughly approximates your implicit metric based on some rough heuristics about what the implicit metric probably looks like. The fact that these heuristics usually work so well seems to indicate that the implicit metric tends to be fairly tame with respect to ways of describing the data that we find most natural. Perhaps this shouldn't be too surprising, but I still feel like this explanation does not make it obvious a priori that this should work so well in practice. It might be interesting to look into why these heuristics work as well as they do with more precision, and how to go about fitting a better inner product to implicit metrics. Perhaps this has been done, and I just haven't found it.

To take a concrete example, consider eigenfaces, the principal components that you get from a set of images of people's faces. Here, you start with the coordinates in which each coordinate axis represents a pixel in the image, and the value of that coordinate is the brightness of the corresponding pixel. By declaring that the coordinate axes are orthogonal, and measuring the brightness of each pixel on the same scale, we get our inner product, which is arguably a fairly natural one.

Presumably, the implicit metric we're using here is visual distance, by which I mean a measure of how similar two images look. It seems clear to me that visual distance is not very well approximated by our inner product, and in fact, there is no norm such that the visual distance between two images is approximately the norm of their difference. To see this, if you take an image and make it brighter, you haven't changed how it looks very much, so the visual distance between the image and its brighter version is small. But their difference is just a dimmer version of the same image, and if you add that difference to a completely different image, you will get the two images superimposed on top of each other, a fairly radical change. Thus the visual distance traversed by adding a vector depends on where you start from.

Despite this, producing eigenfaces by using PCA on images of faces, using the inner product described above, performs well with respect to visual distance, in the sense that you can project the images onto a relatively small number of principal components and leave them still recognizable. I think this can be explained on an intuitive level. In a human eye, each photoreceptor has a narrow receptive field that it detects light in, much like a pixel, so the representation of an image in the eye as patterns of photoreceptor activity is very similar to the representation of an image in a computer as a vector of pixel brightnesses, and the inner product metric is a reasonable measure of distance in this representation. When the visual cortex processes this information from the eye, it is difficult (and perhaps also not useful) for it to make vast distinctions between images that are close to each other according to the inner product metric, and thus result in similar patterns of photoreceptor activity in the eye. Thus the visual distance between two images cannot be too much greater than their inner product distance, and hence changing an image by a small amount according to the inner product metric can only change it by a small amount according to visual distance, even though the reverse is not true.


The serious part of this post is now over. Let's have some fun. Some of the following ways of modifying principal component analysis could be combined, but I'll consider them one at a time for simplicity.

As hinted at above, you could start with an arbitrary metric on V rather than an inner product, and try to find the rank-n projection (for some n\leq\dim V) that minimizes the expected squared distance from a vector to its projection. This would probably be difficult, messy, and not that much like principal component analysis. If it can be done, it would be useful in practice if we were much better at fitting explicit metrics to our implicit metrics than at fitting inner products to our implicit metrics, but I'm under the impression that this is not currently the case. This also differs from the other proposals in this section in that it is a modification of the problem looking for a solution, rather than a modification of the solution looking for a problem.

V could be a real Hilbert space that is not necessarily finite-dimensional. Here we can run into the problem that C_{\mu} might not even have any eigenvectors. However, if \mu (which hopefully was not inferred from a finite sample) is Gaussian (and possibly also under weaker conditions), then C_{\mu} is a compact operator, so V does have an orthonormal basis of eigenvectors of C_{\mu}, which still have non-negative eigenvalues. There probably aren't any guarantees you can get about the order-type of this orthonormal basis when you order the eigenvectors in decreasing order of their eigenvalues, and there probably isn't a sense in which the orthogonal projection onto the closure of the span of an initial segment of the basis accounts for the most variance of any closed subspace of the same "size" ("size" would have to refer to a refinement of the notion of dimension for this to be the case). However, a weaker statement is probably still true: namely that each orthonormal basis element maximizes the variance that it accounts for conditioned on values along the previous orthonormal basis elements. I guess considering infinite-dimensional vector spaces goes against the spirit of machine learning though.

V could be a finite-dimensional complex inner product space. \text{Cov}_{\mu}\in\overline{V}\otimes V would be the sesquilinear form on V^{*} given by \text{Cov}_{\mu}\left(\varphi,\psi\right)=\int_{V}\overline{\varphi\left(x\right)}\psi\left(x\right)d\mu\left(x\right). \left\langle \cdot,\cdot\right\rangle \in\overline{V}^{*}\otimes V^{*}, so \left\langle \cdot,\cdot\right\rangle \otimes\text{Cov}_{\mu}\in\overline{V}^{*}\otimes V^{*}\otimes\overline{V}\otimes V, and applying a tensor contraction to the conjugated indices gives us our covariance operator C_{\mu}\in V^{*}\otimes V (in other words, the inner product gives us an isomorphism \overline{V}\rightarrow V^{*}, and applying this to the first index of \text{Cov}_{\mu} gives us C_{\mu}\in V^{*}\otimes V). C_{\mu} is still self-adjoint and positive semi-definite, so V still has an orthonormal basis of eigenvectors with non-negative real eigenvalues, and we can order the basis in decreasing order of the eigenvalues. Analogously to the real case, projecting onto the span of the first n basis vectors along the span of the rest is the complex rank-n projection that minimizes the expected squared distance from a vector to its projection. As far as I know, machine learning tends to deal with real data, but if you have complex data and for some reason you want to project onto a lower-dimensional complex subspace without losing too much information, now you know what to do.

Suppose your sample consists of events, where you've labeled them with both their spatial location and the time at which they occurred. In this case, events are represented as points in Minkowski space, a four-dimensional vector space representing flat spacetime, which is equipped with a nondegenerate symmetric bilinear form called the Minkowski inner product, even though it is not an inner product because it is not positive-definite. Instead, the Minkowski inner product is such that \left\langle x,x\right\rangle is positive if x is a space-like vector, negative if x is time-like, and zero if x is light-like. We can still get C_{\mu}\in V^{*}\otimes V out of \text{Cov}_{\mu}\in V\otimes V and the Minkowski inner product in V^{*}\otimes V^{*} in the same way, and V has a basis of eigenvectors of C_{\mu}, and we can still order the basis in decreasing order of their eigenvalues. The first 3 eigenvectors will be space-like, with non-negative eigenvalues, and the last eigenvector will be time-like, with a non-positive eigenvalue. The eigenvectors are still orthogonal. Thus principal component analysis provides us with a reference frame in which the span of the first 3 eigenvectors is simultaneous, and the span of the last eigenvector is motionless. If \mu is Gaussian, then this will be the reference frame in which the spatial position of an event and the time at which it occurs are mean independent of each other, meaning that conditioning on one of them doesn't change the expected value of the other one. For general \mu, there might not be a reference frame in which the space and time of an event are mean independent, but the reference frame given to you by by principal component analysis is still the unique reference frame with the property that the time coordinate is uncorrelated with any spatial coordinate.

More generally, we could consider V equipped with any symmetric bilinear form \left\langle \cdot,\cdot\right\rangle taking the role of the inner product. Without loss of generality, we can consider only nondegenerate symmetric bilinear forms, because in the general case, where D:=\left\{ x\mid\forall y\,\left\langle x,y\right\rangle =0\right\} , applying principal component analysis with \left\langle \cdot,\cdot\right\rangle is equivalent to projecting the data onto V/D, applying principal component analysis there with the nondegenerate symmetric bilinear form on V/D induced by \left\langle \cdot,\cdot\right\rangle , and then lifting back to V and throwing in a basis for D with eigenvalues 0 at the end, essentially treating D as the space of completely irrelevant distinctions between data points that we intend to immediately forget about. Anyway, nondegenerate symmetric bilinear forms are classified up to isomorphism by their signature \left(n,m\right), which is such that any orthogonal basis contains exactly n+m basis elements, n of which are space-like and m of which are time-like, using the convention that x is space-like if \left\langle x,x\right\rangle >0, time-like if \left\langle x,x\right\rangle <0, and light-like if \left\langle x,x\right\rangle =0, as above. Using principal component analysis on probability distributions over points in spacetime (or rather, points in the tangent space to spacetime at a point, so that it is a vector space) in a universe with n spatial dimensions and m temporal dimensions still gives you a reference frame in which the span of the first n basis vectors is simultaneous and the span of the last m basis vectors is motionless, and this is again the unique reference frame in which each time coordinate is uncorrelated with each spatial coordinate. Incidentally, I've heard that much of physics still works with multiple temporal dimensions. I don't know what that would mean, except that I think it means there's something wrong with my intuitive understanding of time. But that's another story. Anyway, the spaces spanned by the first n and by the last m basis vectors could be used to establish a reference frame, and then the data might be projected onto the first few (at most n) and last few (at most m) coordinates to approximate the positions of the events in space and in time, respectively, in that reference frame.

Ordered algebraic geometry

Edit: Shortly after posting this, I found where the machinery I develop here was discussed in the literature. Real Algebraic Geometry by Bochnak, Coste, and Roy covers at least most of this material. I may eventually edit this to clean it up and adopt more standard notation, but don't hold your breath.


In algebraic geometry, an affine algebraic set is a subset of \mathbb{C}^{n} which is the set of solutions to some finite set of polynomials. Since all ideals of \mathbb{C}\left[x_{1},...,x_{n}\right] are finitely generated, this is equivalent to saying that an affine algebraic set is a subset of \mathbb{C}^{n} which is the set of solutions to some arbitrary set of polynomials.

In semialgebraic geometry, a closed semialgebraic set is a subset of \mathbb{R}^{n} of the form \left\{ \bar{x}\in\mathbb{R}^{n}\mid f\left(\bar{x}\right)\geq0\,\forall f\in F\right\}  for some finite set of polynomials F\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right]. Unlike in the case of affine algebraic sets, if F\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is an arbitrary set of polynomials, \left\{ \bar{x}\in\mathbb{R}^{n}\mid f\left(\bar{x}\right)\geq0\,\forall f\in F\right\}  is not necessarily a closed semialgebraic set. As a result of this, the collection of closed semialgebraic sets are not the closed sets of a topology on \mathbb{R}^{n}. In the topology on \mathbb{R}^{n} generated by closed semialgebraic sets being closed, the closed sets are the sets of the form \left\{ \bar{x}\in\mathbb{R}^{n}\mid f\left(\bar{x}\right)\geq0\,\forall f\in F\right\}  for arbitrary F\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right]. Semialgebraic geometry usually restricts itself to the study of semialgebraic sets, but here I wish to consider all the closed sets of this topology. Notice that closed semialgebraic sets are also closed in the standard topology, so the standard topology is a refinement of this one. Notice also that the open ball B_{r}\left(\bar{p}\right) of radius r centered at \bar{p} is the complement of the closed semialgebraic set \left\{ \bar{x}\in\mathbb{R}^{n}\mid\left|\bar{x}-\bar{p}\right|^{2}-r^{2}\geq0\right\} , and these open balls are a basis for the standard topology, so this topology is a refinement of the standard one. Thus, the topology I have defined is exactly the standard topology on \mathbb{R}^{n}.

In algebra, instead of referring to a set of polynomials, it is often nicer to talk about the ideal generated by that set instead. What is the analog of an ideal in ordered algebra? It's this thing:

Definition: If A is a partially ordered commutative ring, a cone C in A is a subsemiring of A which contains all positive elements, and such that C\cap-C is an ideal of A. By "subsemiring", I mean a subset that contains 0 and 1, and is closed under addition and multiplication (but not necessarily negation). If F\subseteq A, the cone generated by F, denoted \left\langle F\right\rangle , is the smallest cone containing F. Given a cone C, the ideal C\cap-C will be called the interior ideal of C, and denoted C^{\circ}.

\mathbb{R}\left[x_{1},...,x_{n}\right] is partially ordered by f\geq g\iff f\left(\bar{x}\right)\geq g\left(\bar{x}\right)\,\forall\bar{x}\in\mathbb{R}^{n}. If F\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is a set of polynomials and \bar{x}\in\mathbb{R}^{n}, then f\left(\bar{x}\right)\geq0\,\forall f\in F\iff f\left(\bar{x}\right)\geq0\,\forall f\in\left\langle F\right\rangle . Thus I can consider closed sets to be defined by cones. We now have a Galois connection between cones of \mathbb{R}\left[x_{1},...,x_{n}\right] and subsets of \mathbb{R}^{n}, given by, for a cone C, its positive-set is P_{\mathbb{R}}\left(C\right):=\left\{ \bar{x}\in\mathbb{R}^{n}\mid f\left(\bar{x}\right)\geq0\,\forall f\in C\right\}  (I'm calling it the "positive-set" even though it is where the polynomials are all non-negative, because "non-negative-set" is kind of a mouthful), and for X\subseteq\mathbb{R}^{n}, its cone is C_{\mathbb{R}}\left(X\right):=\left\{ f\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid f\left(\bar{x}\right)\geq0\,\forall\bar{x}\in X\right\} P_{\mathbb{R}}\circ C_{\mathbb{R}} is closure in the standard topology on \mathbb{R}^{n} (the analog in algebraic geometry is closure in the Zariski topology on \mathbb{C}^{n}). A closed set X is semialgebraic if and only if it is the positive-set of a finitely-generated cone.

Quotients by cones, and coordinate rings

An affine algebraic set V is associated with its coordinate ring \mathbb{C}\left[V\right]:=\mathbb{C}\left[x_{1},...,x_{n}\right]/I\left(V\right). We can do something analogous for closed subsets of \mathbb{R}^{n}.

Definition: If A is a partially ordered commutative ring and C\subseteq A is a cone, A/C is the ring A/C^{\circ}, equipped with the partial order given by f+C^{\circ}\geq g+C^{\circ} if and only if f-g\in C, for f,g\in A.

Definition: If X\subseteq\mathbb{R}^{n} is closed, the coordinate ring of X is \mathbb{R}\left[X\right]:=\mathbb{R}\left[x_{1},...,x_{n}\right]/C\left(X\right). This is the ring of functions X\rightarrow\mathbb{R} that are restrictions of polynomials, ordered by f\geq g if and only if f\left(\bar{x}\right)\geq g\left(\bar{x}\right)\,\forall\bar{x}\in X. For arbitrary X\subseteq\mathbb{R}^{n}, the ring of regular functions on X, denoted \mathcal{O}\left(X\right), consists of functions on X that are locally ratios of polynomials, again ordered by f\geq g if and only if f\left(\bar{x}\right)\geq g\left(\bar{x}\right)\,\forall\bar{x}\in X. Assigning its ring of regular functions to each open subset of X endows X with a sheaf of partially ordered commutative rings.

For closed X\subseteq\mathbb{R}^{n}, \mathbb{R}\left[X\right]\subseteq\mathcal{O}\left(X\right), and this inclusion is generally proper, both because it is possible to divide by polynomials that do not have roots in X, and because X may be disconnected, making it possible to have functions given by different polynomials on different connected components.


What is C_{\mathbb{R}}\circ P_{\mathbb{R}}? The Nullstellensatz says that its analog in algebraic geometry is the radical of an ideal. As such, we could say that the radical of a cone C, denoted \text{Rad}_{\mathbb{R}}\left(C\right), is C_{\mathbb{R}}\left(P_{\mathbb{R}}\left(C\right)\right), and that a cone C is radical if C=\text{Rad}_{\mathbb{R}}\left(C\right). In algebraic geometry, the Nullstellensatz shows that a notion of radical ideal defined without reference to algebraic sets in fact characterizes the ideals which are closed in the corresponding Galois connection. It would be nice to have a description of the radical of a cone that does not refer to the Galois connection. There is a semialgebraic analog of the Nullstellensatz, but it does not quite characterize radical cones.

Positivstellensatz 1: If C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is a finitely-generated cone and p\in\mathbb{R}\left[x_{1},...,x_{n}\right] is a polynomial, then p\left(\bar{x}\right)>0\,\forall\bar{x}\in P_{\mathbb{R}}\left(C\right) if and only if \exists f\in C such that pf-1\in C.

There are two ways in which this is unsatisfactory: first, it applies only to finitely-generated cones, and second, it tells us exactly which polynomials are strictly positive everywhere on a closed semialgebraic set, whereas we want to know which polynomials are non-negative everywhere on a set.

The second problem is easier to handle: a polynomial p is non-negative everywhere on a set S if and only if there is a decreasing sequence of polynomials \left(p_{i}\mid i\in\mathbb{N}\right) converging to p such that each p_{i} is strictly positive everywhere on S. Thus, to find \text{Rad}_{\mathbb{R}}\left(C\right), it is enough to first find all the polynomials that are strictly positive everywhere on P_{\mathbb{R}}\left(C\right), and then take the closure under lower limits. Thus we have a characterization of radicals of finitely-generated cones.

Positivstellensatz 2: If C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is a finitely-generated cone, \text{Rad}_{\mathbb{R}}\left(C\right) is the closure of \left\{ p\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid\exists f\in C\, pf-1\in C\right\} , where the closure of a subset X\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is defined to be the set of all polynomials in \mathbb{R}\left[x_{1},...,x_{n}\right] which are infima of chains contained in X.

This still doesn't even tell us what's going on for cones which are not finitely-generated. However, we can generalize the Positivstellensatz to some other cones.

Positivstellensatz 3: Let C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] be a cone containing a finitely-generated subcone D\subseteq C such that P_{\mathbb{R}}\left(D\right) is compact. If p\in\mathbb{R}\left[x_{1},...,x_{n}\right] is a polynomial, then p\left(\bar{x}\right)>0\,\forall\bar{x}\in P_{\mathbb{R}}\left(C\right) if and only if \exists f\in C such that pf-1\in C. As before, it follows that \text{Rad}_{\mathbb{R}}\left(C\right) is the closure of \left\{ p\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid\exists f\in C\, pf-1\in C\right\}.

proof: For a given p\in\mathbb{R}\left[x_{1},...,x_{n}\right]\left\{ \bar{x}\in\mathbb{R}^{n}\mid p\left(\bar{x}\right)\leq0\right\} \cap P_{\mathbb{R}}\left(C\right)=\left\{ \bar{x}\in\mathbb{R}^{n}\mid p\left(\bar{x}\right)\leq0\right\} \cap\bigcap\left\{ P_{\mathbb{R}}\left(\left\langle f\right\rangle \right)\mid f\in C\right\} , an intersection of closed sets contained in the compact set P_{\mathbb{R}}\left(D\right), which is thus empty if and only if some finite subcollection of them has empty intersection within P_{\mathbb{R}}\left(D\right). Thus if p is strictly positive everywhere on P_{\mathbb{R}}\left(C\right), then there is some finitely generated subcone E\subseteq C such that p is strictly positive everywhere on P_{\mathbb{R}}\left(E\right)\cap P_{\mathbb{R}}\left(D\right)=P_{\mathbb{R}}\left(\left\langle E\cup D\right\rangle \right), and \left\langle E\cup D\right\rangle is finitely-generated, so by Positivstellensatz 1, there is f\in\left\langle E\cup D\right\rangle \subseteq C such that pf-1\in\left\langle E\cup D\right\rangle \subseteq C\square

For cones that are not finitely-generated and do not contain any finitely-generated subcones with compact positive-sets, the Positivstellensatz will usually fail. Thus, it seems likely that if there is a satisfactory general definition of radical for cones in arbitrary partially ordered commutative rings that agrees with this one in \mathbb{R}\left[x_{1},...,x_{n}\right], then there is also an abstract notion of "having a compact positive-set" for such cones, even though they don't even have positive-sets associated with them.

Beyond \mathbb{R}^{n}

An example of cone for which the Positivstellensatz fails is C_{\infty}:=\left\{ f\in\mathbb{R}\left[x\right]\mid\exists x\in\mathbb{R}\,\forall y\geq x\, f\left(y\right)\geq0\right\} , the cone of polynomials that are non-negative on sufficiently large inputs (equivalently, the cone of polynomials that are either 0 or have positive leading coefficient). P_{\mathbb{R}}\left(C\right)=\emptyset, and -1 is strictly positive on \emptyset, but for f\in C_{\infty}-f-1\notin C_{\infty}.

However, it doesn't really look C_{\infty} is trying to point to the empty set; instead, C_{\infty} is trying to describe the set of all infinitely large reals, which only looks like the empty set because there are no infinitely large reals. Similar phenomena can occur even for cones that do contain finitely-generated subcones with compact positive-sets. For example, let C_{\varepsilon}:=\left\{ f\in\mathbb{R}\left[x\right]\mid\exists x>0\,\forall y\in\left[0,x\right]\, f\left(y\right)\geq0\right\} P_{\mathbb{R}}\left(C_{\varepsilon}\right)=\left\{ 0\right\} , but C_{\varepsilon} is trying to point out the set containing 0 and all positive infinitesimals. Since \mathbb{R} has no infinitesimals, this looks like \left\{ 0\right\} .

To formalize this intuition, we can change the Galois connection. We could say that for a cone C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right]P_{\text{*}\mathbb{R}}\left(C\right):=\left\{ \bar{x}\in\left(\text{*}\mathbb{R}\right)^{n}\mid f\left(\bar{x}\right)\geq0\,\forall f\in C\right\} , where \text{*}\mathbb{R} is the field of hyperreals. All you really need to know about \text{*}\mathbb{R} is that it is a big ordered field extension of \mathbb{R}. P_{\text{*}\mathbb{R}}\left(C_{\infty}\right) is the set of hyperreals that are bigger than any real number, and P_{\text{*}\mathbb{R}}\left(C_{\varepsilon}\right) is the set of hyperreals that are non-negative and smaller than any positive real. The cone of a subset X\subseteq\left(\text{*}\mathbb{R}\right)^{n}, denoted C_{\text{*}\mathbb{R}}\left(X\right) will be defined as before, still consisting only of polynomials with real coefficients. This defines a topology on \left(\text{*}\mathbb{R}\right)^{n} by saying that the closed sets are the fixed points of P_{\text{*}\mathbb{R}}\circ C_{\text{*}\mathbb{R}}. This topology is not T_{0} because, for example, there are many hyperreals that are larger than all reals, and they cannot be distinguished by polynomials with real coefficients. There is no use keeping track of the difference between points that are in the same closed sets. If you have a topology that is not T_{0}, you can make it T_{0} by identifying any pair of points that have the same closure. If we do this to \left(\text{*}\mathbb{R}\right)^{n} , we get what I'm calling ordered affine n-space over \mathbb{R}.

Definition: An n-type over \mathbb{R} is a set \Phi of inequalities, consisting of, for each polynomial f\in\mathbb{R}\left[x_{1},..,x_{n}\right], one of the inequalities f\left(\bar{x}\right)\geq0 or f\left(\bar{x}\right)<0, such that there is some totally ordered field extension \mathcal{R}\supseteq\mathbb{R} and \bar{x}\in\mathcal{R}^{n} such that all inequalities in \Phi are true about \bar{x}. \Phi is called the type of \bar{x}. Ordered affine n-space over \mathbb{R}, denoted \mathbb{OA}_{\mathbb{R}}^{n} is the set of n-types over \mathbb{R}.

Compactness Theorem: Let \Phi be a set of inequalities consisting of, for each polynomial f\in\mathbb{R}\left[x_{1},..,x_{n}\right], one of the inequalities f\left(\bar{x}\right)\geq0 or f\left(\bar{x}\right)<0. Then \Phi is an n-type if and only if for any finite subset \Delta\subseteq\Phi, there is \bar{x}\in\mathbb{R} such that all inequalities in \Delta are true about \bar{x}.

proof: Follows from the compactness theorem of first-order logic and the fact that ordered field extensions of \mathbb{R} embed into elementary extensions of \mathbb{R}. The theorem is not obvious if you do not know what those mean. \square

An n-type represents an n-tuple of elements of an ordered field extension of \mathbb{R}, up to the equivalence relation that identifies two such tuples that relate to \mathbb{R} by polynomials in the same way. One way that a tuple of elements of an extension of \mathbb{R} can relate to elements of \mathbb{R} is to equal a tuple of elements of \mathbb{R}, so there is a natural inclusion \mathbb{R}^{n}\subseteq\mathbb{OA}_{\mathbb{R}}^{n} that associates an n-tuple of reals with the set of polynomial inequalities that are true at that n-tuple.

A tuple of polynomials \left(f_{1},...,f_{m}\right)\in\left(\mathbb{R}\left[x_{1},...,x_{n}\right]\right)^{m} describes a function f:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}, which extends naturally to a function f:\mathbb{OA}_{\mathbb{R}}^{n}\rightarrow\mathbb{OA}_{\mathbb{R}}^{m} by f\left(\Phi\right) is the type of \left(f_{1}\left(\bar{x}\right),...,f_{m}\left(\bar{x}\right)\right), where \bar{x} is an n-tuple of elements of type \Phi in an extension of \mathbb{R}. In particular, a polynomial f\in\mathbb{R}\left[x_{1},...,x_{n}\right] extends to a function f:\mathbb{OA}_{\mathbb{R}}^{n}\rightarrow\mathbb{OA}_{\mathbb{R}}^{1}, and \mathbb{OA}_{\mathbb{R}}^{1} is totally ordered by \Phi\geq\Psi if and only if x\geq y, where x and y are elements of type \Phi and \Psi, respectively, in an extension of \mathbb{R}f\left(\Phi\right)\geq0 if and only if \text{, so we can talk about inequalities satisfied by types in place of talking about inequalities contained in types.

I will now change the Galois connection that we are talking about yet again (last time, I promise). It will now be a Galois connection between the set of cones in \mathbb{R}\left[x_{1},...,x_{n}\right] and the set of subsets of \mathbb{OA}_{\mathbb{R}}^{n}. For a cone C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right], P\left(C\right):=\left\{ \Phi\in\mathbb{OA}_{\mathbb{R}}^{n}\mid f\left(\Phi\right)\geq0\,\forall f\in C\right\} . For a set X\subseteq\mathbb{OA}_{\mathbb{R}}^{n}, C\left(X\right):=\left\{ f\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid f\left(\Phi\right)\geq0\,\forall\Phi\in X\right\} . Again, this defines a topology on \mathbb{OA}_{\mathbb{R}}^{n} by saying that fixed points of P\circ C are closed. \mathbb{OA}_{\mathbb{R}}^{n} is T_{0}; in fact, it is the T_{0} topological space obtained from \left(\text{*}\mathbb{R}\right)^{n} by identifying points with the same closure as mentioned earlier. \mathbb{OA}_{\mathbb{R}}^{n} is also compact, as can be seen from the compactness theorem. \mathbb{OA}_{\mathbb{R}}^{n} is not T_{1} (unless n=0). Note that model theorists have their own topology on \mathbb{OA}_{\mathbb{R}}^{n}, which is distinct from the one I use here, and is a refinement of it.

The new Galois connection is compatible with the old one via the inclusion \mathbb{R}^{n}\subseteq\mathbb{OA}_{\mathbb{R}}^{n}, in the sense that if X\subseteq\mathbb{R}^{n}, then C_{\mathbb{R}}\left(X\right)=C\left(X\right) (where we identify X with its image in \mathbb{OA}_{\mathbb{R}}^{n}), and for a cone C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right]P_{\mathbb{R}}=P\left(C\right)\cap\mathbb{R}^{n}.

Like our intermediate Galois connection \left(P_{\text{*}\mathbb{R}},C_{\text{*}\mathbb{R}}\right), our final Galois connection \left(P,C\right) succeeds in distinguishing P\left(C_{\infty}\right) and P\left(C_{\varepsilon}\right) from \emptyset and \left\{ 0\right\} , respectively, in the desirable manner. P\left(C_{\infty}\right) consists of the type of numbers larger than any real, and P\left(C_{\varepsilon}\right) consists of the types of 0 and of positive numbers smaller than any positive real.

Just like for subsets of \mathbb{R}^{n}, a closed subset X\subseteq\mathbb{OA}_{\mathbb{R}}^{n} has a coordinate ring \mathbb{R}\left[X\right]:=\mathbb{R}\left[x_{1},...,x_{n}\right]/C\left(X\right), and an arbitrary X\subseteq\mathbb{OA}_{\mathbb{R}}^{n} has a ring of regular functions \mathcal{O}\left(X\right) consisting of functions on X that are locally ratios of polynomials, ordered by f\geq0 if and only if \forall\Phi\in X, where f=\frac{p}{q} is a representation of f as a ratio of polynomials in a neighborhood of \Phi, either p\left(\Phi\right)\geq0 and q\left(\Phi\right)>0, or p\left(\Phi\right)\leq0 and q\left(\Phi\right)<0, and f\geq g if and only if f-g\geq0. As before, \mathbb{R}\left[X\right]\subseteq\mathcal{O}\left(X\right) for closed X\subseteq\mathbb{OA}_{\mathbb{R}}^{n}.

\mathbb{OA}_{\mathbb{R}}^{n} is analogous to \mathbb{A}_{\mathbb{C}}^{n} from algebraic geometry because if, in the above definitions, you replace "\geq" and "<" with "=" and "\neq", replace totally ordered field extensions with field extensions, and replace cones with ideals, then you recover a description of \mathbb{A}_{\mathbb{C}}^{n}, in the sense of \text{Spec}\left(\mathbb{C}\left[x_{1},...,x_{n}\right]\right).

What about an analog of projective space? Since we're paying attention to order, we should look at spheres, not real projective space. The n-sphere over \mathbb{R}, denoted \mathbb{S}_{\mathbb{R}}^{n}, can be described as the locus of \left|\bar{x}\right|^{2}=1 in \mathbb{OA}_{\mathbb{R}}^{n}.

For any totally ordered field k, we can define \mathbb{OA}_{k}^{n} similarly to \mathbb{OA}_{\mathbb{R}}^{n}, as the space of n-types over k, defined as above, replacing \mathbb{R} with k (although a model theorist would no longer call it the space of n-types over k). The compactness theorem is not true for arbitrary k, but its corollary that \mathbb{OA}_{k}^{n} is compact still is true.

Visualizing \mathbb{OA}_{\mathbb{R}}^{n} and \mathbb{S}_{\mathbb{R}}^{n}

\mathbb{S}_{\mathbb{R}}^{n} should be thought of as the n-sphere with infinitesimals in all directions around each point. Specifically, \mathbb{S}_{\mathbb{R}}^{0} is just \mathbb{S}^{0}, a pair of points. The closed points of \mathbb{S}_{\mathbb{R}}^{n+1} are the points of \mathbb{S}^{n+1}, and for each closed point p, there is an n-sphere of infinitesimals around p, meaning a copy of \mathbb{S}_{\mathbb{R}}^{n}, each point of which has p in its closure.

\mathbb{OA}_{\mathbb{R}}^{n} should be thought of as n-space with infinitesimals in all directions around each point, and infinities in all directions. Specifically, \mathbb{OA}_{\mathbb{R}}^{n} contains \mathbb{R}^{n}, and for each point p\in\mathbb{R}^{n}, there is an n-1-sphere of infinitesimals around p, and there is also a copy of \mathbb{S}_{\mathbb{R}}^{n-1} around the whole thing, the closed points of which are limits of rays in \mathbb{R}^{n}.

\mathbb{OA}_{\mathbb{R}}^{n} and \mathbb{S}_{\mathbb{R}}^{n} relate to each other the same way that \mathbb{R}^{n} and \mathbb{S}^{n} do. If you remove a closed point from \mathbb{S}_{\mathbb{R}}^{n}, you get \mathbb{OA}_{\mathbb{R}}^{n}, where the sphere of infinitesimals around the removed closed point becomes the sphere of infinities of \mathbb{OA}_{\mathbb{R}}^{n}.

More generally, if k is a totally ordered field, let k^{r} be its real closure. \mathbb{OA}_{k}^{n} consists of the Cauchy completion of \left(k^{r}\right)^{n} (as a metric space with distances valued in k^{r}), and for each point p\in\left(k^{r}\right)^{n} (though not for points that are limits of Cauchy sequences that do not converge in \left(k^{r}\right)^{n}), an n-1-sphere \mathbb{S}_{k}^{n-1} of infinitesimals around p, and an n-1-sphere \mathbb{S}_{k}^{n-1} around the whole thing, where \mathbb{S}_{k}^{n} is the locus of \left|\bar{x}\right|^{2}=1 in \mathbb{OA}_{k}^{n}. \mathbb{OA} does not distinguish between fields with the same real closure.

More Positivstellensätze

This Galois connection gives us a new notion of what it means for a cone to be radical, which is distinct from the old one and is better, so I will define \text{Rad}\left(C\right) to be C\left(P\left(C\right)\right). A cone C will be called radical if C=\text{Rad}\left(C\right). Again, it would be nice to be able to characterize radical cones without referring to the Galois connection. And this time, I can do it. Note that since \mathbb{OA}_{\mathbb{R}}^{n} is compact, the proof of Positivstellensatz 3 shows that in our new context, the Positivstellensatz holds for all cones, since even the subcone generated by \emptyset has a compact positive-set.

Positivstellensatz 4: If C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is a cone and p\in\mathbb{R}\left[x_{1},...,x_{n}\right] is a polynomial, then p\left(\Phi\right)>0\,\forall\Phi\in P\left(C\right) if and only if \exists f\in C such that pf-1\in C.

However, we can no longer add in lower limits of sequences of polynomials. For example, -x+\varepsilon\in C_{\varepsilon} for all real \varepsilon>0, but -x\notin C_{\varepsilon}, even though C_{\varepsilon} is radical. This happens because, where \Sigma is the type of positive infinitesimals, -\Sigma+\varepsilon>0 for real \varepsilon>0, but -\Sigma<0. However, we can add in lower limits of sequences contained in finitely-generated subcones, and this is all we need to add, so this characterizes radical cones.

Positivstellensatz 5: If C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is a cone, \text{Rad}\left(C\right) is the union over all finitely-generated subcones D\subseteq C of the closure of \left\{ p\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid\exists f\in D\, pf-1\in D\right\}  (again the closure of a subset X\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is defined to be the set of all polynomials in \mathbb{R}\left[x_{1},...,x_{n}\right] which are infima of chains contained in X).

Proof: Suppose D\subseteq C is a subcone generated by a finite set \left\{ f_{1},...,f_{m}\right\} , and q is the infimum of a chain \left\{ q_{\alpha}\right\} _{\alpha\in A}\subseteq\left\{ p\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid\exists f\in D\, pf-1\in D\right\} . For any \bar{x}\in\mathbb{R}^{n}, if f_{i}\left(\bar{x}\right)\geq0 for each i, then q_{\alpha}\left(\bar{x}\right)>0 for each \alpha, and hence q\left(\bar{x}\right)\geq0. That is, the finite set of inequalities \left\{ f_{i}\left(\bar{x}\right)\geq0\mid1\leq i\leq m\right\} \cup\left\{ q\left(\bar{x}\right)<0\right\}  does not hold anywhere in \mathbb{R}^{n}. By the compactness theorem, there are no n-types satisfying all those inequalities. Given \Phi\in P\left(C\right)f_{i}\left(\Phi\right)\geq0, so q\left(\Phi\right)\nless0; that is, q\left(\Phi\right)\geq0.

Conversely, suppose q\in\text{Rad}\left(C\right). Then by the compactness theorem, there are some f_{1},...,f_{m}\in C such that q\in\text{Rad}\left(\left\langle f_{1},...,f_{m}\right\rangle \right). Then \forall\varepsilon>0, q+\varepsilon is strictly positive on P\left(\left\langle f_{1},...,f_{m}\right\rangle \right), and hence by Positivstellensatz 4, \exists f\in\left\langle f_{1},...,f_{m}\right\rangle  such that pf-1\in\left\langle f_{1},...,f_{m}\right\rangle . That is, \left\{ q+\varepsilon\mid\varepsilon>0\right\} is a chain contained in \left\langle f_{1},...,f_{m}\right\rangle , a finitely-generated subcone of C, whose infimum is q. \square

Ordered commutative algebra

Even though they are technically not isomorphic, \mathbb{C}^{n} and \text{Spec}\left(\mathbb{C}\left[x_{1},...,x_{n}\right]\right) are closely related, and can often be used interchangeably. Of the two, \text{Spec}\left(\mathbb{C}\left[x_{1},...,x_{n}\right]\right) is of a form that can be more easily generalized to more abstruse situations in algebraic geometry, which may indicate that it is the better thing to talk about, whereas \mathbb{C}^{n} is merely the simpler thing that is easier to think about and just as good in practice in many contexts. In contrast, \mathbb{R}^{n} and \mathbb{OA}_{\mathbb{R}}^{n} are different in important ways. The situation in algebraic geometry provides further reason to pay more attention to \mathbb{OA}_{\mathbb{R}}^{n} than to \mathbb{R}^{n}.

The next thing to look for would be an analog of the spectrum of a ring for a partially ordered commutative ring (I will henceforth abbreviate "partially ordered commutative ring" as "ordered ring" in order to cut down on the profusion of adjectives) in a way that makes use of the order, and gives us \mathbb{OA}_{\mathbb{R}}^{n} when applied to \mathbb{R}\left[x_{1},...,x_{n}\right]. I will call it the order spectrum of an ordered ring A, denoted \text{OrdSpec}\left(A\right). Then of course \mathbb{OA}_{A}^{n} can be defined as \text{OrdSpec}\left(A\left[x_{1},...,x_{n}\right]\right)\text{OrdSpec}\left(A\right) should be, of course, the set of prime cones. But what even is a prime cone?

Definition: A cone \mathfrak{p}\subseteq A is prime if A/\mathfrak{p} is a totally ordered integral domain.

Definition: \text{OrdSpec}\left(A\right) is the set of prime cones in A, equipped with the topology whose closed sets are the sets of prime cones containing a given cone.

An n-type \Phi\in\mathbb{OA}_{\mathbb{R}}^{n} can be seen as a cone, by identifying it with \left\{ f\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid f\left(\Phi\right)\geq0\right\} , aka C\left(\left\{ \Phi\right\} \right). Under this identification, \mathbb{OA}_{\mathbb{R}}^{n}=\text{OrdSpec}\left(\mathbb{R}\left[x_{1},...,x_{n}\right]\right), as desired. The prime cones in \mathbb{R}\left[x_{1},...,x_{n}\right] are also the radical cones C such that P\left(C\right) is irreducible. Notice that irreducible subsets of \mathbb{OA}_{\mathbb{R}}^{n} are much smaller than irreducible subsets of \mathbb{A}_{\mathbb{C}}^{n}; in particular, none of them contain more than one element of \mathbb{R}^{n}.

There is also a natural notion of maximal cone.

Definition: A cone \mathfrak{m}\subseteq A is maximal if \mathfrak{m}\neq A and there are no strictly intermediate cones between \mathfrak{m} and A. Equivalently, if \mathfrak{m} is prime and closed in \text{OrdSpec}\left(A\right).

Maximal ideals of \mathbb{C}\left[x_{1},...,x_{n}\right] correspond to elements of \mathbb{C}^{n}. And the cones of elements of \mathbb{R}^{n} are maximal cones in \mathbb{R}\left[x_{1},...,x_{n}\right], but unlike in the complex case, these are not all the maximal cones, since there are closed points in \mathbb{OA}_{\mathbb{R}}^{n} outside of \mathbb{R}^{n}. For example, C_{\infty} is a maximal cone, and the type of numbers greater than all reals is closed. To characterize the cones of elements of \mathbb{R}^{n}, we need something slightly different.

Definition: A cone \mathfrak{m}\subseteq A is ideally maximal if A/\mathfrak{m} is a totally ordered field. Equivalently, if \mathfrak{m} is maximal and \mathfrak{m}^{\circ} is a maximal ideal.

Elements of \mathbb{R}^{n} correspond to ideally maximal cones of \mathbb{R}\left[x_{1},...,x_{n}\right].

\text{OrdSpec} also allows us to define the radical of a cone in an arbitrary partially ordered commutative ring.

Definition: For a cone C\subseteq A, \text{Rad}\left(C\right) is the intersection of all prime cones containing C. C is radical if C=\text{Rad}\left(C\right).

Conjecture: \text{Rad}\left(C\right) is the union over all finitely-generated subcones C\subseteq D of the closure of \left\{ p\in A\mid\exists f\in D\, pf-1\in D\right\}  (as before, the closure of a subset X\subseteq A is defined to be the set of all elements of A which are infima of chains contained in X).

Order schemes

Definition: An ordered ringed space is a topological space equipped with a sheaf of ordered rings. An ordered ring is local if it has a unique ideally maximal cone, and a locally ordered ringed space is an ordered ringed space whose stalks are local.

\text{OrdSpec}\left(A\right) can be equipped with a sheaf of ordered rings \mathcal{O}_{A}, making it a locally ordered ringed space.

Definition: For a prime cone \mathfrak{p}\subseteq A, the localization of A at \mathfrak{p}, denoted A_{\mathfrak{p}}, is the ring A_{\mathfrak{p}^{\circ}} equipped with an ordering that makes it a local ordered ring. This will be the stalk at \mathfrak{p} of \mathcal{O}_{A}. A fraction \frac{a}{b}\in A_{\mathfrak{p}} (b\notin\mathfrak{p}^{\circ}) is also an element of A_{\mathfrak{q}} for any prime cone \mathfrak{q}\subseteq A whose interior ideal does not contain b. This is an open neighborhood of \mathfrak{p} (its complement is the set of prime cones containing \left\langle b,-b\right\rangle ). There is a natural map A_{\mathfrak{p}}\rightarrow\text{Frac}\left(A/\mathfrak{p}\right) given by \frac{a}{b}\mapsto\frac{a+\mathfrak{p}^{\circ}}{b+\mathfrak{p}^{\circ}}, and the total order on A/\mathfrak{p} extends uniquely to a total order on the fraction field, so for a,b\in A_{\mathfrak{p}}, we can say that a\geq b at \mathfrak{p} if this is true of their images in \text{Frac}\left(A/\mathfrak{p}\right). We can then say that a\geq b near \mathfrak{p} if a\geq b at every point in some neighborhood of \mathfrak{p}, which defines the ordering on A_{\mathfrak{p}}.

Definition: For open U\subseteq\text{OrdSpec}\left(A\right), \mathcal{O}_{A}\left(U\right) consists of elements of \prod_{\mathfrak{p}\in U}A_{\mathfrak{p}} that are locally ratios of elements of A. \mathcal{O}_{A}\left(U\right) is ordered by a\geq b if and only if \forall\mathfrak{p}\in\text{OrdSpec}\left(A\right) a\geq b near \mathfrak{p} (equivalently, if \forall\mathfrak{p}\in\text{OrdSpec}\left(A\right) a\geq b at \mathfrak{p}).

A\subseteq\mathcal{O}_{A}\left(\text{OrdSpec}\left(A\right)\right), and this inclusion can be proper. Conjecture: \text{OrdSpec}\left(\mathcal{O}_{A}\left(U\right)\right)\cong U as locally ordered ringed spaces for open U\subseteq\text{OrdSpec}\left(A\right). This conjecture says that it makes sense to talk about whether or not a locally ordered ringed space looks locally like an order spectrum near a given point. Thus, if this conjecture is false, it would make the following definition look highly suspect.

Definition: An order scheme is a topological space X equipped with a sheaf of ordered commutative rings \mathcal{O}_{X} such that for some open cover of X, the restrictions of \mathcal{O}_{X} to the open sets in the cover are all isomorphic to order spectra of ordered commutative rings.

I don't have any uses in mind for order schemes, but then again, I don't know what ordinary schemes are for either and they are apparently useful, and order schemes seem like a natural analog of them.

Nonabelian modules

This is a rough overview of my thoughts on a thing I've been thinking about, and as such is incomplete and may contain errors. Proofs have been omitted when writing them out would be at all tedious.

Edit: It has been pointed out to me that near-ring modules have already been defined, and the objects I describe in this post are just near-ring modules where the near-ring happens to be a ring.


As you all know (those of you who have the background for this post, anyway), an R-module is an abelian group M (written additively) together with a multiplication map R\times M\rightarrow M such that for all \alpha,\beta\in R and x,y\in M, \alpha\cdot\left(x+y\right)=\alpha\cdot x+\alpha\cdot y\left(\alpha+\beta\right)\cdot x=\alpha\cdot x+\beta\cdot x, \left(\alpha\beta\right)\cdot x=\alpha\cdot\left(\beta\cdot x\right), and 1\cdot x=x.

What if we don't want to restrict attention to abelian groups? One could attempt to define a nonabelian module using the same axioms, but without the restriction that the group be abelian. As it is customary to write groups multiplicatively if they are not assumed to be abelian, we will do that, and the map R\times M\rightarrow M will be written as exponentiation (since exponents are written on the right, I'll follow the definition of right-modules, rather than left-modules). The axioms become: for all \alpha,\beta\in R and x,y\in M, \left(xy\right)^{\alpha}=x^{\alpha}y^{\alpha}x^{\alpha+\beta}=x^{\alpha}x^{\beta}, x^{\left(\alpha\beta\right)}=\left(x^{\alpha}\right)^{\beta}, and x^{1}=x.

What has changed? Absolutely nothing, as it turns out. The first axiom says again that M is abelian, because yx=x^{-1}\left(xy\right)^{2}y^{-1}=x^{-1}\left(x^{2}y^{2}\right)y^{-1}=xy. We'll have to get rid of that axiom. Our new definition, which it seems to me captures the essence of a module except for abelianness:

A nonabelian R-module is a group M (written multiplicatively) together with a scalar exponentiation map R\times M\rightarrow M such that for all \alpha,\beta\in R and x\in M, x^{1}=xx^{\alpha+\beta}=x^{\alpha}x^{\beta}, and x^{\left(\alpha\beta\right)}=\left(x^{\alpha}\right)^{\beta}.

These imply that x^{0}=1, 1^{\alpha}=1, and x^{-1} is the inverse of x, because x\cdot x^{0}=x^{1}x^{0}=x^{1+0}=x^{1}=x1^{\alpha}=\left(1^{0}\right)^{\alpha}=1^{\left(0\alpha\right)}=1^{0}=1, and x\cdot x^{-1}=x^{1-1}=1.

Just like a \mathbb{Z}-module is just an abelian group, a nonabelian \mathbb{Z}-module is just a group. Just like a \mathbb{Z}/n\mathbb{Z}-module is an abelian group whose exponent divides n, a nonabelian \mathbb{Z}/n\mathbb{Z}-module is a group whose exponent divides n.

Exponentiation-like families of operations

Perhaps a bit more revealing is what nonabelian modules over free rings look like, since then the generators are completely generic ring elements. Where A is the generating set, a \mathbb{Z}\left\langle A\right\rangle -module is an abelian group together with endomorphisms \left\{ x\mapsto\alpha x\mid\alpha\in A\right\} , which tells us that modules are about endomorphisms of an abelian group indexed by the elements of a ring. Nonabelian modules are certainly not about endomorphisms. After all, in a nonabelian group, the map x\mapsto x^{2} is not an endomorphism. I will call the things that nonabelian modules are about "exponentiation-like families of operations'', and give four equivalent definitions, in roughly increasing order of concreteness and decreasing order of elegance. Definition 2 uses basic model theory, so skip it if that scares you. Definition 3 is the "for dummies'' version of definition 2.

Definition 0: Let G be a group, and let A be a family of functions from G to G (not necessarily endomorphisms). If G can be made into a nonabelian \mathbb{Z}\left\langle A\right\rangle -module such that x^{\alpha}=\alpha\left(x\right) for x\in G and \alpha\in A, then A is called an exponentiation-like family of operations on G. If so, the nonabelian \mathbb{Z}\left\langle A\right\rangle -module structure on G with that property is unique, so define x^{p} to be its value according to that structure, for p\in\mathbb{Z}\left\langle A\right\rangle  and x\in G.

Definition 1: A is an exponentiation-like family of operations on G if for all x\in G, the smallest subgroup containing x which is closed under actions by elements of A (which I will call \overline{\left\{ x\right\} }) is abelian, and the elements of A restrict to endomorphisms of it. Using the universal property of \mathbb{Z}\left\langle A\right\rangle , this induces a homomorphism \mathbb{Z}\left\langle A\right\rangle \rightarrow\text{End}\left(\overline{\left\{ x\right\} }\right)^{\text{op}}. Let x^{p} denote the action of p on x under that map, for p\in\mathbb{Z}\left\langle A\right\rangle . By \text{End}\left(\overline{\left\{ x\right\} }\right)^{\text{op}}, I mean the endomorphism ring of \overline{\left\{ x\right\} } with composition running in the opposite direction (i.e., the multiplication operation given by \left(f,g\right)\mapsto g\circ f). This is because of the convention that nonabelian modules are written as nonabelian right-modules by default.

Definition 2: Let consider the language \mathcal{L}_{Rings}\sqcup A, where \mathcal{L}_{Rings}:=\left\{ 0,1,+,-,\cdot\right\} is the language of rings, and each element of A is used as a constant symbol. Closed terms in \mathcal{L}_{Rings}\sqcup A act as functions from G to G, with the action of t written as x\mapsto x^{t}, defined inductively as: x^{0}:=1, x^{1}:=x, x^{\alpha}:=\alpha\left(x\right) for \alpha\in X, x^{t+s}:=x^{t}x^{s}, x^{-t}:=\left(x^{t}\right)^{-1}, and x^{ts}:=\left(x^{t}\right)^{s} for closed \mathcal{L}_{Rings}\sqcup A-terms t and s. A is called an exponentiation-like family of operations on G if x^{t}=x^{s} whenever T_{Rings}\models t=s, where T_{Rings} is the theory of rings. If A is an exponentiation-like family of operations on G and p\in\mathbb{Z}\left\langle A\right\rangle  is a noncommutative polynomial with variables in A, then for x\in Gx^{p} is defined to be x^{t} where t is any term representing p.

Definition 3: Pick a total order on the free monoid on A (e.g. by ordering A and then using the lexicographic order). The order you use won't matter. Given x\in G and w:=\alpha_{1}...\alpha_{n} in the free monoid on A, let x^{w}=\alpha_{n}\left(...\alpha_{1}\left(x\right)\right). Where p\in\mathbb{Z}\left\langle A\right\rangle is a noncommutative polynomial, p=c_{1}w_{1}+...+c_{n}w_{n} for some c_{1},...,c_{n}\in\mathbb{Z} and decreasing sequence w_{1},...,w_{n} of noncommutative monomials (elements of the free monoid on A). Let x^{p}=\left(x^{c_{1}}\right)^{w_{1}}...\left(x^{c_{n}}\right)^{w_{n}}A is called an exponentiation-like family of operations on G if for every x\in G and p,q\in\mathbb{Z}\left\langle A\right\rangle x^{pq}=\left(x^{p}\right)^{q} and x^{p+q}=x^{p}x^{q}.

These four definitions of exponentiation-like family are equivalent, and for exponentiation-like families, their definitions of exponentiation by a noncommutative polynomial are equivalent.

Facts: \emptyset is an exponentiation-like family of operations on G. If A is an exponentiation-like family of operations on G and B\subseteq A, then so is B. If G is abelian, then \text{End}\left(G\right) is exponentiation-like. Given a nonabelian R-module structure on G, the actions of the elements of R on G form an exponentiation-like family. In particular, if A is an exponentiation-like family of operations on G, then so is \mathbb{Z}\left\langle A\right\rangle , with the actions being defined as above.

[The following paragraph has been edited since this comment.]

For an abelian group A, the endomorphisms of A form a ring \text{End}\left(A\right), and an R-module structure on A is simply a homomorphism R\rightarrow\text{End}\left(A\right). Can we say a similar thing about exponentiation-like families of operations of G? Let \text{Exp}\left(G\right) be the set of all functions G\rightarrow G (as sets). Given \alpha,\beta\in\text{Exp}\left(G\right), let multiplication be given by composition: x^{\left(\alpha\beta\right)}=\left(x^{\alpha}\right)^{\beta}, addition be given by x^{\alpha+\beta}=x^{\alpha}x^{\beta}, negation be given by x^{-\alpha}=\left(x^{\alpha}\right)^{-1}, and 0 and 1 be given by x^{0}=1 and x^{1}=x. This makes \text{Exp}\left(G\right) into a near-ring. A nonabelian R-module structure on G is a homomorphism R\rightarrow\text{Exp}\left(G\right), and a set of operations on G is an exponentiation-like family of operations on G if and only if it is contained in a ring which is contained in \text{Exp}\left(G\right).

Some aimless rambling

What are some interesting examples of nonabelian modules that are not abelian? (That might sound redundant, but "nonabelian module'' means that the requirement of abelianness has been removed, not that a requirement of nonabelianness has been imposed. Perhaps I should come up with better terminology. To make matters worse, since the requirement that got removed is actually stronger than abelianness, there are nonabelian modules that are abelian and not modules. For instance, consider the nonabelian \mathbb{Z}\left[\alpha\right]-module whose underlying set is the Klein four group (generated by two elements a,b) such that a^{\alpha}=a, b^{\alpha}=b, and \left(ab\right)^{\alpha}=1.)

In particular, what do free nonabelian modules look like? The free nonabelian \mathbb{Z}-modules are, of course, free groups. The free nonabelian \mathbb{Z}/n\mathbb{Z}-modules have been studied in combinatorial group theory; they're called Burnside groups. (Fun but tangential fact: not all Burnside groups are finite (the Burnside problem), but despite this, the category of finite nonabelian \mathbb{Z}/n\mathbb{Z}-modules has free objects on any finite generating set, called Restricted Burnside groups.)

The free nonabelian \mathbb{Z}\left[\alpha\right]-modules are monstrosities. They can be constructed in the usual way of constructing free objects in a variety of algebraic structures, but that construction seems not to be very enlightening about their structure. So I'll give a somewhat more direct construction of the free nonabelian \mathbb{Z}\left[\alpha\right]-module on d generators, which may also not be that enlightening, and which is only suspected to be correct. Define an increasing sequence of groups G_{n}, and functions \alpha_{n}:G_{n}\rightarrow G_{n+1}, as follows: G_{0} is the free group on d generators. Given G_{n}, and given a subgroup X\leq G_{n}, let the top-degree portion of X be \alpha_{n-1}^{k}\left(X\right) for the largest k such that this is nontrivial. Let H_{n} be the free product of the top-degree portions of maximal abelian subgroups of G_{n}. Let G_{n+1} be the free product of G_{n} with H_{n} modulo commutativity of the maximal abelian subgroups of G_{n} with the images of their top-degree portions in H_{n}. Given a maximal abelian subgroup X\leq G_{n}, let \alpha_{n}\restriction_{X} be the homomorphism extending \alpha_{n-1}\restriction_{X\cap G_{n-1}} which sends the top-degree portion identically onto its image in H_{n}. Since every non-identity element of G_{n} is in a unique maximal abelian subgroup, this defines \alpha_{n}. G:=\bigcup_{n}G_{n} with \alpha:=\bigcup_{n}\alpha_{n} is the free nonabelian \mathbb{Z}\left[\alpha\right]-module on d generators. If A is a set, the free nonabelian \mathbb{Z}\left\langle A\right\rangle -modules can be constructed similarly, with \left|A\right| copies of H_{n} at each step. Are these constructions even correct? Are there nicer ones?

A nonabelian \mathbb{Z}\left[\frac{1}{2}\right]-module would be a group with a formal square root operation. As an example, any group of odd exponent n can be made into a \mathbb{Z}\left[\frac{1}{2}\right]-module in a canonical way by letting x^{\frac{1}{2}}=x^{\frac{n+1}{2}}. More generally, any group of finite exponent n can be made into a \mathbb{Z}\left[\left\{ p^{-1}|p\nmid n\right\} \right]-module in a similar fashion. Are there any more nice examples of nonabelian modules over localizations of \mathbb{Z}?

In particular, a nonabelian \mathbb{Q}-module would be a group with formal nth root operations for all n. What are some nonabelian examples of these? Note that nonabelian \mathbb{Q}-modules cannot have any torsion, for suppose x^{n}=1 for some n\neq0. Then x=\left(x^{n}\right)^{\frac{1}{n}}=1^{\frac{1}{n}}=1. More generally, nonabelian modules cannot have any n-torsion (meaning x^{n}=1\implies x=1) for any n which is invertible in the scalar ring.

The free nonabelian \mathbb{Z}\left[\frac{1}{m}\right]-modules can be constructed similarly to the construction of free nonabelian \mathbb{Z}\left[\alpha\right]-modules above, except that when constructing G_{n+1} from G_{n} and H_{n}, we also mod out by elements of G_{n} being equal to the mth powers of their images in H_{n}. Using the fact that \mathbb{Q}\cong\mathbb{Z}\left\langle \left\{ p^{-1}|\text{primes }p\right\} \right\rangle , this lets us modify the construction of free nonabelian \mathbb{Z}\left\langle A\right\rangle -modules to give us a construction of free nonabelian \mathbb{Q}-modules. Again, is there a nicer way to do it?

Topological nonabelian modules

It is also interesting to consider topological nonabelian modules over topological rings; that is, nonabelian modules endowed with a topology such that the group operation and scalar exponentiation are continuous. A module over a topological ring has a canonical finest topology on it, and the same remains true for nonabelian modules. For finite-dimensional real vector spaces, this is the only topology. Does the same remain true for finitely-generated nonabelian \mathbb{R}-modules? Finite-dimensional real vector spaces are complete, and topological nonabelian modules are, in particular, topological groups, and can thus be made into uniform spaces, so the notion of completeness still makes sense, but I think some finitely-generated nonabelian \mathbb{R}-modules are not complete.

A topological nonabelian \mathbb{R}-module is a sort of Lie group-like object. One might try constructing a Lie algebra for a complete nonabelian \mathbb{R}-module M by letting the underlying set be M, and defining x+y=\lim_{\varepsilon\rightarrow0}\left(x^{\varepsilon}y^{\varepsilon}\right)^{\left(\varepsilon^{-1}\right)} and \left[x,y\right]=\lim_{\varepsilon\rightarrow0}\left(x^{\varepsilon}y^{\varepsilon}x^{-\varepsilon}y^{-\varepsilon}\right)^{\left(\varepsilon^{-2}\right)}. One might try putting a differential structure on M such that this is the Lie algebra of left-invariant derivations. Does this or something like it work?

A Lie group is a nonabelian \mathbb{R}-module if and only if its exponential map is a bijection between it and its Lie algebra. In this case, scalar exponentiation is closely related to the exponential map by a compelling formula: x^{\alpha}=\exp\left(\alpha\exp^{-1}\left(x\right)\right). As an example, the continuous Heisenberg group is a nonabelian \mathbb{R}-module which is not abelian. This observation actually suggests a nice class of examples of nonabelian modules without a topology: given a commutative ring R, the Heisenberg group over R is a nonabelian R-module.

The Heisenberg group of dimension 2n+1 over a commutative ring R has underlying set R^{n}\times R^{n}\times R^{1}, with the group operation given by \left(\boldsymbol{a}_{1},\boldsymbol{b}_{1},c_{1}\right)*\left(\boldsymbol{a}_{2},\boldsymbol{b}_{2},c_{2}\right):=\left(\boldsymbol{a}_{1}+\boldsymbol{a}_{2},\boldsymbol{b}_{1}+\boldsymbol{b}_{2},c_{1}+c_{2}+\boldsymbol{a}_{1}\cdot\boldsymbol{b}_{2}-\boldsymbol{a}_{2}\cdot\boldsymbol{b}_{1}\right). The continuous Heisenberg group means the Heisenberg group over \mathbb{R}. Scalar exponentiation on a Heisenberg group is just given by scalar multiplication: \left(\boldsymbol{a},\boldsymbol{b},c\right)^{\alpha}:=\left(\alpha\boldsymbol{a},\alpha\boldsymbol{b},\alpha c\right).