Uniqueness of mathematical structures

This post is an introduction to model theory, of sorts. Occasionally I get asked what model theory is, and I generally find it quite difficult to give someone who doesn't already know any model theory a good answer to this question, that actually says anything useful about what model theory is really about without leaving them hopelessly lost. This is my attempt to provide a real taste of model theory in a way that should be accessible to a math grad student without a background in logic.

Warm-up exercise

Let's say I make a graph with the following procedure: I start with a countably infinite set of vertices. For each pair of vertices, I flip a fair coin. If the coin lands heads, I put an edge between those two vertices; if the coin lands tails, no edge.

Now you make another graph in a very similar manner. You also start with a countably infinite set of vertices. But instead of flipping a coin, you roll a fair standard six-sided die for each pair of vertices. If the die comes up 6, you put an edge between those two vertices; if it comes up anything from 1 through 5, no edge.

What is the probability that these two graphs are isomorphic?

For the numerical answer, paste "Gur zhygvcyvpngvir vqragvgl" into https://rot13.com/. An explanation will appear later in this post.

Introduction

There are several cases in which we can identify a mathematical object up to isomorphism with a list of first-order properties it satisfies (I'll tell you what that means in a sec) and some data about cardinality. Here's a couple examples: All countable dense linear orders without endpoints are isomorphic. Any two algebraically closed fields of the same characteristic, which have transcendence bases of the same cardinality, are isomorphic. It turns out that the possibility of uniquely specifying a mathematical structure in this way corresponds to interesting structural properties of that structure.

First, the basic definitions:

A first-order language consists of a set of relation symbols, each of which is labeled with a number representing its arity (number of inputs it takes), a set of function symbols, each of which is also labeled with a number representing its arity, and a set of constant symbols (which could also just be thought of as 0-ary function symbols). For example, the language of linear orders has one binary relation < and no functions or constants. The language of fields has constants 0 and 1, binary functions + and \cdot, a unary function - (no unary function for reciprocal, because the functions should be total), and no relations.

A first-order structure in a given language is a set X in which each constant symbol is interpreted as an element of X, each n-ary function symbol is interpreted as a function X^n\rightarrow X, and each n-ary relation symbol is interpreted as a subset of X^n (or alternatively, as a function X^n\rightarrow\{\text{True},\text{False}\}). So linear orders and fields are examples of structures in their respective languages.

We can compose function symbols, constant symbols, and variables into ways of pointing to elements of a structure, called terms. We have as many variables as we want, and they are terms. Constant symbols are terms. And for each n-ary function symbol f and terms t_1,...,t_n, f(t_1,...,t_n) is a term. So in the language of fields, we can construct terms representing integers by adding however many 1s together (and then negating to get negative numbers), and then combine these with variables using addition and multiplication to get terms representing polynomials in however many variables with integer coefficients. In the language of linear orders, since we have no functions or constants, the only terms are variables.

A first-order formula is a way of actually saying things about a first-order structure and elements of it represented by variables. If R is an n-ary relation and t_1,...t_n are terms, then R(t_1,...,t_n) is a formula. If t_1,t_2 are terms, then t_1=t_2 is a formula (you can think of this as just meaning that languages always have the binary relation = by default). Boolean combinations of formulas are formulas (i.e. if \varphi and \psi are formulas, then so are \varphi \& \psi, \varphi\text{or}\psi, and \neg\varphi), and if \varphi is a formula that refers to a variable x, then \forall x\,\varphi and \exists x\,\varphi are formulas. Any variable that appears in a formula without being bound to a quantifier \forall or \exists is called a free variable, and if each free variable is assigned to an element of a structure, the formula makes a claim about them, which can be either true or false. For example, in a ring, \exists y\, x\cdot y=1 is true iff x is a unit.

A first-order formula with no free variables is called a sentence. These are true or false statements about a first-order structure. Many types of mathematical objects are defined by listing first-order sentences that are true of them. For instance, a linear order is a structure with a < relation satisfying transitivity (\forall x \forall y \forall z\, x<y \& y<z \implies x<z), antisymmetry (\forall x \forall y \neg(x<y \& y<x)), and totality (\forall x \forall y\, x<y \text{ or } y<x \text{ or } x=y), and a linear order is dense without endpoints if it also satisfies \forall x \forall y \exists z \, x<z \& z<y and \forall x \exists y \exists z \, y<x \& x<z. These are all first-order sentences. Algebraically closed fields of a given characteristic are another example. The field axioms are first-order sentences. For each positive integer n, we can formulate a first-order sentence saying that every polynomial of degree n has a root: \forall y_0 \forall y_1 ... \forall y_{n-1} \exists x \, y_0 + y_1x + ... + y_{n-1}x^{n-1}+x^n = 0 (the y_is represent the coefficients, with the leading coefficient normalized to 1). So we just add in these infinitely many sentences, one for each n. And we can say that the field has characteristic p by saying 1+...+1=0 (with p ones), or say that it has characteristic 0 by, for each prime p, saying \neg(1+...+1=0).

First-order sentences can tell us a lot about a structure, but not everything, unless the structure is finite.

Löwenheim–Skolem theorem: Given a countable set of first-order sentences (in particular, any set of sentences if the language is countable), if there is any infinite structure in which they are all true, then there are first-order structures of every infinite cardinality in which they are all true.

This is why the uniqueness results all have to say something about cardinality. You might also think of some examples of ways to identify an infinite mathematical object up to isomorphism with a list of axioms without saying directly anything about cardinality, but in all such cases, you'll be using an axiom that isn't first-order. For instance, all Dedekind-complete ordered fields are isomorphic to the reals, but Dedekind-completeness isn't a first-order sentence. Same goes for any way of characterizing the natural numbers up to isomorphism that says something like "every set of natural numbers that contains 0 and is closed under successor contains all of the natural numbers".

Countable structures

Let's go back to the example of countable dense linear orders. If you don't know the proof that all countable dense linear orders are isomorphic, here it goes: suppose we have two countable dense linear orders, X and Y. Since they're countable, we can label the elements of each of them with distinct natural numbers. We're going to match elements of X to elements of Y one at a time such that we get an isomorphism at the end. To ensure that every element of X gets matched to something in Y, on odd-numbered steps, we'll take the lowest-numbered element of X that hasn't been matched yet, and match it with an element of Y. Similarly, to ensure that every element of Y gets matched to something in X, on even-numbered steps, we'll take the lowest-numbered element of Y that hasn't been matched yet, and match it with an element of X. As for what we do on each step (suppose it's an odd-numbered step; even-numbered steps are the same but with the roles of X and Y reversed), at the start of the step, finitely many elements of X have already been matched. We take the first element that hasn't yet been matched. Call it x. x is either greater than all previously matched elements, less than all previously matched elements, or between two previously matched elements that don't already have previously matched elements between them. Since Y is dense and has no endpoints, we know that in the first case, there will be something greater than all previously matched elements of Y, so we can match x to it; in the second case, there will be something less than all previously matched elements of Y for us to match x to; and in the third case, there will be something between the elements matched to the elements on either side of x, which we can match x to. By doing this, we continue to preserve the ordering at each step, so the bijection we get at the end is order-preserving, and thus an isomorphism.

Now let's get back to the warm-up exercise. A graph can be viewed as a first-order structure whose elements are the vertices, with a single binary relation E (the edge relation) that is symmetric and anti-reflexive (symmetry and anti-reflexivity are both first-order conditions). There are some more first-order sentences satisfied by both of our random graphs with probability 1. Given any two finite disjoint sets of vertices, we can find another vertex that's connected to everything in the first set and not connected to anything in the second set. This is because each vertex has the same positive probability of having this property, they're all independent, and there's infinitely many of them, so there also must be some (in fact, infinitely many) that have all the desired edges and none of the undesired edges. To write this condition using first-order sentences, for each natural number n and m, we have a sentence \forall x_1 ... \forall x_n\forall y_1 ... \forall y_m \, (x_1\neq y_1 \& ... \& x_n\neq y_m) \implies \exists z \, E(z,x_1)\&...\&E(z,x_n)\&\neg E(z,y_1)\&...\&\neg E(z,y_m)
(the big conjunction before "\implies" includes x_i\neq y_j for each 1\leq i\leq n and 1\leq j\leq m, so that this says \{x_1,...,x_n\} and \{y_1,...,y_m\} are disjoint).

This is enough for us to construct an isomorphism, using essentially the same proof as for countable dense linear orders. Since we each started with countably many vertices, we can label each of our vertices with natural numbers, and then iteratively match the next available unmatched vertex in one graph to a vertex on the other, alternating between which graph we take the next available unmatched vertex from on each step, just like before. On each step, only finitely many vertices have been matched. The new vertex shares edges with some of the already matched vertices and doesn't share edges with some others. We need to match it with a vertex in the other graph that shares exactly the same pattern of edges with previously matched vertices. And we know that somewhere in that graph, there must be such a vertex. So we can match the new vertex and keep going, and the bijection we get at the end preserves the edge relation, and is thus an isomorphism.

For the general argument that these are both special cases of, we'll need the concept of a type (not to be confused with the identically-named concept from type theory). Given a first-order structure X and c_1,...,c_n\in X, say that a,b\in X have the same type over A if for every first-order formula \varphi(x,y_1,...,y_n) (where x,y_1,...,y_n are its free variables), \varphi(a,c_1,...,c_n) holds iff \varphi(b,c_1,...,c_n) does. So, for example, in a dense linear order without endpoints, if c_1<a<c_2, then, in order for b to have the same type as a over c_1,c_2, it must be the case that c_1<b<c_2 as well, since y_1<x<y_2 is a first-order formula, and a and b must satisfy exactly the same first-order formulas with parameters in c_1,c_2. And as it turns out, this is enough; if c_1<a<c_2 and c_1<b<c_2, then a and b have the same type over c_1,c_2. In an infinite random graph, if vertices a and b have the same type over some other vertices c_1,..,c_n, then a must have an edge to each c_i that b has an edge to, and vice-versa. Again, this turns out to be enough to guarantee that they have the same type.

In both of these cases, for any finite set of elements c_1,...,c_n, there are only finitely many types over c_1,...,c_n. Let's count them. Each c_i is its own type, since x=y_i is a formula, so if a and c_i have the same type over c_1,...,c_n, then, since c_i=c_i, a=c_i as well. Let's ignore these and count the rest. In a dense linear order without endpoints, we can assume WLOG that c_1<c_2<...<c_n. There are n+1 nontrivial types over c_1,...,c_n: \{a\mid a<c_1\}, \{a\mid c_n<a\}, and, for each 1\leq i<n, \{a\mid c_i<a<c_{i+1}\}. In an infinite random graph, there are 2^n nontrivial types over n vertices c_1,...,c_n: for each S\subseteq\{c_1,...,c_n\}, there's a type of vertices that have edges to everything in S and no edges to anything in \{c_1,...,c_n\}\setminus S.

Theorem: Let X be a countably infinite first-order structure such that for every n\in\mathbb{N} and c_1,...,c_n\in X, there are only finitely many types over c_1,...,c_n. Then every countable structure Y satisfying the same first-order sentences that X does is isomorphic to X.

That is, our ability to specify a countable structure (up to isomorphism) by its first-order properties corresponds exactly to the condition that there are only finitely many different behaviors that elements of the structure can have in relation to any given finite subset. The proofs that all countable dense linear orders without endpoints are isomorphic and that all countable random graphs are isomorphic look the same because they both follow the proof of this theorem, which goes like so:

Suppose there are only finitely many types over c_1,...,c_n. Let p be one of those types, and let q_1,...,q_k be the others. For each q_i, there's some formula \varphi_i(x,c_1,...,c_n) that's true for x\in p but not for x\in q_i. Then \varphi_1(x,c_1,...,c_n) \& ... \& \varphi_k(x,c_1,...,c_n) is a formula that holds only for x\in p. That is, the entire type is specified by a single formula; it wasn't just coincidence that we were able to find such a formula for the types in each of our two examples.

Lemma: If X and Y are two first-order structures in the same language which satisfy all the same sentences, c_1,...,c_n\in X and d_1,...,d_n\in Y satisfy all the same formulas (i.e., for any \varphi(x_1,...,x_n), \varphi(c_1,...,c_n) is true in X iff \varphi(d_1,...,d_n) is true in Y), and X has only finitely many types, then there's a natural bijection between types in X over c_1,...,c_n and types in Y over d_1,...,d_n. A formula \varphi(x,c_1,...,c_n) is true for x in some type over c_1,...,c_n iff \varphi(x,d_1,...,d_n) is true for x in the corresponding type over d_1,...,d_n.

Proof: Let p_1,...,p_k be the types in X over c_1,...,c_n, and for each 1\leq i\leq k, let \varphi_i(x,c_1,...,c_n) be a formula specifying that p_i is the type of x. These formulas \varphi_i specify types in Y over d_1,...,d_n as well; for any other formula \psi(x,y_1,...,y_n), either \forall x\, \varphi_i(x,c_1,...,c_n)\implies\psi(x,c_1,...,c_n) or \forall x\, \varphi_i(x,c_1,...,c_n)\implies\neg\psi(x,c_1,...,c_n) in X. These are first-order formulas, so again since d_1,...,d_n satisfy the same first-order formulas that c_1,...,c_n do, one of \forall x\, \varphi_i(x,d_1,...,d_n)\implies\psi(x,d_1,...,d_n) or \forall x\, \varphi_i(x,d_1,...,d_n)\implies\neg\psi(x,d_1,...,d_n) is true in Y as well. So \varphi_i(x,d_1,...,d_n) determines the truth value of every such formula; that is, it specifies the type of x over d_1,...,d_n, and formulas are true in this type iff they are true in the corresponding type in X over c_1,...,c_n. To show that these are all of the types in Y over d_1,...,d_n, consider the formula \forall x\, \varphi_1(x,y_1,...,y_n)\text{ or ... or }\varphi_k(x,y_1,...,y_n). In X, when we plug in c_1,...,c_n, the formula is true. And d_1,...,d_n satisfies all the same formulas as c_1,...,c_n, so the same formula must also be true in Y when we plug in d_1,...,d_n. That is, for any b\in Y, there's some i such that \varphi_i(b,d_1,...,d_n) is true, so every element of Y must have one of the above types.

Armed with this lemma, we can prove the theorem. Let X and Y be countable structures satisfying the same first-order sentences, and suppose for every c_1,...,c_n\in X, there are only finitely many types over c_1,...,c_n. We'll match elements of X to elements of Y one at a time, using the same back-and-forth trick from our two examples to ensure that we get a bijection at the end. After n steps, we'll have c_1,...,c_n from X matched with d_1,...,d_n from Y, and we'll want to ensure that c_1,...,c_n and d_1,...,d_n satisfy exactly the same first-order formulas. If we've done this, then on step n+1, we'll have an element either of X or of Y, which we need to match with some element of the other one. We can match it to an element that has the corresponding type; that is, we're matching c_{n+1}\in X and c_{n+1}\in Y such that the type of c_{n+1} over c_1,...,c_n corresponds to the type of d_{n+1} over d_1,...,d_n. Then c_1,...,c_{n+1} satisfy the same formulas that d_1,...,d_{n+1} do, so by induction, c_1,...,c_n and d_1,...,d_n satisfy the same formulas for every n (the assumption that X and Y satisfy the same first-order sentences provides a base case). Thus, the bijection we get at the end preserves the truth-values of all formulas, so it is an isomorphism, and we're done.

As it turns out, the converse of the theorem is also true. Given a set of first-order sentences for which there is, up to isomorphism, only one countable model, all models have only finitely many types over any finite list of elements. Whenever there's infinitely many types, there will be some types (which cannot be specified by a single formula) that appear in some models but not in others.

Uncountable structures

Let's turn to the other example I introduced at the beginning: any two algebraically closed fields of the same characteristic with transcendence bases of the same cardinality are isomorphic. Every field has a transcendence basis, so a corollary of this is that any two uncountable algebraically closed fields of the same characteristic and cardinality are isomorphic.

A sketch of the proof: Given algebraically closed fields F and K of the same characteristic, with transcendence bases B_F and B_K of the same cardinality, any isomorphism between a subfield of F and a subfield of K extends to a maximal such isomorphism (by Zorn's lemma). Since B_F and B_K have the same cardinality, there's a bijection between them, and since F and K have the same characteristic, this bijection extends to an isomorphism between the fields they generate. Thus there is a maximal isomorphism between subfields of F and of K which restricts to a bijection between B_F and B_K. Now we just need to show that these subfields are all of F and K. This is because, given any such isomorphism between subfields F'\subseteq F and K'\subseteq K with B_F\subseteq F' and B_K\subseteq K', if F'\neq F, then let a\in F\setminus F', and let f(x) be the minimal polynomial of a. Applying the isomorphism to the coefficients gives us an irreducible polynomial over K', which must have a root b\in K, and then by matching a with b, we get an isomorphism F'[a]\cong K'[b], contradicting maximality of the isomorphism.

Here's another example: Any two vector spaces over the same vector space, with bases of the same cardinality, are isomorphic. Since every vector space has a basis, a corollary of this is that, over a countable field, any two uncountable vector spaces of the same cardinality are isomorphic. Citing Zorn's lemma is overkill, since there's only one way to extend a bijection between bases to an isomorphism. But the basic idea is the same in each case: We have an appropriate notion of basis, and we extend a bijection between bases to an isomorphism. And vector spaces are also first-order structures; the language has a binary operation +, a constant 0, and, for each scalar \alpha, a unary operation for multiplication by \alpha.

The thing that unites both these cases is called strong minimality. A first-order structure is called minimal if every set defined by a first-order formula is either finite, or the complement of a finite set. More formally: X is minimal if for every formula \varphi(x,y_1,...,y_n) and b_1,...,b_n\in X, one of \{a\in X\mid\varphi(a,b_1,...,b_n)\} or \{a\in X\mid\neg\varphi(a,b_1,...,b_n)\} is finite. We call a structure strongly minimal if every structure satisfying the same first-order sentences is also minimal. (This turns out to be equivalent to, for each \varphi(x,y_1,...,y_n), there's a finite upper bound on the size of whichever of \{a\in X\mid\varphi(a,b_1,...,b_n)\} or \{a\in X\mid\neg\varphi(a,b_1,...,b_n)\} is finite, as b_1,...,b_n vary.)

Let's go over the general notion of "basis" we'll be using: Say that a is algebraic over b_1,...,b_n if there is a formula \varphi(x,y_1,...,y_n) such that \varphi(a,b_1,...,b_n) holds, and \{a'\in X\mid\varphi(a',b_1,...,b_n)\} is finite. In algebraically closed fields, this corresponds to the usual notion of a being algebraic over the subfield generated by b_1,...,b_n. In vector spaces, this corresponds to a being a linear combination of b_1,...,b_n. Call B\subseteq X independent if no element of B is algebraic over any other elements of B. In other words, you can't pin down an element of B to one of finitely many possibilities by using a single formula and other elements of B. In a vector space, independence is linear independence. In an algebraically closed field, independence is algebraic independence. Now call B a basis if it is a maximal independent set. If X is minimal, this turns out to imply that every a\in X is algebraic over some b_1,...,b_n\in B. An increasing union of independent sets is independent, so by Zorn's lemma, every structure has a basis.

Now let's look at type spaces in minimal structures. Let X be a minimal structure and b_1,...,b_n\in X. If a is algebraic over b_1,...,b_n, then there's some formula \varphi(x,y_1,...,y_n) such that \varphi(a,b_1,...,b_n) holds and \{a'\in X\mid\varphi(a',b_1,...,b_n)\} is as small as possible. Then a and a' have the same type iff \varphi(a',b_1,...,b_n). So this type is implied by a single formula, and there are only finitely many elements of this type. There's only one remaining type: the type of elements that aren't algebraic over b_1,...,b_n. If a and a' are both non-algebraic over b_1,...,b_n, then for every formula \varphi(x,y_1,...,y_n), since X is minimal, one of \varphi(x,b_1,...,b_n) and \neg\varphi(x,b_1,...,b_n) must have only finitely many solutions x; a and a', being non-algebraic, must both be solutions to the other one. This shows they have the same type over b_1,...,b_n. This non-algebraic type is optional; in some cases, there might not be any elements that aren't algebraic over b_1,...,b_n.

Let X and Y be minimal structures in the same language, which satisfy the same first-order sentences. They each have a basis. If those bases have the same cardinality, then X and Y are isomorphic. Say a "partial isomorphism" between X and Y is a bijection between a subset of X and a subset of Y, such that whenever a formula is true about some elements of the subset of X, then it is also true about the corresponding elements of the subset of Y, and vice-versa. If B_X and B_Y are bases for X and Y, respectively, then a bijection between B_X and B_Y is a partial isomorphism (this is because if b_1,...,b_n\in B_X and c_1,...,c_n\in B_Y satisfy all the same formulas, and b_{n+1}\in B_X\setminus\{b_1,...,b_n\} and c_{n+1}\in B_Y\setminus\{c_1,...,c_n\}, then b_{n+1} must have the unique non-algebraic type over b_1,...,b_n, c_{n+1} has the unique non-algebraic type over c_1,...,c_n, and these unique non-algebraic types satisfy the same formulas, so it follows by induction on the number of variables that a formula is true of distinct elements of B_X iff it is true of distinct elements of B_Y). An increasing union of partial isomorphisms is a partial isomorphism, so by Zorn's lemma, there's a maximal partial isomorphism extending a bijection between B_X and B_Y. If this maximal partial isomorphism is a bijection between X'\subseteq X and Y'\subseteq Y, and X'\neq X, then let a\in X\setminus X'. a is algebraic over X' (since B_X\subseteq X'), so there's a single formula \varphi(x,b_1,...,b_n) (b_1,..,b_n\in X') that is true for x=a, and which determines its type over X' (meaning, determines its type over b_1',...,b_m' for every b_1',...,b_m'\in X'). Then, where d_1,...,d_n\in Y' correspond to b_1,...,b_n under the partial isomorphism, there must be c\in Y such that \varphi(c,d_1,...,d_n) (since d_1,...,d_n satisfies the same formulas b_1,...,b_n do, and \exists x \varphi(x,b_1,...,b_n)). c\notin Y', because this can be expressed as part of the type of c over Y', which is the same as the type of a over X'. Thus we can extend the partial isomorphism by matching a with c. Thus, in our maximal partial isomorphism, X'=X, and for the same reason, Y'=Y, so it is an isomorphism.

So for a strongly minimal structure, the structures satisfying the same sentences are classified by the cardinality of a basis. This isn't quite the end of the story; in some cases, a structure with too small a basis would be finite, and we could thus distinguish it from the rest with a first-order sentence saying that there are n distinct elements (for large enough n). This isn't the case for algebraically closed fields, which are infinite even when the empty set is a transcendence basis. But for vector spaces, the empty basis generates a one-element vector space, so an infinite vector space must have basis of size at least one.

And if the vector space is over a finite field, then its basis must be infinite. Another case where where the basis must be infinite is an infinite set. A set is a first-order structure in the language with no relations, no functions, and no constants. Every subset of a set is independent, so a basis for the set is just the entire set. In these cases where a basis must be infinite; there's only one (up to isomorphism) countable model: the model with a countably infinite basis. You can check that both of these examples satisfy the finitely-many-types condition from the previous section for having a unique countable model.

So the general story, for a strongly minimal structure X, is that there is some n\in\mathbb{N}\cup\{\aleph_0\} such that structures satisfying the same sentences as X are classified by cardinalities that are at least n, that being the cardinality of a basis. In a countable language, the cardinality of a structure is the maximum of \aleph_0 and the cardinality of a basis, so it follows that an uncountable strongly minimal structure is isomorphic to all structures of the same cardinality satisfying the same sentences.

In the previous section, we had a converse, so you may ask, if an uncountable structure is isomorphic to all structures of the same cardinality satisfying the same sentences, is it strongly minimal? This is not quite true. For example, consider a vector space V over a countable field, where we add two unary relations W and U to the language, each of which define subspaces of V, which are disjoint and span V, and then add a unary function T to the language, which is a linear function such that T^2=id_V, and T\restriction_W is an isomorphism between W and U. Vector spaces like this are classified by the dimension of W, so there is a unique one (up to isomorphism) of any given uncountable cardinality. It is not strongly minimal because W itself is a formula picking out a set that is neither finite nor the complement of a finite set. But it is almost strongly minimal, in the sense that it is basically just the vector space W^2, and W is strongly minimal. It turns out that for any uncountable structure (in a finite or countable language) that is isomorphic to every structure of the same cardinality and satisfying the same sentences, there's a formula defining a subset that is strongly minimal in an appropriate sense, such that the rest of the structure can be parameterized somehow using the subset.

Exact 2-cycles are degenerate isomorphisms

The situation in which you have vector spaces V and W, and linear maps f:V\rightarrow W and g:W\rightarrow V such that \ker\left(f\right)=\text{im}\left(g\right) and \ker\left(g\right)=\text{im}\left(f\right) often arises in the situation in which you would have an isomorphism between V and W if you knew how to divide by 0. Specifically, this happens when you'd need to divide by 0 exactly once; in similar situations in which you'd need to know how to divide by 0 multiple times in order to get an isomorphism, you get f:V\rightarrow W and g:W\rightarrow V such that f\circ g=0 and g\circ f=0 but whose kernels and images are not necessarily equal.

I'll call such a pair \left(f,g\right) with \ker\left(f\right)=\text{im}\left(g\right) and \ker\left(g\right)=\text{im}\left(f\right) an exact 2-cycle of vector spaces. Note that the two vector spaces V and W in an exact 2-cycle are in fact isomorphic, as \dim\left(V\right)=\dim\left(\ker\left(f\right)\right)+\dim\left(\text{im}\left(f\right)\right)=\dim\left(\text{im}\left(g\right)\right)+\dim\left(\ker\left(g\right)\right)=\dim\left(W\right).

Adjugates

Given a finite-dimensional vector space V and an invertible linear map f:V\rightarrow V, its adjugate is almost its inverse; you just have to divide by \det\left(f\right). If f:V\rightarrow V is not invertible, then of course, \det\left(f\right)=0, so dividing by \det\left(f\right) doesn't work. But if f has nullity 1, then \ker\left(f\right)=\text{im}\left(\text{adj}\left(f\right)\right) and \ker\left(\text{adj}\left(f\right)\right)=\text{im}\left(f\right). That is, \left(f,\text{adj}\left(f\right)\right) is an exact 2-cycle. If f has nullity N\left(f\right)\geq2, then \det\left(f\right)=0^{N\left(f\right)}, and hence inverting f requires dividing by 0 more than once, and \text{adj}\left(f\right)=0.

Homogeneous polynomials and multilinear forms

Given a vector space V over a field k, let k\left[V\right]_{2} denote the space of quadratic forms on V (that is, homogeneous quadratic polynomial maps V\rightarrow k), and let Sym^{2}V^{*} denote the space of symmetric bilinear forms on V.

Given a symmetric bilinear form B on V, we can construct a quadratic form Q on V by Q\left(\vec{x}\right):=B\left(\vec{x},\vec{x}\right). This gives us a map f:Sym^{2}V^{*}\rightarrow k\left[V\right]_{2} by f\left(B\right)\left(\vec{x}\right)=B\left(\vec{x},\vec{x}\right).

2B\left(\vec{x},\vec{y}\right)=B\left(\vec{x}+\vec{y},\vec{x}+\vec{y}\right)-B\left(\vec{x},\vec{x}\right)-B\left(\vec{y},\vec{y}\right), so we can recover B from f\left(B\right) by B\left(\vec{x},\vec{y}\right)=\frac{1}{2}\left(f\left(B\right)\left(\vec{x}+\vec{y}\right)-f\left(B\right)\left(\vec{x}\right)-f\left(B\right)\left(\vec{y}\right)\right). That is, the map g:k\left[V\right]_{2}\rightarrow Sym^{2}V^{*} given by g\left(Q\right)\left(\vec{x},\vec{y}\right)=Q\left(\vec{x}+\vec{y}\right)-Q\left(\vec{x}\right)-Q\left(\vec{y}\right) is twice the inverse of f.

This doesn't quite work if \text{char}\left(k\right)=2, since we can't do the part where we divide by 2. In fact, f is not invertible in this case. But g is still a well-defined map k\left[V\right]_{2}\rightarrow Sym^{2}V^{*}, and it's still true that g\circ f=2 id_{Sym^{2}V^{*}} and f\circ g=2 id_{k\left[V\right]_{2}}; it's just that now that means g\circ f=0 and f\circ g=0. In fact, \ker\left(f\right)=\text{im}\left(g\right) and \ker\left(g\right)=\text{im}\left(f\right). \ker\left(g\right) and \text{im}\left(f\right) are the \dim\left(V\right)-dimensional space of diagonal quadratic forms (polynomials that are linear combinations of squares of linear functions V\rightarrow k), and \ker\left(f\right) and \text{im}\left(g\right) are the {\dim\left(V\right) \choose 2}-dimensional space of alternating symmetric bilinear forms. Thus Sym^{d}V^{*} and k\left[V\right]_{2} are both \dim\left(V\right)+{\dim\left(V\right) \choose 2}-dimensional.

Similar things happen with higher degree homogeneous polynomials and symmetric multilinear forms. Let k\left[V\right]_{d} be the space of homogeneous degree-d polynomials on V and Sym^{d}V^{*} the space of symmetric d-linear forms on V. We have functions f:Sym^{d}V^{*}\rightarrow k\left[V\right]_{d} given by f\left(\Phi\right)\left(\vec{x}\right):=\Phi\left(\vec{x},...,\vec{x}\right) and g:k\left[V\right]_{d}\rightarrow Sym^{d}V^{*} given by g\left(\phi\right)\left(\vec{x}^{1},...,\vec{x}^{d}\right)=\sum_{I\subseteq\left[d\right]}\left(-1\right)^{d-\left|I\right|}\phi\left(\sum_{i\in I}\vec{x}^{i}\right). g\left(f\left(\Phi\right)\right)=d!\Phi and f\left(g\left(\phi\right)\right)=d!\phi, so if \text{char}\left(k\right)=0 or \text{char}\left(k\right)>d, then f and g are bijections, and d! times each others' inverse. Otherwise, g\circ f=0 and f\circ g=0. If \frac{d}{2}<\text{char}\left(k\right)\leq d, then \text{char}\left(k\right) divides d! with multiplicity 1, and \ker\left(f\right)=\text{im}\left(g\right) and \ker\left(g\right)=\text{im}\left(f\right). If 1<\text{char}\left(k\right)\leq\frac{d}{2}, then \text{char}\left(k\right) divides d! with multiplicity \geq2, and all bets are off. Though \dim\left(k\left[V\right]_{d}\right)=\dim\left(Sym^{d}V^{*}\right)={\dim\left(V\right)+d-1 \choose d}, no matter what \text{char}\left(k\right) is.

Newtonian spacetime

In special relativity, we work with a 4-dimensional (3 for space and 1 for time) real vector space T, with a symmetric bilinear form \left\langle \cdot,\cdot\right\rangle , called the Minkowski inner product, of signature \left(-+++\right); that is, the associated quadratic form can be given, in coordinates, by -t^{2}+x^{2}+y^{2}+z^{2} (t is the time coordinate and x,y,z are spatial coordinates for some reference frame). If \left\langle \vec{v},\vec{v}\right\rangle >0, then \vec{v} is spacelike, and \sqrt{\left\langle \vec{v},\vec{v}\right\rangle } measures its distance (in the reference frame in which its temporal coordinate is 0). If \left\langle \vec{v},\vec{v}\right\rangle <0, then \vec{v} is timelike, and \sqrt{-\left\langle \vec{v},\vec{v}\right\rangle } measures its duration (in the reference frame in which it is at rest). By currying, the Minkowski inner product can be seen as a linear map T\rightarrow T^{*}, where T^{*} is the vector space of linear maps T\rightarrow\mathbb{R}. Since the Minkowski inner product is nondegenerate, this linear map T\rightarrow T^{*} is an isomorphism.

In Newtonian physics, things are a little different. We can still work in 4-dimensional spacetime, but we don't have a single Minkowski inner product measuring both distance and duration. We do have a global notion of time; that is, there's a linear map t:T\rightarrow\mathbb{R} that tells you what time it is at each point in spacetime. \ker\left(t\right) is space in the present moment, so it should be Euclidean space; that is, it should be equipped with an ordinary inner product.

The time function t induces a degenerate inner product on T by \left\langle \vec{v},\vec{w}\right\rangle :=t\left(\vec{v}\right)t\left(\vec{w}\right). As before, this can be seen as a linear map T\rightarrow T^{*} (it sends \vec{v} to t\left(\vec{v}\right)t), with 1-dimensional image \text{span}\left(t\right) and 3-dimensional kernel \ker\left(t\right).

The ordinary inner product on \ker\left(t\right) gives us a degenerate inner product on T^{*}: since our inner product on \ker\left(t\right) is non-degenerate, it induces an isomorphism between \ker\left(t\right) and its dual, and hence induces an inner product on \ker\left(t\right)^{*}. There's a canonical map T^{*}\rightarrow\ker\left(t\right)^{*} given by restriction: \varphi\mapsto\varphi\restriction_{\ker\left(t\right)}. So given \varphi,\psi\in T^{*}, we can define their inner product to be the spatial inner product of their restrictions to \ker\left(t\right). This can be seen as a linear map T^{*}\rightarrow T (given \varphi:T\rightarrow\mathbb{R}, restrict it to \ker\left(t\right), and then find the element of \ker\left(t\right)\subseteq T that corresponds to it via the spatial inner product) with image \ker\left(t\right) and kernel \text{span}\left(t\right). We have thus found canonical maps T\rightarrow T^{*} and T^{*}\rightarrow T such that the kernel of each is the image of the other.

Why?

In the spacetime example, it is conventional in special relativity to normalize the speed of light to 1. But another thing we can do is let the speed of light be the variable c. So \left\langle \left[\begin{array}{c} t_{1}\\ x_{1}\\ y_{1}\\ z_{1} \end{array}\right],\left[\begin{array}{c} t_{2}\\ x_{2}\\ y_{2}\\ z_{2} \end{array}\right]\right\rangle =-c^{2}t_{1}t_{2}+x_{1}x_{2}+y_{1}y_{2}+z_{1}z_{1}. As a map T\rightarrow T^{*}, this is \left[\begin{array}{c} t\\ x\\ y\\ z \end{array}\right]\mapsto\left[\begin{array}{cccc} -c^{2}t & x & y & z\end{array}\right]. The inverse map T^{*}\rightarrow T is \left[\begin{array}{cccc} \tau & \alpha & \beta & \gamma\end{array}\right]\mapsto\left[\begin{array}{c} -c^{-2}\tau\\ \alpha\\ \beta\\ \gamma \end{array}\right], or, as an inner product on T^{*}, \left\langle \left[\begin{array}{cccc} \tau_{1} & \alpha_{1} & \beta_{1} & \gamma_{1}\end{array}\right],\left[\begin{array}{cccc} \tau_{2} & \alpha_{2} & \beta_{2} & \gamma_{2}\end{array}\right]\right\rangle =-c^{-2}\tau_{1}\tau_{2}+\alpha_{1}\alpha_{2}+\beta_{1}\beta_{2}+\gamma_{1}\gamma_{2}. We're going to want to take a limit as c\rightarrow\infty and get something finite, so we'll have to scale our inner product on T down by a factor of c^{2}, giving us \left\langle \left[\begin{array}{c} t_{1}\\ x_{1}\\ y_{1}\\ z_{1} \end{array}\right],\left[\begin{array}{c} t_{2}\\ x_{2}\\ y_{2}\\ z_{2} \end{array}\right]\right\rangle =-t_{1}t_{2}+c^{-2}x_{1}x_{2}+c^{-2}y_{1}y_{2}+c^{-2}z_{1}z_{1}, or, as a map T\rightarrow T^{*}, \left[\begin{array}{c} t\\ x\\ y\\ z \end{array}\right]\mapsto\left[\begin{array}{cccc} -t & c^{-2}x & c^{-2}y & c^{-2}z\end{array}\right]. The limit c\rightarrow\infty gives us our temporal inner product on Newtonian spacetime, \left\langle \left[\begin{array}{c} t_{1}\\ x_{1}\\ y_{1}\\ z_{1} \end{array}\right],\left[\begin{array}{c} t_{2}\\ x_{2}\\ y_{2}\\ z_{2} \end{array}\right]\right\rangle =-t_{1}t_{2}, and our spatial inner product on the dual space \left\langle \left[\begin{array}{cccc} \tau_{1} & \alpha_{1} & \beta_{1} & \gamma_{1}\end{array}\right],\left[\begin{array}{cccc} \tau_{2} & \alpha_{2} & \beta_{2} & \gamma_{2}\end{array}\right]\right\rangle =\alpha_{1}\alpha_{2}+\beta_{1}\beta_{2}+\gamma_{1}\gamma_{2}, giving us our exact 2-cycle of maps between T and T^{*}, \left[\begin{array}{c} t\\ x\\ y\\ z \end{array}\right]\mapsto\left[\begin{array}{cccc} -t & 0 & 0 & 0\end{array}\right] and \left[\begin{array}{cccc} \tau & \alpha & \beta & \gamma\end{array}\right]\mapsto\left[\begin{array}{c} 0\\ \alpha\\ \beta\\ \gamma \end{array}\right]. (I did say that this should only work if we have to divide by 0 once, not if we must do so twice, and this involved c^{2}, but we never used c on its own anywhere, so we can just say c=\sqrt{\infty}, and it's fine).

Let's go back to the first example. Given f:V\rightarrow V of nullity 1, perturb f slightly to make it invertible by adding an infinitesimal \varepsilon times some map g:V\rightarrow V. The only condition we need g to satisfy is g\left(\ker\left(f\right)\right)\nsubseteq\text{im}\left(f\right). That way \det\left(f+\varepsilon g\right), which must be a multiple of \varepsilon, is not a multiple of \varepsilon^{2}. \left(f+\varepsilon g\right)\circ\text{adj}\left(f+\varepsilon g\right)=\text{adj}\left(f+\varepsilon g\right)\circ\left(f+\varepsilon g\right)=\det\left(f+\varepsilon g\right)id_{V}. Clearly f\circ\text{adj}\left(f\right)=\text{adj}\left(f\right)\circ f=\det\left(f\right)id_{V}=0. Given \vec{x}\in\ker\left(f\right), \left(f+\varepsilon g\right)\left(\vec{x}\right)=\varepsilon g\left(\vec{x}\right), so \text{adj}\left(f+\varepsilon g\right)\left(\varepsilon g\left(\vec{x}\right)\right)=\det\left(f+\varepsilon g\right)\vec{x}. Hence \text{adj}\left(f+\varepsilon g\right)\left(\frac{\varepsilon}{\det\left(f+\varepsilon g\right)}g\left(\vec{x}\right)\right)=\vec{x}. Since \det\left(f+\varepsilon g\right) has 0 constant term but nonzero coefficient of \varepsilon, \frac{\varepsilon}{\det\left(f+\varepsilon g\right)} can be evaluated at \varepsilon=0, and has a nonzero, finite value. Then \text{adj}\left(f\right)\left(\frac{\varepsilon}{\det\left(f+\varepsilon g\right)}\vert_{\varepsilon=0}g\left(\vec{x}\right)\right)=\vec{x}. So \left(f,\text{adj}\left(f\right)\right) forms an exact 2-cycle for reasons closely relating to the fact that perturbing each of them infinitesimally can make them inverses up to an infinitesimal scalar multiple.

Now, in the second example, where V is a vector space over a field k of positive characteristic, \frac{d}{2}<\text{char}\left(k\right)\leq d, and we have an exact 2-cycle f:Sym^{d}V^{*}\rightarrow k\left[V\right]_{d}, g:k\left[V\right]_{d}\rightarrow Sym^{d}V^{*}, let {\cal O} be an integral domain of characteristic 0 with a unique maximal ideal \mathfrak{m}, such that {\cal O}/\mathfrak{m}=k and \text{char}\left(k\right)\notin\mathfrak{m}^2 (for instance, if k=\mathbb{F}_{p}, we can use {\cal O}=\mathbb{Z}_{p} and \mathfrak{m}=p\mathbb{Z}_{p}). Lift V to a free {\cal O}-module \tilde{V} with \tilde{V}\otimes k=V (in coordinates, this means, instead of V=k^{n}, work with \tilde{V}={\cal O}^{n}, which carries a natural map to k^{n} by reducing each coordinate mod \mathfrak{m}). Then there are natural maps \tilde{f}:Sym^{d}\tilde{V}^{*}\rightarrow{\cal O}\left[\tilde{V}\right]_{d} and \tilde{g}:{\cal O}\left[\tilde{V}\right]_{d}\rightarrow Sym^{d}\tilde{V}^{*} such that \tilde{g}\circ\tilde{f}=d!id_{Sym^{d}\tilde{V}^{*}} and \tilde{f}\circ\tilde{g}=d!id_{{\cal O}\left[\tilde{V}\right]_{d}}, and \tilde{f} and \tilde{g} reduce mod \mathfrak{m} to f and g, respectively. Where K is the field of fractions of {\cal O} (so K=\mathbb{Q}_p in our example with k=\mathbb{F}_p and \cal{O}=\mathbb{Z}_p), \tilde{f}\otimes K:Sym^{d}\left(\tilde{V}\otimes K\right)^{*}\rightarrow K\left[\tilde{V}\otimes K\right]_{d} and \tilde{g}\otimes K:K\left[\tilde{V}\otimes K\right]_{d}\rightarrow Sym^{d}\left(\tilde{V}\otimes K\right)^{*} are bijections (in coordinates, \tilde{V}\otimes K=K^{n}, and tensoring a map with K just means the same map extended over the field of fractions), as they are inverses of each other up to a multiple of d! , which is invertible in K. Since \text{char}\left(k\right)\leq d, d!\in\mathfrak{m}, and g\circ f=0 and f\circ g=0. Given \phi\in\ker\left(g\right), if we lift \phi to \tilde{\phi}\in{\cal O}\left[\tilde{V}\right]_{d}, \tilde{g}\left(\tilde{\phi}\right)\in\mathfrak{m}Sym^{d}\tilde{V}^{*}. Since \text{char}\left(k\right)>\frac{d}{2}, d!\notin\mathfrak{m}^{2}, and hence \frac{\tilde{g}\left(\tilde{\phi}\right)}{d!}\in Sym^{d}\tilde{V}^{*}, and of course, \tilde{f}\left(\frac{\tilde{g}\left(\tilde{\phi}\right)}{d!}\right)=\tilde{\phi}. Reducing mod \mathfrak{m}, we get f\left(\frac{\tilde{g}\left(\tilde{\phi}\right)}{d!}\mod\mathfrak{m}\right)=\phi. Thus \ker\left(g\right)=\text{im}\left(f\right). Similarly, given \Phi\in\ker\left(f\right), lift \Phi to \tilde{\Phi}\in Sym^{d}\tilde{V}^{*}. \tilde{f}\left(\tilde{\Phi}\right)\in\mathfrak{m}{\cal O}\left[\tilde{V}\right]_{d}. \frac{\tilde{f}\left(\tilde{\Phi}\right)}{d!}\in{\cal O}\left[\tilde{V}\right]_{d}, and \tilde{g}\left(\frac{\tilde{f}\left(\tilde{\Phi}\right)}{d!}\right)=\tilde{\Phi}. Reducing mod \mathfrak{m}, we get g\left(\frac{\tilde{f}\left(\tilde{\Phi}\right)}{d!}\mod\mathfrak{m}\right)=\Phi. Thus \ker\left(f\right)=\text{im}\left(g\right). So \left(f,g\right) forms an exact cycle because, in K, they are inverses up to a factor of d! , which we can divide by, and which is 0 with multiplicity 1 in k, since d!\in\mathfrak{m}\setminus\mathfrak{m}^{2}.

The general story

All three arguments from the previous section took the following form: Let {\cal O} be a discrete valuation ring with residue field k, field of fractions K, and valuation \nu:K^{\times}\rightarrow\mathbb{Z}. Let V and W be free {\cal O}-modules, and let f:V\rightarrow W and g:W\rightarrow V be such that f\circ g=\varepsilon id_{W} and g\circ f=\varepsilon id_{V}, for some \varepsilon\in{\cal O} with \nu\left(\varepsilon\right)=1. Then f\otimes K:V\otimes K\rightarrow W\otimes K and g\otimes K:W\otimes K\rightarrow V\otimes K are isomorphisms, and each is \varepsilon times the inverse of the other. f\otimes k:V\otimes k\rightarrow W\otimes k and g\otimes k:W\otimes k\rightarrow V\otimes k form an exact 2-cycle: they compose to 0 because f and g compose to \varepsilon id, which goes to 0 in k, and given \vec{x}\in V\otimes k such that \left(f\otimes k\right)\left(\vec{x}\right)=0, we can lift \vec{x} to \tilde{x}\in V. f\left(\tilde{x}\right)\in\varepsilon W, so \varepsilon^{-1}f\left(\tilde{x}\right)\in W, and g\left(\varepsilon^{-1}f\left(\tilde{x}\right)\right)=\tilde{x}, so tensoring with k sends \varepsilon^{-1}f\left(\tilde{x}\right) to some \vec{y}\in W such that \left(g\otimes k\right)\left(\vec{y}\right)=\vec{x}. Thus \ker\left(f\right)=\text{im}\left(g\right). The same argument with f and g switched shows \ker\left(g\right)=\text{im}\left(f\right). The exact 2-cycle \left(f\otimes k,g\otimes k\right) is a sort of shadow of the isomorphisms f\otimes K,g\otimes K.

In the spacetime example, K=\mathbb{R}\left(\left(c^{-2}\right)\right), {\cal O}=\mathbb{R}\left[\left[c^{-2}\right]\right], k=\mathbb{R}, and \varepsilon=c^{-2}. In the adjugates example, K=k\left(\left(\varepsilon\right)\right), {\cal O}=k\left[\left[\varepsilon\right]\right], and the \varepsilon in the general story is \det\left(f+\varepsilon g\right). In the homogeneous polynomials and symmetric multilinear forms example, K is a discretely valued field of characteristic 0 with residue field k, {\cal O} is its valuation ring, and \varepsilon=d! .

All exact 2-cycles of vector spaces can be fit into this general story. Given any exact 2-cycle f:V\rightarrow W, g:W\rightarrow V (V, W vector spaces over k), we can take a discretely valued field K with residue field k, and then lift f,g to \tilde{f},\tilde{g} with \tilde{g}\circ\tilde{f}=\varepsilon id for some \varepsilon\in K with \nu\left(\varepsilon\right)=1, exactly the conditions in the above argument.

What more?

What about exact 2-cycles in abelian categories other than vector spaces? In general, the two objects in an exact 2-cycle need not be isomorphic. For instance, with abelian groups, there's an exact 2-cycle between the 4-element cyclic group and the Klein four-group. Though two objects in an exact 2-cycle must be isomorphic in any category in which every short exact sequence splits (this is the gist of the dimension-counting argument from the beginning showing that two vector spaces in an exact 2-cycle must be isomorphic). Is there still some way of seeing exact 2-cycles as degenerate isomorphisms even in contexts in which there need not be actual isomorphisms?

Also, what about exact n-cycles? That is, a cycle of n functions such that the image of each is the kernel of the next. If an exact 2-cycle is a degenerate form of an isomorphism, and an isomorphism is an exact sequence of length 2, then perhaps an exact 3-cycle should be a degenerate form of an exact sequence of length 3 (i.e. a short exact sequence). This is hard to picture, as a short exact sequence is not symmetric between its objects. However, for reasons not understood by me, algebraic topologists care about exact 3-cycles in which two of the three objects involved are the same (these are called exact couples), and this apparently has something to do with short exact sequences in which the first two objects are isomorphic, which provides some support for the idea that exact 3-cycles should have something to do with short exact sequences. An exact sequence of length 1 just consists of the 0 object, so this suggests an exact 1-cycle (i.e. an endomorphism of an object whose kernel and image are the same) should be considered a degenerate form of the 0 object, which is also hard to picture.

Metamathematics and probability

Content warning: mathematical logic.

Note: This write-up consists mainly of open questions rather than results, but may contain errors anyway.

Setup

I'd like to describe a logic for talking about probabilities of logical sentences. Fix some first-order language {\cal L}. This logic deals with pairs \left(\varphi,p\right), which I'm calling assertions, where \varphi\in{\cal L} is a formula and p\in\left[0,1\right]. Such a pair is to be interpreted as a claim that \varphi has probability at least p.

A theory consists of a set of assertions. A model of a theory T consists of a probability space \left(X,P\right) whose points are {\cal L}-structures, such that for every assertion \left(\varphi,p\right)\in T, P_{*}\left(\left\{ {\cal M}\in X\mid{\cal M}\models\varphi\right\} \right)\geq p, where P_{*} is inner probability. I'll write T\vdash_{p}\varphi for \left(\varphi,p\right) can be proved from T, and T\models_{p}\varphi for all models of T are also models of \left\{ \left(\varphi,p\right)\right\} .

The rules of inference are all rules \Gamma\vdash_{p}\varphi where \Gamma is a finite set of assertions, and \left(\varphi,p\right) is an assertion such that P_{*}\left(\left\{ {\cal M}\in X\mid{\cal M}\models\varphi\right\} \right)\geq p in all models of \Gamma. Can we make an explicit finite list of inference rules that generate this logic? If not, is the set of inference rules at least recursively enumerable? (For recursive enumerability to make sense here, we need to restrict attention to probabilities in some countable dense subset of \left[0,1\right] that has a natural explicit bijection with \mathbb{N}, such as \mathbb{Q}\cap\left[0,1\right].) I'm going to assume later that the set of inference rules is recursively enumerable; if it isn't, everything should still work if we use some recursively enumerable subset of the inference rules that includes all of the ones that I use.

Note that the compactness theorem fails for this logic; for example, \left\{ \left(\varphi,p\right)\mid p<1\right\} \models_{1}\varphi, but no finite subset of \left\{ \left(\varphi,p\right)\mid p<1\right\} implies \left(\varphi,1\right), and hence \left\{ \left(\varphi,p\right)\mid p<1\right\} \nvdash_{1}\varphi.

Any classical first-order theory T can be converted into a theory in this logic as \left\{ \left(\varphi,1\right)\mid T\vdash\varphi\right\} .

Löb's Theorem

Let T be a consistent, recursively axiomatizable extension of Peano Arithmetic. By the usual sort of construction, there is a \Sigma_{1}^{0} binary predicate \square_{y}\left(x\right) such that T\vdash_{p}\varphi\iff\mathbb{N}\models\square_{p}\left(\ulcorner\varphi\urcorner\right) for any sentence \varphi and p\in\left[0,1\right]\cap\mathbb{Q}, where \ulcorner\urcorner is a coding of sentences with natural numbers. We have a probabilistic analog of Löb's theorem: if T\vdash_{p}\square_{p}\left(\ulcorner\varphi\urcorner\right)\rightarrow\varphi, then T\vdash_{p}\varphi. Peano arithmetic can prove this theorem, in the sense that PA\vdash_{1}\square_{p}\left(\ulcorner\square_{p}\left(\ulcorner\varphi\urcorner\right)\rightarrow\varphi\urcorner\right)\rightarrow\square_{p}\left(\ulcorner\varphi\urcorner\right).

Proof: Assume T\vdash_{p}\square_{p}\left(\ulcorner\varphi\urcorner\right)\rightarrow\varphi. By the diagonal lemma, there is a sentence \psi such that T\vdash_{1}\psi\leftrightarrow\left(\square_{p}\left(\ulcorner\psi\urcorner\right)\rightarrow\varphi\right). If \square_{p}\left(\ulcorner\psi\urcorner\right), then \square_{1}\left(\ulcorner\square_{p}\left(\ulcorner\psi\urcorner\right)\urcorner\right) and \square_{p}\left(\ulcorner\square_{p}\left(\ulcorner\psi\urcorner\right)\rightarrow\varphi\urcorner\right), so \square_{p}\left(\ulcorner\varphi\urcorner\right). This shows that T\cup\left\{ \left(\square_{p}\left(\ulcorner\psi\urcorner\right),1\right)\right\} \vdash_{1}\square_{p}\left(\ulcorner\varphi\urcorner\right). By the assumption that T\vdash_{p}\square_{p}\left(\ulcorner\varphi\urcorner\right)\rightarrow\varphi, this implies that T\cup\left\{ \left(\square_{p}\left(\ulcorner\psi\urcorner\right),1\right)\right\} \vdash_{p}\varphi. By a probabilistic version of the deduction theorem, T\vdash_{p}\square_{p}\left(\ulcorner\psi\urcorner\right)\rightarrow\varphi. That is, T\vdash_{p}\psi. Going back around through all that again, we get T\vdash_{p}\varphi.

If we change the assumption to be that T\vdash_{q}\square_{p}\left(\ulcorner\varphi\urcorner\right)\rightarrow\varphi for some q<p, then the above proof does not go through (if q>p, then it does, because \left(\theta,q\right)\vdash_{p}\theta). Is there a consistent theory extending Peano Arithmetic that proves a soundness schema about itself, \left\{ \left(\square_{p}\left(\ulcorner\varphi\urcorner\right)\rightarrow\varphi,q\right)\mid q<p\right\} , or can this be used to derive a contradiction some other way? If there is no such consistent theory, then can the soundness schema be modified so that it is consistent, while still being nontrivial? If there is such a consistent theory with a soundness schema, can the theory also be sound? That is actually several questions, because there are multiple things I could mean by "sound". The possible syntactic things "sound" could mean, in decreasing order of strictness, are: 1) The theory does not assert a positive probability to any sentence that is false in \mathbb{N}. 2) There is an upper bound below 1 for all probabilities asserted of sentences that are false in \mathbb{N}. 3) The theory does not assert probability 1 to any sentence that is false in \mathbb{N}.

There are also semantic versions of the above questions, which are at least as strict as their syntactic analogs, but probably aren't equivalent to them, since the compactness theorem does not hold. The semantic version of asking if the soundness schema is consistent is asking if it has a model. The first two soundness notions also have semantic analogs. 1') \left\{ \mathbb{N}\right\} is a model of the theory. 2') There is a model of the theory that assigns positive probability to \mathbb{N}. I don't have a semantic version of 3, but metaphorically speaking, a semantic version of 3 should mean that there is a model that assigns nonzero probability density at \mathbb{N}, even though it might not have a point mass at \mathbb{N}.

Motivation

This is somewhat similar to Definability of Truth in Probabilistic Logic. But in place of adding a probability predicate to the language, I'm only changing the metalanguage to refer to probabilities, and using this to express statements about probability in the language through conventional metamathematics. An advantage of this approach is that it's constructive. Theories with the properties described by the Christiano et al paper are unsound, so if some reasonably strong notion of soundness applies to an extension of Peano Arithmetic with the soundness schema I described, that would be another advantage of my approach.

A type of situation that this might be useful for is that when an agent is reasoning about what actions it will take in the future, it should be able to trust its future self's reasoning. An agent with the soundness schema can assume that its future self's beliefs are accurate, up to arbitrarily small loss in precision. A related type of situation is if an agent reaches some conclusion, and then writes it to external storage instead of its own memory, and later reads the claim it had written to external storage. With the soundness schema, if the agent has reason to believe that the external storage hasn't been tampered with, it can reason that since its past self had derived the claim, the claim is to be trusted arbitrarily close to as much as it would have been if the agent had remembered it internally.

First Incompleteness Theorem

For a consistent theory T, say that a sentence \varphi is T-measurable if there is some p\in\left[0,1\right] such that T\vdash_{q}\varphi for every q<p and T\vdash_{q}\neg\varphi for every q<1-p. So T-measurability essentially means that T pins down the probability of the sentence. If \varphi is not T-measurable, then you could say that T has Knightian uncertainty about \varphi. Say that T is complete if every sentence is T-measurable. Essentially, complete theories assign a probability to every sentence, while incomplete theories have Knightian uncertainty.

The first incompleteness theorem (that no recursively axiomatizable extension of PA is consistent and complete) holds in this setting. In fact, for every consistent recursively axiomatizable extension of PA, there must be sentences that are given neither a nontrivial upper bound nor a nontrivial lower bound on their probability. Otherwise, we would be able to recursively separate the theorems of PA from the negations of theorems of PA, by picking some recursive enumeration of assertions of the theory, and sorting sentences by whether they are first given a nontrivial lower bound or first given a nontrivial upper bound; theorems of PA will only be given a nontrivial lower bound, and their negations will only be given a nontrivial upper bound. [Thanks to Sam Eisenstat for pointing this out; I had somehow managed not to notice this on my own.]

For an explicit example of a sentence for which no nontrivial bounds on its probability can be established, use the diagonal lemma to construct a sentence \varphi which is provably equivalent to "for every proof of \left(\varphi,p\right) for any p>0, there is a proof of \left(\neg\varphi,q\right) for some q>0 with smaller Gödel number."

Thus a considerable amount of Knightian uncertainty is inevitable in this framework. Dogmatic Bayesians such as myself might find this unsatisfying, but I suspect that any attempt to unify probability and first-order arithmetic will suffer similar problems.

A side note on model theory and compactness

I'm a bit unnerved about the compactness theorem failing. It occurred to me that it might be possible to fix this by letting models use hyperreal probabilities. Problem is, the hyperreals aren't complete, so the countable additivity axiom for probability measures doesn't mean anything, and it's unclear what a hyperreal-valued probability measure is. One possible solution is to drop countable additivity, and allow finitely-additive hyperreal-valued probability measures, but I'm worried that the logic might not even be sound for such models.

A different possible solution to this is to take a countably complete ultrafilter U on a set \kappa, and use probabilities valued in the ultrapower \mathbb{R}^{\kappa}/U. Despite \mathbb{R}^{\kappa}/U not being Cauchy complete, it inherits a notion of convergence of sequences from \mathbb{R}, since a sequence \left\{ \left[x_{i,j}\mid i\in\kappa\right]\mid j\in\omega\right\} can be said to converge to \left[\lim_{j\rightarrow\infty}x_{i,j}\mid i\in\kappa\right], and this is well-defined (if \lim_{j\rightarrow\infty}x_{i,j} is for a U-large set of indices i) by countable completeness. Thus the countable additivity axiom makes sense for \mathbb{R}^{\kappa}/U-valued probability measures. Allowing models to use \mathbb{R}^{\kappa}/U-valued probability measures might make the compactness theorem work. [Edit: This doesn't work, because \mathbb{R}^{\kappa}/U\cong\mathbb{R}. To see this, it is enough to show that \mathbb{R}^{\kappa}/U is Archimedean, since \mathbb{R} has no proper Archimedean extensions. Given \left[x_i\mid i\in\kappa\right]\in\mathbb{R}^{\kappa}/U, let A_n:=\{i\in\kappa\mid| x_i|<n\} for n\in\mathbb{N}. \bigcup_{n\in\mathbb{N}}A_n = \kappa, so by countable completeness of U, there is some n\in\mathbb{N} such that A_n\in U, and thus \left[x_i\mid i\in\kappa\right]<n.]

Complexity classes of natural numbers (googology for ultrafinitists)

Ultrafinitists think common ways of defining extremely large numbers don't actually refer to numbers that exist. For example, most ultrafinitists would maintain that a googolplex isn't a number. But to a classical mathematician, while numbers like a googolplex are far larger than the numbers we deal with on a day-to-day basis like 10, both numbers have the same ontological status. In this post, I want to consider a compromise position, that any number we can define can be meaningfully reasoned about, but that a special status is afforded to the sorts of numbers that ultrafinitists can accept.

Specifically, define an “ultrafinite number” to be a natural number that it is physically possible to express in unary. This isn't very precise, since there are all sorts of things that “physically possible to express in unary” could mean, but let's just not worry about that too much. Also, many ultrafinitists would not insist that numbers must be expressible in such an austere language as unary, but I'm about to get to that.

Examples: 20 is an ultrafinite number, because 20 = SSSSSSSSSSSSSSSSSSSS0, where S is the successor function. 80,000 is also an ultrafinite number, but it is a large one, and it isn't worth demonstrating its ultrafiniteness. A googol is not ultrafinite. The observable universe isn't even big enough to contain a googol written in unary.

Now, define a “polynomially finite number” to be a natural number that it is physically possible to express using addition and multiplication. Binary and decimal are basically just concise ways of expressing certain sequences of addition and multiplication operations. For instance, “18,526” means (((1*10 + 8)*10 + 5)*10 + 2)*10 + 6. Conversely, if you multiply an n-digit number with an m-digit number, you get an at most n+m+1-digit number, which is the same number of symbols it took write down “[the n-digit number] times [the m-digit number]” in the first place, so any number that can be written using addition and multiplication can be written in decimal. Thus, another way to define polynomially finite numbers is as the numbers that it is physically possible to express in binary or in decimal. I've been ignoring some small constant factors that might make these definitions not quite equivalent, but any plausible candidate for a counterexample would be an ambiguous edge case according to each definition anyway, so I'm not worried about that. Many ultrafinitists may see something more like polynomially finite number, rather than ultrafinite number, as a good description of what numbers exist.

Examples: A googol is polynomially finite, because a googol is 10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000. A googolplex is not polynomially finite, because it would require a googol digits to express in decimal, which is physically impossible.

Define an “elementarily finite number” to be a number that it is physically possible to express using addition, multiplication, subtraction, exponentiation, and the integer division function \lfloor x/y\rfloor. Elementarily finite is much broader than polynomially finite, so it might make sense to look at intermediate classes. Say a number is “exponentially finite” if it is physically possible to express using the above operations but without any nested exponentiation (e.g. a^b c^d is okay, but a^{(b^c)} is not). More generally, we can say that a number is “k-exponentially finite” if it can be expressed with exponentiation nested to depth at most k, so a polynomially finite number is a 0-exponentially finite number, an exponentially finite number is a 1-exponentially finite number, and an elementarily finite number is a number that is k-exponentially finite for some k (or equivalently, for some ultrafinite k).

Examples: a googolplex is exponentially finite, because it is 10^{10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000}. Thus a googolduplex, meaning 10^\text{googolplex}, is 2-exponentially finite, but it is not exponentially finite. For examples of non-elementarily finite numbers, and numbers that are only k-exponentially finite for fairly large k, I'll use up-arrow notation. a\uparrow b just means a^b, a\uparrow^{n+1} b means a\uparrow^n a\uparrow^n a\uparrow^n ... a, where b is the number of copies of a, and using order of operations that starts on the right. So 3\uparrow\uparrow3 = 3^{(3^3)} = 3^{27} = 7,625,597,484,987, which is certainly polynomially finite, and could also be ultrafinite depending on what is meant by “physically possible” (a human cannot possibly count that high, but a computer with a large enough hard drive can store 3\uparrow\uparrow3 in unary). 3\uparrow\uparrow\uparrow3 = 3\uparrow\uparrow(3\uparrow\uparrow3) = 3\text{^}3\text{^}3\text{^}...\text{^}3, where there are 3^{27} threes in that tower. Under the assumptions that imply 3^{27} is ultrafinite, 3\uparrow\uparrow\uparrow3 is elementarily finite. Specifically, it is 3^{27}-exponentially finite, but I'm pretty sure it's not 3^{26}-exponentially finite, or even 7,625,597,484,000-exponentially finite. 3\uparrow\uparrow\uparrow\uparrow3 = 3\uparrow\uparrow\uparrow(3\uparrow\uparrow\uparrow3), and is certainly not elementarily finite.

Interestingly, even though a googolplex is exponentially finite, there are numbers less than a googolplex that are not. There's an easy nonconstructive proof of this: in order to be able to represent every number less than a googolplex in any encoding scheme at all, there has to be some number less than a googolplex that requires at least a googol decimal digits of information to express. But it is physically impossible to store a googol decimal digits of information. Therefore for any encoding scheme for numbers, there is some number less than a googolplex that cannot physically be expressed in it. This is why the definition of elementarily finite is significantly more complicated than the definition of polynomially finite; in the polynomial case, if n can be expressed using addition and multiplication and m<n, then m can also be expressed using addition and multiplication, so there's no need for additional operations to construct smaller numbers, but in the elementary case, the operations of subtraction and integer division are useful for expressing more numbers, and are simpler than exponentiation. For example, these let us express the number that you get from reading off the last googol digits, or the first googol digits, of 3\uparrow\uparrow100, so these numbers are elementarily finite. However, it is exceptionally unlikely that the number you get from reading off the first googol decimal digits of 3\uparrow\uparrow\uparrow\uparrow3 is elementarily finite. But for a difficult exercise, show that the number you get from reading off the last googol decimal digits of 3\uparrow\uparrow\uparrow\uparrow3 is elementarily finite.

Why stop there instead of including more operations for getting smaller numbers, like \lfloor\log\rfloor, which I implicitly used when I told you that the number formed by the first googol digits of 3\uparrow\uparrow100 is elementarily finite? We don't have to. The functions that you can get by composition from addition, multiplication, exponentiation, \max(x-y,0), and \lfloor x/y\rfloor coincide with the functions that can be computed in iterated exponential time (meaning O(2^{2^{...^{2^n}}}) time, for some height of that tower). So if you have any remotely close to efficient way to compute an operation, it can be expressed in terms of the operations I already specified.

We can go farther. Consider a programming language that has the basic arithmetic operations, if/else clauses, and loops, where the number of iterations of each loop must be fixed in advance. The programs that can be written in such a language are the primitive recursive functions. Say that a number is primitive recursively finite if it is physically possible to write a program (that does not take any input) in this language that outputs it. For each fixed n, the binary function (x,y)\mapsto x\uparrow^n y is primitive recursive, so 3\uparrow\uparrow\uparrow\uparrow3 is primitive recursively finite. But the ternary function (x,y,z)\mapsto x\uparrow^y z is not primitive recursive, so 3\uparrow^\text{googol}3 is not primitive recursively finite.

The primitive recursively finite numbers can be put in a hierarchy of subclasses based on the depth of nested loops that are needed to express them. If the only arithmetic operation available is the successor function (from which other operations can be defined using loops), then the elementarily finite numbers are those that can be expressed with loops nested to depth at most 2. The k-exponentially finite numbers should roughly correspond to the numbers that can be expressed with at most k loops at depth 2.

Next comes the provably computably finite numbers. Say that a number is provably computably finite if it is physically possible to write a program in a Turing-complete language that outputs the number (taking no input), together with a proof in Peano Arithmetic that the program halts. The famous Graham's number is provably computably finite. Graham's number is defined in terms of a function g, defined recursively as g(0):=4 and g(n+1):=3\uparrow^{g(n)}3. Graham's number is g(64). You could write a computer program to compute g, and prove that g is total using Peano arithmetic. By replacing Peano arithmetic with other formal systems, you can get other variations on the notion of provably computably finite.

For an example of a number that is not provably computably finite, I'll use the hydra game, which is described here. There is no proof in Peano arithmetic (that can physically be written down) that it is possible to win the hydra game starting from the complete binary tree of depth a googol. So the number of turns it takes to win the hydra game on the complete binary tree of depth a googol is not provably computably finite. If you start with a reasonably small hydra (say, with 100 nodes), you could write a program to search for the shortest winning strategy, and prove in Peano arithmetic that it succeeds in finding one, if you're sufficiently clever and determined, and you use a computer to help you search for proofs. The proof you'd get out of this endeavor would be profoundly unenlightening, but the point is, the number of turns it takes to win the hydra game for a small hydra is provably computably finite (but not primitive recursively finite, except in certain trivial special cases).

Next we'll drop the provability requirement, and say that a number is computably finite if it is physically possible to write a computer program that computes it from no input. Of course, in order to describe a computably finite number, you need the program you use to actually halt, so you'd need some argument that it does halt in order to establish that you're describing a computably finite number. Thus this is arguably just a variation on provably computably finite, where Peano arithmetic is replaced by some unspecified strong theory encompassing the sort of reasoning that classical mathematicians tend to endorse. This is probably the point where even the most patient of ultrafinitists would roll their eyes in disgust, but oh well. Anyway, the number of steps that it takes to win the hydra game starting from the complete binary tree of depth a googol is a computably finite number, because there exists a shortest winning strategy, and you can write a computer program to exhaustively search for it.

The busy-beaver function BB is defined so that BB(n) is the longest any Turing machine with n states runs before halting (among those that do halt). BB(10^{100}) is not computably finite, because Turing machines with a googol states cannot be explicitly described, and since the busy-beaver function is very fast-growing, no smaller Turing machine has comparable behavior. What about BB(10,000)? Turing machines with 10,000 states are not too big to describe explicitly, so it may be tempting to say that BB(10,000) is computably finite. But on the other hand, it is not possible to search through all Turing machines with 10,000 states and find the one that runs the longest before halting. No matter how hard you search and no matter how clever your heuristics for finding Turing machines that run for exceptionally long and then halt, it is vanishingly unlikely that you will find the 10,000-state Turing machine that runs longest before halting, let alone realize that you have found it. And the idea is to use classical reasoning for large numbers themselves, but constructive reasoning for descriptions of large numbers. So since it is pretty much impossible to actually write a program that outputs BB(10,000), it is not computably finite.

For a class that can handle busy-beaver numbers too, let's turn to the arithmetically finite numbers. These are the numbers that are defined by arithmetical formulas. These form a natural hierarchy, where the \Sigma^0_n-finite numbers are the numbers defined by arithmetical formulas with at most n unbounded quantifiers starting with \exists, the \Pi^0_n-finite numbers are the numbers defined by arithmetical formulas with at most n unbounded quantifiers starting with \forall, and the \Delta^0_n-finite numbers are those that are both \Sigma^0_n-finite and \Pi^0_n-finite. The \Delta^0_1-finite numbers are the same as the computably finite numbers. BB(10^{100}) is \Pi^0_2-finite, because it is defined by “\forall n every Turing machine with 10^{100} states that halts in at most n steps halts in at most BB(10^{100}) steps, and there is a Turing machine with 10^{100} states that halts in exactly BB(10^{100}) steps.” Everything after the first quantifier in that formula is computable. BB(BB(10^{100})) is \Delta^0_2-finite, but no lower than that. To get a number that is not arithmetically finite, consider the function f given by f(n) is the largest number defined by an arithmetical formula with n symbols. f(10,000) is \Delta^0_{5,000}-finite, but f(10^{100}) is not arithmetically finite. I'll stop there.

Principal Component Analysis in Theory and Practice

Prerequisites for this post are linear algebra, including tensors, and basic probability theory. Already knowing how PCA works will also be helpful. In section 1, I'll summarize the technique of principal component analysis (PCA), stubbornly doing so in a coordinate-free manner, partly because I am an asshole but mostly because it is rhetorically useful for emphasizing my point in section 2. In section 2, I'll gripe about how PCA is often used in ways that shouldn't be expected to work, but works just fine anyway. In section 3, I'll discuss some useless but potentially amusing ways that PCA could be modified. Thanks to Laurens Gunnarsen for inspiring this post by talking to me about the problem that I discuss in section 2.

A brief introduction to Principal Component Analysis

You start with a finite-dimensional real inner product space V and a probability distribution \mu on V. Actually, you probably just started with a large finite number of elements of V, and you've inferred a probability distribution that you're supposing they came from, but that difference is not important here. The goal is to find the n-dimensional (for some n\leq\dim V) affine subspace W_{n}\subseteq V minimizing the expected squared distance between a vector (distributed according to \mu) and its orthogonal projection onto W_{n}. We can assume without loss of generality that the mean of \mu is 0, because we can just shift any probability distribution by its mean and get a probability distribution with mean 0. This is useful because then W_{n} will be a linear subspace of V. In fact, we will solve this problem for all n\leq\dim V simultaneously by finding an ordered orthonormal basis such that W_{n} is the span of the first n basis elements.

First you take \text{Cov}_{\mu}\in V\otimes V, called the covariance of \mu, defined as the bilinear form on V^{*} given by \text{Cov}_{\mu}\left(\varphi,\psi\right)=\int_{V}\varphi\left(x\right)\psi\left(x\right)d\mu\left(x\right). From this, we get the covariance operator C_{\mu}\in V^{*}\otimes V by raising the first index, which means starting with \left\langle \cdot,\cdot\right\rangle \otimes\text{Cov}_{\mu}\in V^{*}\otimes V^{*}\otimes V\otimes V and performing a tensor contraction (in other words, C_{\mu} is obtained from \text{Cov}_{\mu} by applying the map V\rightarrow V^{*} given by the inner product to the first index). \text{Cov}_{\mu} is symmetric and positive semi-definite, so C_{\mu} is self-adjoint and positive semi-definite, and hence V has an orthonormal basis of eigenvectors of C_{\mu}, with non-negative real eigenvalues. This gives an orthonormal basis in which \text{Cov}_{\mu} is diagonal, where the diagonal entries are the eigenvalues. Ordering the eigenvectors in decreasing order of the corresponding eigenvalues gives us the desired ordered orthonormal basis.

The problem

There's no problem with principal component analysis as I described it above. It works just fine, and in fact is quite beautiful. But often people apply principal component analysis to probability distributions on finite-dimensional real vector spaces that don't have a natural inner product structure. There are two closely related problems with this: First, the goal is underdefined. We want to find a projection onto an n-dimensional subspace that minimizes the expected squared distance from a vector to its projection, but we don't have a measure of distance. Second, the procedure is underdefined. \text{Cov}_{\mu} is a bilinear form, not a linear operator, so it doesn't have eigenvectors or eigenvalues, and we don't have a way of raising an index to produce something that does. It should come as no surprise that these two problems arise together. After all, you shouldn't be able to find a fully specified solution to an underspecified problem.

People will apply principal component analysis in such cases by picking an inner product. This solves the second problem, since it allows you to carry out the procedure. But it does not solve the first problem. If you wanted to find a projection onto an n-dimensional subspace such that the distance from a vector to its projection tends to be small, then you must have already had some notion of distance in mind by which to judge success. Haphazardly picking an inner product gives you a new notion of distance, and then allows you to find an optimal solution with respect to your new notion of distance, and it is not clear to me why you should expect this solution to be reasonable with respect to the notion of distance that you actually care about.

In fact, it's worse than that. Of course, principal component analysis can't given you literally any ordered basis at all, but it is almost as bad. The thing that you use PCA for is the projection onto the span of the first n basis elements along the span of the rest. These projections only depend on the sequence of 1-dimensional subspaces spanned by the basis elements, and not the basis elements themselves. That is, we might as well only pay attention to the principal components up to scale, rather than making sure that are all unit length. Let a "coordinate system" refer to an ordered basis up to two ordered bases being equivalent if they differ only by scaling the basis vectors, so that we're paying attention to the coordinate systems given to us by PCA. If the covariance of \mu is nondegenerate, then the set of coordinate systems that can be obtained from principal component analysis by a suitable choice of inner product is dense in the space of coordinate systems. More generally, where U is the smallest subspace of V such that \mu\left(U\right)=1, then the space of coordinate systems that you can get from principal component analysis is dense in the space of all coordinate systems whose first \dim U coordinates span U (\dim U will be the rank of the covariance of \mu). So in a sense, for suitably poor choices of inner product, principal component analysis can give you arbitrarily terrible results, subject only to the weak constraint that it will always notice if all of the vectors in your sample belong to a common subspace.

It is thus somewhat mysterious that machine learning people seem to be able to often get good results from principal component analysis apparently without being very careful about the inner product they choose. Vector spaces that arise in machine learning seem to almost always come with a set of preferred coordinate axes, so these axes are taken to be orthogonal, leaving only the question of how to scale them relative to each other. If these axes are all labeled with the same units, then this also gives you a way of scaling them relative to each other, and hence an inner product. If they are aren't, then I'm under the impression that the most popular method is to normalize them such that the pushforward of \mu along each coordinate axis has the same variance. This is unsatisfying, since figuring out which axes \mu has enough variance along to be worth paying attention to seems like the sort of thing that you would want principal component analysis to be able to tell you. Normalizing the axes in this way seems to me like an admission that you don't know exactly what question you're hoping to use principal component analysis to answer, so you just tell it not to answer that part of the question to minimize the risk of asking it to answer the wrong question, and let it focus on telling you how the axes, which you're pretty sure should be considered orthogonal, correlate with each other.

That conservatism is actually pretty understandable, because figuring out how to ask the right question seems hard. You implicitly have some metric d on V such that you want to find a projection \pi onto an n-dimensional subspace such that d\left(x,\pi\left(x\right)\right) is usually small when x is distributed according to \mu. This metric is probably very difficult to describe explicitly, and might not be the metric induced by any inner product (for that matter, it might not even be a metric; d\left(x,y\right) could be any way of quantifying how bad it is to be told the value y when the correct value you wanted to know is x). Even if you somehow manage to explicitly describe your metric, coming up with a version of PCA with the inner product replaced with an arbitrary metric also sounds hard, so the next thing you would want to do is fit an inner product to the metric.
The usual approach is essentially to skip the step of attempting to explicitly describe the metric, and just find an inner product that roughly approximates your implicit metric based on some rough heuristics about what the implicit metric probably looks like. The fact that these heuristics usually work so well seems to indicate that the implicit metric tends to be fairly tame with respect to ways of describing the data that we find most natural. Perhaps this shouldn't be too surprising, but I still feel like this explanation does not make it obvious a priori that this should work so well in practice. It might be interesting to look into why these heuristics work as well as they do with more precision, and how to go about fitting a better inner product to implicit metrics. Perhaps this has been done, and I just haven't found it.

To take a concrete example, consider eigenfaces, the principal components that you get from a set of images of people's faces. Here, you start with the coordinates in which each coordinate axis represents a pixel in the image, and the value of that coordinate is the brightness of the corresponding pixel. By declaring that the coordinate axes are orthogonal, and measuring the brightness of each pixel on the same scale, we get our inner product, which is arguably a fairly natural one.

Presumably, the implicit metric we're using here is visual distance, by which I mean a measure of how similar two images look. It seems clear to me that visual distance is not very well approximated by our inner product, and in fact, there is no norm such that the visual distance between two images is approximately the norm of their difference. To see this, if you take an image and make it brighter, you haven't changed how it looks very much, so the visual distance between the image and its brighter version is small. But their difference is just a dimmer version of the same image, and if you add that difference to a completely different image, you will get the two images superimposed on top of each other, a fairly radical change. Thus the visual distance traversed by adding a vector depends on where you start from.

Despite this, producing eigenfaces by using PCA on images of faces, using the inner product described above, performs well with respect to visual distance, in the sense that you can project the images onto a relatively small number of principal components and leave them still recognizable. I think this can be explained on an intuitive level. In a human eye, each photoreceptor has a narrow receptive field that it detects light in, much like a pixel, so the representation of an image in the eye as patterns of photoreceptor activity is very similar to the representation of an image in a computer as a vector of pixel brightnesses, and the inner product metric is a reasonable measure of distance in this representation. When the visual cortex processes this information from the eye, it is difficult (and perhaps also not useful) for it to make vast distinctions between images that are close to each other according to the inner product metric, and thus result in similar patterns of photoreceptor activity in the eye. Thus the visual distance between two images cannot be too much greater than their inner product distance, and hence changing an image by a small amount according to the inner product metric can only change it by a small amount according to visual distance, even though the reverse is not true.

Generalizations

The serious part of this post is now over. Let's have some fun. Some of the following ways of modifying principal component analysis could be combined, but I'll consider them one at a time for simplicity.

As hinted at above, you could start with an arbitrary metric on V rather than an inner product, and try to find the rank-n projection (for some n\leq\dim V) that minimizes the expected squared distance from a vector to its projection. This would probably be difficult, messy, and not that much like principal component analysis. If it can be done, it would be useful in practice if we were much better at fitting explicit metrics to our implicit metrics than at fitting inner products to our implicit metrics, but I'm under the impression that this is not currently the case. This also differs from the other proposals in this section in that it is a modification of the problem looking for a solution, rather than a modification of the solution looking for a problem.

V could be a real Hilbert space that is not necessarily finite-dimensional. Here we can run into the problem that C_{\mu} might not even have any eigenvectors. However, if \mu (which hopefully was not inferred from a finite sample) is Gaussian (and possibly also under weaker conditions), then C_{\mu} is a compact operator, so V does have an orthonormal basis of eigenvectors of C_{\mu}, which still have non-negative eigenvalues. There probably aren't any guarantees you can get about the order-type of this orthonormal basis when you order the eigenvectors in decreasing order of their eigenvalues, and there probably isn't a sense in which the orthogonal projection onto the closure of the span of an initial segment of the basis accounts for the most variance of any closed subspace of the same "size" ("size" would have to refer to a refinement of the notion of dimension for this to be the case). However, a weaker statement is probably still true: namely that each orthonormal basis element maximizes the variance that it accounts for conditioned on values along the previous orthonormal basis elements. I guess considering infinite-dimensional vector spaces goes against the spirit of machine learning though.

V could be a finite-dimensional complex inner product space. \text{Cov}_{\mu}\in\overline{V}\otimes V would be the sesquilinear form on V^{*} given by \text{Cov}_{\mu}\left(\varphi,\psi\right)=\int_{V}\overline{\varphi\left(x\right)}\psi\left(x\right)d\mu\left(x\right). \left\langle \cdot,\cdot\right\rangle \in\overline{V}^{*}\otimes V^{*}, so \left\langle \cdot,\cdot\right\rangle \otimes\text{Cov}_{\mu}\in\overline{V}^{*}\otimes V^{*}\otimes\overline{V}\otimes V, and applying a tensor contraction to the conjugated indices gives us our covariance operator C_{\mu}\in V^{*}\otimes V (in other words, the inner product gives us an isomorphism \overline{V}\rightarrow V^{*}, and applying this to the first index of \text{Cov}_{\mu} gives us C_{\mu}\in V^{*}\otimes V). C_{\mu} is still self-adjoint and positive semi-definite, so V still has an orthonormal basis of eigenvectors with non-negative real eigenvalues, and we can order the basis in decreasing order of the eigenvalues. Analogously to the real case, projecting onto the span of the first n basis vectors along the span of the rest is the complex rank-n projection that minimizes the expected squared distance from a vector to its projection. As far as I know, machine learning tends to deal with real data, but if you have complex data and for some reason you want to project onto a lower-dimensional complex subspace without losing too much information, now you know what to do.

Suppose your sample consists of events, where you've labeled them with both their spatial location and the time at which they occurred. In this case, events are represented as points in Minkowski space, a four-dimensional vector space representing flat spacetime, which is equipped with a nondegenerate symmetric bilinear form called the Minkowski inner product, even though it is not an inner product because it is not positive-definite. Instead, the Minkowski inner product is such that \left\langle x,x\right\rangle is positive if x is a space-like vector, negative if x is time-like, and zero if x is light-like. We can still get C_{\mu}\in V^{*}\otimes V out of \text{Cov}_{\mu}\in V\otimes V and the Minkowski inner product in V^{*}\otimes V^{*} in the same way, and V has a basis of eigenvectors of C_{\mu}, and we can still order the basis in decreasing order of their eigenvalues. The first 3 eigenvectors will be space-like, with non-negative eigenvalues, and the last eigenvector will be time-like, with a non-positive eigenvalue. The eigenvectors are still orthogonal. Thus principal component analysis provides us with a reference frame in which the span of the first 3 eigenvectors is simultaneous, and the span of the last eigenvector is motionless. If \mu is Gaussian, then this will be the reference frame in which the spatial position of an event and the time at which it occurs are mean independent of each other, meaning that conditioning on one of them doesn't change the expected value of the other one. For general \mu, there might not be a reference frame in which the space and time of an event are mean independent, but the reference frame given to you by by principal component analysis is still the unique reference frame with the property that the time coordinate is uncorrelated with any spatial coordinate.

More generally, we could consider V equipped with any symmetric bilinear form \left\langle \cdot,\cdot\right\rangle taking the role of the inner product. Without loss of generality, we can consider only nondegenerate symmetric bilinear forms, because in the general case, where D:=\left\{ x\mid\forall y\,\left\langle x,y\right\rangle =0\right\} , applying principal component analysis with \left\langle \cdot,\cdot\right\rangle is equivalent to projecting the data onto V/D, applying principal component analysis there with the nondegenerate symmetric bilinear form on V/D induced by \left\langle \cdot,\cdot\right\rangle , and then lifting back to V and throwing in a basis for D with eigenvalues 0 at the end, essentially treating D as the space of completely irrelevant distinctions between data points that we intend to immediately forget about. Anyway, nondegenerate symmetric bilinear forms are classified up to isomorphism by their signature \left(n,m\right), which is such that any orthogonal basis contains exactly n+m basis elements, n of which are space-like and m of which are time-like, using the convention that x is space-like if \left\langle x,x\right\rangle >0, time-like if \left\langle x,x\right\rangle <0, and light-like if \left\langle x,x\right\rangle =0, as above. Using principal component analysis on probability distributions over points in spacetime (or rather, points in the tangent space to spacetime at a point, so that it is a vector space) in a universe with n spatial dimensions and m temporal dimensions still gives you a reference frame in which the span of the first n basis vectors is simultaneous and the span of the last m basis vectors is motionless, and this is again the unique reference frame in which each time coordinate is uncorrelated with each spatial coordinate. Incidentally, I've heard that much of physics still works with multiple temporal dimensions. I don't know what that would mean, except that I think it means there's something wrong with my intuitive understanding of time. But that's another story. Anyway, the spaces spanned by the first n and by the last m basis vectors could be used to establish a reference frame, and then the data might be projected onto the first few (at most n) and last few (at most m) coordinates to approximate the positions of the events in space and in time, respectively, in that reference frame.

Ordered algebraic geometry

Edit: Shortly after posting this, I found where the machinery I develop here was discussed in the literature. Real Algebraic Geometry by Bochnak, Coste, and Roy covers at least most of this material. I may eventually edit this to clean it up and adopt more standard notation, but don't hold your breath.

Introduction

In algebraic geometry, an affine algebraic set is a subset of \mathbb{C}^{n} which is the set of solutions to some finite set of polynomials. Since all ideals of \mathbb{C}\left[x_{1},...,x_{n}\right] are finitely generated, this is equivalent to saying that an affine algebraic set is a subset of \mathbb{C}^{n} which is the set of solutions to some arbitrary set of polynomials.

In semialgebraic geometry, a closed semialgebraic set is a subset of \mathbb{R}^{n} of the form \left\{ \bar{x}\in\mathbb{R}^{n}\mid f\left(\bar{x}\right)\geq0\,\forall f\in F\right\}  for some finite set of polynomials F\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right]. Unlike in the case of affine algebraic sets, if F\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is an arbitrary set of polynomials, \left\{ \bar{x}\in\mathbb{R}^{n}\mid f\left(\bar{x}\right)\geq0\,\forall f\in F\right\}  is not necessarily a closed semialgebraic set. As a result of this, the collection of closed semialgebraic sets are not the closed sets of a topology on \mathbb{R}^{n}. In the topology on \mathbb{R}^{n} generated by closed semialgebraic sets being closed, the closed sets are the sets of the form \left\{ \bar{x}\in\mathbb{R}^{n}\mid f\left(\bar{x}\right)\geq0\,\forall f\in F\right\}  for arbitrary F\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right]. Semialgebraic geometry usually restricts itself to the study of semialgebraic sets, but here I wish to consider all the closed sets of this topology. Notice that closed semialgebraic sets are also closed in the standard topology, so the standard topology is a refinement of this one. Notice also that the open ball B_{r}\left(\bar{p}\right) of radius r centered at \bar{p} is the complement of the closed semialgebraic set \left\{ \bar{x}\in\mathbb{R}^{n}\mid\left|\bar{x}-\bar{p}\right|^{2}-r^{2}\geq0\right\} , and these open balls are a basis for the standard topology, so this topology is a refinement of the standard one. Thus, the topology I have defined is exactly the standard topology on \mathbb{R}^{n}.

In algebra, instead of referring to a set of polynomials, it is often nicer to talk about the ideal generated by that set instead. What is the analog of an ideal in ordered algebra? It's this thing:

Definition: If A is a partially ordered commutative ring, a cone C in A is a subsemiring of A which contains all positive elements, and such that C\cap-C is an ideal of A. By "subsemiring", I mean a subset that contains 0 and 1, and is closed under addition and multiplication (but not necessarily negation). If F\subseteq A, the cone generated by F, denoted \left\langle F\right\rangle , is the smallest cone containing F. Given a cone C, the ideal C\cap-C will be called the interior ideal of C, and denoted C^{\circ}.

\mathbb{R}\left[x_{1},...,x_{n}\right] is partially ordered by f\geq g\iff f\left(\bar{x}\right)\geq g\left(\bar{x}\right)\,\forall\bar{x}\in\mathbb{R}^{n}. If F\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is a set of polynomials and \bar{x}\in\mathbb{R}^{n}, then f\left(\bar{x}\right)\geq0\,\forall f\in F\iff f\left(\bar{x}\right)\geq0\,\forall f\in\left\langle F\right\rangle . Thus I can consider closed sets to be defined by cones. We now have a Galois connection between cones of \mathbb{R}\left[x_{1},...,x_{n}\right] and subsets of \mathbb{R}^{n}, given by, for a cone C, its positive-set is P_{\mathbb{R}}\left(C\right):=\left\{ \bar{x}\in\mathbb{R}^{n}\mid f\left(\bar{x}\right)\geq0\,\forall f\in C\right\}  (I'm calling it the "positive-set" even though it is where the polynomials are all non-negative, because "non-negative-set" is kind of a mouthful), and for X\subseteq\mathbb{R}^{n}, its cone is C_{\mathbb{R}}\left(X\right):=\left\{ f\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid f\left(\bar{x}\right)\geq0\,\forall\bar{x}\in X\right\} P_{\mathbb{R}}\circ C_{\mathbb{R}} is closure in the standard topology on \mathbb{R}^{n} (the analog in algebraic geometry is closure in the Zariski topology on \mathbb{C}^{n}). A closed set X is semialgebraic if and only if it is the positive-set of a finitely-generated cone.

Quotients by cones, and coordinate rings

An affine algebraic set V is associated with its coordinate ring \mathbb{C}\left[V\right]:=\mathbb{C}\left[x_{1},...,x_{n}\right]/I\left(V\right). We can do something analogous for closed subsets of \mathbb{R}^{n}.

Definition: If A is a partially ordered commutative ring and C\subseteq A is a cone, A/C is the ring A/C^{\circ}, equipped with the partial order given by f+C^{\circ}\geq g+C^{\circ} if and only if f-g\in C, for f,g\in A.

Definition: If X\subseteq\mathbb{R}^{n} is closed, the coordinate ring of X is \mathbb{R}\left[X\right]:=\mathbb{R}\left[x_{1},...,x_{n}\right]/C\left(X\right). This is the ring of functions X\rightarrow\mathbb{R} that are restrictions of polynomials, ordered by f\geq g if and only if f\left(\bar{x}\right)\geq g\left(\bar{x}\right)\,\forall\bar{x}\in X. For arbitrary X\subseteq\mathbb{R}^{n}, the ring of regular functions on X, denoted \mathcal{O}\left(X\right), consists of functions on X that are locally ratios of polynomials, again ordered by f\geq g if and only if f\left(\bar{x}\right)\geq g\left(\bar{x}\right)\,\forall\bar{x}\in X. Assigning its ring of regular functions to each open subset of X endows X with a sheaf of partially ordered commutative rings.

For closed X\subseteq\mathbb{R}^{n}, \mathbb{R}\left[X\right]\subseteq\mathcal{O}\left(X\right), and this inclusion is generally proper, both because it is possible to divide by polynomials that do not have roots in X, and because X may be disconnected, making it possible to have functions given by different polynomials on different connected components.

Positivstellensätze

What is C_{\mathbb{R}}\circ P_{\mathbb{R}}? The Nullstellensatz says that its analog in algebraic geometry is the radical of an ideal. As such, we could say that the radical of a cone C, denoted \text{Rad}_{\mathbb{R}}\left(C\right), is C_{\mathbb{R}}\left(P_{\mathbb{R}}\left(C\right)\right), and that a cone C is radical if C=\text{Rad}_{\mathbb{R}}\left(C\right). In algebraic geometry, the Nullstellensatz shows that a notion of radical ideal defined without reference to algebraic sets in fact characterizes the ideals which are closed in the corresponding Galois connection. It would be nice to have a description of the radical of a cone that does not refer to the Galois connection. There is a semialgebraic analog of the Nullstellensatz, but it does not quite characterize radical cones.

Positivstellensatz 1: If C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is a finitely-generated cone and p\in\mathbb{R}\left[x_{1},...,x_{n}\right] is a polynomial, then p\left(\bar{x}\right)>0\,\forall\bar{x}\in P_{\mathbb{R}}\left(C\right) if and only if \exists f\in C such that pf-1\in C.

There are two ways in which this is unsatisfactory: first, it applies only to finitely-generated cones, and second, it tells us exactly which polynomials are strictly positive everywhere on a closed semialgebraic set, whereas we want to know which polynomials are non-negative everywhere on a set.

The second problem is easier to handle: a polynomial p is non-negative everywhere on a set S if and only if there is a decreasing sequence of polynomials \left(p_{i}\mid i\in\mathbb{N}\right) converging to p such that each p_{i} is strictly positive everywhere on S. Thus, to find \text{Rad}_{\mathbb{R}}\left(C\right), it is enough to first find all the polynomials that are strictly positive everywhere on P_{\mathbb{R}}\left(C\right), and then take the closure under lower limits. Thus we have a characterization of radicals of finitely-generated cones.

Positivstellensatz 2: If C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is a finitely-generated cone, \text{Rad}_{\mathbb{R}}\left(C\right) is the closure of \left\{ p\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid\exists f\in C\, pf-1\in C\right\} , where the closure of a subset X\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is defined to be the set of all polynomials in \mathbb{R}\left[x_{1},...,x_{n}\right] which are infima of chains contained in X.

This still doesn't even tell us what's going on for cones which are not finitely-generated. However, we can generalize the Positivstellensatz to some other cones.

Positivstellensatz 3: Let C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] be a cone containing a finitely-generated subcone D\subseteq C such that P_{\mathbb{R}}\left(D\right) is compact. If p\in\mathbb{R}\left[x_{1},...,x_{n}\right] is a polynomial, then p\left(\bar{x}\right)>0\,\forall\bar{x}\in P_{\mathbb{R}}\left(C\right) if and only if \exists f\in C such that pf-1\in C. As before, it follows that \text{Rad}_{\mathbb{R}}\left(C\right) is the closure of \left\{ p\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid\exists f\in C\, pf-1\in C\right\}.

proof: For a given p\in\mathbb{R}\left[x_{1},...,x_{n}\right]\left\{ \bar{x}\in\mathbb{R}^{n}\mid p\left(\bar{x}\right)\leq0\right\} \cap P_{\mathbb{R}}\left(C\right)=\left\{ \bar{x}\in\mathbb{R}^{n}\mid p\left(\bar{x}\right)\leq0\right\} \cap\bigcap\left\{ P_{\mathbb{R}}\left(\left\langle f\right\rangle \right)\mid f\in C\right\} , an intersection of closed sets contained in the compact set P_{\mathbb{R}}\left(D\right), which is thus empty if and only if some finite subcollection of them has empty intersection within P_{\mathbb{R}}\left(D\right). Thus if p is strictly positive everywhere on P_{\mathbb{R}}\left(C\right), then there is some finitely generated subcone E\subseteq C such that p is strictly positive everywhere on P_{\mathbb{R}}\left(E\right)\cap P_{\mathbb{R}}\left(D\right)=P_{\mathbb{R}}\left(\left\langle E\cup D\right\rangle \right), and \left\langle E\cup D\right\rangle is finitely-generated, so by Positivstellensatz 1, there is f\in\left\langle E\cup D\right\rangle \subseteq C such that pf-1\in\left\langle E\cup D\right\rangle \subseteq C\square

For cones that are not finitely-generated and do not contain any finitely-generated subcones with compact positive-sets, the Positivstellensatz will usually fail. Thus, it seems likely that if there is a satisfactory general definition of radical for cones in arbitrary partially ordered commutative rings that agrees with this one in \mathbb{R}\left[x_{1},...,x_{n}\right], then there is also an abstract notion of "having a compact positive-set" for such cones, even though they don't even have positive-sets associated with them.

Beyond \mathbb{R}^{n}

An example of cone for which the Positivstellensatz fails is C_{\infty}:=\left\{ f\in\mathbb{R}\left[x\right]\mid\exists x\in\mathbb{R}\,\forall y\geq x\, f\left(y\right)\geq0\right\} , the cone of polynomials that are non-negative on sufficiently large inputs (equivalently, the cone of polynomials that are either 0 or have positive leading coefficient). P_{\mathbb{R}}\left(C\right)=\emptyset, and -1 is strictly positive on \emptyset, but for f\in C_{\infty}-f-1\notin C_{\infty}.

However, it doesn't really look C_{\infty} is trying to point to the empty set; instead, C_{\infty} is trying to describe the set of all infinitely large reals, which only looks like the empty set because there are no infinitely large reals. Similar phenomena can occur even for cones that do contain finitely-generated subcones with compact positive-sets. For example, let C_{\varepsilon}:=\left\{ f\in\mathbb{R}\left[x\right]\mid\exists x>0\,\forall y\in\left[0,x\right]\, f\left(y\right)\geq0\right\} P_{\mathbb{R}}\left(C_{\varepsilon}\right)=\left\{ 0\right\} , but C_{\varepsilon} is trying to point out the set containing 0 and all positive infinitesimals. Since \mathbb{R} has no infinitesimals, this looks like \left\{ 0\right\} .

To formalize this intuition, we can change the Galois connection. We could say that for a cone C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right]P_{\text{*}\mathbb{R}}\left(C\right):=\left\{ \bar{x}\in\left(\text{*}\mathbb{R}\right)^{n}\mid f\left(\bar{x}\right)\geq0\,\forall f\in C\right\} , where \text{*}\mathbb{R} is the field of hyperreals. All you really need to know about \text{*}\mathbb{R} is that it is a big ordered field extension of \mathbb{R}. P_{\text{*}\mathbb{R}}\left(C_{\infty}\right) is the set of hyperreals that are bigger than any real number, and P_{\text{*}\mathbb{R}}\left(C_{\varepsilon}\right) is the set of hyperreals that are non-negative and smaller than any positive real. The cone of a subset X\subseteq\left(\text{*}\mathbb{R}\right)^{n}, denoted C_{\text{*}\mathbb{R}}\left(X\right) will be defined as before, still consisting only of polynomials with real coefficients. This defines a topology on \left(\text{*}\mathbb{R}\right)^{n} by saying that the closed sets are the fixed points of P_{\text{*}\mathbb{R}}\circ C_{\text{*}\mathbb{R}}. This topology is not T_{0} because, for example, there are many hyperreals that are larger than all reals, and they cannot be distinguished by polynomials with real coefficients. There is no use keeping track of the difference between points that are in the same closed sets. If you have a topology that is not T_{0}, you can make it T_{0} by identifying any pair of points that have the same closure. If we do this to \left(\text{*}\mathbb{R}\right)^{n} , we get what I'm calling ordered affine n-space over \mathbb{R}.

Definition: An n-type over \mathbb{R} is a set \Phi of inequalities, consisting of, for each polynomial f\in\mathbb{R}\left[x_{1},..,x_{n}\right], one of the inequalities f\left(\bar{x}\right)\geq0 or f\left(\bar{x}\right)<0, such that there is some totally ordered field extension \mathcal{R}\supseteq\mathbb{R} and \bar{x}\in\mathcal{R}^{n} such that all inequalities in \Phi are true about \bar{x}. \Phi is called the type of \bar{x}. Ordered affine n-space over \mathbb{R}, denoted \mathbb{OA}_{\mathbb{R}}^{n} is the set of n-types over \mathbb{R}.

Compactness Theorem: Let \Phi be a set of inequalities consisting of, for each polynomial f\in\mathbb{R}\left[x_{1},..,x_{n}\right], one of the inequalities f\left(\bar{x}\right)\geq0 or f\left(\bar{x}\right)<0. Then \Phi is an n-type if and only if for any finite subset \Delta\subseteq\Phi, there is \bar{x}\in\mathbb{R} such that all inequalities in \Delta are true about \bar{x}.

proof: Follows from the compactness theorem of first-order logic and the fact that ordered field extensions of \mathbb{R} embed into elementary extensions of \mathbb{R}. The theorem is not obvious if you do not know what those mean. \square

An n-type represents an n-tuple of elements of an ordered field extension of \mathbb{R}, up to the equivalence relation that identifies two such tuples that relate to \mathbb{R} by polynomials in the same way. One way that a tuple of elements of an extension of \mathbb{R} can relate to elements of \mathbb{R} is to equal a tuple of elements of \mathbb{R}, so there is a natural inclusion \mathbb{R}^{n}\subseteq\mathbb{OA}_{\mathbb{R}}^{n} that associates an n-tuple of reals with the set of polynomial inequalities that are true at that n-tuple.

A tuple of polynomials \left(f_{1},...,f_{m}\right)\in\left(\mathbb{R}\left[x_{1},...,x_{n}\right]\right)^{m} describes a function f:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}, which extends naturally to a function f:\mathbb{OA}_{\mathbb{R}}^{n}\rightarrow\mathbb{OA}_{\mathbb{R}}^{m} by f\left(\Phi\right) is the type of \left(f_{1}\left(\bar{x}\right),...,f_{m}\left(\bar{x}\right)\right), where \bar{x} is an n-tuple of elements of type \Phi in an extension of \mathbb{R}. In particular, a polynomial f\in\mathbb{R}\left[x_{1},...,x_{n}\right] extends to a function f:\mathbb{OA}_{\mathbb{R}}^{n}\rightarrow\mathbb{OA}_{\mathbb{R}}^{1}, and \mathbb{OA}_{\mathbb{R}}^{1} is totally ordered by \Phi\geq\Psi if and only if x\geq y, where x and y are elements of type \Phi and \Psi, respectively, in an extension of \mathbb{R}f\left(\Phi\right)\geq0 if and only if \text{, so we can talk about inequalities satisfied by types in place of talking about inequalities contained in types.

I will now change the Galois connection that we are talking about yet again (last time, I promise). It will now be a Galois connection between the set of cones in \mathbb{R}\left[x_{1},...,x_{n}\right] and the set of subsets of \mathbb{OA}_{\mathbb{R}}^{n}. For a cone C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right], P\left(C\right):=\left\{ \Phi\in\mathbb{OA}_{\mathbb{R}}^{n}\mid f\left(\Phi\right)\geq0\,\forall f\in C\right\} . For a set X\subseteq\mathbb{OA}_{\mathbb{R}}^{n}, C\left(X\right):=\left\{ f\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid f\left(\Phi\right)\geq0\,\forall\Phi\in X\right\} . Again, this defines a topology on \mathbb{OA}_{\mathbb{R}}^{n} by saying that fixed points of P\circ C are closed. \mathbb{OA}_{\mathbb{R}}^{n} is T_{0}; in fact, it is the T_{0} topological space obtained from \left(\text{*}\mathbb{R}\right)^{n} by identifying points with the same closure as mentioned earlier. \mathbb{OA}_{\mathbb{R}}^{n} is also compact, as can be seen from the compactness theorem. \mathbb{OA}_{\mathbb{R}}^{n} is not T_{1} (unless n=0). Note that model theorists have their own topology on \mathbb{OA}_{\mathbb{R}}^{n}, which is distinct from the one I use here, and is a refinement of it.

The new Galois connection is compatible with the old one via the inclusion \mathbb{R}^{n}\subseteq\mathbb{OA}_{\mathbb{R}}^{n}, in the sense that if X\subseteq\mathbb{R}^{n}, then C_{\mathbb{R}}\left(X\right)=C\left(X\right) (where we identify X with its image in \mathbb{OA}_{\mathbb{R}}^{n}), and for a cone C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right]P_{\mathbb{R}}=P\left(C\right)\cap\mathbb{R}^{n}.

Like our intermediate Galois connection \left(P_{\text{*}\mathbb{R}},C_{\text{*}\mathbb{R}}\right), our final Galois connection \left(P,C\right) succeeds in distinguishing P\left(C_{\infty}\right) and P\left(C_{\varepsilon}\right) from \emptyset and \left\{ 0\right\} , respectively, in the desirable manner. P\left(C_{\infty}\right) consists of the type of numbers larger than any real, and P\left(C_{\varepsilon}\right) consists of the types of 0 and of positive numbers smaller than any positive real.

Just like for subsets of \mathbb{R}^{n}, a closed subset X\subseteq\mathbb{OA}_{\mathbb{R}}^{n} has a coordinate ring \mathbb{R}\left[X\right]:=\mathbb{R}\left[x_{1},...,x_{n}\right]/C\left(X\right), and an arbitrary X\subseteq\mathbb{OA}_{\mathbb{R}}^{n} has a ring of regular functions \mathcal{O}\left(X\right) consisting of functions on X that are locally ratios of polynomials, ordered by f\geq0 if and only if \forall\Phi\in X, where f=\frac{p}{q} is a representation of f as a ratio of polynomials in a neighborhood of \Phi, either p\left(\Phi\right)\geq0 and q\left(\Phi\right)>0, or p\left(\Phi\right)\leq0 and q\left(\Phi\right)<0, and f\geq g if and only if f-g\geq0. As before, \mathbb{R}\left[X\right]\subseteq\mathcal{O}\left(X\right) for closed X\subseteq\mathbb{OA}_{\mathbb{R}}^{n}.

\mathbb{OA}_{\mathbb{R}}^{n} is analogous to \mathbb{A}_{\mathbb{C}}^{n} from algebraic geometry because if, in the above definitions, you replace "\geq" and "<" with "=" and "\neq", replace totally ordered field extensions with field extensions, and replace cones with ideals, then you recover a description of \mathbb{A}_{\mathbb{C}}^{n}, in the sense of \text{Spec}\left(\mathbb{C}\left[x_{1},...,x_{n}\right]\right).

What about an analog of projective space? Since we're paying attention to order, we should look at spheres, not real projective space. The n-sphere over \mathbb{R}, denoted \mathbb{S}_{\mathbb{R}}^{n}, can be described as the locus of \left|\bar{x}\right|^{2}=1 in \mathbb{OA}_{\mathbb{R}}^{n}.

For any totally ordered field k, we can define \mathbb{OA}_{k}^{n} similarly to \mathbb{OA}_{\mathbb{R}}^{n}, as the space of n-types over k, defined as above, replacing \mathbb{R} with k (although a model theorist would no longer call it the space of n-types over k). The compactness theorem is not true for arbitrary k, but its corollary that \mathbb{OA}_{k}^{n} is compact still is true.

Visualizing \mathbb{OA}_{\mathbb{R}}^{n} and \mathbb{S}_{\mathbb{R}}^{n}

\mathbb{S}_{\mathbb{R}}^{n} should be thought of as the n-sphere with infinitesimals in all directions around each point. Specifically, \mathbb{S}_{\mathbb{R}}^{0} is just \mathbb{S}^{0}, a pair of points. The closed points of \mathbb{S}_{\mathbb{R}}^{n+1} are the points of \mathbb{S}^{n+1}, and for each closed point p, there is an n-sphere of infinitesimals around p, meaning a copy of \mathbb{S}_{\mathbb{R}}^{n}, each point of which has p in its closure.

\mathbb{OA}_{\mathbb{R}}^{n} should be thought of as n-space with infinitesimals in all directions around each point, and infinities in all directions. Specifically, \mathbb{OA}_{\mathbb{R}}^{n} contains \mathbb{R}^{n}, and for each point p\in\mathbb{R}^{n}, there is an n-1-sphere of infinitesimals around p, and there is also a copy of \mathbb{S}_{\mathbb{R}}^{n-1} around the whole thing, the closed points of which are limits of rays in \mathbb{R}^{n}.

\mathbb{OA}_{\mathbb{R}}^{n} and \mathbb{S}_{\mathbb{R}}^{n} relate to each other the same way that \mathbb{R}^{n} and \mathbb{S}^{n} do. If you remove a closed point from \mathbb{S}_{\mathbb{R}}^{n}, you get \mathbb{OA}_{\mathbb{R}}^{n}, where the sphere of infinitesimals around the removed closed point becomes the sphere of infinities of \mathbb{OA}_{\mathbb{R}}^{n}.

More generally, if k is a totally ordered field, let k^{r} be its real closure. \mathbb{OA}_{k}^{n} consists of the Cauchy completion of \left(k^{r}\right)^{n} (as a metric space with distances valued in k^{r}), and for each point p\in\left(k^{r}\right)^{n} (though not for points that are limits of Cauchy sequences that do not converge in \left(k^{r}\right)^{n}), an n-1-sphere \mathbb{S}_{k}^{n-1} of infinitesimals around p, and an n-1-sphere \mathbb{S}_{k}^{n-1} around the whole thing, where \mathbb{S}_{k}^{n} is the locus of \left|\bar{x}\right|^{2}=1 in \mathbb{OA}_{k}^{n}. \mathbb{OA} does not distinguish between fields with the same real closure.

More Positivstellensätze

This Galois connection gives us a new notion of what it means for a cone to be radical, which is distinct from the old one and is better, so I will define \text{Rad}\left(C\right) to be C\left(P\left(C\right)\right). A cone C will be called radical if C=\text{Rad}\left(C\right). Again, it would be nice to be able to characterize radical cones without referring to the Galois connection. And this time, I can do it. Note that since \mathbb{OA}_{\mathbb{R}}^{n} is compact, the proof of Positivstellensatz 3 shows that in our new context, the Positivstellensatz holds for all cones, since even the subcone generated by \emptyset has a compact positive-set.

Positivstellensatz 4: If C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is a cone and p\in\mathbb{R}\left[x_{1},...,x_{n}\right] is a polynomial, then p\left(\Phi\right)>0\,\forall\Phi\in P\left(C\right) if and only if \exists f\in C such that pf-1\in C.

However, we can no longer add in lower limits of sequences of polynomials. For example, -x+\varepsilon\in C_{\varepsilon} for all real \varepsilon>0, but -x\notin C_{\varepsilon}, even though C_{\varepsilon} is radical. This happens because, where \Sigma is the type of positive infinitesimals, -\Sigma+\varepsilon>0 for real \varepsilon>0, but -\Sigma<0. However, we can add in lower limits of sequences contained in finitely-generated subcones, and this is all we need to add, so this characterizes radical cones.

Positivstellensatz 5: If C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is a cone, \text{Rad}\left(C\right) is the union over all finitely-generated subcones D\subseteq C of the closure of \left\{ p\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid\exists f\in D\, pf-1\in D\right\}  (again the closure of a subset X\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is defined to be the set of all polynomials in \mathbb{R}\left[x_{1},...,x_{n}\right] which are infima of chains contained in X).

Proof: Suppose D\subseteq C is a subcone generated by a finite set \left\{ f_{1},...,f_{m}\right\} , and q is the infimum of a chain \left\{ q_{\alpha}\right\} _{\alpha\in A}\subseteq\left\{ p\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid\exists f\in D\, pf-1\in D\right\} . For any \bar{x}\in\mathbb{R}^{n}, if f_{i}\left(\bar{x}\right)\geq0 for each i, then q_{\alpha}\left(\bar{x}\right)>0 for each \alpha, and hence q\left(\bar{x}\right)\geq0. That is, the finite set of inequalities \left\{ f_{i}\left(\bar{x}\right)\geq0\mid1\leq i\leq m\right\} \cup\left\{ q\left(\bar{x}\right)<0\right\}  does not hold anywhere in \mathbb{R}^{n}. By the compactness theorem, there are no n-types satisfying all those inequalities. Given \Phi\in P\left(C\right)f_{i}\left(\Phi\right)\geq0, so q\left(\Phi\right)\nless0; that is, q\left(\Phi\right)\geq0.

Conversely, suppose q\in\text{Rad}\left(C\right). Then by the compactness theorem, there are some f_{1},...,f_{m}\in C such that q\in\text{Rad}\left(\left\langle f_{1},...,f_{m}\right\rangle \right). Then \forall\varepsilon>0, q+\varepsilon is strictly positive on P\left(\left\langle f_{1},...,f_{m}\right\rangle \right), and hence by Positivstellensatz 4, \exists f\in\left\langle f_{1},...,f_{m}\right\rangle  such that pf-1\in\left\langle f_{1},...,f_{m}\right\rangle . That is, \left\{ q+\varepsilon\mid\varepsilon>0\right\} is a chain contained in \left\langle f_{1},...,f_{m}\right\rangle , a finitely-generated subcone of C, whose infimum is q. \square

Ordered commutative algebra

Even though they are technically not isomorphic, \mathbb{C}^{n} and \text{Spec}\left(\mathbb{C}\left[x_{1},...,x_{n}\right]\right) are closely related, and can often be used interchangeably. Of the two, \text{Spec}\left(\mathbb{C}\left[x_{1},...,x_{n}\right]\right) is of a form that can be more easily generalized to more abstruse situations in algebraic geometry, which may indicate that it is the better thing to talk about, whereas \mathbb{C}^{n} is merely the simpler thing that is easier to think about and just as good in practice in many contexts. In contrast, \mathbb{R}^{n} and \mathbb{OA}_{\mathbb{R}}^{n} are different in important ways. The situation in algebraic geometry provides further reason to pay more attention to \mathbb{OA}_{\mathbb{R}}^{n} than to \mathbb{R}^{n}.

The next thing to look for would be an analog of the spectrum of a ring for a partially ordered commutative ring (I will henceforth abbreviate "partially ordered commutative ring" as "ordered ring" in order to cut down on the profusion of adjectives) in a way that makes use of the order, and gives us \mathbb{OA}_{\mathbb{R}}^{n} when applied to \mathbb{R}\left[x_{1},...,x_{n}\right]. I will call it the order spectrum of an ordered ring A, denoted \text{OrdSpec}\left(A\right). Then of course \mathbb{OA}_{A}^{n} can be defined as \text{OrdSpec}\left(A\left[x_{1},...,x_{n}\right]\right)\text{OrdSpec}\left(A\right) should be, of course, the set of prime cones. But what even is a prime cone?

Definition: A cone \mathfrak{p}\subseteq A is prime if A/\mathfrak{p} is a totally ordered integral domain.

Definition: \text{OrdSpec}\left(A\right) is the set of prime cones in A, equipped with the topology whose closed sets are the sets of prime cones containing a given cone.

An n-type \Phi\in\mathbb{OA}_{\mathbb{R}}^{n} can be seen as a cone, by identifying it with \left\{ f\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid f\left(\Phi\right)\geq0\right\} , aka C\left(\left\{ \Phi\right\} \right). Under this identification, \mathbb{OA}_{\mathbb{R}}^{n}=\text{OrdSpec}\left(\mathbb{R}\left[x_{1},...,x_{n}\right]\right), as desired. The prime cones in \mathbb{R}\left[x_{1},...,x_{n}\right] are also the radical cones C such that P\left(C\right) is irreducible. Notice that irreducible subsets of \mathbb{OA}_{\mathbb{R}}^{n} are much smaller than irreducible subsets of \mathbb{A}_{\mathbb{C}}^{n}; in particular, none of them contain more than one element of \mathbb{R}^{n}.

There is also a natural notion of maximal cone.

Definition: A cone \mathfrak{m}\subseteq A is maximal if \mathfrak{m}\neq A and there are no strictly intermediate cones between \mathfrak{m} and A. Equivalently, if \mathfrak{m} is prime and closed in \text{OrdSpec}\left(A\right).

Maximal ideals of \mathbb{C}\left[x_{1},...,x_{n}\right] correspond to elements of \mathbb{C}^{n}. And the cones of elements of \mathbb{R}^{n} are maximal cones in \mathbb{R}\left[x_{1},...,x_{n}\right], but unlike in the complex case, these are not all the maximal cones, since there are closed points in \mathbb{OA}_{\mathbb{R}}^{n} outside of \mathbb{R}^{n}. For example, C_{\infty} is a maximal cone, and the type of numbers greater than all reals is closed. To characterize the cones of elements of \mathbb{R}^{n}, we need something slightly different.

Definition: A cone \mathfrak{m}\subseteq A is ideally maximal if A/\mathfrak{m} is a totally ordered field. Equivalently, if \mathfrak{m} is maximal and \mathfrak{m}^{\circ} is a maximal ideal.

Elements of \mathbb{R}^{n} correspond to ideally maximal cones of \mathbb{R}\left[x_{1},...,x_{n}\right].

\text{OrdSpec} also allows us to define the radical of a cone in an arbitrary partially ordered commutative ring.

Definition: For a cone C\subseteq A, \text{Rad}\left(C\right) is the intersection of all prime cones containing C. C is radical if C=\text{Rad}\left(C\right).

Conjecture: \text{Rad}\left(C\right) is the union over all finitely-generated subcones C\subseteq D of the closure of \left\{ p\in A\mid\exists f\in D\, pf-1\in D\right\}  (as before, the closure of a subset X\subseteq A is defined to be the set of all elements of A which are infima of chains contained in X).

Order schemes

Definition: An ordered ringed space is a topological space equipped with a sheaf of ordered rings. An ordered ring is local if it has a unique ideally maximal cone, and a locally ordered ringed space is an ordered ringed space whose stalks are local.

\text{OrdSpec}\left(A\right) can be equipped with a sheaf of ordered rings \mathcal{O}_{A}, making it a locally ordered ringed space.

Definition: For a prime cone \mathfrak{p}\subseteq A, the localization of A at \mathfrak{p}, denoted A_{\mathfrak{p}}, is the ring A_{\mathfrak{p}^{\circ}} equipped with an ordering that makes it a local ordered ring. This will be the stalk at \mathfrak{p} of \mathcal{O}_{A}. A fraction \frac{a}{b}\in A_{\mathfrak{p}} (b\notin\mathfrak{p}^{\circ}) is also an element of A_{\mathfrak{q}} for any prime cone \mathfrak{q}\subseteq A whose interior ideal does not contain b. This is an open neighborhood of \mathfrak{p} (its complement is the set of prime cones containing \left\langle b,-b\right\rangle ). There is a natural map A_{\mathfrak{p}}\rightarrow\text{Frac}\left(A/\mathfrak{p}\right) given by \frac{a}{b}\mapsto\frac{a+\mathfrak{p}^{\circ}}{b+\mathfrak{p}^{\circ}}, and the total order on A/\mathfrak{p} extends uniquely to a total order on the fraction field, so for a,b\in A_{\mathfrak{p}}, we can say that a\geq b at \mathfrak{p} if this is true of their images in \text{Frac}\left(A/\mathfrak{p}\right). We can then say that a\geq b near \mathfrak{p} if a\geq b at every point in some neighborhood of \mathfrak{p}, which defines the ordering on A_{\mathfrak{p}}.

Definition: For open U\subseteq\text{OrdSpec}\left(A\right), \mathcal{O}_{A}\left(U\right) consists of elements of \prod_{\mathfrak{p}\in U}A_{\mathfrak{p}} that are locally ratios of elements of A. \mathcal{O}_{A}\left(U\right) is ordered by a\geq b if and only if \forall\mathfrak{p}\in\text{OrdSpec}\left(A\right) a\geq b near \mathfrak{p} (equivalently, if \forall\mathfrak{p}\in\text{OrdSpec}\left(A\right) a\geq b at \mathfrak{p}).

A\subseteq\mathcal{O}_{A}\left(\text{OrdSpec}\left(A\right)\right), and this inclusion can be proper. Conjecture: \text{OrdSpec}\left(\mathcal{O}_{A}\left(U\right)\right)\cong U as locally ordered ringed spaces for open U\subseteq\text{OrdSpec}\left(A\right). This conjecture says that it makes sense to talk about whether or not a locally ordered ringed space looks locally like an order spectrum near a given point. Thus, if this conjecture is false, it would make the following definition look highly suspect.

Definition: An order scheme is a topological space X equipped with a sheaf of ordered commutative rings \mathcal{O}_{X} such that for some open cover of X, the restrictions of \mathcal{O}_{X} to the open sets in the cover are all isomorphic to order spectra of ordered commutative rings.

I don't have any uses in mind for order schemes, but then again, I don't know what ordinary schemes are for either and they are apparently useful, and order schemes seem like a natural analog of them.

Nonabelian modules

This is a rough overview of my thoughts on a thing I've been thinking about, and as such is incomplete and may contain errors. Proofs have been omitted when writing them out would be at all tedious.

Edit: It has been pointed out to me that near-ring modules have already been defined, and the objects I describe in this post are just near-ring modules where the near-ring happens to be a ring.

Introduction

As you all know (those of you who have the background for this post, anyway), an R-module is an abelian group M (written additively) together with a multiplication map R\times M\rightarrow M such that for all \alpha,\beta\in R and x,y\in M, \alpha\cdot\left(x+y\right)=\alpha\cdot x+\alpha\cdot y\left(\alpha+\beta\right)\cdot x=\alpha\cdot x+\beta\cdot x, \left(\alpha\beta\right)\cdot x=\alpha\cdot\left(\beta\cdot x\right), and 1\cdot x=x.

What if we don't want to restrict attention to abelian groups? One could attempt to define a nonabelian module using the same axioms, but without the restriction that the group be abelian. As it is customary to write groups multiplicatively if they are not assumed to be abelian, we will do that, and the map R\times M\rightarrow M will be written as exponentiation (since exponents are written on the right, I'll follow the definition of right-modules, rather than left-modules). The axioms become: for all \alpha,\beta\in R and x,y\in M, \left(xy\right)^{\alpha}=x^{\alpha}y^{\alpha}x^{\alpha+\beta}=x^{\alpha}x^{\beta}, x^{\left(\alpha\beta\right)}=\left(x^{\alpha}\right)^{\beta}, and x^{1}=x.

What has changed? Absolutely nothing, as it turns out. The first axiom says again that M is abelian, because yx=x^{-1}\left(xy\right)^{2}y^{-1}=x^{-1}\left(x^{2}y^{2}\right)y^{-1}=xy. We'll have to get rid of that axiom. Our new definition, which it seems to me captures the essence of a module except for abelianness:

A nonabelian R-module is a group M (written multiplicatively) together with a scalar exponentiation map R\times M\rightarrow M such that for all \alpha,\beta\in R and x\in M, x^{1}=xx^{\alpha+\beta}=x^{\alpha}x^{\beta}, and x^{\left(\alpha\beta\right)}=\left(x^{\alpha}\right)^{\beta}.

These imply that x^{0}=1, 1^{\alpha}=1, and x^{-1} is the inverse of x, because x\cdot x^{0}=x^{1}x^{0}=x^{1+0}=x^{1}=x1^{\alpha}=\left(1^{0}\right)^{\alpha}=1^{\left(0\alpha\right)}=1^{0}=1, and x\cdot x^{-1}=x^{1-1}=1.

Just like a \mathbb{Z}-module is just an abelian group, a nonabelian \mathbb{Z}-module is just a group. Just like a \mathbb{Z}/n\mathbb{Z}-module is an abelian group whose exponent divides n, a nonabelian \mathbb{Z}/n\mathbb{Z}-module is a group whose exponent divides n.

Exponentiation-like families of operations

Perhaps a bit more revealing is what nonabelian modules over free rings look like, since then the generators are completely generic ring elements. Where A is the generating set, a \mathbb{Z}\left\langle A\right\rangle -module is an abelian group together with endomorphisms \left\{ x\mapsto\alpha x\mid\alpha\in A\right\} , which tells us that modules are about endomorphisms of an abelian group indexed by the elements of a ring. Nonabelian modules are certainly not about endomorphisms. After all, in a nonabelian group, the map x\mapsto x^{2} is not an endomorphism. I will call the things that nonabelian modules are about "exponentiation-like families of operations'', and give four equivalent definitions, in roughly increasing order of concreteness and decreasing order of elegance. Definition 2 uses basic model theory, so skip it if that scares you. Definition 3 is the "for dummies'' version of definition 2.

Definition 0: Let G be a group, and let A be a family of functions from G to G (not necessarily endomorphisms). If G can be made into a nonabelian \mathbb{Z}\left\langle A\right\rangle -module such that x^{\alpha}=\alpha\left(x\right) for x\in G and \alpha\in A, then A is called an exponentiation-like family of operations on G. If so, the nonabelian \mathbb{Z}\left\langle A\right\rangle -module structure on G with that property is unique, so define x^{p} to be its value according to that structure, for p\in\mathbb{Z}\left\langle A\right\rangle  and x\in G.

Definition 1: A is an exponentiation-like family of operations on G if for all x\in G, the smallest subgroup containing x which is closed under actions by elements of A (which I will call \overline{\left\{ x\right\} }) is abelian, and the elements of A restrict to endomorphisms of it. Using the universal property of \mathbb{Z}\left\langle A\right\rangle , this induces a homomorphism \mathbb{Z}\left\langle A\right\rangle \rightarrow\text{End}\left(\overline{\left\{ x\right\} }\right)^{\text{op}}. Let x^{p} denote the action of p on x under that map, for p\in\mathbb{Z}\left\langle A\right\rangle . By \text{End}\left(\overline{\left\{ x\right\} }\right)^{\text{op}}, I mean the endomorphism ring of \overline{\left\{ x\right\} } with composition running in the opposite direction (i.e., the multiplication operation given by \left(f,g\right)\mapsto g\circ f). This is because of the convention that nonabelian modules are written as nonabelian right-modules by default.

Definition 2: Let consider the language \mathcal{L}_{Rings}\sqcup A, where \mathcal{L}_{Rings}:=\left\{ 0,1,+,-,\cdot\right\} is the language of rings, and each element of A is used as a constant symbol. Closed terms in \mathcal{L}_{Rings}\sqcup A act as functions from G to G, with the action of t written as x\mapsto x^{t}, defined inductively as: x^{0}:=1, x^{1}:=x, x^{\alpha}:=\alpha\left(x\right) for \alpha\in X, x^{t+s}:=x^{t}x^{s}, x^{-t}:=\left(x^{t}\right)^{-1}, and x^{ts}:=\left(x^{t}\right)^{s} for closed \mathcal{L}_{Rings}\sqcup A-terms t and s. A is called an exponentiation-like family of operations on G if x^{t}=x^{s} whenever T_{Rings}\models t=s, where T_{Rings} is the theory of rings. If A is an exponentiation-like family of operations on G and p\in\mathbb{Z}\left\langle A\right\rangle  is a noncommutative polynomial with variables in A, then for x\in Gx^{p} is defined to be x^{t} where t is any term representing p.

Definition 3: Pick a total order on the free monoid on A (e.g. by ordering A and then using the lexicographic order). The order you use won't matter. Given x\in G and w:=\alpha_{1}...\alpha_{n} in the free monoid on A, let x^{w}=\alpha_{n}\left(...\alpha_{1}\left(x\right)\right). Where p\in\mathbb{Z}\left\langle A\right\rangle is a noncommutative polynomial, p=c_{1}w_{1}+...+c_{n}w_{n} for some c_{1},...,c_{n}\in\mathbb{Z} and decreasing sequence w_{1},...,w_{n} of noncommutative monomials (elements of the free monoid on A). Let x^{p}=\left(x^{c_{1}}\right)^{w_{1}}...\left(x^{c_{n}}\right)^{w_{n}}A is called an exponentiation-like family of operations on G if for every x\in G and p,q\in\mathbb{Z}\left\langle A\right\rangle x^{pq}=\left(x^{p}\right)^{q} and x^{p+q}=x^{p}x^{q}.

These four definitions of exponentiation-like family are equivalent, and for exponentiation-like families, their definitions of exponentiation by a noncommutative polynomial are equivalent.

Facts: \emptyset is an exponentiation-like family of operations on G. If A is an exponentiation-like family of operations on G and B\subseteq A, then so is B. If G is abelian, then \text{End}\left(G\right) is exponentiation-like. Given a nonabelian R-module structure on G, the actions of the elements of R on G form an exponentiation-like family. In particular, if A is an exponentiation-like family of operations on G, then so is \mathbb{Z}\left\langle A\right\rangle , with the actions being defined as above.

[The following paragraph has been edited since this comment.]

For an abelian group A, the endomorphisms of A form a ring \text{End}\left(A\right), and an R-module structure on A is simply a homomorphism R\rightarrow\text{End}\left(A\right). Can we say a similar thing about exponentiation-like families of operations of G? Let \text{Exp}\left(G\right) be the set of all functions G\rightarrow G (as sets). Given \alpha,\beta\in\text{Exp}\left(G\right), let multiplication be given by composition: x^{\left(\alpha\beta\right)}=\left(x^{\alpha}\right)^{\beta}, addition be given by x^{\alpha+\beta}=x^{\alpha}x^{\beta}, negation be given by x^{-\alpha}=\left(x^{\alpha}\right)^{-1}, and 0 and 1 be given by x^{0}=1 and x^{1}=x. This makes \text{Exp}\left(G\right) into a near-ring. A nonabelian R-module structure on G is a homomorphism R\rightarrow\text{Exp}\left(G\right), and a set of operations on G is an exponentiation-like family of operations on G if and only if it is contained in a ring which is contained in \text{Exp}\left(G\right).

Some aimless rambling

What are some interesting examples of nonabelian modules that are not abelian? (That might sound redundant, but "nonabelian module'' means that the requirement of abelianness has been removed, not that a requirement of nonabelianness has been imposed. Perhaps I should come up with better terminology. To make matters worse, since the requirement that got removed is actually stronger than abelianness, there are nonabelian modules that are abelian and not modules. For instance, consider the nonabelian \mathbb{Z}\left[\alpha\right]-module whose underlying set is the Klein four group (generated by two elements a,b) such that a^{\alpha}=a, b^{\alpha}=b, and \left(ab\right)^{\alpha}=1.)

In particular, what do free nonabelian modules look like? The free nonabelian \mathbb{Z}-modules are, of course, free groups. The free nonabelian \mathbb{Z}/n\mathbb{Z}-modules have been studied in combinatorial group theory; they're called Burnside groups. (Fun but tangential fact: not all Burnside groups are finite (the Burnside problem), but despite this, the category of finite nonabelian \mathbb{Z}/n\mathbb{Z}-modules has free objects on any finite generating set, called Restricted Burnside groups.)

The free nonabelian \mathbb{Z}\left[\alpha\right]-modules are monstrosities. They can be constructed in the usual way of constructing free objects in a variety of algebraic structures, but that construction seems not to be very enlightening about their structure. So I'll give a somewhat more direct construction of the free nonabelian \mathbb{Z}\left[\alpha\right]-module on d generators, which may also not be that enlightening, and which is only suspected to be correct. Define an increasing sequence of groups G_{n}, and functions \alpha_{n}:G_{n}\rightarrow G_{n+1}, as follows: G_{0} is the free group on d generators. Given G_{n}, and given a subgroup X\leq G_{n}, let the top-degree portion of X be \alpha_{n-1}^{k}\left(X\right) for the largest k such that this is nontrivial. Let H_{n} be the free product of the top-degree portions of maximal abelian subgroups of G_{n}. Let G_{n+1} be the free product of G_{n} with H_{n} modulo commutativity of the maximal abelian subgroups of G_{n} with the images of their top-degree portions in H_{n}. Given a maximal abelian subgroup X\leq G_{n}, let \alpha_{n}\restriction_{X} be the homomorphism extending \alpha_{n-1}\restriction_{X\cap G_{n-1}} which sends the top-degree portion identically onto its image in H_{n}. Since every non-identity element of G_{n} is in a unique maximal abelian subgroup, this defines \alpha_{n}. G:=\bigcup_{n}G_{n} with \alpha:=\bigcup_{n}\alpha_{n} is the free nonabelian \mathbb{Z}\left[\alpha\right]-module on d generators. If A is a set, the free nonabelian \mathbb{Z}\left\langle A\right\rangle -modules can be constructed similarly, with \left|A\right| copies of H_{n} at each step. Are these constructions even correct? Are there nicer ones?

A nonabelian \mathbb{Z}\left[\frac{1}{2}\right]-module would be a group with a formal square root operation. As an example, any group of odd exponent n can be made into a \mathbb{Z}\left[\frac{1}{2}\right]-module in a canonical way by letting x^{\frac{1}{2}}=x^{\frac{n+1}{2}}. More generally, any group of finite exponent n can be made into a \mathbb{Z}\left[\left\{ p^{-1}|p\nmid n\right\} \right]-module in a similar fashion. Are there any more nice examples of nonabelian modules over localizations of \mathbb{Z}?

In particular, a nonabelian \mathbb{Q}-module would be a group with formal nth root operations for all n. What are some nonabelian examples of these? Note that nonabelian \mathbb{Q}-modules cannot have any torsion, for suppose x^{n}=1 for some n\neq0. Then x=\left(x^{n}\right)^{\frac{1}{n}}=1^{\frac{1}{n}}=1. More generally, nonabelian modules cannot have any n-torsion (meaning x^{n}=1\implies x=1) for any n which is invertible in the scalar ring.

The free nonabelian \mathbb{Z}\left[\frac{1}{m}\right]-modules can be constructed similarly to the construction of free nonabelian \mathbb{Z}\left[\alpha\right]-modules above, except that when constructing G_{n+1} from G_{n} and H_{n}, we also mod out by elements of G_{n} being equal to the mth powers of their images in H_{n}. Using the fact that \mathbb{Q}\cong\mathbb{Z}\left\langle \left\{ p^{-1}|\text{primes }p\right\} \right\rangle , this lets us modify the construction of free nonabelian \mathbb{Z}\left\langle A\right\rangle -modules to give us a construction of free nonabelian \mathbb{Q}-modules. Again, is there a nicer way to do it?

Topological nonabelian modules

It is also interesting to consider topological nonabelian modules over topological rings; that is, nonabelian modules endowed with a topology such that the group operation and scalar exponentiation are continuous. A module over a topological ring has a canonical finest topology on it, and the same remains true for nonabelian modules. For finite-dimensional real vector spaces, this is the only topology. Does the same remain true for finitely-generated nonabelian \mathbb{R}-modules? Finite-dimensional real vector spaces are complete, and topological nonabelian modules are, in particular, topological groups, and can thus be made into uniform spaces, so the notion of completeness still makes sense, but I think some finitely-generated nonabelian \mathbb{R}-modules are not complete.

A topological nonabelian \mathbb{R}-module is a sort of Lie group-like object. One might try constructing a Lie algebra for a complete nonabelian \mathbb{R}-module M by letting the underlying set be M, and defining x+y=\lim_{\varepsilon\rightarrow0}\left(x^{\varepsilon}y^{\varepsilon}\right)^{\left(\varepsilon^{-1}\right)} and \left[x,y\right]=\lim_{\varepsilon\rightarrow0}\left(x^{\varepsilon}y^{\varepsilon}x^{-\varepsilon}y^{-\varepsilon}\right)^{\left(\varepsilon^{-2}\right)}. One might try putting a differential structure on M such that this is the Lie algebra of left-invariant derivations. Does this or something like it work?

A Lie group is a nonabelian \mathbb{R}-module if and only if its exponential map is a bijection between it and its Lie algebra. In this case, scalar exponentiation is closely related to the exponential map by a compelling formula: x^{\alpha}=\exp\left(\alpha\exp^{-1}\left(x\right)\right). As an example, the continuous Heisenberg group is a nonabelian \mathbb{R}-module which is not abelian. This observation actually suggests a nice class of examples of nonabelian modules without a topology: given a commutative ring R, the Heisenberg group over R is a nonabelian R-module.

The Heisenberg group of dimension 2n+1 over a commutative ring R has underlying set R^{n}\times R^{n}\times R^{1}, with the group operation given by \left(\boldsymbol{a}_{1},\boldsymbol{b}_{1},c_{1}\right)*\left(\boldsymbol{a}_{2},\boldsymbol{b}_{2},c_{2}\right):=\left(\boldsymbol{a}_{1}+\boldsymbol{a}_{2},\boldsymbol{b}_{1}+\boldsymbol{b}_{2},c_{1}+c_{2}+\boldsymbol{a}_{1}\cdot\boldsymbol{b}_{2}-\boldsymbol{a}_{2}\cdot\boldsymbol{b}_{1}\right). The continuous Heisenberg group means the Heisenberg group over \mathbb{R}. Scalar exponentiation on a Heisenberg group is just given by scalar multiplication: \left(\boldsymbol{a},\boldsymbol{b},c\right)^{\alpha}:=\left(\alpha\boldsymbol{a},\alpha\boldsymbol{b},\alpha c\right).