## Cardinality and the axiom of choice

If the axiom of choice is false, then there are two sets which are not the same size, but neither one of them is larger than the other. This, and similar seemingly absurd results, are sometimes used as motivation for the axiom of choice. But this is not so absurd when you unpack what it means: A set is said to be at least as large as a set if there is an injection from into , the same size as if there is a bijection between and , and strictly larger than if it is at least as large as but not the same size. So all this result is saying is that if the axiom of choice is false, then there are two sets, neither of which can be injected into the other. It’s not hard to find two groups without an injective homomorphism between them in either direction, and not hard to find two topological spaces without any injective continuous maps between them in either direction. So why not sets?

The usual reason to think this shouldn’t happen for sets is that cardinalities of sets are supposed to correspond to an intuitive notion of size. But this intuitive interpretation is not God-given; it is an interpretation that people came up with because it seemed to fit. If the axiom of choice is false, this interpretation fits less well. The axiom of choice can be viewed as saying that sets are so flexible that the possibility of fitting one inside another is limited only by their relative sizes, whereas the negation of the axiom of choice says that sets have some essential structure that can’t be adequately interpreted as size.

But it gets worse. Without the axiom of choice, it’s possible to have an equivalence relation on a set such that there are strictly more equivalence classes of than there are elements of . For instance, if all subsets of are Lebesgue measurable, then this can be done with an equivalence relation on , namely iff .

Surely this is still absurd? Again, I think it is not. It only sounds absurd because of the inappropriate language we used to describe the situation in which there’s an injection from to but no bijection between them. Instead, you can think about it as and being flexible in the ways that allow to fit inside , but rigid in ways that prevent from fitting inside , rather than in terms of bigness.

I suspect that, to many people, “ has strictly larger cardinality than ” sounds more absurd than “ and have incomparable cardinalities” does, but it really shouldn’t, since these are almost the same thing. The reason these are almost the same is that an infinite set can be in bijection with the union of two disjoint copies of itself, another phenomenon that could be thought of as an absurdity if the identification of cardinality with size is taken too literally, but which you have probably long since gotten used to. If and have incomparable cardinalities and can be put into bijection with two copies of itself, then, using such a bijection to identify with two copies of itself, mod out one of them by while leaving the other unchanged. The set of equivalence classes of this new equivalence relation looks like , which easily injects into.

And it shouldn’t be surprising that there could be an equivalence relation such that can’t inject into ; the only obvious reason should be able to inject into is that you could pick an element of each equivalence class, but the possibility of doing this is a restatement of the axiom of choice. For instance, in the example of and , a set consisting of one element from each equivalence class would not be Lebesgue measurable, and thus doesn’t exist if all sets are Lebesgue measurable.

The sense in which cardinality can be about structure more general than size can become even more apparent in more austere foundations. Consider a mathematical universe in which everything that exists can be coded for somehow with natural numbers, and every function is computable. There’s a set of real numbers in this universe, which we know of as the computable real numbers: they’re coded by numbers representing programs computing Cauchy sequences that converge to them. It doesn’t really make sense to think of this universe as containing anything “bigger” than , since everything is coded for by integers. But Cantor’s theorem is constructive, so it applies here. Given a computable sequence of computable reals, we can produce a computation of a real number that isn’t in the sequence. So the integers and the (computable) reals here have different “cardinalities” in the sense that, due to their differing computational structure, there’s no bijection between them in this computability universe.

I think that it can be helpful to think of cardinalities as potentially being about inherent structure of sets rather than simply “size” even if you’re assuming the axiom of choice the whole time. Fun fact: if there’s any model of ZFC at all, then there’s a countable model. This often strikes people as absurd; ZFC asserts the existence of uncountable sets, like , so how could something countable be a model of ZFC? The answer is that an infinite set being uncountable just means that there’s no bijection between it and . A countable model of ZFC can contain a countable set but not contain any of the bijections between it and ; then, internally to the model, this set qualifies as uncountable. This is sometimes described as the countable model of ZFC “believing” that some of its sets are uncountable, but being “wrong”. I think this is a little sloppy; models of ZFC are just mathematical structures, not talk radio hosts with bad opinions. Restricting attention to a certain model of ZFC means imposing additional structure on its elements; namely that structure which is preserved by the functions in the model. This additional structure isn’t respected by functions outside the model, just like equipping a set with a topology imposes structure on it that isn’t respected by discontinuous maps.

To give a couple concrete examples of how I visualize cardinality as being about structure: When we encounter mathematical objects of cardinality in practice, they often naturally carry separable topologies on them, so I think of as being thicker than , but no longer. Since smaller ordinals are initial segments of larger ordinals, I think of , the cardinality of the first uncountable ordinal, as longer than , but no thicker. being well-orderable would mean we can rearrange something thick into something long.

It’s interesting to note that you can forge something long () out of something thick () by modding out by an equivalence relation. (This gives another example of a quotient of that, axiom of choice aside, it’s perfectly reasonable to think shouldn’t fit back inside ). This is because is the cardinality of the set of countable ordinals, each countable ordinal is the order-type of a well-ordering of , and well-orderings on are binary relations on , aka subsets of . So, starting with (with cardinality ), say that any two elements that are both well-orderings of the same order-type are equivalent (and, if you want to end up with just , rather than , also say that all the left-overs that aren’t well-orderings are equivalent to each other). The set of equivalence classes then corresponds to the set of countable ordinals (plus whatever you did with the leftovers that aren’t well-orderings).

The idea behind this post was a specific instance of the general principle that, when a result seems absurd, this doesn’t necessarily refute the foundational assumptions used to prove it, but rather means that your way of thinking isn’t well adapted to a mathematical universe in which those assumptions are true. Another example of this is that the Banach-Tarski theorem and similar results often strike people as patently absurd, but people get used to it, and one could try explaining why such results shouldn’t be seen as as absurd as they first seem, as a way of conveying intuition about what a mathematical universe in which the axiom of choice holds looks like.

While I don’t find the allegedly counterintuitive things that are likely to happen without the axiom of choice compelling, this doesn’t undercut other arguments for the axiom of choice. I think the strongest is that every statements (a broad class of statements that arguably includes everything concrete or directly applicable in the real world) that can be proved in ZFC can also be proved in ZF, so assuming the axiom of choice isn’t going to lead us astray about concrete things regardless of whether it is true in some fundamental sense, but assuming the axiom of choice can sometimes make it easier to prove something even if in theory it could be proved otherwise. This seems like a good reason to assume the axiom of choice to me, but that’s different from the axiom of choice being fundamentally true, or things that can happen if the axiom of choice is false being absurd.

## Uniqueness of mathematical structures

This post is an introduction to model theory, of sorts. Occasionally I get asked what model theory is, and I generally find it quite difficult to give someone who doesn’t already know any model theory a good answer to this question, that actually says anything useful about what model theory is really about without leaving them hopelessly lost. This is my attempt to provide a real taste of model theory in a way that should be accessible to a math grad student without a background in logic.

### Warm-up exercise

Let’s say I make a graph with the following procedure: I start with a countably infinite set of vertices. For each pair of vertices, I flip a fair coin. If the coin lands heads, I put an edge between those two vertices; if the coin lands tails, no edge.

Now you make another graph in a very similar manner. You also start with a countably infinite set of vertices. But instead of flipping a coin, you roll a fair standard six-sided die for each pair of vertices. If the die comes up 6, you put an edge between those two vertices; if it comes up anything from 1 through 5, no edge.

What is the probability that these two graphs are isomorphic?

For the numerical answer, paste “Gur zhygvcyvpngvir vqragvgl” into https://rot13.com/. An explanation will appear later in this post.

### Introduction

There are several cases in which we can identify a mathematical object up to isomorphism with a list of first-order properties it satisfies (I’ll tell you what that means in a sec) and some data about cardinality. Here’s a couple examples: All countable dense linear orders without endpoints are isomorphic. Any two algebraically closed fields of the same characteristic, which have transcendence bases of the same cardinality, are isomorphic. It turns out that the possibility of uniquely specifying a mathematical structure in this way corresponds to interesting structural properties of that structure.

First, the basic definitions:

A first-order language consists of a set of relation symbols, each of which is labeled with a number representing its arity (number of inputs it takes), a set of function symbols, each of which is also labeled with a number representing its arity, and a set of constant symbols (which could also just be thought of as 0-ary function symbols). For example, the language of linear orders has one binary relation and no functions or constants. The language of fields has constants and , binary functions and , a unary function (no unary function for reciprocal, because the functions should be total), and no relations.

A first-order structure in a given language is a set in which each constant symbol is interpreted as an element of , each n-ary function symbol is interpreted as a function , and each n-ary relation symbol is interpreted as a subset of (or alternatively, as a function ). So linear orders and fields are examples of structures in their respective languages.

We can compose function symbols, constant symbols, and variables into ways of pointing to elements of a structure, called terms. We have as many variables as we want, and they are terms. Constant symbols are terms. And for each n-ary function symbol and terms , is a term. So in the language of fields, we can construct terms representing integers by adding however many 1s together (and then negating to get negative numbers), and then combine these with variables using addition and multiplication to get terms representing polynomials in however many variables with integer coefficients. In the language of linear orders, since we have no functions or constants, the only terms are variables.

A first-order formula is a way of actually saying things about a first-order structure and elements of it represented by variables. If is an n-ary relation and are terms, then is a formula. If are terms, then is a formula (you can think of this as just meaning that languages always have the binary relation by default). Boolean combinations of formulas are formulas (i.e. if and are formulas, then so are , , and ), and if is a formula that refers to a variable , then and are formulas. Any variable that appears in a formula without being bound to a quantifier or is called a free variable, and if each free variable is assigned to an element of a structure, the formula makes a claim about them, which can be either true or false. For example, in a ring, is true iff is a unit.

A first-order formula with no free variables is called a sentence. These are true or false statements about a first-order structure. Many types of mathematical objects are defined by listing first-order sentences that are true of them. For instance, a linear order is a structure with a relation satisfying transitivity (), antisymmetry (), and totality (), and a linear order is dense without endpoints if it also satisfies and . These are all first-order sentences. Algebraically closed fields of a given characteristic are another example. The field axioms are first-order sentences. For each positive integer , we can formulate a first-order sentence saying that every polynomial of degree has a root: (the s represent the coefficients, with the leading coefficient normalized to ). So we just add in these infinitely many sentences, one for each . And we can say that the field has characteristic by saying (with ones), or say that it has characteristic by, for each prime , saying .

First-order sentences can tell us a lot about a structure, but not everything, unless the structure is finite.

Löwenheim–Skolem theorem: Given a countable set of first-order sentences (in particular, any set of sentences if the language is countable), if there is any infinite structure in which they are all true, then there are first-order structures of every infinite cardinality in which they are all true.

This is why the uniqueness results all have to say something about cardinality. You might also think of some examples of ways to identify an infinite mathematical object up to isomorphism with a list of axioms without saying directly anything about cardinality, but in all such cases, you’ll be using an axiom that isn’t first-order. For instance, all Dedekind-complete ordered fields are isomorphic to the reals, but Dedekind-completeness isn’t a first-order sentence. Same goes for any way of characterizing the natural numbers up to isomorphism that says something like “every set of natural numbers that contains 0 and is closed under successor contains all of the natural numbers”.

### Countable structures

Let’s go back to the example of countable dense linear orders. If you don’t know the proof that all countable dense linear orders are isomorphic, here it goes: suppose we have two countable dense linear orders, and . Since they’re countable, we can label the elements of each of them with distinct natural numbers. We’re going to match elements of to elements of one at a time such that we get an isomorphism at the end. To ensure that every element of gets matched to something in , on odd-numbered steps, we’ll take the lowest-numbered element of that hasn’t been matched yet, and match it with an element of . Similarly, to ensure that every element of gets matched to something in , on even-numbered steps, we’ll take the lowest-numbered element of that hasn’t been matched yet, and match it with an element of . As for what we do on each step (suppose it’s an odd-numbered step; even-numbered steps are the same but with the roles of and reversed), at the start of the step, finitely many elements of have already been matched. We take the first element that hasn’t yet been matched. Call it . is either greater than all previously matched elements, less than all previously matched elements, or between two previously matched elements that don’t already have previously matched elements between them. Since is dense and has no endpoints, we know that in the first case, there will be something greater than all previously matched elements of , so we can match to it; in the second case, there will be something less than all previously matched elements of for us to match to; and in the third case, there will be something between the elements matched to the elements on either side of , which we can match to. By doing this, we continue to preserve the ordering at each step, so the bijection we get at the end is order-preserving, and thus an isomorphism.

Now let’s get back to the warm-up exercise. A graph can be viewed as a first-order structure whose elements are the vertices, with a single binary relation (the edge relation) that is symmetric and anti-reflexive (symmetry and anti-reflexivity are both first-order conditions). There are some more first-order sentences satisfied by both of our random graphs with probability 1. Given any two finite disjoint sets of vertices, we can find another vertex that’s connected to everything in the first set and not connected to anything in the second set. This is because each vertex has the same positive probability of having this property, they’re all independent, and there’s infinitely many of them, so there also must be some (in fact, infinitely many) that have all the desired edges and none of the undesired edges. To write this condition using first-order sentences, for each natural number and , we have a sentence
(the big conjunction before “” includes for each and , so that this says and are disjoint).

This is enough for us to construct an isomorphism, using essentially the same proof as for countable dense linear orders. Since we each started with countably many vertices, we can label each of our vertices with natural numbers, and then iteratively match the next available unmatched vertex in one graph to a vertex on the other, alternating between which graph we take the next available unmatched vertex from on each step, just like before. On each step, only finitely many vertices have been matched. The new vertex shares edges with some of the already matched vertices and doesn’t share edges with some others. We need to match it with a vertex in the other graph that shares exactly the same pattern of edges with previously matched vertices. And we know that somewhere in that graph, there must be such a vertex. So we can match the new vertex and keep going, and the bijection we get at the end preserves the edge relation, and is thus an isomorphism.

For the general argument that these are both special cases of, we’ll need the concept of a type (not to be confused with the identically-named concept from type theory). Given a first-order structure and , say that have the same type over if for every first-order formula (where are its free variables), holds iff does. So, for example, in a dense linear order without endpoints, if , then, in order for to have the same type as over , it must be the case that as well, since is a first-order formula, and and must satisfy exactly the same first-order formulas with parameters in . And as it turns out, this is enough; if and , then and have the same type over . In an infinite random graph, if vertices and have the same type over some other vertices , then must have an edge to each that has an edge to, and vice-versa. Again, this turns out to be enough to guarantee that they have the same type.

In both of these cases, for any finite set of elements , there are only finitely many types over . Let’s count them. Each is its own type, since is a formula, so if and have the same type over , then, since , as well. Let’s ignore these and count the rest. In a dense linear order without endpoints, we can assume WLOG that . There are nontrivial types over : , , and, for each , . In an infinite random graph, there are nontrivial types over vertices : for each , there’s a type of vertices that have edges to everything in and no edges to anything in .

Theorem: Let be a countably infinite first-order structure such that for every and , there are only finitely many types over . Then every countable structure satisfying the same first-order sentences that does is isomorphic to .

That is, our ability to specify a countable structure (up to isomorphism) by its first-order properties corresponds exactly to the condition that there are only finitely many different behaviors that elements of the structure can have in relation to any given finite subset. The proofs that all countable dense linear orders without endpoints are isomorphic and that all countable random graphs are isomorphic look the same because they both follow the proof of this theorem, which goes like so:

Suppose there are only finitely many types over . Let be one of those types, and let be the others. For each , there’s some formula that’s true for but not for . Then is a formula that holds only for . That is, the entire type is specified by a single formula; it wasn’t just coincidence that we were able to find such a formula for the types in each of our two examples.

Lemma: If and are two first-order structures in the same language which satisfy all the same sentences, and satisfy all the same formulas (i.e., for any , is true in iff is true in ), and has only finitely many types, then there’s a natural bijection between types in over and types in over . A formula is true for in some type over iff is true for in the corresponding type over .

Proof: Let be the types in over , and for each , let be a formula specifying that is the type of . These formulas specify types in over as well; for any other formula , either or in . These are first-order formulas, so again since satisfy the same first-order formulas that do, one of or is true in as well. So determines the truth value of every such formula; that is, it specifies the type of over , and formulas are true in this type iff they are true in the corresponding type in over . To show that these are all of the types in over , consider the formula . In , when we plug in , the formula is true. And satisfies all the same formulas as , so the same formula must also be true in when we plug in . That is, for any , there’s some such that is true, so every element of must have one of the above types.

Armed with this lemma, we can prove the theorem. Let and be countable structures satisfying the same first-order sentences, and suppose for every , there are only finitely many types over . We’ll match elements of to elements of one at a time, using the same back-and-forth trick from our two examples to ensure that we get a bijection at the end. After steps, we’ll have from matched with from , and we’ll want to ensure that and satisfy exactly the same first-order formulas. If we’ve done this, then on step , we’ll have an element either of or of , which we need to match with some element of the other one. We can match it to an element that has the corresponding type; that is, we’re matching and such that the type of over corresponds to the type of over . Then satisfy the same formulas that do, so by induction, and satisfy the same formulas for every (the assumption that and satisfy the same first-order sentences provides a base case). Thus, the bijection we get at the end preserves the truth-values of all formulas, so it is an isomorphism, and we’re done.

As it turns out, the converse of the theorem is also true. Given a set of first-order sentences for which there is, up to isomorphism, only one countable model, all models have only finitely many types over any finite list of elements. Whenever there’s infinitely many types, there will be some types (which cannot be specified by a single formula) that appear in some models but not in others.

### Uncountable structures

Let’s turn to the other example I introduced at the beginning: any two algebraically closed fields of the same characteristic with transcendence bases of the same cardinality are isomorphic. Every field has a transcendence basis, so a corollary of this is that any two uncountable algebraically closed fields of the same characteristic and cardinality are isomorphic.

A sketch of the proof: Given algebraically closed fields and of the same characteristic, with transcendence bases and of the same cardinality, any isomorphism between a subfield of and a subfield of extends to a maximal such isomorphism (by Zorn’s lemma). Since and have the same cardinality, there’s a bijection between them, and since and have the same characteristic, this bijection extends to an isomorphism between the fields they generate. Thus there is a maximal isomorphism between subfields of and of which restricts to a bijection between and . Now we just need to show that these subfields are all of and . This is because, given any such isomorphism between subfields and with and , if , then let , and let be the minimal polynomial of . Applying the isomorphism to the coefficients gives us an irreducible polynomial over , which must have a root , and then by matching with , we get an isomorphism , contradicting maximality of the isomorphism.

Here’s another example: Any two vector spaces over the same vector space, with bases of the same cardinality, are isomorphic. Since every vector space has a basis, a corollary of this is that, over a countable field, any two uncountable vector spaces of the same cardinality are isomorphic. Citing Zorn’s lemma is overkill, since there’s only one way to extend a bijection between bases to an isomorphism. But the basic idea is the same in each case: We have an appropriate notion of basis, and we extend a bijection between bases to an isomorphism. And vector spaces are also first-order structures; the language has a binary operation , a constant , and, for each scalar , a unary operation for multiplication by .

The thing that unites both these cases is called strong minimality. A first-order structure is called minimal if every set defined by a first-order formula is either finite, or the complement of a finite set. More formally: is minimal if for every formula and , one of or is finite. We call a structure strongly minimal if every structure satisfying the same first-order sentences is also minimal. (This turns out to be equivalent to, for each , there’s a finite upper bound on the size of whichever of or is finite, as vary.)

Let’s go over the general notion of “basis” we’ll be using: Say that is algebraic over if there is a formula such that holds, and is finite. In algebraically closed fields, this corresponds to the usual notion of being algebraic over the subfield generated by . In vector spaces, this corresponds to being a linear combination of . Call independent if no element of is algebraic over any other elements of . In other words, you can’t pin down an element of to one of finitely many possibilities by using a single formula and other elements of . In a vector space, independence is linear independence. In an algebraically closed field, independence is algebraic independence. Now call a basis if it is a maximal independent set. If is minimal, this turns out to imply that every is algebraic over some . An increasing union of independent sets is independent, so by Zorn’s lemma, every structure has a basis.

Now let’s look at type spaces in minimal structures. Let be a minimal structure and . If is algebraic over , then there’s some formula such that holds and is as small as possible. Then and have the same type iff . So this type is implied by a single formula, and there are only finitely many elements of this type. There’s only one remaining type: the type of elements that aren’t algebraic over . If and are both non-algebraic over , then for every formula , since is minimal, one of and must have only finitely many solutions ; and , being non-algebraic, must both be solutions to the other one. This shows they have the same type over . This non-algebraic type is optional; in some cases, there might not be any elements that aren’t algebraic over .

Let and be minimal structures in the same language, which satisfy the same first-order sentences. They each have a basis. If those bases have the same cardinality, then and are isomorphic. Say a “partial isomorphism” between and is a bijection between a subset of and a subset of , such that whenever a formula is true about some elements of the subset of , then it is also true about the corresponding elements of the subset of , and vice-versa. If and are bases for and , respectively, then a bijection between and is a partial isomorphism (this is because if and satisfy all the same formulas, and and , then must have the unique non-algebraic type over , has the unique non-algebraic type over , and these unique non-algebraic types satisfy the same formulas, so it follows by induction on the number of variables that a formula is true of distinct elements of iff it is true of distinct elements of ). An increasing union of partial isomorphisms is a partial isomorphism, so by Zorn’s lemma, there’s a maximal partial isomorphism extending a bijection between and . If this maximal partial isomorphism is a bijection between and , and , then let . is algebraic over (since ), so there’s a single formula () that is true for , and which determines its type over (meaning, determines its type over for every ). Then, where correspond to under the partial isomorphism, there must be such that (since satisfies the same formulas do, and ). , because this can be expressed as part of the type of over , which is the same as the type of over . Thus we can extend the partial isomorphism by matching with . Thus, in our maximal partial isomorphism, , and for the same reason, , so it is an isomorphism.

So for a strongly minimal structure, the structures satisfying the same sentences are classified by the cardinality of a basis. This isn’t quite the end of the story; in some cases, a structure with too small a basis would be finite, and we could thus distinguish it from the rest with a first-order sentence saying that there are distinct elements (for large enough ). This isn’t the case for algebraically closed fields, which are infinite even when the empty set is a transcendence basis. But for vector spaces, the empty basis generates a one-element vector space, so an infinite vector space must have basis of size at least one.

And if the vector space is over a finite field, then its basis must be infinite. Another case where where the basis must be infinite is an infinite set. A set is a first-order structure in the language with no relations, no functions, and no constants. Every subset of a set is independent, so a basis for the set is just the entire set. In these cases where a basis must be infinite; there’s only one (up to isomorphism) countable model: the model with a countably infinite basis. You can check that both of these examples satisfy the finitely-many-types condition from the previous section for having a unique countable model.

So the general story, for a strongly minimal structure , is that there is some such that structures satisfying the same sentences as are classified by cardinalities that are at least , that being the cardinality of a basis. In a countable language, the cardinality of a structure is the maximum of and the cardinality of a basis, so it follows that an uncountable strongly minimal structure is isomorphic to all structures of the same cardinality satisfying the same sentences.

In the previous section, we had a converse, so you may ask, if an uncountable structure is isomorphic to all structures of the same cardinality satisfying the same sentences, is it strongly minimal? This is not quite true. For example, consider a vector space over a countable field, where we add two unary relations and to the language, each of which define subspaces of , which are disjoint and span , and then add a unary function to the language, which is a linear function such that , and is an isomorphism between and . Vector spaces like this are classified by the dimension of , so there is a unique one (up to isomorphism) of any given uncountable cardinality. It is not strongly minimal because itself is a formula picking out a set that is neither finite nor the complement of a finite set. But it is almost strongly minimal, in the sense that it is basically just the vector space , and is strongly minimal. It turns out that for any uncountable structure (in a finite or countable language) that is isomorphic to every structure of the same cardinality and satisfying the same sentences, there’s a formula defining a subset that is strongly minimal in an appropriate sense, such that the rest of the structure can be parameterized somehow using the subset.

## Exact 2-cycles are degenerate isomorphisms

The situation in which you have vector spaces and , and linear maps and such that and often arises in the situation in which you would have an isomorphism between and if you knew how to divide by . Specifically, this happens when you’d need to divide by exactly once; in similar situations in which you’d need to know how to divide by multiple times in order to get an isomorphism, you get and such that and but whose kernels and images are not necessarily equal.

I’ll call such a pair with and an exact 2-cycle of vector spaces. Note that the two vector spaces and in an exact 2-cycle are in fact isomorphic, as .

### Adjugates

Given a finite-dimensional vector space and an invertible linear map , its adjugate is almost its inverse; you just have to divide by . If is not invertible, then of course, , so dividing by doesn’t work. But if has nullity , then and . That is, is an exact 2-cycle. If has nullity , then , and hence inverting requires dividing by more than once, and .

### Homogeneous polynomials and multilinear forms

Given a vector space over a field , let denote the space of quadratic forms on (that is, homogeneous quadratic polynomial maps ), and let denote the space of symmetric bilinear forms on .

Given a symmetric bilinear form on , we can construct a quadratic form on by . This gives us a map by .

, so we can recover from by . That is, the map given by is twice the inverse of .

This doesn’t quite work if , since we can’t do the part where we divide by . In fact, is not invertible in this case. But is still a well-defined map , and it’s still true that and ; it’s just that now that means and . In fact, and . and are the -dimensional space of diagonal quadratic forms (polynomials that are linear combinations of squares of linear functions ), and and are the -dimensional space of alternating symmetric bilinear forms. Thus and are both -dimensional.

Similar things happen with higher degree homogeneous polynomials and symmetric multilinear forms. Let be the space of homogeneous degree- polynomials on and the space of symmetric -linear forms on . We have functions given by and given by . and , so if or , then and are bijections, and times each others’ inverse. Otherwise, and . If , then divides with multiplicity , and and . If , then divides with multiplicity , and all bets are off. Though , no matter what is.

### Newtonian spacetime

In special relativity, we work with a 4-dimensional (3 for space and 1 for time) real vector space , with a symmetric bilinear form , called the Minkowski inner product, of signature ; that is, the associated quadratic form can be given, in coordinates, by ( is the time coordinate and are spatial coordinates for some reference frame). If , then is spacelike, and measures its distance (in the reference frame in which its temporal coordinate is ). If , then is timelike, and measures its duration (in the reference frame in which it is at rest). By currying, the Minkowski inner product can be seen as a linear map , where is the vector space of linear maps . Since the Minkowski inner product is nondegenerate, this linear map is an isomorphism.

In Newtonian physics, things are a little different. We can still work in 4-dimensional spacetime, but we don’t have a single Minkowski inner product measuring both distance and duration. We do have a global notion of time; that is, there’s a linear map that tells you what time it is at each point in spacetime. is space in the present moment, so it should be Euclidean space; that is, it should be equipped with an ordinary inner product.

The time function induces a degenerate inner product on by . As before, this can be seen as a linear map (it sends to ), with 1-dimensional image and 3-dimensional kernel .

The ordinary inner product on gives us a degenerate inner product on : since our inner product on is non-degenerate, it induces an isomorphism between and its dual, and hence induces an inner product on . There’s a canonical map given by restriction: . So given , we can define their inner product to be the spatial inner product of their restrictions to . This can be seen as a linear map (given , restrict it to , and then find the element of that corresponds to it via the spatial inner product) with image and kernel . We have thus found canonical maps and such that the kernel of each is the image of the other.

### Why?

In the spacetime example, it is conventional in special relativity to normalize the speed of light to . But another thing we can do is let the speed of light be the variable . So . As a map , this is . The inverse map is , or, as an inner product on , . We’re going to want to take a limit as and get something finite, so we’ll have to scale our inner product on down by a factor of , giving us , or, as a map , . The limit gives us our temporal inner product on Newtonian spacetime, , and our spatial inner product on the dual space , giving us our exact 2-cycle of maps between and , and . (I did say that this should only work if we have to divide by once, not if we must do so twice, and this involved , but we never used on its own anywhere, so we can just say , and it’s fine).

Let’s go back to the first example. Given of nullity , perturb slightly to make it invertible by adding an infinitesimal times some map . The only condition we need to satisfy is . That way , which must be a multiple of , is not a multiple of . . Clearly . Given , , so . Hence . Since has constant term but nonzero coefficient of , can be evaluated at , and has a nonzero, finite value. Then . So forms an exact 2-cycle for reasons closely relating to the fact that perturbing each of them infinitesimally can make them inverses up to an infinitesimal scalar multiple.

Now, in the second example, where is a vector space over a field of positive characteristic, , and we have an exact -cycle , , let be an integral domain of characteristic with a unique maximal ideal , such that and (for instance, if , we can use and ). Lift to a free -module with (in coordinates, this means, instead of , work with , which carries a natural map to by reducing each coordinate mod ). Then there are natural maps and such that and , and and reduce mod to and , respectively. Where is the field of fractions of (so in our example with and ), and are bijections (in coordinates, , and tensoring a map with just means the same map extended over the field of fractions), as they are inverses of each other up to a multiple of , which is invertible in . Since , , and and . Given , if we lift to , . Since , , and hence , and of course, . Reducing mod , we get . Thus . Similarly, given , lift to . . , and . Reducing mod , we get . Thus . So forms an exact cycle because, in , they are inverses up to a factor of , which we can divide by, and which is with multiplicity in , since .

### The general story

All three arguments from the previous section took the following form: Let be a discrete valuation ring with residue field , field of fractions , and valuation . Let and be free -modules, and let and be such that and , for some with . Then and are isomorphisms, and each is times the inverse of the other. and form an exact 2-cycle: they compose to because and compose to , which goes to in , and given such that , we can lift to . , so , and , so tensoring with sends to some such that . Thus . The same argument with and switched shows . The exact 2-cycle is a sort of shadow of the isomorphisms .

In the spacetime example, , , , and . In the adjugates example, , , and the in the general story is . In the homogeneous polynomials and symmetric multilinear forms example, is a discretely valued field of characteristic with residue field , is its valuation ring, and .

All exact 2-cycles of vector spaces can be fit into this general story. Given any exact 2-cycle , (, vector spaces over ), we can take a discretely valued field with residue field , and then lift to with for some with , exactly the conditions in the above argument.

### What more?

What about exact 2-cycles in abelian categories other than vector spaces? In general, the two objects in an exact 2-cycle need not be isomorphic. For instance, with abelian groups, there’s an exact 2-cycle between the 4-element cyclic group and the Klein four-group. Though two objects in an exact 2-cycle must be isomorphic in any category in which every short exact sequence splits (this is the gist of the dimension-counting argument from the beginning showing that two vector spaces in an exact 2-cycle must be isomorphic). Is there still some way of seeing exact 2-cycles as degenerate isomorphisms even in contexts in which there need not be actual isomorphisms?

Also, what about exact -cycles? That is, a cycle of functions such that the image of each is the kernel of the next. If an exact 2-cycle is a degenerate form of an isomorphism, and an isomorphism is an exact sequence of length 2, then perhaps an exact 3-cycle should be a degenerate form of an exact sequence of length 3 (i.e. a short exact sequence). This is hard to picture, as a short exact sequence is not symmetric between its objects. However, for reasons not understood by me, algebraic topologists care about exact 3-cycles in which two of the three objects involved are the same (these are called exact couples), and this apparently has something to do with short exact sequences in which the first two objects are isomorphic, which provides some support for the idea that exact 3-cycles should have something to do with short exact sequences. An exact sequence of length 1 just consists of the object, so this suggests an exact 1-cycle (i.e. an endomorphism of an object whose kernel and image are the same) should be considered a degenerate form of the object, which is also hard to picture.

## Metamathematics and probability

Content warning: mathematical logic.

Note: This write-up consists mainly of open questions rather than results, but may contain errors anyway.

### Setup

I’d like to describe a logic for talking about probabilities of logical sentences. Fix some first-order language . This logic deals with pairs , which I’m calling assertions, where is a formula and . Such a pair is to be interpreted as a claim that has probability at least .

A theory consists of a set of assertions. A model of a theory consists of a probability space whose points are -structures, such that for every assertion , , where is inner probability. I’ll write for can be proved from , and for all models of are also models of .

The rules of inference are all rules where is a finite set of assertions, and is an assertion such that in all models of . Can we make an explicit finite list of inference rules that generate this logic? If not, is the set of inference rules at least recursively enumerable? (For recursive enumerability to make sense here, we need to restrict attention to probabilities in some countable dense subset of that has a natural explicit bijection with , such as .) I’m going to assume later that the set of inference rules is recursively enumerable; if it isn’t, everything should still work if we use some recursively enumerable subset of the inference rules that includes all of the ones that I use.

Note that the compactness theorem fails for this logic; for example, , but no finite subset of implies , and hence .

Any classical first-order theory can be converted into a theory in this logic as .

### Löb’s Theorem

Let be a consistent, recursively axiomatizable extension of Peano Arithmetic. By the usual sort of construction, there is a binary predicate such that for any sentence and , where is a coding of sentences with natural numbers. We have a probabilistic analog of Löb’s theorem: if , then . Peano arithmetic can prove this theorem, in the sense that .

Proof: Assume . By the diagonal lemma, there is a sentence such that . If , then and , so . This shows that . By the assumption that , this implies that . By a probabilistic version of the deduction theorem, . That is, . Going back around through all that again, we get .

If we change the assumption to be that for some , then the above proof does not go through (if , then it does, because ). Is there a consistent theory extending Peano Arithmetic that proves a soundness schema about itself, , or can this be used to derive a contradiction some other way? If there is no such consistent theory, then can the soundness schema be modified so that it is consistent, while still being nontrivial? If there is such a consistent theory with a soundness schema, can the theory also be sound? That is actually several questions, because there are multiple things I could mean by “sound”. The possible syntactic things “sound” could mean, in decreasing order of strictness, are: 1) The theory does not assert a positive probability to any sentence that is false in . 2) There is an upper bound below for all probabilities asserted of sentences that are false in . 3) The theory does not assert probability to any sentence that is false in .

There are also semantic versions of the above questions, which are at least as strict as their syntactic analogs, but probably aren’t equivalent to them, since the compactness theorem does not hold. The semantic version of asking if the soundness schema is consistent is asking if it has a model. The first two soundness notions also have semantic analogs. 1′) is a model of the theory. 2′) There is a model of the theory that assigns positive probability to . I don’t have a semantic version of 3, but metaphorically speaking, a semantic version of 3 should mean that there is a model that assigns nonzero probability density at , even though it might not have a point mass at .

### Motivation

This is somewhat similar to Definability of Truth in Probabilistic Logic. But in place of adding a probability predicate to the language, I’m only changing the metalanguage to refer to probabilities, and using this to express statements about probability in the language through conventional metamathematics. An advantage of this approach is that it’s constructive. Theories with the properties described by the Christiano et al paper are unsound, so if some reasonably strong notion of soundness applies to an extension of Peano Arithmetic with the soundness schema I described, that would be another advantage of my approach.

A type of situation that this might be useful for is that when an agent is reasoning about what actions it will take in the future, it should be able to trust its future self’s reasoning. An agent with the soundness schema can assume that its future self’s beliefs are accurate, up to arbitrarily small loss in precision. A related type of situation is if an agent reaches some conclusion, and then writes it to external storage instead of its own memory, and later reads the claim it had written to external storage. With the soundness schema, if the agent has reason to believe that the external storage hasn’t been tampered with, it can reason that since its past self had derived the claim, the claim is to be trusted arbitrarily close to as much as it would have been if the agent had remembered it internally.

### First Incompleteness Theorem

For a consistent theory , say that a sentence is -measurable if there is some such that for every and for every . So -measurability essentially means that pins down the probability of the sentence. If is not -measurable, then you could say that has Knightian uncertainty about . Say that is complete if every sentence is -measurable. Essentially, complete theories assign a probability to every sentence, while incomplete theories have Knightian uncertainty.

The first incompleteness theorem (that no recursively axiomatizable extension of PA is consistent and complete) holds in this setting. In fact, for every consistent recursively axiomatizable extension of PA, there must be sentences that are given neither a nontrivial upper bound nor a nontrivial lower bound on their probability. Otherwise, we would be able to recursively separate the theorems of PA from the negations of theorems of PA, by picking some recursive enumeration of assertions of the theory, and sorting sentences by whether they are first given a nontrivial lower bound or first given a nontrivial upper bound; theorems of PA will only be given a nontrivial lower bound, and their negations will only be given a nontrivial upper bound. [Thanks to Sam Eisenstat for pointing this out; I had somehow managed not to notice this on my own.]

For an explicit example of a sentence for which no nontrivial bounds on its probability can be established, use the diagonal lemma to construct a sentence which is provably equivalent to “for every proof of for any , there is a proof of for some with smaller Gödel number.”

Thus a considerable amount of Knightian uncertainty is inevitable in this framework. Dogmatic Bayesians such as myself might find this unsatisfying, but I suspect that any attempt to unify probability and first-order arithmetic will suffer similar problems.

### A side note on model theory and compactness

I’m a bit unnerved about the compactness theorem failing. It occurred to me that it might be possible to fix this by letting models use hyperreal probabilities. Problem is, the hyperreals aren’t complete, so the countable additivity axiom for probability measures doesn’t mean anything, and it’s unclear what a hyperreal-valued probability measure is. One possible solution is to drop countable additivity, and allow finitely-additive hyperreal-valued probability measures, but I’m worried that the logic might not even be sound for such models.

A different possible solution to this is to take a countably complete ultrafilter on a set , and use probabilities valued in the ultrapower . Despite not being Cauchy complete, it inherits a notion of convergence of sequences from , since a sequence can be said to converge to , and this is well-defined (if is for a -large set of indices ) by countable completeness. Thus the countable additivity axiom makes sense for -valued probability measures. Allowing models to use -valued probability measures might make the compactness theorem work. [Edit: This doesn’t work, because . To see this, it is enough to show that is Archimedean, since has no proper Archimedean extensions. Given , let for . , so by countable completeness of , there is some such that , and thus .]

## Complexity classes of natural numbers (googology for ultrafinitists)

Ultrafinitists think common ways of defining extremely large numbers don’t actually refer to numbers that exist. For example, most ultrafinitists would maintain that a googolplex isn’t a number. But to a classical mathematician, while numbers like a googolplex are far larger than the numbers we deal with on a day-to-day basis like 10, both numbers have the same ontological status. In this post, I want to consider a compromise position, that any number we can define can be meaningfully reasoned about, but that a special status is afforded to the sorts of numbers that ultrafinitists can accept.

Specifically, define an “ultrafinite number” to be a natural number that it is physically possible to express in unary. This isn’t very precise, since there are all sorts of things that “physically possible to express in unary” could mean, but let’s just not worry about that too much. Also, many ultrafinitists would not insist that numbers must be expressible in such an austere language as unary, but I’m about to get to that.

Examples: is an ultrafinite number, because , where is the successor function. 80,000 is also an ultrafinite number, but it is a large one, and it isn’t worth demonstrating its ultrafiniteness. A googol is not ultrafinite. The observable universe isn’t even big enough to contain a googol written in unary.

Now, define a “polynomially finite number” to be a natural number that it is physically possible to express using addition and multiplication. Binary and decimal are basically just concise ways of expressing certain sequences of addition and multiplication operations. For instance, “” means . Conversely, if you multiply an -digit number with an -digit number, you get an at most -digit number, which is the same number of symbols it took write down “[the -digit number] times [the -digit number]” in the first place, so any number that can be written using addition and multiplication can be written in decimal. Thus, another way to define polynomially finite numbers is as the numbers that it is physically possible to express in binary or in decimal. I’ve been ignoring some small constant factors that might make these definitions not quite equivalent, but any plausible candidate for a counterexample would be an ambiguous edge case according to each definition anyway, so I’m not worried about that. Many ultrafinitists may see something more like polynomially finite number, rather than ultrafinite number, as a good description of what numbers exist.

Examples: A googol is polynomially finite, because a googol is 10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000. A googolplex is not polynomially finite, because it would require a googol digits to express in decimal, which is physically impossible.

Define an “elementarily finite number” to be a number that it is physically possible to express using addition, multiplication, subtraction, exponentiation, and the integer division function . Elementarily finite is much broader than polynomially finite, so it might make sense to look at intermediate classes. Say a number is “exponentially finite” if it is physically possible to express using the above operations but without any nested exponentiation (e.g. is okay, but is not). More generally, we can say that a number is “-exponentially finite” if it can be expressed with exponentiation nested to depth at most , so a polynomially finite number is a -exponentially finite number, an exponentially finite number is a -exponentially finite number, and an elementarily finite number is a number that is -exponentially finite for some (or equivalently, for some ultrafinite ).

Examples: a googolplex is exponentially finite, because it is . Thus a googolduplex, meaning , is -exponentially finite, but it is not exponentially finite. For examples of non-elementarily finite numbers, and numbers that are only -exponentially finite for fairly large , I’ll use up-arrow notation. just means , means , where is the number of copies of , and using order of operations that starts on the right. So , which is certainly polynomially finite, and could also be ultrafinite depending on what is meant by “physically possible” (a human cannot possibly count that high, but a computer with a large enough hard drive can store in unary). , where there are threes in that tower. Under the assumptions that imply is ultrafinite, is elementarily finite. Specifically, it is -exponentially finite, but I’m pretty sure it’s not -exponentially finite, or even -exponentially finite. , and is certainly not elementarily finite.

Interestingly, even though a googolplex is exponentially finite, there are numbers less than a googolplex that are not. There’s an easy nonconstructive proof of this: in order to be able to represent every number less than a googolplex in any encoding scheme at all, there has to be some number less than a googolplex that requires at least a googol decimal digits of information to express. But it is physically impossible to store a googol decimal digits of information. Therefore for any encoding scheme for numbers, there is some number less than a googolplex that cannot physically be expressed in it. This is why the definition of elementarily finite is significantly more complicated than the definition of polynomially finite; in the polynomial case, if can be expressed using addition and multiplication and , then can also be expressed using addition and multiplication, so there’s no need for additional operations to construct smaller numbers, but in the elementary case, the operations of subtraction and integer division are useful for expressing more numbers, and are simpler than exponentiation. For example, these let us express the number that you get from reading off the last googol digits, or the first googol digits, of , so these numbers are elementarily finite. However, it is exceptionally unlikely that the number you get from reading off the first googol decimal digits of is elementarily finite. But for a difficult exercise, show that the number you get from reading off the last googol decimal digits of is elementarily finite.

Why stop there instead of including more operations for getting smaller numbers, like , which I implicitly used when I told you that the number formed by the first googol digits of is elementarily finite? We don’t have to. The functions that you can get by composition from addition, multiplication, exponentiation, , and coincide with the functions that can be computed in iterated exponential time (meaning time, for some height of that tower). So if you have any remotely close to efficient way to compute an operation, it can be expressed in terms of the operations I already specified.

We can go farther. Consider a programming language that has the basic arithmetic operations, if/else clauses, and loops, where the number of iterations of each loop must be fixed in advance. The programs that can be written in such a language are the primitive recursive functions. Say that a number is primitive recursively finite if it is physically possible to write a program (that does not take any input) in this language that outputs it. For each fixed , the binary function is primitive recursive, so is primitive recursively finite. But the ternary function is not primitive recursive, so is not primitive recursively finite.

The primitive recursively finite numbers can be put in a hierarchy of subclasses based on the depth of nested loops that are needed to express them. If the only arithmetic operation available is the successor function (from which other operations can be defined using loops), then the elementarily finite numbers are those that can be expressed with loops nested to depth at most 2. The -exponentially finite numbers should roughly correspond to the numbers that can be expressed with at most loops at depth 2.

Next comes the provably computably finite numbers. Say that a number is provably computably finite if it is physically possible to write a program in a Turing-complete language that outputs the number (taking no input), together with a proof in Peano Arithmetic that the program halts. The famous Graham’s number is provably computably finite. Graham’s number is defined in terms of a function , defined recursively as and . Graham’s number is . You could write a computer program to compute , and prove that is total using Peano arithmetic. By replacing Peano arithmetic with other formal systems, you can get other variations on the notion of provably computably finite.

For an example of a number that is not provably computably finite, I’ll use the hydra game, which is described here. There is no proof in Peano arithmetic (that can physically be written down) that it is possible to win the hydra game starting from the complete binary tree of depth a googol. So the number of turns it takes to win the hydra game on the complete binary tree of depth a googol is not provably computably finite. If you start with a reasonably small hydra (say, with 100 nodes), you could write a program to search for the shortest winning strategy, and prove in Peano arithmetic that it succeeds in finding one, if you’re sufficiently clever and determined, and you use a computer to help you search for proofs. The proof you’d get out of this endeavor would be profoundly unenlightening, but the point is, the number of turns it takes to win the hydra game for a small hydra is provably computably finite (but not primitive recursively finite, except in certain trivial special cases).

Next we’ll drop the provability requirement, and say that a number is computably finite if it is physically possible to write a computer program that computes it from no input. Of course, in order to describe a computably finite number, you need the program you use to actually halt, so you’d need some argument that it does halt in order to establish that you’re describing a computably finite number. Thus this is arguably just a variation on provably computably finite, where Peano arithmetic is replaced by some unspecified strong theory encompassing the sort of reasoning that classical mathematicians tend to endorse. This is probably the point where even the most patient of ultrafinitists would roll their eyes in disgust, but oh well. Anyway, the number of steps that it takes to win the hydra game starting from the complete binary tree of depth a googol is a computably finite number, because there exists a shortest winning strategy, and you can write a computer program to exhaustively search for it.

The busy-beaver function is defined so that is the longest any Turing machine with states runs before halting (among those that do halt). is not computably finite, because Turing machines with a googol states cannot be explicitly described, and since the busy-beaver function is very fast-growing, no smaller Turing machine has comparable behavior. What about ? Turing machines with 10,000 states are not too big to describe explicitly, so it may be tempting to say that is computably finite. But on the other hand, it is not possible to search through all Turing machines with 10,000 states and find the one that runs the longest before halting. No matter how hard you search and no matter how clever your heuristics for finding Turing machines that run for exceptionally long and then halt, it is vanishingly unlikely that you will find the 10,000-state Turing machine that runs longest before halting, let alone realize that you have found it. And the idea is to use classical reasoning for large numbers themselves, but constructive reasoning for descriptions of large numbers. So since it is pretty much impossible to actually write a program that outputs , it is not computably finite.

For a class that can handle busy-beaver numbers too, let’s turn to the arithmetically finite numbers. These are the numbers that are defined by arithmetical formulas. These form a natural hierarchy, where the -finite numbers are the numbers defined by arithmetical formulas with at most unbounded quantifiers starting with , the -finite numbers are the numbers defined by arithmetical formulas with at most unbounded quantifiers starting with , and the -finite numbers are those that are both -finite and -finite. The -finite numbers are the same as the computably finite numbers. is -finite, because it is defined by “ every Turing machine with states that halts in at most steps halts in at most steps, and there is a Turing machine with states that halts in exactly steps.” Everything after the first quantifier in that formula is computable. is -finite, but no lower than that. To get a number that is not arithmetically finite, consider the function given by is the largest number defined by an arithmetical formula with symbols. is -finite, but is not arithmetically finite. I’ll stop there.

## Principal Component Analysis in Theory and Practice

Prerequisites for this post are linear algebra, including tensors, and basic probability theory. Already knowing how PCA works will also be helpful. In section 1, I’ll summarize the technique of principal component analysis (PCA), stubbornly doing so in a coordinate-free manner, partly because I am an asshole but mostly because it is rhetorically useful for emphasizing my point in section 2. In section 2, I’ll gripe about how PCA is often used in ways that shouldn’t be expected to work, but works just fine anyway. In section 3, I’ll discuss some useless but potentially amusing ways that PCA could be modified. Thanks to Laurens Gunnarsen for inspiring this post by talking to me about the problem that I discuss in section 2.

### A brief introduction to Principal Component Analysis

You start with a finite-dimensional real inner product space and a probability distribution on . Actually, you probably just started with a large finite number of elements of , and you’ve inferred a probability distribution that you’re supposing they came from, but that difference is not important here. The goal is to find the -dimensional (for some ) affine subspace minimizing the expected squared distance between a vector (distributed according to ) and its orthogonal projection onto . We can assume without loss of generality that the mean of is , because we can just shift any probability distribution by its mean and get a probability distribution with mean . This is useful because then will be a linear subspace of . In fact, we will solve this problem for all simultaneously by finding an ordered orthonormal basis such that is the span of the first basis elements.

First you take , called the covariance of , defined as the bilinear form on given by . From this, we get the covariance operator by raising the first index, which means starting with and performing a tensor contraction (in other words, is obtained from by applying the map given by the inner product to the first index). is symmetric and positive semi-definite, so is self-adjoint and positive semi-definite, and hence has an orthonormal basis of eigenvectors of , with non-negative real eigenvalues. This gives an orthonormal basis in which is diagonal, where the diagonal entries are the eigenvalues. Ordering the eigenvectors in decreasing order of the corresponding eigenvalues gives us the desired ordered orthonormal basis.

### The problem

There’s no problem with principal component analysis as I described it above. It works just fine, and in fact is quite beautiful. But often people apply principal component analysis to probability distributions on finite-dimensional real vector spaces that don’t have a natural inner product structure. There are two closely related problems with this: First, the goal is underdefined. We want to find a projection onto an -dimensional subspace that minimizes the expected squared distance from a vector to its projection, but we don’t have a measure of distance. Second, the procedure is underdefined. is a bilinear form, not a linear operator, so it doesn’t have eigenvectors or eigenvalues, and we don’t have a way of raising an index to produce something that does. It should come as no surprise that these two problems arise together. After all, you shouldn’t be able to find a fully specified solution to an underspecified problem.

People will apply principal component analysis in such cases by picking an inner product. This solves the second problem, since it allows you to carry out the procedure. But it does not solve the first problem. If you wanted to find a projection onto an -dimensional subspace such that the distance from a vector to its projection tends to be small, then you must have already had some notion of distance in mind by which to judge success. Haphazardly picking an inner product gives you a new notion of distance, and then allows you to find an optimal solution with respect to your new notion of distance, and it is not clear to me why you should expect this solution to be reasonable with respect to the notion of distance that you actually care about.

In fact, it’s worse than that. Of course, principal component analysis can’t given you literally any ordered basis at all, but it is almost as bad. The thing that you use PCA for is the projection onto the span of the first basis elements along the span of the rest. These projections only depend on the sequence of -dimensional subspaces spanned by the basis elements, and not the basis elements themselves. That is, we might as well only pay attention to the principal components up to scale, rather than making sure that are all unit length. Let a “coordinate system” refer to an ordered basis up to two ordered bases being equivalent if they differ only by scaling the basis vectors, so that we’re paying attention to the coordinate systems given to us by PCA. If the covariance of is nondegenerate, then the set of coordinate systems that can be obtained from principal component analysis by a suitable choice of inner product is dense in the space of coordinate systems. More generally, where is the smallest subspace of such that , then the space of coordinate systems that you can get from principal component analysis is dense in the space of all coordinate systems whose first coordinates span ( will be the rank of the covariance of ). So in a sense, for suitably poor choices of inner product, principal component analysis can give you arbitrarily terrible results, subject only to the weak constraint that it will always notice if all of the vectors in your sample belong to a common subspace.

It is thus somewhat mysterious that machine learning people seem to be able to often get good results from principal component analysis apparently without being very careful about the inner product they choose. Vector spaces that arise in machine learning seem to almost always come with a set of preferred coordinate axes, so these axes are taken to be orthogonal, leaving only the question of how to scale them relative to each other. If these axes are all labeled with the same units, then this also gives you a way of scaling them relative to each other, and hence an inner product. If they are aren’t, then I’m under the impression that the most popular method is to normalize them such that the pushforward of along each coordinate axis has the same variance. This is unsatisfying, since figuring out which axes has enough variance along to be worth paying attention to seems like the sort of thing that you would want principal component analysis to be able to tell you. Normalizing the axes in this way seems to me like an admission that you don’t know exactly what question you’re hoping to use principal component analysis to answer, so you just tell it not to answer that part of the question to minimize the risk of asking it to answer the wrong question, and let it focus on telling you how the axes, which you’re pretty sure should be considered orthogonal, correlate with each other.

That conservatism is actually pretty understandable, because figuring out how to ask the right question seems hard. You implicitly have some metric on such that you want to find a projection onto an -dimensional subspace such that is usually small when is distributed according to . This metric is probably very difficult to describe explicitly, and might not be the metric induced by any inner product (for that matter, it might not even be a metric; could be any way of quantifying how bad it is to be told the value when the correct value you wanted to know is ). Even if you somehow manage to explicitly describe your metric, coming up with a version of PCA with the inner product replaced with an arbitrary metric also sounds hard, so the next thing you would want to do is fit an inner product to the metric.

The usual approach is essentially to skip the step of attempting to explicitly describe the metric, and just find an inner product that roughly approximates your implicit metric based on some rough heuristics about what the implicit metric probably looks like. The fact that these heuristics usually work so well seems to indicate that the implicit metric tends to be fairly tame with respect to ways of describing the data that we find most natural. Perhaps this shouldn’t be too surprising, but I still feel like this explanation does not make it obvious a priori that this should work so well in practice. It might be interesting to look into why these heuristics work as well as they do with more precision, and how to go about fitting a better inner product to implicit metrics. Perhaps this has been done, and I just haven’t found it.

To take a concrete example, consider eigenfaces, the principal components that you get from a set of images of people’s faces. Here, you start with the coordinates in which each coordinate axis represents a pixel in the image, and the value of that coordinate is the brightness of the corresponding pixel. By declaring that the coordinate axes are orthogonal, and measuring the brightness of each pixel on the same scale, we get our inner product, which is arguably a fairly natural one.

Presumably, the implicit metric we’re using here is visual distance, by which I mean a measure of how similar two images look. It seems clear to me that visual distance is not very well approximated by our inner product, and in fact, there is no norm such that the visual distance between two images is approximately the norm of their difference. To see this, if you take an image and make it brighter, you haven’t changed how it looks very much, so the visual distance between the image and its brighter version is small. But their difference is just a dimmer version of the same image, and if you add that difference to a completely different image, you will get the two images superimposed on top of each other, a fairly radical change. Thus the visual distance traversed by adding a vector depends on where you start from.

Despite this, producing eigenfaces by using PCA on images of faces, using the inner product described above, performs well with respect to visual distance, in the sense that you can project the images onto a relatively small number of principal components and leave them still recognizable. I think this can be explained on an intuitive level. In a human eye, each photoreceptor has a narrow receptive field that it detects light in, much like a pixel, so the representation of an image in the eye as patterns of photoreceptor activity is very similar to the representation of an image in a computer as a vector of pixel brightnesses, and the inner product metric is a reasonable measure of distance in this representation. When the visual cortex processes this information from the eye, it is difficult (and perhaps also not useful) for it to make vast distinctions between images that are close to each other according to the inner product metric, and thus result in similar patterns of photoreceptor activity in the eye. Thus the visual distance between two images cannot be too much greater than their inner product distance, and hence changing an image by a small amount according to the inner product metric can only change it by a small amount according to visual distance, even though the reverse is not true.

### Generalizations

The serious part of this post is now over. Let’s have some fun. Some of the following ways of modifying principal component analysis could be combined, but I’ll consider them one at a time for simplicity.

As hinted at above, you could start with an arbitrary metric on rather than an inner product, and try to find the rank- projection (for some ) that minimizes the expected squared distance from a vector to its projection. This would probably be difficult, messy, and not that much like principal component analysis. If it can be done, it would be useful in practice if we were much better at fitting explicit metrics to our implicit metrics than at fitting inner products to our implicit metrics, but I’m under the impression that this is not currently the case. This also differs from the other proposals in this section in that it is a modification of the problem looking for a solution, rather than a modification of the solution looking for a problem.

could be a real Hilbert space that is not necessarily finite-dimensional. Here we can run into the problem that might not even have any eigenvectors. However, if (which hopefully was not inferred from a finite sample) is Gaussian (and possibly also under weaker conditions), then is a compact operator, so does have an orthonormal basis of eigenvectors of , which still have non-negative eigenvalues. There probably aren’t any guarantees you can get about the order-type of this orthonormal basis when you order the eigenvectors in decreasing order of their eigenvalues, and there probably isn’t a sense in which the orthogonal projection onto the closure of the span of an initial segment of the basis accounts for the most variance of any closed subspace of the same “size” (“size” would have to refer to a refinement of the notion of dimension for this to be the case). However, a weaker statement is probably still true: namely that each orthonormal basis element maximizes the variance that it accounts for conditioned on values along the previous orthonormal basis elements. I guess considering infinite-dimensional vector spaces goes against the spirit of machine learning though.

could be a finite-dimensional complex inner product space. would be the sesquilinear form on given by . , so , and applying a tensor contraction to the conjugated indices gives us our covariance operator (in other words, the inner product gives us an isomorphism , and applying this to the first index of gives us ). is still self-adjoint and positive semi-definite, so still has an orthonormal basis of eigenvectors with non-negative real eigenvalues, and we can order the basis in decreasing order of the eigenvalues. Analogously to the real case, projecting onto the span of the first basis vectors along the span of the rest is the complex rank- projection that minimizes the expected squared distance from a vector to its projection. As far as I know, machine learning tends to deal with real data, but if you have complex data and for some reason you want to project onto a lower-dimensional complex subspace without losing too much information, now you know what to do.

Suppose your sample consists of events, where you’ve labeled them with both their spatial location and the time at which they occurred. In this case, events are represented as points in Minkowski space, a four-dimensional vector space representing flat spacetime, which is equipped with a nondegenerate symmetric bilinear form called the Minkowski inner product, even though it is not an inner product because it is not positive-definite. Instead, the Minkowski inner product is such that is positive if is a space-like vector, negative if is time-like, and zero if is light-like. We can still get out of and the Minkowski inner product in in the same way, and has a basis of eigenvectors of , and we can still order the basis in decreasing order of their eigenvalues. The first 3 eigenvectors will be space-like, with non-negative eigenvalues, and the last eigenvector will be time-like, with a non-positive eigenvalue. The eigenvectors are still orthogonal. Thus principal component analysis provides us with a reference frame in which the span of the first 3 eigenvectors is simultaneous, and the span of the last eigenvector is motionless. If is Gaussian, then this will be the reference frame in which the spatial position of an event and the time at which it occurs are mean independent of each other, meaning that conditioning on one of them doesn’t change the expected value of the other one. For general , there might not be a reference frame in which the space and time of an event are mean independent, but the reference frame given to you by by principal component analysis is still the unique reference frame with the property that the time coordinate is uncorrelated with any spatial coordinate.

More generally, we could consider equipped with any symmetric bilinear form taking the role of the inner product. Without loss of generality, we can consider only nondegenerate symmetric bilinear forms, because in the general case, where , applying principal component analysis with is equivalent to projecting the data onto , applying principal component analysis there with the nondegenerate symmetric bilinear form on induced by , and then lifting back to and throwing in a basis for with eigenvalues at the end, essentially treating as the space of completely irrelevant distinctions between data points that we intend to immediately forget about. Anyway, nondegenerate symmetric bilinear forms are classified up to isomorphism by their signature , which is such that any orthogonal basis contains exactly basis elements, of which are space-like and of which are time-like, using the convention that is space-like if , time-like if , and light-like if , as above. Using principal component analysis on probability distributions over points in spacetime (or rather, points in the tangent space to spacetime at a point, so that it is a vector space) in a universe with spatial dimensions and temporal dimensions still gives you a reference frame in which the span of the first basis vectors is simultaneous and the span of the last basis vectors is motionless, and this is again the unique reference frame in which each time coordinate is uncorrelated with each spatial coordinate. Incidentally, I’ve heard that much of physics still works with multiple temporal dimensions. I don’t know what that would mean, except that I think it means there’s something wrong with my intuitive understanding of time. But that’s another story. Anyway, the spaces spanned by the first and by the last basis vectors could be used to establish a reference frame, and then the data might be projected onto the first few (at most ) and last few (at most ) coordinates to approximate the positions of the events in space and in time, respectively, in that reference frame.

## Ordered algebraic geometry

Edit: Shortly after posting this, I found where the machinery I develop here was discussed in the literature. Real Algebraic Geometry by Bochnak, Coste, and Roy covers at least most of this material. I may eventually edit this to clean it up and adopt more standard notation, but don’t hold your breath.

### Introduction

In algebraic geometry, an affine algebraic set is a subset of which is the set of solutions to some finite set of polynomials. Since all ideals of are finitely generated, this is equivalent to saying that an affine algebraic set is a subset of which is the set of solutions to some arbitrary set of polynomials.

In semialgebraic geometry, a closed semialgebraic set is a subset of of the form for some finite set of polynomials . Unlike in the case of affine algebraic sets, if is an arbitrary set of polynomials, is not necessarily a closed semialgebraic set. As a result of this, the collection of closed semialgebraic sets are not the closed sets of a topology on . In the topology on generated by closed semialgebraic sets being closed, the closed sets are the sets of the form for arbitrary . Semialgebraic geometry usually restricts itself to the study of semialgebraic sets, but here I wish to consider all the closed sets of this topology. Notice that closed semialgebraic sets are also closed in the standard topology, so the standard topology is a refinement of this one. Notice also that the open ball of radius centered at is the complement of the closed semialgebraic set , and these open balls are a basis for the standard topology, so this topology is a refinement of the standard one. Thus, the topology I have defined is exactly the standard topology on .

In algebra, instead of referring to a set of polynomials, it is often nicer to talk about the ideal generated by that set instead. What is the analog of an ideal in ordered algebra? It’s this thing:

Definition: If is a partially ordered commutative ring, a cone in is a subsemiring of which contains all positive elements, and such that is an ideal of . By “subsemiring”, I mean a subset that contains and , and is closed under addition and multiplication (but not necessarily negation). If , the cone generated by , denoted , is the smallest cone containing . Given a cone , the ideal will be called the interior ideal of , and denoted .

is partially ordered by . If is a set of polynomials and , then . Thus I can consider closed sets to be defined by cones. We now have a Galois connection between cones of and subsets of , given by, for a cone , its positive-set is (I’m calling it the “positive-set” even though it is where the polynomials are all non-negative, because “non-negative-set” is kind of a mouthful), and for , its cone is . is closure in the standard topology on (the analog in algebraic geometry is closure in the Zariski topology on ). A closed set is semialgebraic if and only if it is the positive-set of a finitely-generated cone.

### Quotients by cones, and coordinate rings

An affine algebraic set is associated with its coordinate ring . We can do something analogous for closed subsets of .

Definition: If is a partially ordered commutative ring and is a cone, is the ring , equipped with the partial order given by if and only if , for .

Definition: If is closed, the coordinate ring of is . This is the ring of functions that are restrictions of polynomials, ordered by if and only if . For arbitrary , the ring of regular functions on , denoted , consists of functions on that are locally ratios of polynomials, again ordered by if and only if . Assigning its ring of regular functions to each open subset of endows with a sheaf of partially ordered commutative rings.

For closed , , and this inclusion is generally proper, both because it is possible to divide by polynomials that do not have roots in , and because may be disconnected, making it possible to have functions given by different polynomials on different connected components.

### Positivstellensätze

What is ? The Nullstellensatz says that its analog in algebraic geometry is the radical of an ideal. As such, we could say that the radical of a cone , denoted , is , and that a cone is radical if . In algebraic geometry, the Nullstellensatz shows that a notion of radical ideal defined without reference to algebraic sets in fact characterizes the ideals which are closed in the corresponding Galois connection. It would be nice to have a description of the radical of a cone that does not refer to the Galois connection. There is a semialgebraic analog of the Nullstellensatz, but it does not quite characterize radical cones.

Positivstellensatz 1: If is a finitely-generated cone and is a polynomial, then if and only if such that .

There are two ways in which this is unsatisfactory: first, it applies only to finitely-generated cones, and second, it tells us exactly which polynomials are strictly positive everywhere on a closed semialgebraic set, whereas we want to know which polynomials are non-negative everywhere on a set.

The second problem is easier to handle: a polynomial is non-negative everywhere on a set if and only if there is a decreasing sequence of polynomials converging to such that each is strictly positive everywhere on . Thus, to find , it is enough to first find all the polynomials that are strictly positive everywhere on , and then take the closure under lower limits. Thus we have a characterization of radicals of finitely-generated cones.

Positivstellensatz 2: If is a finitely-generated cone, is the closure of , where the closure of a subset is defined to be the set of all polynomials in which are infima of chains contained in .

This still doesn’t even tell us what’s going on for cones which are not finitely-generated. However, we can generalize the Positivstellensatz to some other cones.

Positivstellensatz 3: Let be a cone containing a finitely-generated subcone such that is compact. If is a polynomial, then if and only if such that . As before, it follows that is the closure of .

proof: For a given , , an intersection of closed sets contained in the compact set , which is thus empty if and only if some finite subcollection of them has empty intersection within . Thus if is strictly positive everywhere on , then there is some finitely generated subcone such that is strictly positive everywhere on , and is finitely-generated, so by Positivstellensatz 1, there is such that .

For cones that are not finitely-generated and do not contain any finitely-generated subcones with compact positive-sets, the Positivstellensatz will usually fail. Thus, it seems likely that if there is a satisfactory general definition of radical for cones in arbitrary partially ordered commutative rings that agrees with this one in , then there is also an abstract notion of “having a compact positive-set” for such cones, even though they don’t even have positive-sets associated with them.

### Beyond

An example of cone for which the Positivstellensatz fails is , the cone of polynomials that are non-negative on sufficiently large inputs (equivalently, the cone of polynomials that are either or have positive leading coefficient). , and is strictly positive on , but for , .

However, it doesn’t really look is trying to point to the empty set; instead, is trying to describe the set of all infinitely large reals, which only looks like the empty set because there are no infinitely large reals. Similar phenomena can occur even for cones that do contain finitely-generated subcones with compact positive-sets. For example, let . , but is trying to point out the set containing and all positive infinitesimals. Since has no infinitesimals, this looks like .

To formalize this intuition, we can change the Galois connection. We could say that for a cone , , where is the field of hyperreals. All you really need to know about is that it is a big ordered field extension of . is the set of hyperreals that are bigger than any real number, and is the set of hyperreals that are non-negative and smaller than any positive real. The cone of a subset , denoted will be defined as before, still consisting only of polynomials with real coefficients. This defines a topology on by saying that the closed sets are the fixed points of . This topology is not because, for example, there are many hyperreals that are larger than all reals, and they cannot be distinguished by polynomials with real coefficients. There is no use keeping track of the difference between points that are in the same closed sets. If you have a topology that is not , you can make it by identifying any pair of points that have the same closure. If we do this to , we get what I’m calling ordered affine -space over .

Definition: An -type over is a set of inequalities, consisting of, for each polynomial , one of the inequalities or , such that there is some totally ordered field extension and such that all inequalities in are true about . is called the type of . Ordered affine -space over , denoted is the set of -types over .

Compactness Theorem: Let be a set of inequalities consisting of, for each polynomial , one of the inequalities or . Then is an -type if and only if for any finite subset , there is such that all inequalities in are true about .

proof: Follows from the compactness theorem of first-order logic and the fact that ordered field extensions of embed into elementary extensions of . The theorem is not obvious if you do not know what those mean.

An -type represents an -tuple of elements of an ordered field extension of , up to the equivalence relation that identifies two such tuples that relate to by polynomials in the same way. One way that a tuple of elements of an extension of can relate to elements of is to equal a tuple of elements of , so there is a natural inclusion that associates an -tuple of reals with the set of polynomial inequalities that are true at that -tuple.

A tuple of polynomials describes a function , which extends naturally to a function by is the type of , where is an -tuple of elements of type in an extension of . In particular, a polynomial extends to a function , and is totally ordered by if and only if , where and are elements of type and , respectively, in an extension of . if and only if , so we can talk about inequalities satisfied by types in place of talking about inequalities contained in types.

I will now change the Galois connection that we are talking about yet again (last time, I promise). It will now be a Galois connection between the set of cones in and the set of subsets of . For a cone , . For a set , . Again, this defines a topology on by saying that fixed points of are closed. is ; in fact, it is the topological space obtained from by identifying points with the same closure as mentioned earlier. is also compact, as can be seen from the compactness theorem. is not (unless ). Note that model theorists have their own topology on , which is distinct from the one I use here, and is a refinement of it.

The new Galois connection is compatible with the old one via the inclusion , in the sense that if , then (where we identify with its image in ), and for a cone , .

Like our intermediate Galois connection , our final Galois connection succeeds in distinguishing and from and , respectively, in the desirable manner. consists of the type of numbers larger than any real, and consists of the types of and of positive numbers smaller than any positive real.

Just like for subsets of , a closed subset has a coordinate ring , and an arbitrary has a ring of regular functions consisting of functions on that are locally ratios of polynomials, ordered by if and only if , where is a representation of as a ratio of polynomials in a neighborhood of , either and , or and , and if and only if . As before, for closed .

is analogous to from algebraic geometry because if, in the above definitions, you replace “” and “” with “” and ““, replace totally ordered field extensions with field extensions, and replace cones with ideals, then you recover a description of , in the sense of .

What about an analog of projective space? Since we’re paying attention to order, we should look at spheres, not real projective space. The -sphere over , denoted , can be described as the locus of in .

For any totally ordered field , we can define similarly to , as the space of -types over , defined as above, replacing with (although a model theorist would no longer call it the space of -types over ). The compactness theorem is not true for arbitrary , but its corollary that is compact still is true.

### Visualizing and

should be thought of as the -sphere with infinitesimals in all directions around each point. Specifically, is just , a pair of points. The closed points of are the points of , and for each closed point , there is an -sphere of infinitesimals around , meaning a copy of , each point of which has in its closure.

should be thought of as -space with infinitesimals in all directions around each point, and infinities in all directions. Specifically, contains , and for each point , there is an -sphere of infinitesimals around , and there is also a copy of around the whole thing, the closed points of which are limits of rays in .

and relate to each other the same way that and do. If you remove a closed point from , you get , where the sphere of infinitesimals around the removed closed point becomes the sphere of infinities of .

More generally, if is a totally ordered field, let be its real closure. consists of the Cauchy completion of (as a metric space with distances valued in ), and for each point (though not for points that are limits of Cauchy sequences that do not converge in ), an -sphere of infinitesimals around , and an -sphere around the whole thing, where is the locus of in . does not distinguish between fields with the same real closure.

### More Positivstellensätze

This Galois connection gives us a new notion of what it means for a cone to be radical, which is distinct from the old one and is better, so I will define to be . A cone will be called radical if . Again, it would be nice to be able to characterize radical cones without referring to the Galois connection. And this time, I can do it. Note that since is compact, the proof of Positivstellensatz 3 shows that in our new context, the Positivstellensatz holds for all cones, since even the subcone generated by has a compact positive-set.

Positivstellensatz 4: If is a cone and is a polynomial, then if and only if such that .

However, we can no longer add in lower limits of sequences of polynomials. For example, for all real , but , even though is radical. This happens because, where is the type of positive infinitesimals, for real , but . However, we can add in lower limits of sequences contained in finitely-generated subcones, and this is all we need to add, so this characterizes radical cones.

Positivstellensatz 5: If is a cone, is the union over all finitely-generated subcones of the closure of (again the closure of a subset is defined to be the set of all polynomials in which are infima of chains contained in ).

Proof: Suppose is a subcone generated by a finite set , and is the infimum of a chain . For any , if for each , then for each , and hence . That is, the finite set of inequalities does not hold anywhere in . By the compactness theorem, there are no -types satisfying all those inequalities. Given , , so ; that is, .

Conversely, suppose . Then by the compactness theorem, there are some such that . Then , is strictly positive on , and hence by Positivstellensatz 4, such that . That is, is a chain contained in , a finitely-generated subcone of , whose infimum is .

### Ordered commutative algebra

Even though they are technically not isomorphic, and are closely related, and can often be used interchangeably. Of the two, is of a form that can be more easily generalized to more abstruse situations in algebraic geometry, which may indicate that it is the better thing to talk about, whereas is merely the simpler thing that is easier to think about and just as good in practice in many contexts. In contrast, and are different in important ways. The situation in algebraic geometry provides further reason to pay more attention to than to .

The next thing to look for would be an analog of the spectrum of a ring for a partially ordered commutative ring (I will henceforth abbreviate “partially ordered commutative ring” as “ordered ring” in order to cut down on the profusion of adjectives) in a way that makes use of the order, and gives us when applied to . I will call it the order spectrum of an ordered ring , denoted . Then of course can be defined as . should be, of course, the set of prime cones. But what even is a prime cone?

Definition: A cone is prime if is a totally ordered integral domain.

Definition: is the set of prime cones in , equipped with the topology whose closed sets are the sets of prime cones containing a given cone.

An -type can be seen as a cone, by identifying it with , aka . Under this identification, , as desired. The prime cones in are also the radical cones such that is irreducible. Notice that irreducible subsets of are much smaller than irreducible subsets of ; in particular, none of them contain more than one element of .

There is also a natural notion of maximal cone.

Definition: A cone is maximal if and there are no strictly intermediate cones between and . Equivalently, if is prime and closed in .

Maximal ideals of correspond to elements of . And the cones of elements of are maximal cones in , but unlike in the complex case, these are not all the maximal cones, since there are closed points in outside of . For example, is a maximal cone, and the type of numbers greater than all reals is closed. To characterize the cones of elements of , we need something slightly different.

Definition: A cone is ideally maximal if is a totally ordered field. Equivalently, if is maximal and is a maximal ideal.

Elements of correspond to ideally maximal cones of .

also allows us to define the radical of a cone in an arbitrary partially ordered commutative ring.

Definition: For a cone , is the intersection of all prime cones containing . is radical if .

Conjecture: is the union over all finitely-generated subcones of the closure of (as before, the closure of a subset is defined to be the set of all elements of which are infima of chains contained in ).

### Order schemes

Definition: An ordered ringed space is a topological space equipped with a sheaf of ordered rings. An ordered ring is local if it has a unique ideally maximal cone, and a locally ordered ringed space is an ordered ringed space whose stalks are local.

can be equipped with a sheaf of ordered rings , making it a locally ordered ringed space.

Definition: For a prime cone , the localization of at , denoted , is the ring equipped with an ordering that makes it a local ordered ring. This will be the stalk at of . A fraction () is also an element of for any prime cone whose interior ideal does not contain . This is an open neighborhood of (its complement is the set of prime cones containing ). There is a natural map given by , and the total order on extends uniquely to a total order on the fraction field, so for , we can say that at if this is true of their images in . We can then say that near if at every point in some neighborhood of , which defines the ordering on .

Definition: For open , consists of elements of that are locally ratios of elements of . is ordered by if and only if near (equivalently, if at ).

, and this inclusion can be proper. Conjecture: as locally ordered ringed spaces for open . This conjecture says that it makes sense to talk about whether or not a locally ordered ringed space looks locally like an order spectrum near a given point. Thus, if this conjecture is false, it would make the following definition look highly suspect.

Definition: An order scheme is a topological space equipped with a sheaf of ordered commutative rings such that for some open cover of , the restrictions of to the open sets in the cover are all isomorphic to order spectra of ordered commutative rings.

I don’t have any uses in mind for order schemes, but then again, I don’t know what ordinary schemes are for either and they are apparently useful, and order schemes seem like a natural analog of them.

## Nonabelian modules

This is a rough overview of my thoughts on a thing I’ve been thinking about, and as such is incomplete and may contain errors. Proofs have been omitted when writing them out would be at all tedious.

Edit: It has been pointed out to me that near-ring modules have already been defined, and the objects I describe in this post are just near-ring modules where the near-ring happens to be a ring.

### Introduction

As you all know (those of you who have the background for this post, anyway), an -module is an abelian group (written additively) together with a multiplication map such that for all and , , , , and .

What if we don’t want to restrict attention to abelian groups? One could attempt to define a nonabelian module using the same axioms, but without the restriction that the group be abelian. As it is customary to write groups multiplicatively if they are not assumed to be abelian, we will do that, and the map will be written as exponentiation (since exponents are written on the right, I’ll follow the definition of right-modules, rather than left-modules). The axioms become: for all and , , , and .

What has changed? Absolutely nothing, as it turns out. The first axiom says again that is abelian, because . We’ll have to get rid of that axiom. Our new definition, which it seems to me captures the essence of a module except for abelianness:

A nonabelian -module is a group (written multiplicatively) together with a scalar exponentiation map  such that for all and , , and .

These imply that , , and is the inverse of , because , and .

Just like a -module is just an abelian group, a nonabelian -module is just a group. Just like a -module is an abelian group whose exponent divides , a nonabelian -module is a group whose exponent divides .

### Exponentiation-like families of operations

Perhaps a bit more revealing is what nonabelian modules over free rings look like, since then the generators are completely generic ring elements. Where is the generating set, a -module is an abelian group together with endomorphisms , which tells us that modules are about endomorphisms of an abelian group indexed by the elements of a ring. Nonabelian modules are certainly not about endomorphisms. After all, in a nonabelian group, the map  is not an endomorphism. I will call the things that nonabelian modules are about “exponentiation-like families of operations”, and give four equivalent definitions, in roughly increasing order of concreteness and decreasing order of elegance. Definition 2 uses basic model theory, so skip it if that scares you. Definition 3 is the “for dummies” version of definition 2.

Definition 0: Let be a group, and let be a family of functions from to (not necessarily endomorphisms). If can be made into a nonabelian -module such that for and , then is called an exponentiation-like family of operations on . If so, the nonabelian -module structure on with that property is unique, so define  to be its value according to that structure, for  and .

Definition 1: is an exponentiation-like family of operations on if for all , the smallest subgroup containing  which is closed under actions by elements of (which I will call ) is abelian, and the elements of  restrict to endomorphisms of it. Using the universal property of , this induces a homomorphism . Let denote the action of on under that map, for . By , I mean the endomorphism ring of with composition running in the opposite direction (i.e., the multiplication operation given by ). This is because of the convention that nonabelian modules are written as nonabelian right-modules by default.

Definition 2: Let consider the language , where is the language of rings, and each element of is used as a constant symbol. Closed terms in act as functions from to , with the action of written as , defined inductively as: , ,  for , , , and for closed -terms  and . is called an exponentiation-like family of operations on if whenever , where  is the theory of rings. If is an exponentiation-like family of operations on and  is a noncommutative polynomial with variables in , then for is defined to be where is any term representing .

Definition 3: Pick a total order on the free monoid on (e.g. by ordering and then using the lexicographic order). The order you use won’t matter. Given and  in the free monoid on , let . Where is a noncommutative polynomial, for some  and decreasing sequence of noncommutative monomials (elements of the free monoid on ). Let is called an exponentiation-like family of operations on  if for every and and .

These four definitions of exponentiation-like family are equivalent, and for exponentiation-like families, their definitions of exponentiation by a noncommutative polynomial are equivalent.

Facts: is an exponentiation-like family of operations on . If is an exponentiation-like family of operations on  and , then so is . If is abelian, then  is exponentiation-like. Given a nonabelian -module structure on , the actions of the elements of  on form an exponentiation-like family. In particular, if  is an exponentiation-like family of operations on , then so is , with the actions being defined as above.

[The following paragraph has been edited since this comment.]

For an abelian group , the endomorphisms of form a ring , and an -module structure on is simply a homomorphism . Can we say a similar thing about exponentiation-like families of operations of ? Let be the set of all functions (as sets). Given , let multiplication be given by composition: , addition be given by , negation be given by , and  and be given by and . This makes into a near-ring. A nonabelian -module structure on  is a homomorphism , and a set of operations on is an exponentiation-like family of operations on if and only if it is contained in a ring which is contained in .

### Some aimless rambling

What are some interesting examples of nonabelian modules that are not abelian? (That might sound redundant, but “nonabelian module” means that the requirement of abelianness has been removed, not that a requirement of nonabelianness has been imposed. Perhaps I should come up with better terminology. To make matters worse, since the requirement that got removed is actually stronger than abelianness, there are nonabelian modules that are abelian and not modules. For instance, consider the nonabelian -module whose underlying set is the Klein four group (generated by two elements ) such that , , and .)

In particular, what do free nonabelian modules look like? The free nonabelian -modules are, of course, free groups. The free nonabelian -modules have been studied in combinatorial group theory; they’re called Burnside groups. (Fun but tangential fact: not all Burnside groups are finite (the Burnside problem), but despite this, the category of finite nonabelian -modules has free objects on any finite generating set, called Restricted Burnside groups.)

The free nonabelian -modules are monstrosities. They can be constructed in the usual way of constructing free objects in a variety of algebraic structures, but that construction seems not to be very enlightening about their structure. So I’ll give a somewhat more direct construction of the free nonabelian -module on generators, which may also not be that enlightening, and which is only suspected to be correct. Define an increasing sequence of groups , and functions , as follows: is the free group on generators. Given , and given a subgroup , let the top-degree portion of be for the largest such that this is nontrivial. Let be the free product of the top-degree portions of maximal abelian subgroups of . Let be the free product of with modulo commutativity of the maximal abelian subgroups of with the images of their top-degree portions in . Given a maximal abelian subgroup , let be the homomorphism extending which sends the top-degree portion identically onto its image in . Since every non-identity element of is in a unique maximal abelian subgroup, this defines . with  is the free nonabelian -module on generators. If is a set, the free nonabelian -modules can be constructed similarly, with copies of at each step. Are these constructions even correct? Are there nicer ones?

A nonabelian -module would be a group with a formal square root operation. As an example, any group of odd exponent can be made into a -module in a canonical way by letting . More generally, any group of finite exponent can be made into a -module in a similar fashion. Are there any more nice examples of nonabelian modules over localizations of ?

In particular, a nonabelian -module would be a group with formal th root operations for all . What are some nonabelian examples of these? Note that nonabelian -modules cannot have any torsion, for suppose for some . Then . More generally, nonabelian modules cannot have any -torsion (meaning ) for any which is invertible in the scalar ring.

The free nonabelian -modules can be constructed similarly to the construction of free nonabelian -modules above, except that when constructing from and , we also mod out by elements of being equal to the th powers of their images in . Using the fact that , this lets us modify the construction of free nonabelian -modules to give us a construction of free nonabelian -modules. Again, is there a nicer way to do it?

### Topological nonabelian modules

It is also interesting to consider topological nonabelian modules over topological rings; that is, nonabelian modules endowed with a topology such that the group operation and scalar exponentiation are continuous. A module over a topological ring has a canonical finest topology on it, and the same remains true for nonabelian modules. For finite-dimensional real vector spaces, this is the only topology. Does the same remain true for finitely-generated nonabelian -modules? Finite-dimensional real vector spaces are complete, and topological nonabelian modules are, in particular, topological groups, and can thus be made into uniform spaces, so the notion of completeness still makes sense, but I think some finitely-generated nonabelian -modules are not complete.

A topological nonabelian -module is a sort of Lie group-like object. One might try constructing a Lie algebra for a complete nonabelian -module by letting the underlying set be , and defining  and . One might try putting a differential structure on such that this is the Lie algebra of left-invariant derivations. Does this or something like it work?

A Lie group is a nonabelian -module if and only if its exponential map is a bijection between it and its Lie algebra. In this case, scalar exponentiation is closely related to the exponential map by a compelling formula: . As an example, the continuous Heisenberg group is a nonabelian -module which is not abelian. This observation actually suggests a nice class of examples of nonabelian modules without a topology: given a commutative ring , the Heisenberg group over is a nonabelian -module.

The Heisenberg group of dimension over a commutative ring has underlying set , with the group operation given by . The continuous Heisenberg group means the Heisenberg group over . Scalar exponentiation on a Heisenberg group is just given by scalar multiplication: .