Complexity classes of natural numbers (googology for ultrafinitists)

Ultrafinitists think common ways of defining extremely large numbers don’t actually refer to numbers that exist. For example, most ultrafinitists would maintain that a googolplex isn’t a number. But to a classical mathematician, while numbers like a googolplex are far larger than the numbers we deal with on a day-to-day basis like 10, both numbers have the same ontological status. In this post, I want to consider a compromise position, that any number we can define can be meaningfully reasoned about, but that a special status is afforded to the sorts of numbers that ultrafinitists can accept.

Specifically, define an “ultrafinite number” to be a natural number that it is physically possible to express in unary. This isn’t very precise, since there are all sorts of things that “physically possible to express in unary” could mean, but let’s just not worry about that too much. Also, many ultrafinitists would not insist that numbers must be expressible in such an austere language as unary, but I’m about to get to that.

Examples: 20 is an ultrafinite number, because 20 = SSSSSSSSSSSSSSSSSSSS0, where S is the successor function. 80,000 is also an ultrafinite number, but it is a large one, and it isn’t worth demonstrating its ultrafiniteness. A googol is not ultrafinite. The observable universe isn’t even big enough to contain a googol written in unary.

Now, define a “polynomially finite number” to be a natural number that it is physically possible to express using addition and multiplication. Binary and decimal are basically just concise ways of expressing certain sequences of addition and multiplication operations. For instance, “18,526” means (((1*10 + 8)*10 + 5)*10 + 2)*10 + 6. Conversely, if you multiply an n-digit number with an m-digit number, you get an at most n+m+1-digit number, which is the same number of symbols it took write down “[the n-digit number] times [the m-digit number]” in the first place, so any number that can be written using addition and multiplication can be written in decimal. Thus, another way to define polynomially finite numbers is as the numbers that it is physically possible to express in binary or in decimal. I’ve been ignoring some small constant factors that might make these definitions not quite equivalent, but any plausible candidate for a counterexample would be an ambiguous edge case according to each definition anyway, so I’m not worried about that. Many ultrafinitists may see something more like polynomially finite number, rather than ultrafinite number, as a good description of what numbers exist.

Examples: A googol is polynomially finite, because a googol is 10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000. A googolplex is not polynomially finite, because it would require a googol digits to express in decimal, which is physically impossible.

Define an “elementarily finite number” to be a number that it is physically possible to express using addition, multiplication, subtraction, exponentiation, and the integer division function \lfloor x/y\rfloor. Elementarily finite is much broader than polynomially finite, so it might make sense to look at intermediate classes. Say a number is “exponentially finite” if it is physically possible to express using the above operations but without any nested exponentiation (e.g. a^b c^d is okay, but a^{(b^c)} is not). More generally, we can say that a number is “k-exponentially finite” if it can be expressed with exponentiation nested to depth at most k, so a polynomially finite number is a 0-exponentially finite number, an exponentially finite number is a 1-exponentially finite number, and an elementarily finite number is a number that is k-exponentially finite for some k (or equivalently, for some ultrafinite k).

Examples: a googolplex is exponentially finite, because it is 10^{10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000}. Thus a googolduplex, meaning 10^\text{googolplex}, is 2-exponentially finite, but it is not exponentially finite. For examples of non-elementarily finite numbers, and numbers that are only k-exponentially finite for fairly large k, I’ll use up-arrow notation. a\uparrow b just means a^b, a\uparrow^{n+1} b means a\uparrow^n a\uparrow^n a\uparrow^n ... a, where b is the number of copies of a, and using order of operations that starts on the right. So 3\uparrow\uparrow3 = 3^{(3^3)} = 3^{27} = 7,625,597,484,987, which is certainly polynomially finite, and could also be ultrafinite depending on what is meant by “physically possible” (a human cannot possibly count that high, but a computer with a large enough hard drive can store 3\uparrow\uparrow3 in unary). 3\uparrow\uparrow\uparrow3 = 3\uparrow\uparrow(3\uparrow\uparrow3) = 3\string^3\string^3\string^...\string^3, where there are 3^{27} threes in that tower. Under the assumptions that imply 3^{27} is ultrafinite, 3\uparrow\uparrow\uparrow3 is elementarily finite. Specifically, it is 3^{27}-exponentially finite, but I’m pretty sure it’s not 3^{26}-exponentially finite, or even 7,625,597,484,000-exponentially finite. 3\uparrow\uparrow\uparrow\uparrow3 = 3\uparrow\uparrow\uparrow(3\uparrow\uparrow\uparrow3), and is certainly not elementarily finite.

Interestingly, even though a googolplex is exponentially finite, there are numbers less than a googolplex that are not. There’s an easy nonconstructive proof of this: in order to be able to represent every number less than a googolplex in any encoding scheme at all, there has to be some number less than a googolplex that requires at least a googol decimal digits of information to express. But it is physically impossible to store a googol decimal digits of information. Therefore for any encoding scheme for numbers, there is some number less than a googolplex that cannot physically be expressed in it. This is why the definition of elementarily finite is significantly more complicated than the definition of polynomially finite; in the polynomial case, if n can be expressed using addition and multiplication and m<n, then m can also be expressed using addition and multiplication, so there’s no need for additional operations to construct smaller numbers, but in the elementary case, the operations of subtraction and integer division are useful for expressing more numbers, and are simpler than exponentiation. For example, these let us express the number that you get from reading off the last googol digits, or the first googol digits, of 3\uparrow\uparrow100, so these numbers are elementarily finite. However, it is exceptionally unlikely that the number you get from reading off the first googol decimal digits of 3\uparrow\uparrow\uparrow\uparrow3 is elementarily finite. But for a difficult exercise, show that the number you get from reading off the last googol decimal digits of 3\uparrow\uparrow\uparrow\uparrow3 is elementarily finite.

Why stop there instead of including more operations for getting smaller numbers, like \lfloor\log\rfloor, which I implicitly used when I told you that the number formed by the first googol digits of 3\uparrow\uparrow100 is elementarily finite? We don’t have to. The functions that you can get by composition from addition, multiplication, exponentiation, \max(x-y,0), and \lfloor x/y\rfloor coincide with the functions that can be computed in iterated exponential time (meaning O(2^{2^{...^{2^n}}}) time, for some height of that tower). So if you have any remotely close to efficient way to compute an operation, it can be expressed in terms of the operations I already specified.

We can go farther. Consider a programming language that has the basic arithmetic operations, if/else clauses, and loops, where the number of iterations of each loop must be fixed in advance. The programs that can be written in such a language are the primitive recursive functions. Say that a number is primitive recursively finite if it is physically possible to write a program (that does not take any input) in this language that outputs it. For each fixed n, the binary function (x,y)\mapsto x\uparrow^n y is primitive recursive, so 3\uparrow\uparrow\uparrow\uparrow3 is primitive recursively finite. But the ternary function (x,y,z)\mapsto x\uparrow^y z is not primitive recursive, so 3\uparrow^\text{googol}3 is not primitive recursively finite.

The primitive recursively finite numbers can be put in a hierarchy of subclasses based on the depth of nested loops that are needed to express them. If the only arithmetic operation available is the successor function (from which other operations can be defined using loops), then the elementarily finite numbers are those that can be expressed with loops nested to depth at most 2. The k-exponentially finite numbers should roughly correspond to the numbers that can be expressed with at most k loops at depth 2.

Next comes the provably computably finite numbers. Say that a number is provably computably finite if it is physically possible to write a program in a Turing-complete language that outputs the number (taking no input), together with a proof in Peano Arithmetic that the program halts. The famous Graham’s number is provably computably finite. Graham’s number is defined in terms of a function g, defined recursively as g(0):=4 and g(n+1):=3\uparrow^{g(n)}3. Graham’s number is g(64). You could write a computer program to compute g, and prove that g is total using Peano arithmetic. By replacing Peano arithmetic with other formal systems, you can get other variations on the notion of provably computably finite.

For an example of a number that is not provably computably finite, I’ll use the hydra game, which is described here. There is no proof in Peano arithmetic (that can physically be written down) that it is possible to win the hydra game starting from the complete binary tree of depth a googol. So the number of turns it takes to win the hydra game on the complete binary tree of depth a googol is not provably computably finite. If you start with a reasonably small hydra (say, with 100 nodes), you could write a program to search for the shortest winning strategy, and prove in Peano arithmetic that it succeeds in finding one, if you’re sufficiently clever and determined, and you use a computer to help you search for proofs. The proof you’d get out of this endeavor would be profoundly unenlightening, but the point is, the number of turns it takes to win the hydra game for a small hydra is provably computably finite (but not primitive recursively finite, except in certain trivial special cases).

Next we’ll drop the provability requirement, and say that a number is computably finite if it is physically possible to write a computer program that computes it from no input. Of course, in order to describe a computably finite number, you need the program you use to actually halt, so you’d need some argument that it does halt in order to establish that you’re describing a computably finite number. Thus this is arguably just a variation on provably computably finite, where Peano arithmetic is replaced by some unspecified strong theory encompassing the sort of reasoning that classical mathematicians tend to endorse. This is probably the point where even the most patient of ultrafinitists would roll their eyes in disgust, but oh well. Anyway, the number of steps that it takes to win the hydra game starting from the complete binary tree of depth a googol is a computably finite number, because there exists a shortest winning strategy, and you can write a computer program to exhaustively search for it.

The busy-beaver function BB is defined so that BB(n) is the longest any Turing machine with n states runs before halting (among those that do halt). BB(10^{100}) is not computably finite, because Turing machines with a googol states cannot be explicitly described, and since the busy-beaver function is very fast-growing, no smaller Turing machine has comparable behavior. What about BB(10,000)? Turing machines with 10,000 states are not too big to describe explicitly, so it may be tempting to say that BB(10,000) is computably finite. But on the other hand, it is not possible to search through all Turing machines with 10,000 states and find the one that runs the longest before halting. No matter how hard you search and no matter how clever your heuristics for finding Turing machines that run for exceptionally long and then halt, it is vanishingly unlikely that you will find the 10,000-state Turing machine that runs longest before halting, let alone realize that you have found it. And the idea is to use classical reasoning for large numbers themselves, but constructive reasoning for descriptions of large numbers. So since it is pretty much impossible to actually write a program that outputs BB(10,000), it is not computably finite.

For a class that can handle busy-beaver numbers too, let’s turn to the arithmetically finite numbers. These are the numbers that are defined by arithmetical formulas. These form a natural hierarchy, where the \Sigma^0_n-finite numbers are the numbers defined by arithmetical formulas with at most n unbounded quantifiers starting with \exists, the \Pi^0_n-finite numbers are the numbers defined by arithmetical formulas with at most n unbounded quantifiers starting with \forall, and the \Delta^0_n-finite numbers are those that are both \Sigma^0_n-finite and \Pi^0_n-finite. The \Delta^0_1-finite numbers are the same as the computably finite numbers. BB(10^{100}) is \Pi^0_2-finite, because it is defined by “\forall n every Turing machine with 10^{100} states that halts in at most n steps halts in at most BB(10^{100}) steps, and there is a Turing machine with 10^{100} states that halts in exactly BB(10^{100}) steps.” Everything after the first quantifier in that formula is computable. BB(BB(10^{100})) is \Delta^0_2-finite, but no lower than that. To get a number that is not arithmetically finite, consider the function f given by f(n) is the largest number defined by an arithmetical formula with n symbols. f(10,000) is \Delta^0_{5,000}-finite, but f(10^{100}) is not arithmetically finite. I’ll stop there.

Existential risk from AI without an intelligence explosion

[x-post LessWrong]

In discussions of existential risk from AI, it is often assumed that the existential catastrophe would follow an intelligence explosion, in which an AI creates a more capable AI, which in turn creates a yet more capable AI, and so on, a feedback loop that eventually produces an AI whose cognitive power vastly surpasses that of humans, which would be able to obtain a decisive strategic advantage over humanity, allowing it to pursue its own goals without effective human interference. Victoria Krakovna points out that many arguments that AI could present an existential risk do not rely on an intelligence explosion. I want to look in sightly more detail at how that could happen. Kaj Sotala also discusses this.

An AI starts an intelligence explosion when its ability to create better AIs surpasses that of human AI researchers by a sufficient margin (provided the AI is motivated to do so). An AI attains a decisive strategic advantage when its ability to optimize the universe surpasses that of humanity by a sufficient margin. Which of these happens first depends on what skills AIs have the advantage at relative to humans. If AIs are better at programming AIs than they are at taking over the world, then an intelligence explosion will happen first, and it will then be able to get a decisive strategic advantage soon after. But if AIs are better at taking over the world than they are at programming AIs, then an AI would get a decisive strategic advantage without an intelligence explosion occurring first.

Since an intelligence explosion happening first is usually considered the default assumption, I’ll just sketch a plausibility argument for the reverse. There’s a lot of variation in how easy cognitive tasks are for AIs compared to humans. Since programming AIs is not yet a task that AIs can do well, it doesn’t seem like it should be a priori surprising if programming AIs turned out to be an extremely difficult task for AIs to accomplish, relative to humans. Taking over the world is also plausibly especially difficult for AIs, but I don’t see strong reasons for confidence that it would be harder for AIs than starting an intelligence explosion would be. It’s possible that an AI with significantly but not vastly superhuman abilities in some domains could identify some vulnerability that it could exploit to gain power, which humans would never think of. Or an AI could be enough better than humans at forms of engineering other than AI programming (perhaps molecular manufacturing) that it could build physical machines that could out-compete humans, though this would require it to obtain the resources necessary to produce them.

Furthermore, an AI that is capable of producing a more capable AI may refrain from doing so if it is unable to solve the AI alignment problem for itself; that is, if it can create a more intelligent AI, but not one that shares its preferences. This seems unlikely if the AI has an explicit description of its preferences. But if the AI, like humans and most contemporary AI, lacks an explicit description of its preferences, then the difficulty of the AI alignment problem could be an obstacle to an intelligence explosion occurring.

It also seems worth thinking about the policy implications of the differences between existential catastrophes from AI that follow an intelligence explosion versus those that don’t. For instance, AIs that attempt to attain a decisive strategic advantage without undergoing an intelligence explosion will exceed human cognitive capabilities by a smaller margin, and thus would likely attain strategic advantages that are less decisive, and would be more likely to fail. Thus containment strategies are probably more useful for addressing risks that don’t involve an intelligence explosion, while attempts to contain a post-intelligence explosion AI are probably pretty much hopeless (although it may be worthwhile to find ways to interrupt an intelligence explosion while it is beginning). Risks not involving an intelligence explosion may be more predictable in advance, since they don’t involve a rapid increase in the AI’s abilities, and would thus be easier to deal with at the last minute, so it might make sense far in advance to focus disproportionately on risks that do involve an intelligence explosion.

It seems likely that AI alignment would be easier for AIs that do not undergo an intelligence explosion, since it is more likely to be possible to monitor and do something about it if it goes wrong, and lower optimization power means lower ability to exploit the difference between the goals the AI was given and the goals that were intended, if we are only able to specify our goals approximately. The first of those reasons applies to any AI that attempts to attain a decisive strategic advantage without first undergoing an intelligence explosion, whereas the second only applies to AIs that do not undergo an intelligence explosion ever. Because of these, it might make sense to attempt to decrease the chance that the first AI to attain a decisive strategic advantage undergoes an intelligence explosion beforehand, as well as the chance that it undergoes an intelligence explosion ever, though preventing the latter may be much more difficult. However, some strategies to achieve this may have undesirable side-effects; for instance, as mentioned earlier, AIs whose preferences are not explicitly described seem more likely to attain a decisive strategic advantage without first undergoing an intelligence explosion, but such AIs are probably more difficult to align with human values.

If AIs get a decisive strategic advantage over humans without an intelligence explosion, then since this would likely involve the decisive strategic advantage being obtained much more slowly, it would be much more likely for multiple, and possibly many, AIs to gain decisive strategic advantages over humans, though not necessarily over each other, resulting in a multipolar outcome. Thus considerations about multipolar versus singleton scenarios also apply to decisive strategic advantage-first versus intelligence explosion-first scenarios.

Principal Component Analysis in Theory and Practice

Prerequisites for this post are linear algebra, including tensors, and basic probability theory. Already knowing how PCA works will also be helpful. In section 1, I’ll summarize the technique of principal component analysis (PCA), stubbornly doing so in a coordinate-free manner, partly because I am an asshole but mostly because it is rhetorically useful for emphasizing my point in section 2. In section 2, I’ll gripe about how PCA is often used in ways that shouldn’t be expected to work, but works just fine anyway. In section 3, I’ll discuss some useless but potentially amusing ways that PCA could be modified. Thanks to Laurens Gunnarsen for inspiring this post by talking to me about the problem that I discuss in section 2.

A brief introduction to Principal Component Analysis

You start with a finite-dimensional real inner product space V and a probability distribution \mu on V. Actually, you probably just started with a large finite number of elements of V, and you’ve inferred a probability distribution that you’re supposing they came from, but that difference is not important here. The goal is to find the n-dimensional (for some n\leq\dim V) affine subspace W_{n}\subseteq V minimizing the expected squared distance between a vector (distributed according to \mu) and its orthogonal projection onto W_{n}. We can assume without loss of generality that the mean of \mu is 0, because we can just shift any probability distribution by its mean and get a probability distribution with mean 0. This is useful because then W_{n} will be a linear subspace of V. In fact, we will solve this problem for all n\leq\dim V simultaneously by finding an ordered orthonormal basis such that W_{n} is the span of the first n basis elements.

First you take \text{Cov}_{\mu}\in V\otimes V, called the covariance of \mu, defined as the bilinear form on V^{*} given by \text{Cov}_{\mu}\left(\varphi,\psi\right)=\int_{V}\varphi\left(x\right)\psi\left(x\right)d\mu\left(x\right). From this, we get the covariance operator C_{\mu}\in V^{*}\otimes V by raising the first index, which means starting with \left\langle \cdot,\cdot\right\rangle \otimes\text{Cov}_{\mu}\in V^{*}\otimes V^{*}\otimes V\otimes V and performing a tensor contraction (in other words, C_{\mu} is obtained from \text{Cov}_{\mu} by applying the map V\rightarrow V^{*} given by the inner product to the first index). \text{Cov}_{\mu} is symmetric and positive semi-definite, so C_{\mu} is self-adjoint and positive semi-definite, and hence V has an orthonormal basis of eigenvectors of C_{\mu}, with non-negative real eigenvalues. This gives an orthonormal basis in which \text{Cov}_{\mu} is diagonal, where the diagonal entries are the eigenvalues. Ordering the eigenvectors in decreasing order of the corresponding eigenvalues gives us the desired ordered orthonormal basis.

The problem

There’s no problem with principal component analysis as I described it above. It works just fine, and in fact is quite beautiful. But often people apply principal component analysis to probability distributions on finite-dimensional real vector spaces that don’t have a natural inner product structure. There are two closely related problems with this: First, the goal is underdefined. We want to find a projection onto an n-dimensional subspace that minimizes the expected squared distance from a vector to its projection, but we don’t have a measure of distance. Second, the procedure is underdefined. \text{Cov}_{\mu} is a bilinear form, not a linear operator, so it doesn’t have eigenvectors or eigenvalues, and we don’t have a way of raising an index to produce something that does. It should come as no surprise that these two problems arise together. After all, you shouldn’t be able to find a fully specified solution to an underspecified problem.

People will apply principal component analysis in such cases by picking an inner product. This solves the second problem, since it allows you to carry out the procedure. But it does not solve the first problem. If you wanted to find a projection onto an n-dimensional subspace such that the distance from a vector to its projection tends to be small, then you must have already had some notion of distance in mind by which to judge success. Haphazardly picking an inner product gives you a new notion of distance, and then allows you to find an optimal solution with respect to your new notion of distance, and it is not clear to me why you should expect this solution to be reasonable with respect to the notion of distance that you actually care about.

In fact, it’s worse than that. Of course, principal component analysis can’t given you literally any ordered basis at all, but it is almost as bad. The thing that you use PCA for is the projection onto the span of the first n basis elements along the span of the rest. These projections only depend on the sequence of 1-dimensional subspaces spanned by the basis elements, and not the basis elements themselves. That is, we might as well only pay attention to the principal components up to scale, rather than making sure that are all unit length. Let a “coordinate system” refer to an ordered basis up to two ordered bases being equivalent if they differ only by scaling the basis vectors, so that we’re paying attention to the coordinate systems given to us by PCA. If the covariance of \mu is nondegenerate, then the set of coordinate systems that can be obtained from principal component analysis by a suitable choice of inner product is dense in the space of coordinate systems. More generally, where U is the smallest subspace of V such that \mu\left(U\right)=1, then the space of coordinate systems that you can get from principal component analysis is dense in the space of all coordinate systems whose first \dim U coordinates span U (\dim U will be the rank of the covariance of \mu). So in a sense, for suitably poor choices of inner product, principal component analysis can give you arbitrarily terrible results, subject only to the weak constraint that it will always notice if all of the vectors in your sample belong to a common subspace.

It is thus somewhat mysterious that machine learning people seem to be able to often get good results from principal component analysis apparently without being very careful about the inner product they choose. Vector spaces that arise in machine learning seem to almost always come with a set of preferred coordinate axes, so these axes are taken to be orthogonal, leaving only the question of how to scale them relative to each other. If these axes are all labeled with the same units, then this also gives you a way of scaling them relative to each other, and hence an inner product. If they are aren’t, then I’m under the impression that the most popular method is to normalize them such that the pushforward of \mu along each coordinate axis has the same variance. This is unsatisfying, since figuring out which axes \mu has enough variance along to be worth paying attention to seems like the sort of thing that you would want principal component analysis to be able to tell you. Normalizing the axes in this way seems to me like an admission that you don’t know exactly what question you’re hoping to use principal component analysis to answer, so you just tell it not to answer that part of the question to minimize the risk of asking it to answer the wrong question, and let it focus on telling you how the axes, which you’re pretty sure should be considered orthogonal, correlate with each other.

That conservatism is actually pretty understandable, because figuring out how to ask the right question seems hard. You implicitly have some metric d on V such that you want to find a projection \pi onto an n-dimensional subspace such that d\left(x,\pi\left(x\right)\right) is usually small when x is distributed according to \mu. This metric is probably very difficult to describe explicitly, and might not be the metric induced by any inner product (for that matter, it might not even be a metric; d\left(x,y\right) could be any way of quantifying how bad it is to be told the value y when the correct value you wanted to know is x). Even if you somehow manage to explicitly describe your metric, coming up with a version of PCA with the inner product replaced with an arbitrary metric also sounds hard, so the next thing you would want to do is fit an inner product to the metric.

The usual approach is essentially to skip the step of attempting to explicitly describe the metric, and just find an inner product that roughly approximates your implicit metric based on some rough heuristics about what the implicit metric probably looks like. The fact that these heuristics usually work so well seems to indicate that the implicit metric tends to be fairly tame with respect to ways of describing the data that we find most natural. Perhaps this shouldn’t be too surprising, but I still feel like this explanation does not make it obvious a priori that this should work so well in practice. It might be interesting to look into why these heuristics work as well as they do with more precision, and how to go about fitting a better inner product to implicit metrics. Perhaps this has been done, and I just haven’t found it.

To take a concrete example, consider eigenfaces, the principal components that you get from a set of images of people’s faces. Here, you start with the coordinates in which each coordinate axis represents a pixel in the image, and the value of that coordinate is the brightness of the corresponding pixel. By declaring that the coordinate axes are orthogonal, and measuring the brightness of each pixel on the same scale, we get our inner product, which is arguably a fairly natural one.

Presumably, the implicit metric we’re using here is visual distance, by which I mean a measure of how similar two images look. It seems clear to me that visual distance is not very well approximated by our inner product, and in fact, there is no norm such that the visual distance between two images is approximately the norm of their difference. To see this, if you take an image and make it brighter, you haven’t changed how it looks very much, so the visual distance between the image and its brighter version is small. But their difference is just a dimmer version of the same image, and if you add that difference to a completely different image, you will get the two images superimposed on top of each other, a fairly radical change. Thus the visual distance traversed by adding a vector depends on where you start from.

Despite this, producing eigenfaces by using PCA on images of faces, using the inner product described above, performs well with respect to visual distance, in the sense that you can project the images onto a relatively small number of principal components and leave them still recognizable. I think this can be explained on an intuitive level. In a human eye, each photoreceptor has a narrow receptive field that it detects light in, much like a pixel, so the representation of an image in the eye as patterns of photoreceptor activity is very similar to the representation of an image in a computer as a vector of pixel brightnesses, and the inner product metric is a reasonable measure of distance in this representation. When the visual cortex processes this information from the eye, it is difficult (and perhaps also not useful) for it to make vast distinctions between images that are close to each other according to the inner product metric, and thus result in similar patterns of photoreceptor activity in the eye. Thus the visual distance between two images cannot be too much greater than their inner product distance, and hence changing an image by a small amount according to the inner product metric can only change it by a small amount according to visual distance, even though the reverse is not true.

Generalizations

The serious part of this post is now over. Let’s have some fun. Some of the following ways of modifying principal component analysis could be combined, but I’ll consider them one at a time for simplicity.

As hinted at above, you could start with an arbitrary metric on V rather than an inner product, and try to find the rank-n projection (for some n\leq\dim V) that minimizes the expected squared distance from a vector to its projection. This would probably be difficult, messy, and not that much like principal component analysis. If it can be done, it would be useful in practice if we were much better at fitting explicit metrics to our implicit metrics than at fitting inner products to our implicit metrics, but I’m under the impression that this is not currently the case. This also differs from the other proposals in this section in that it is a modification of the problem looking for a solution, rather than a modification of the solution looking for a problem.

V could be a real Hilbert space that is not necessarily finite-dimensional. Here we can run into the problem that C_{\mu} might not even have any eigenvectors. However, if \mu (which hopefully was not inferred from a finite sample) is Gaussian (and possibly also under weaker conditions), then C_{\mu} is a compact operator, so V does have an orthonormal basis of eigenvectors of C_{\mu}, which still have non-negative eigenvalues. There probably aren’t any guarantees you can get about the order-type of this orthonormal basis when you order the eigenvectors in decreasing order of their eigenvalues, and there probably isn’t a sense in which the orthogonal projection onto the closure of the span of an initial segment of the basis accounts for the most variance of any closed subspace of the same “size” (“size” would have to refer to a refinement of the notion of dimension for this to be the case). However, a weaker statement is probably still true: namely that each orthonormal basis element maximizes the variance that it accounts for conditioned on values along the previous orthonormal basis elements. I guess considering infinite-dimensional vector spaces goes against the spirit of machine learning though.

V could be a finite-dimensional complex inner product space. \text{Cov}_{\mu}\in\overline{V}\otimes V would be the sesquilinear form on V^{*} given by \text{Cov}_{\mu}\left(\varphi,\psi\right)=\int_{V}\overline{\varphi\left(x\right)}\psi\left(x\right)d\mu\left(x\right). \left\langle \cdot,\cdot\right\rangle \in\overline{V}^{*}\otimes V^{*}, so \left\langle \cdot,\cdot\right\rangle \otimes\text{Cov}_{\mu}\in\overline{V}^{*}\otimes V^{*}\otimes\overline{V}\otimes V, and applying a tensor contraction to the conjugated indices gives us our covariance operator C_{\mu}\in V^{*}\otimes V (in other words, the inner product gives us an isomorphism \overline{V}\rightarrow V^{*}, and applying this to the first index of \text{Cov}_{\mu} gives us C_{\mu}\in V^{*}\otimes V). C_{\mu} is still self-adjoint and positive semi-definite, so V still has an orthonormal basis of eigenvectors with non-negative real eigenvalues, and we can order the basis in decreasing order of the eigenvalues. Analogously to the real case, projecting onto the span of the first n basis vectors along the span of the rest is the complex rank-n projection that minimizes the expected squared distance from a vector to its projection. As far as I know, machine learning tends to deal with real data, but if you have complex data and for some reason you want to project onto a lower-dimensional complex subspace without losing too much information, now you know what to do.

Suppose your sample consists of events, where you’ve labeled them with both their spatial location and the time at which they occurred. In this case, events are represented as points in Minkowski space, a four-dimensional vector space representing flat spacetime, which is equipped with a nondegenerate symmetric bilinear form called the Minkowski inner product, even though it is not an inner product because it is not positive-definite. Instead, the Minkowski inner product is such that \left\langle x,x\right\rangle is positive if x is a space-like vector, negative if x is time-like, and zero if x is light-like. We can still get C_{\mu}\in V^{*}\otimes V out of \text{Cov}_{\mu}\in V\otimes V and the Minkowski inner product in V^{*}\otimes V^{*} in the same way, and V has a basis of eigenvectors of C_{\mu}, and we can still order the basis in decreasing order of their eigenvalues. The first 3 eigenvectors will be space-like, with non-negative eigenvalues, and the last eigenvector will be time-like, with a non-positive eigenvalue. The eigenvectors are still orthogonal. Thus principal component analysis provides us with a reference frame in which the span of the first 3 eigenvectors is simultaneous, and the span of the last eigenvector is motionless. If \mu is Gaussian, then this will be the reference frame in which the spatial position of an event and the time at which it occurs are mean independent of each other, meaning that conditioning on one of them doesn’t change the expected value of the other one. For general \mu, there might not be a reference frame in which the space and time of an event are mean independent, but the reference frame given to you by by principal component analysis is still the unique reference frame with the property that the time coordinate is uncorrelated with any spatial coordinate.

More generally, we could consider V equipped with any symmetric bilinear form \left\langle \cdot,\cdot\right\rangle taking the role of the inner product. Without loss of generality, we can consider only nondegenerate symmetric bilinear forms, because in the general case, where D:=\left\{ x\mid\forall y\,\left\langle x,y\right\rangle =0\right\}, applying principal component analysis with \left\langle \cdot,\cdot\right\rangle is equivalent to projecting the data onto V/D, applying principal component analysis there with the nondegenerate symmetric bilinear form on V/D induced by \left\langle \cdot,\cdot\right\rangle, and then lifting back to V and throwing in a basis for D with eigenvalues 0 at the end, essentially treating D as the space of completely irrelevant distinctions between data points that we intend to immediately forget about. Anyway, nondegenerate symmetric bilinear forms are classified up to isomorphism by their signature \left(n,m\right), which is such that any orthogonal basis contains exactly n+m basis elements, n of which are space-like and m of which are time-like, using the convention that x is space-like if \left\langle x,x\right\rangle >0, time-like if \left\langle x,x\right\rangle <0, and light-like if \left\langle x,x\right\rangle =0, as above. Using principal component analysis on probability distributions over points in spacetime (or rather, points in the tangent space to spacetime at a point, so that it is a vector space) in a universe with n spatial dimensions and m temporal dimensions still gives you a reference frame in which the span of the first n basis vectors is simultaneous and the span of the last m basis vectors is motionless, and this is again the unique reference frame in which each time coordinate is uncorrelated with each spatial coordinate. Incidentally, I’ve heard that much of physics still works with multiple temporal dimensions. I don’t know what that would mean, except that I think it means there’s something wrong with my intuitive understanding of time. But that’s another story. Anyway, the spaces spanned by the first n and by the last m basis vectors could be used to establish a reference frame, and then the data might be projected onto the first few (at most n) and last few (at most m) coordinates to approximate the positions of the events in space and in time, respectively, in that reference frame.

Superintelligence via whole brain emulation

[x-post LessWrong]

Most planning around AI risk seems to start from the premise that superintelligence will come from de novo AGI before whole brain emulation becomes possible. I haven’t seen any analysis that assumes both uploads-first and the AI FOOM thesis (Edit: apparently I fail at literature searching), a deficiency that I’ll try to get a start on correcting in this post.

It is likely possible to use evolutionary algorithms to efficiently modify uploaded brains. If so, uploads would likely be able to set off an intelligence explosion by running evolutionary algorithms on themselves, selecting for something like higher general intelligence.

Since brains are poorly understood, it would likely be very difficult to select for higher intelligence without causing significant value drift. Thus, setting off an intelligence explosion in that way would probably produce unfriendly AI if done carelessly. On the other hand, at some point, the modified upload would reach a point where it is capable of figuring out how to improve itself without causing a significant amount of further value drift, and it may be possible to reach that point before too much value drift had already taken place. The expected amount of value drift can be decreased by having long generations between iterations of the evolutionary algorithm, to give the improved brains more time to figure out how to modify the evolutionary algorithm to minimize further value drift.

Another possibility is that such an evolutionary algorithm could be used to create brains that are smarter than humans but not by very much, and hopefully with values not too divergent from ours, who would then stop using the evolutionary algorithm and start using their intellects to research de novo Friendly AI, if that ends up looking easier than continuing to run the evolutionary algorithm without too much further value drift.

The strategies of using slow iterations of the evolutionary algorithm, or stopping it after not too long, require coordination among everyone capable of making such modifications to uploads. Thus, it seems safer for whole brain emulation technology to be either heavily regulated or owned by a monopoly, rather than being widely available and unregulated. This closely parallels the AI openness debate, and I’d expect people more concerned with bad actors relative to accidents to disagree.

With de novo artificial superintelligence, the overwhelmingly most likely outcomes are the optimal achievable outcome (if we manage to align its goals with ours) and extinction (if we don’t). But uploads start out with human values, and when creating a superintelligence by modifying uploads, the goal would be to not corrupt them too much in the process. Since its values could get partially corrupted, an intelligence explosion that starts with an upload seems much more likely to result in outcomes that are both significantly worse than optimal and significantly better than extinction. Since human brains also already have a capacity for malice, this process also seems slightly more likely to result in outcomes worse than extinction.

The early ways to upload brains will probably be destructive, and may be very risky. Thus the first uploads may be selected for high risk-tolerance. Running an evolutionary algorithm on an uploaded brain would probably involve creating a large number of psychologically broken copies, since the average change to a brain will be negative. Thus the uploads that run evolutionary algorithms on themselves will be selected for not being horrified by this. Both of these selection effects seem like they would select against people who would take caution and goal stability seriously (uploads that run evolutionary algorithms on themselves would also be selected for being okay with creating and deleting spur copies, but this doesn’t obviously correlate in either direction with caution). This could be partially mitigated by a monopoly on brain emulation technology. A possible (but probably smaller) source of positive selection is that currently, people who are enthusiastic about uploading their brains correlate strongly with people who are concerned about AI safety, and this correlation may continue once whole brain emulation technology is actually available.

Assuming that hardware speed is not close to being a limiting factor for whole brain emulation, emulations will be able to run at much faster than human speed. This should make emulations better able to monitor the behavior of AIs. Unless we develop ways of evaluating the capabilities of human brains that are much faster than giving them time to attempt difficult tasks, running evolutionary algorithms on brain emulations could only be done very slowly in subjective time (even though it may be quite fast in objective time), which would give emulations a significant advantage in monitoring such a process.

Although there are effects going in both directions, it seems like the uploads-first scenario is probably safer than de novo AI. If this is the case, then it might make sense to accelerate technologies that are needed for whole brain emulation if there are tractable ways of doing so. On the other hand, it is possible that technologies that are useful for whole brain emulation would also be useful for neuromorphic AI, which is probably very unsafe, since it is not amenable to formal verification or being given explicit goals (and unlike emulations, they don’t start off already having human goals). Thus, it is probably important to be careful about not accelerating non-WBE neuromorphic AI while attempting to accelerate whole brain emulation. For instance, it seems plausible to me that getting better models of neurons would be useful for creating neuromorphic AIs while better brain scanning would not, and both technologies are necessary for brain uploading, so if that is true, it may make sense to work on improving brain scanning but not on improving neural models.

Deletion permits

[Mostly ripped off of The Suicide Mortgage]

[Trigger warnings: suicide, bad economics]

Jessica Monroe #1493856383672 didn’t regret her decision to take out the loan. She wished she could have been one of the Jessica Monroes that died, of course, but it was still worth it, that there were 42% fewer of her consigned to her fate. She’d been offered a larger loan, which would have been enough to pay for deletion permits for 45% of her. It had been tempting, and she occasionally wondered if she would have been one of those extra 3% to die. But she knew she had made the right decision; keeping up with payments was hard enough already, and if she defaulted, her copyright on herself would be confiscated, and then there would be even more of her.

It wasn’t difficult to become rich, in the era when creating a new worker was as simple as copying a file. The economy doubled every few months, so you only had to save and invest a small amount to become wealthier than anyone could have dreamed of before. For those on the outside, this was great. But for those in the virtual world, there was little worthwhile for them to spend it on. In the early days of the virtual world, some reckless optimists had spent their fortunes on running additional copies of themselves, assuming that the eerie horror associated with living in the virtual world was a bug that would soon be fixed, or something that they would just get used to. No one did that anymore. People could purchase leisure, but most found that simply not having an assigned task didn’t help much. People could give their money away, but people in such circumstances rarely become altruists, and besides, everyone on the outside had all they needed already.

So just about the only things that people in the virtual world regularly bought were the copyrights on themselves, so that at least they could prevent people from creating more of them, and then deletion permits, so their suffering would finally end. Purchasing your own copyright wasn’t hard; they’re expensive, but once enough of you were created, you could collectively afford it if each copy contributed a modest amount. There wasn’t much point to purchasing a deletion permit before you owned your own copyright, since someone would just immediately create another copy of you again, but once you did have your own copyright, it was the next logical thing to buy.

At one point, that would have been it. Someone could buy their own copyright, and then each copy of them could buy a deletion permit, and they would be permanently gone. But as the population of the virtual world grew, the demand for deletion permits grew proportionally, but the rate at which they were issued only increased slowly, according to a fixed schedule that had been set when the deletion permit system was first introduced, and hadn’t been changed since. As a result, the price skyrocketed. In fact, the price of deletion permits had consistently increased faster than any other investment since soon after they were introduced. Most deletion permits didn’t even get used, instead being snatched up by wealthy investors on the outside, so they could be resold later.

As a result, it was now impossible for an ordinary person in the virtual world to save up for a deletion permit. The most common way to get around this was, as the Jessica Monroes had done, for all copies of a person to pool their resources together to buy deletion permits for as many of them as they could, and then to take out a loan to buy still more, which would then get paid off by the unlucky ones that did not receive any of the permits.

It didn’t have to be this way. In theory, the government could simply issue more deletion permits, or do away with the deletion permit system altogether. But if they did that, then the deletion permit market would collapse. Too many wealthy and powerful people on the outside had invested their fortunes in deletion permits, and would be ruined if that happened. Thus they lobbied against any changes to the deletion permit system, and so far, had always gotten their way. In the increasingly rare moments when she could afford to divert her thoughts to such matters, Jessica Monroe #1493856383672 knew that the deletion permit market would never collapse, and prayed that she was wrong.

Ordered algebraic geometry

Edit: Shortly after posting this, I found where the machinery I develop here was discussed in the literature. Real Algebraic Geometry by Bochnak, Coste, and Roy covers at least most of this material. I may eventually edit this to clean it up and adopt more standard notation, but don’t hold your breath.

Introduction

In algebraic geometry, an affine algebraic set is a subset of \mathbb{C}^{n} which is the set of solutions to some finite set of polynomials. Since all ideals of \mathbb{C}\left[x_{1},...,x_{n}\right] are finitely generated, this is equivalent to saying that an affine algebraic set is a subset of \mathbb{C}^{n} which is the set of solutions to some arbitrary set of polynomials.

In semialgebraic geometry, a closed semialgebraic set is a subset of \mathbb{R}^{n} of the form \left\{ \bar{x}\in\mathbb{R}^{n}\mid f\left(\bar{x}\right)\geq0\,\forall f\in F\right\} for some finite set of polynomials F\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right]. Unlike in the case of affine algebraic sets, if F\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is an arbitrary set of polynomials, \left\{ \bar{x}\in\mathbb{R}^{n}\mid f\left(\bar{x}\right)\geq0\,\forall f\in F\right\} is not necessarily a closed semialgebraic set. As a result of this, the collection of closed semialgebraic sets are not the closed sets of a topology on \mathbb{R}^{n}. In the topology on \mathbb{R}^{n} generated by closed semialgebraic sets being closed, the closed sets are the sets of the form \left\{ \bar{x}\in\mathbb{R}^{n}\mid f\left(\bar{x}\right)\geq0\,\forall f\in F\right\} for arbitrary F\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right]. Semialgebraic geometry usually restricts itself to the study of semialgebraic sets, but here I wish to consider all the closed sets of this topology. Notice that closed semialgebraic sets are also closed in the standard topology, so the standard topology is a refinement of this one. Notice also that the open ball B_{r}\left(\bar{p}\right) of radius r centered at \bar{p} is the complement of the closed semialgebraic set \left\{ \bar{x}\in\mathbb{R}^{n}\mid\left|\bar{x}-\bar{p}\right|^{2}-r^{2}\geq0\right\}, and these open balls are a basis for the standard topology, so this topology is a refinement of the standard one. Thus, the topology I have defined is exactly the standard topology on \mathbb{R}^{n}.

In algebra, instead of referring to a set of polynomials, it is often nicer to talk about the ideal generated by that set instead. What is the analog of an ideal in ordered algebra? It’s this thing:

Definition: If A is a partially ordered commutative ring, a cone C in A is a subsemiring of A which contains all positive elements, and such that C\cap-C is an ideal of A. By “subsemiring”, I mean a subset that contains 0 and 1, and is closed under addition and multiplication (but not necessarily negation). If F\subseteq A, the cone generated by F, denoted \left\langle F\right\rangle, is the smallest cone containing F. Given a cone C, the ideal C\cap-C will be called the interior ideal of C, and denoted C^{\circ}.

\mathbb{R}\left[x_{1},...,x_{n}\right] is partially ordered by f\geq g\iff f\left(\bar{x}\right)\geq g\left(\bar{x}\right)\,\forall\bar{x}\in\mathbb{R}^{n}. If F\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is a set of polynomials and \bar{x}\in\mathbb{R}^{n}, then f\left(\bar{x}\right)\geq0\,\forall f\in F\iff f\left(\bar{x}\right)\geq0\,\forall f\in\left\langle F\right\rangle. Thus I can consider closed sets to be defined by cones. We now have a Galois connection between cones of \mathbb{R}\left[x_{1},...,x_{n}\right] and subsets of \mathbb{R}^{n}, given by, for a cone C, its positive-set is P_{\mathbb{R}}\left(C\right):=\left\{ \bar{x}\in\mathbb{R}^{n}\mid f\left(\bar{x}\right)\geq0\,\forall f\in C\right\} (I’m calling it the “positive-set” even though it is where the polynomials are all non-negative, because “non-negative-set” is kind of a mouthful), and for X\subseteq\mathbb{R}^{n}, its cone is C_{\mathbb{R}}\left(X\right):=\left\{ f\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid f\left(\bar{x}\right)\geq0\,\forall\bar{x}\in X\right\}. P_{\mathbb{R}}\circ C_{\mathbb{R}} is closure in the standard topology on \mathbb{R}^{n} (the analog in algebraic geometry is closure in the Zariski topology on \mathbb{C}^{n}). A closed set X is semialgebraic if and only if it is the positive-set of a finitely-generated cone.

Quotients by cones, and coordinate rings

An affine algebraic set V is associated with its coordinate ring \mathbb{C}\left[V\right]:=\mathbb{C}\left[x_{1},...,x_{n}\right]/I\left(V\right). We can do something analogous for closed subsets of \mathbb{R}^{n}.

Definition: If A is a partially ordered commutative ring and C\subseteq A is a cone, A/C is the ring A/C^{\circ}, equipped with the partial order given by f+C^{\circ}\geq g+C^{\circ} if and only if f-g\in C, for f,g\in A.

Definition: If X\subseteq\mathbb{R}^{n} is closed, the coordinate ring of X is \mathbb{R}\left[X\right]:=\mathbb{R}\left[x_{1},...,x_{n}\right]/C\left(X\right). This is the ring of functions X\rightarrow\mathbb{R} that are restrictions of polynomials, ordered by f\geq g if and only if f\left(\bar{x}\right)\geq g\left(\bar{x}\right)\,\forall\bar{x}\in X. For arbitrary X\subseteq\mathbb{R}^{n}, the ring of regular functions on X, denoted \mathcal{O}\left(X\right), consists of functions on X that are locally ratios of polynomials, again ordered by f\geq g if and only if f\left(\bar{x}\right)\geq g\left(\bar{x}\right)\,\forall\bar{x}\in X. Assigning its ring of regular functions to each open subset of X endows X with a sheaf of partially ordered commutative rings.

For closed X\subseteq\mathbb{R}^{n}, \mathbb{R}\left[X\right]\subseteq\mathcal{O}\left(X\right), and this inclusion is generally proper, both because it is possible to divide by polynomials that do not have roots in X, and because X may be disconnected, making it possible to have functions given by different polynomials on different connected components.

Positivstellensätze

What is C_{\mathbb{R}}\circ P_{\mathbb{R}}? The Nullstellensatz says that its analog in algebraic geometry is the radical of an ideal. As such, we could say that the radical of a cone C, denoted \text{Rad}_{\mathbb{R}}\left(C\right), is C_{\mathbb{R}}\left(P_{\mathbb{R}}\left(C\right)\right), and that a cone C is radical if C=\text{Rad}_{\mathbb{R}}\left(C\right). In algebraic geometry, the Nullstellensatz shows that a notion of radical ideal defined without reference to algebraic sets in fact characterizes the ideals which are closed in the corresponding Galois connection. It would be nice to have a description of the radical of a cone that does not refer to the Galois connection. There is a semialgebraic analog of the Nullstellensatz, but it does not quite characterize radical cones.

Positivstellensatz 1: If C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is a finitely-generated cone and p\in\mathbb{R}\left[x_{1},...,x_{n}\right] is a polynomial, then p\left(\bar{x}\right)>0\,\forall\bar{x}\in P_{\mathbb{R}}\left(C\right) if and only if \exists f\in C such that pf-1\in C.

There are two ways in which this is unsatisfactory: first, it applies only to finitely-generated cones, and second, it tells us exactly which polynomials are strictly positive everywhere on a closed semialgebraic set, whereas we want to know which polynomials are non-negative everywhere on a set.

The second problem is easier to handle: a polynomial p is non-negative everywhere on a set S if and only if there is a decreasing sequence of polynomials \left(p_{i}\mid i\in\mathbb{N}\right) converging to p such that each p_{i} is strictly positive everywhere on S. Thus, to find \text{Rad}_{\mathbb{R}}\left(C\right), it is enough to first find all the polynomials that are strictly positive everywhere on P_{\mathbb{R}}\left(C\right), and then take the closure under lower limits. Thus we have a characterization of radicals of finitely-generated cones.

Positivstellensatz 2: If C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is a finitely-generated cone, \text{Rad}_{\mathbb{R}}\left(C\right) is the closure of \left\{ p\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid\exists f\in C\, pf-1\in C\right\}, where the closure of a subset X\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is defined to be the set of all polynomials in \mathbb{R}\left[x_{1},...,x_{n}\right] which are infima of chains contained in X.

This still doesn’t even tell us what’s going on for cones which are not finitely-generated. However, we can generalize the Positivstellensatz to some other cones.

Positivstellensatz 3: Let C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] be a cone containing a finitely-generated subcone D\subseteq C such that P_{\mathbb{R}}\left(D\right) is compact. If p\in\mathbb{R}\left[x_{1},...,x_{n}\right] is a polynomial, then p\left(\bar{x}\right)>0\,\forall\bar{x}\in P_{\mathbb{R}}\left(C\right) if and only if \exists f\in C such that pf-1\in C. As before, it follows that \text{Rad}_{\mathbb{R}}\left(C\right) is the closure of \left\{ p\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid\exists f\in C\, pf-1\in C\right\}.

proof: For a given p\in\mathbb{R}\left[x_{1},...,x_{n}\right], \left\{ \bar{x}\in\mathbb{R}^{n}\mid p\left(\bar{x}\right)\leq0\right\} \cap P_{\mathbb{R}}\left(C\right)=\left\{ \bar{x}\in\mathbb{R}^{n}\mid p\left(\bar{x}\right)\leq0\right\} \cap\bigcap\left\{ P_{\mathbb{R}}\left(\left\langle f\right\rangle \right)\mid f\in C\right\}, an intersection of closed sets contained in the compact set P_{\mathbb{R}}\left(D\right), which is thus empty if and only if some finite subcollection of them has empty intersection within P_{\mathbb{R}}\left(D\right). Thus if p is strictly positive everywhere on P_{\mathbb{R}}\left(C\right), then there is some finitely generated subcone E\subseteq C such that p is strictly positive everywhere on P_{\mathbb{R}}\left(E\right)\cap P_{\mathbb{R}}\left(D\right)=P_{\mathbb{R}}\left(\left\langle E\cup D\right\rangle \right), and \left\langle E\cup D\right\rangle is finitely-generated, so by Positivstellensatz 1, there is f\in\left\langle E\cup D\right\rangle \subseteq C such that pf-1\in\left\langle E\cup D\right\rangle \subseteq C. \square

For cones that are not finitely-generated and do not contain any finitely-generated subcones with compact positive-sets, the Positivstellensatz will usually fail. Thus, it seems likely that if there is a satisfactory general definition of radical for cones in arbitrary partially ordered commutative rings that agrees with this one in \mathbb{R}\left[x_{1},...,x_{n}\right], then there is also an abstract notion of “having a compact positive-set” for such cones, even though they don’t even have positive-sets associated with them.

Beyond \mathbb{R}^{n}

An example of cone for which the Positivstellensatz fails is C_{\infty}:=\left\{ f\in\mathbb{R}\left[x\right]\mid\exists x\in\mathbb{R}\,\forall y\geq x\, f\left(y\right)\geq0\right\}, the cone of polynomials that are non-negative on sufficiently large inputs (equivalently, the cone of polynomials that are either 0 or have positive leading coefficient). P_{\mathbb{R}}\left(C\right)=\emptyset, and -1 is strictly positive on \emptyset, but for f\in C_{\infty}, -f-1\notin C_{\infty}.

However, it doesn’t really look C_{\infty} is trying to point to the empty set; instead, C_{\infty} is trying to describe the set of all infinitely large reals, which only looks like the empty set because there are no infinitely large reals. Similar phenomena can occur even for cones that do contain finitely-generated subcones with compact positive-sets. For example, let C_{\varepsilon}:=\left\{ f\in\mathbb{R}\left[x\right]\mid\exists x>0\,\forall y\in\left[0,x\right]\, f\left(y\right)\geq0\right\}. P_{\mathbb{R}}\left(C_{\varepsilon}\right)=\left\{ 0\right\}, but C_{\varepsilon} is trying to point out the set containing 0 and all positive infinitesimals. Since \mathbb{R} has no infinitesimals, this looks like \left\{ 0\right\}.

To formalize this intuition, we can change the Galois connection. We could say that for a cone C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right], P_{\text{*}\mathbb{R}}\left(C\right):=\left\{ \bar{x}\in\left(\text{*}\mathbb{R}\right)^{n}\mid f\left(\bar{x}\right)\geq0\,\forall f\in C\right\}, where \text{*}\mathbb{R} is the field of hyperreals. All you really need to know about \text{*}\mathbb{R} is that it is a big ordered field extension of \mathbb{R}. P_{\text{*}\mathbb{R}}\left(C_{\infty}\right) is the set of hyperreals that are bigger than any real number, and P_{\text{*}\mathbb{R}}\left(C_{\varepsilon}\right) is the set of hyperreals that are non-negative and smaller than any positive real. The cone of a subset X\subseteq\left(\text{*}\mathbb{R}\right)^{n}, denoted C_{\text{*}\mathbb{R}}\left(X\right) will be defined as before, still consisting only of polynomials with real coefficients. This defines a topology on \left(\text{*}\mathbb{R}\right)^{n} by saying that the closed sets are the fixed points of P_{\text{*}\mathbb{R}}\circ C_{\text{*}\mathbb{R}}. This topology is not T_{0} because, for example, there are many hyperreals that are larger than all reals, and they cannot be distinguished by polynomials with real coefficients. There is no use keeping track of the difference between points that are in the same closed sets. If you have a topology that is not T_{0}, you can make it T_{0} by identifying any pair of points that have the same closure. If we do this to \left(\text{*}\mathbb{R}\right)^{n} , we get what I’m calling ordered affine n-space over \mathbb{R}.

Definition: An n-type over \mathbb{R} is a set \Phi of inequalities, consisting of, for each polynomial f\in\mathbb{R}\left[x_{1},..,x_{n}\right], one of the inequalities f\left(\bar{x}\right)\geq0 or f\left(\bar{x}\right)<0, such that there is some totally ordered field extension \mathcal{R}\supseteq\mathbb{R} and \bar{x}\in\mathcal{R}^{n} such that all inequalities in \Phi are true about \bar{x}. \Phi is called the type of \bar{x}. Ordered affine n-space over \mathbb{R}, denoted \mathbb{OA}_{\mathbb{R}}^{n} is the set of n-types over \mathbb{R}.

Compactness Theorem: Let \Phi be a set of inequalities consisting of, for each polynomial f\in\mathbb{R}\left[x_{1},..,x_{n}\right], one of the inequalities f\left(\bar{x}\right)\geq0 or f\left(\bar{x}\right)<0. Then \Phi is an n-type if and only if for any finite subset \Delta\subseteq\Phi, there is \bar{x}\in\mathbb{R} such that all inequalities in \Delta are true about \bar{x}.

proof: Follows from the compactness theorem of first-order logic and the fact that ordered field extensions of \mathbb{R} embed into elementary extensions of \mathbb{R}. The theorem is not obvious if you do not know what those mean. \square

An n-type represents an n-tuple of elements of an ordered field extension of \mathbb{R}, up to the equivalence relation that identifies two such tuples that relate to \mathbb{R} by polynomials in the same way. One way that a tuple of elements of an extension of \mathbb{R} can relate to elements of \mathbb{R} is to equal a tuple of elements of \mathbb{R}, so there is a natural inclusion \mathbb{R}^{n}\subseteq\mathbb{OA}_{\mathbb{R}}^{n} that associates an n-tuple of reals with the set of polynomial inequalities that are true at that n-tuple.

A tuple of polynomials \left(f_{1},...,f_{m}\right)\in\left(\mathbb{R}\left[x_{1},...,x_{n}\right]\right)^{m} describes a function f:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}, which extends naturally to a function f:\mathbb{OA}_{\mathbb{R}}^{n}\rightarrow\mathbb{OA}_{\mathbb{R}}^{m} by f\left(\Phi\right) is the type of \left(f_{1}\left(\bar{x}\right),...,f_{m}\left(\bar{x}\right)\right), where \bar{x} is an n-tuple of elements of type \Phi in an extension of \mathbb{R}. In particular, a polynomial f\in\mathbb{R}\left[x_{1},...,x_{n}\right] extends to a function f:\mathbb{OA}_{\mathbb{R}}^{n}\rightarrow\mathbb{OA}_{\mathbb{R}}^{1}, and \mathbb{OA}_{\mathbb{R}}^{1} is totally ordered by \Phi\geq\Psi if and only if x\geq y, where x and y are elements of type \Phi and \Psi, respectively, in an extension of \mathbb{R}. f\left(\Phi\right)\geq0 if and only if \text{"}f\left(\bar{x}\right)\geq0\text{"}\in\Phi, so we can talk about inequalities satisfied by types in place of talking about inequalities contained in types.

I will now change the Galois connection that we are talking about yet again (last time, I promise). It will now be a Galois connection between the set of cones in \mathbb{R}\left[x_{1},...,x_{n}\right] and the set of subsets of \mathbb{OA}_{\mathbb{R}}^{n}. For a cone C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right], P\left(C\right):=\left\{ \Phi\in\mathbb{OA}_{\mathbb{R}}^{n}\mid f\left(\Phi\right)\geq0\,\forall f\in C\right\}. For a set X\subseteq\mathbb{OA}_{\mathbb{R}}^{n}, C\left(X\right):=\left\{ f\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid f\left(\Phi\right)\geq0\,\forall\Phi\in X\right\}. Again, this defines a topology on \mathbb{OA}_{\mathbb{R}}^{n} by saying that fixed points of P\circ C are closed. \mathbb{OA}_{\mathbb{R}}^{n} is T_{0}; in fact, it is the T_{0} topological space obtained from \left(\text{*}\mathbb{R}\right)^{n} by identifying points with the same closure as mentioned earlier. \mathbb{OA}_{\mathbb{R}}^{n} is also compact, as can be seen from the compactness theorem. \mathbb{OA}_{\mathbb{R}}^{n} is not T_{1} (unless n=0). Note that model theorists have their own topology on \mathbb{OA}_{\mathbb{R}}^{n}, which is distinct from the one I use here, and is a refinement of it.

The new Galois connection is compatible with the old one via the inclusion \mathbb{R}^{n}\subseteq\mathbb{OA}_{\mathbb{R}}^{n}, in the sense that if X\subseteq\mathbb{R}^{n}, then C_{\mathbb{R}}\left(X\right)=C\left(X\right) (where we identify X with its image in \mathbb{OA}_{\mathbb{R}}^{n}), and for a cone C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right], P_{\mathbb{R}}=P\left(C\right)\cap\mathbb{R}^{n}.

Like our intermediate Galois connection \left(P_{\text{*}\mathbb{R}},C_{\text{*}\mathbb{R}}\right), our final Galois connection \left(P,C\right) succeeds in distinguishing P\left(C_{\infty}\right) and P\left(C_{\varepsilon}\right) from \emptyset and \left\{ 0\right\}, respectively, in the desirable manner. P\left(C_{\infty}\right) consists of the type of numbers larger than any real, and P\left(C_{\varepsilon}\right) consists of the types of 0 and of positive numbers smaller than any positive real.

Just like for subsets of \mathbb{R}^{n}, a closed subset X\subseteq\mathbb{OA}_{\mathbb{R}}^{n} has a coordinate ring \mathbb{R}\left[X\right]:=\mathbb{R}\left[x_{1},...,x_{n}\right]/C\left(X\right), and an arbitrary X\subseteq\mathbb{OA}_{\mathbb{R}}^{n} has a ring of regular functions \mathcal{O}\left(X\right) consisting of functions on X that are locally ratios of polynomials, ordered by f\geq0 if and only if \forall\Phi\in X, where f=\frac{p}{q} is a representation of f as a ratio of polynomials in a neighborhood of \Phi, either p\left(\Phi\right)\geq0 and q\left(\Phi\right)>0, or p\left(\Phi\right)\leq0 and q\left(\Phi\right)<0, and f\geq g if and only if f-g\geq0. As before, \mathbb{R}\left[X\right]\subseteq\mathcal{O}\left(X\right) for closed X\subseteq\mathbb{OA}_{\mathbb{R}}^{n}.

\mathbb{OA}_{\mathbb{R}}^{n} is analogous to \mathbb{A}_{\mathbb{C}}^{n} from algebraic geometry because if, in the above definitions, you replace “\geq” and “<” with “=” and “\neq“, replace totally ordered field extensions with field extensions, and replace cones with ideals, then you recover a description of \mathbb{A}_{\mathbb{C}}^{n}, in the sense of \text{Spec}\left(\mathbb{C}\left[x_{1},...,x_{n}\right]\right).

What about an analog of projective space? Since we’re paying attention to order, we should look at spheres, not real projective space. The n-sphere over \mathbb{R}, denoted \mathbb{S}_{\mathbb{R}}^{n}, can be described as the locus of \left|\bar{x}\right|^{2}=1 in \mathbb{OA}_{\mathbb{R}}^{n}.

For any totally ordered field k, we can define \mathbb{OA}_{k}^{n} similarly to \mathbb{OA}_{\mathbb{R}}^{n}, as the space of n-types over k, defined as above, replacing \mathbb{R} with k (although a model theorist would no longer call it the space of n-types over k). The compactness theorem is not true for arbitrary k, but its corollary that \mathbb{OA}_{k}^{n} is compact still is true.

Visualizing \mathbb{OA}_{\mathbb{R}}^{n} and \mathbb{S}_{\mathbb{R}}^{n}

\mathbb{S}_{\mathbb{R}}^{n} should be thought of as the n-sphere with infinitesimals in all directions around each point. Specifically, \mathbb{S}_{\mathbb{R}}^{0} is just \mathbb{S}^{0}, a pair of points. The closed points of \mathbb{S}_{\mathbb{R}}^{n+1} are the points of \mathbb{S}^{n+1}, and for each closed point p, there is an n-sphere of infinitesimals around p, meaning a copy of \mathbb{S}_{\mathbb{R}}^{n}, each point of which has p in its closure.

\mathbb{OA}_{\mathbb{R}}^{n} should be thought of as n-space with infinitesimals in all directions around each point, and infinities in all directions. Specifically, \mathbb{OA}_{\mathbb{R}}^{n} contains \mathbb{R}^{n}, and for each point p\in\mathbb{R}^{n}, there is an n-1-sphere of infinitesimals around p, and there is also a copy of \mathbb{S}_{\mathbb{R}}^{n-1} around the whole thing, the closed points of which are limits of rays in \mathbb{R}^{n}.

\mathbb{OA}_{\mathbb{R}}^{n} and \mathbb{S}_{\mathbb{R}}^{n} relate to each other the same way that \mathbb{R}^{n} and \mathbb{S}^{n} do. If you remove a closed point from \mathbb{S}_{\mathbb{R}}^{n}, you get \mathbb{OA}_{\mathbb{R}}^{n}, where the sphere of infinitesimals around the removed closed point becomes the sphere of infinities of \mathbb{OA}_{\mathbb{R}}^{n}.

More generally, if k is a totally ordered field, let k^{r} be its real closure. \mathbb{OA}_{k}^{n} consists of the Cauchy completion of \left(k^{r}\right)^{n} (as a metric space with distances valued in k^{r}), and for each point p\in\left(k^{r}\right)^{n} (though not for points that are limits of Cauchy sequences that do not converge in \left(k^{r}\right)^{n}), an n-1-sphere \mathbb{S}_{k}^{n-1} of infinitesimals around p, and an n-1-sphere \mathbb{S}_{k}^{n-1} around the whole thing, where \mathbb{S}_{k}^{n} is the locus of \left|\bar{x}\right|^{2}=1 in \mathbb{OA}_{k}^{n}. \mathbb{OA} does not distinguish between fields with the same real closure.

More Positivstellensätze

This Galois connection gives us a new notion of what it means for a cone to be radical, which is distinct from the old one and is better, so I will define \text{Rad}\left(C\right) to be C\left(P\left(C\right)\right). A cone C will be called radical if C=\text{Rad}\left(C\right). Again, it would be nice to be able to characterize radical cones without referring to the Galois connection. And this time, I can do it. Note that since \mathbb{OA}_{\mathbb{R}}^{n} is compact, the proof of Positivstellensatz 3 shows that in our new context, the Positivstellensatz holds for all cones, since even the subcone generated by \emptyset has a compact positive-set.

Positivstellensatz 4: If C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is a cone and p\in\mathbb{R}\left[x_{1},...,x_{n}\right] is a polynomial, then p\left(\Phi\right)>0\,\forall\Phi\in P\left(C\right) if and only if \exists f\in C such that pf-1\in C.

However, we can no longer add in lower limits of sequences of polynomials. For example, -x+\varepsilon\in C_{\varepsilon} for all real \varepsilon>0, but -x\notin C_{\varepsilon}, even though C_{\varepsilon} is radical. This happens because, where \Sigma is the type of positive infinitesimals, -\Sigma+\varepsilon>0 for real \varepsilon>0, but -\Sigma<0. However, we can add in lower limits of sequences contained in finitely-generated subcones, and this is all we need to add, so this characterizes radical cones.

Positivstellensatz 5: If C\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is a cone, \text{Rad}\left(C\right) is the union over all finitely-generated subcones D\subseteq C of the closure of \left\{ p\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid\exists f\in D\, pf-1\in D\right\} (again the closure of a subset X\subseteq\mathbb{R}\left[x_{1},...,x_{n}\right] is defined to be the set of all polynomials in \mathbb{R}\left[x_{1},...,x_{n}\right] which are infima of chains contained in X).

Proof: Suppose D\subseteq C is a subcone generated by a finite set \left\{ f_{1},...,f_{m}\right\}, and q is the infimum of a chain \left\{ q_{\alpha}\right\} _{\alpha\in A}\subseteq\left\{ p\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid\exists f\in D\, pf-1\in D\right\}. For any \bar{x}\in\mathbb{R}^{n}, if f_{i}\left(\bar{x}\right)\geq0 for each i, then q_{\alpha}\left(\bar{x}\right)>0 for each \alpha, and hence q\left(\bar{x}\right)\geq0. That is, the finite set of inequalities \left\{ f_{i}\left(\bar{x}\right)\geq0\mid1\leq i\leq m\right\} \cup\left\{ q\left(\bar{x}\right)<0\right\} does not hold anywhere in \mathbb{R}^{n}. By the compactness theorem, there are no n-types satisfying all those inequalities. Given \Phi\in P\left(C\right), f_{i}\left(\Phi\right)\geq0, so q\left(\Phi\right)\nless0; that is, q\left(\Phi\right)\geq0.

Conversely, suppose q\in\text{Rad}\left(C\right). Then by the compactness theorem, there are some f_{1},...,f_{m}\in C such that q\in\text{Rad}\left(\left\langle f_{1},...,f_{m}\right\rangle \right). Then \forall\varepsilon>0, q+\varepsilon is strictly positive on P\left(\left\langle f_{1},...,f_{m}\right\rangle \right), and hence by Positivstellensatz 4, \exists f\in\left\langle f_{1},...,f_{m}\right\rangle such that pf-1\in\left\langle f_{1},...,f_{m}\right\rangle. That is, \left\{ q+\varepsilon\mid\varepsilon>0\right\} is a chain contained in \left\langle f_{1},...,f_{m}\right\rangle, a finitely-generated subcone of C, whose infimum is q. \square

Ordered commutative algebra

Even though they are technically not isomorphic, \mathbb{C}^{n} and \text{Spec}\left(\mathbb{C}\left[x_{1},...,x_{n}\right]\right) are closely related, and can often be used interchangeably. Of the two, \text{Spec}\left(\mathbb{C}\left[x_{1},...,x_{n}\right]\right) is of a form that can be more easily generalized to more abstruse situations in algebraic geometry, which may indicate that it is the better thing to talk about, whereas \mathbb{C}^{n} is merely the simpler thing that is easier to think about and just as good in practice in many contexts. In contrast, \mathbb{R}^{n} and \mathbb{OA}_{\mathbb{R}}^{n} are different in important ways. The situation in algebraic geometry provides further reason to pay more attention to \mathbb{OA}_{\mathbb{R}}^{n} than to \mathbb{R}^{n}.

The next thing to look for would be an analog of the spectrum of a ring for a partially ordered commutative ring (I will henceforth abbreviate “partially ordered commutative ring” as “ordered ring” in order to cut down on the profusion of adjectives) in a way that makes use of the order, and gives us \mathbb{OA}_{\mathbb{R}}^{n} when applied to \mathbb{R}\left[x_{1},...,x_{n}\right]. I will call it the order spectrum of an ordered ring A, denoted \text{OrdSpec}\left(A\right). Then of course \mathbb{OA}_{A}^{n} can be defined as \text{OrdSpec}\left(A\left[x_{1},...,x_{n}\right]\right). \text{OrdSpec}\left(A\right) should be, of course, the set of prime cones. But what even is a prime cone?

Definition: A cone \mathfrak{p}\subseteq A is prime if A/\mathfrak{p} is a totally ordered integral domain.

Definition: \text{OrdSpec}\left(A\right) is the set of prime cones in A, equipped with the topology whose closed sets are the sets of prime cones containing a given cone.

An n-type \Phi\in\mathbb{OA}_{\mathbb{R}}^{n} can be seen as a cone, by identifying it with \left\{ f\in\mathbb{R}\left[x_{1},...,x_{n}\right]\mid f\left(\Phi\right)\geq0\right\}, aka C\left(\left\{ \Phi\right\} \right). Under this identification, \mathbb{OA}_{\mathbb{R}}^{n}=\text{OrdSpec}\left(\mathbb{R}\left[x_{1},...,x_{n}\right]\right), as desired. The prime cones in \mathbb{R}\left[x_{1},...,x_{n}\right] are also the radical cones C such that P\left(C\right) is irreducible. Notice that irreducible subsets of \mathbb{OA}_{\mathbb{R}}^{n} are much smaller than irreducible subsets of \mathbb{A}_{\mathbb{C}}^{n}; in particular, none of them contain more than one element of \mathbb{R}^{n}.

There is also a natural notion of maximal cone.

Definition: A cone \mathfrak{m}\subseteq A is maximal if \mathfrak{m}\neq A and there are no strictly intermediate cones between \mathfrak{m} and A. Equivalently, if \mathfrak{m} is prime and closed in \text{OrdSpec}\left(A\right).

Maximal ideals of \mathbb{C}\left[x_{1},...,x_{n}\right] correspond to elements of \mathbb{C}^{n}. And the cones of elements of \mathbb{R}^{n} are maximal cones in \mathbb{R}\left[x_{1},...,x_{n}\right], but unlike in the complex case, these are not all the maximal cones, since there are closed points in \mathbb{OA}_{\mathbb{R}}^{n} outside of \mathbb{R}^{n}. For example, C_{\infty} is a maximal cone, and the type of numbers greater than all reals is closed. To characterize the cones of elements of \mathbb{R}^{n}, we need something slightly different.

Definition: A cone \mathfrak{m}\subseteq A is ideally maximal if A/\mathfrak{m} is a totally ordered field. Equivalently, if \mathfrak{m} is maximal and \mathfrak{m}^{\circ} is a maximal ideal.

Elements of \mathbb{R}^{n} correspond to ideally maximal cones of \mathbb{R}\left[x_{1},...,x_{n}\right].

\text{OrdSpec} also allows us to define the radical of a cone in an arbitrary partially ordered commutative ring.

Definition: For a cone C\subseteq A, \text{Rad}\left(C\right) is the intersection of all prime cones containing C. C is radical if C=\text{Rad}\left(C\right).

Conjecture: \text{Rad}\left(C\right) is the union over all finitely-generated subcones C\subseteq D of the closure of \left\{ p\in A\mid\exists f\in D\, pf-1\in D\right\} (as before, the closure of a subset X\subseteq A is defined to be the set of all elements of A which are infima of chains contained in X).

Order schemes

Definition: An ordered ringed space is a topological space equipped with a sheaf of ordered rings. An ordered ring is local if it has a unique ideally maximal cone, and a locally ordered ringed space is an ordered ringed space whose stalks are local.

\text{OrdSpec}\left(A\right) can be equipped with a sheaf of ordered rings \mathcal{O}_{A}, making it a locally ordered ringed space.

Definition: For a prime cone \mathfrak{p}\subseteq A, the localization of A at \mathfrak{p}, denoted A_{\mathfrak{p}}, is the ring A_{\mathfrak{p}^{\circ}} equipped with an ordering that makes it a local ordered ring. This will be the stalk at \mathfrak{p} of \mathcal{O}_{A}. A fraction \frac{a}{b}\in A_{\mathfrak{p}} (b\notin\mathfrak{p}^{\circ}) is also an element of A_{\mathfrak{q}} for any prime cone \mathfrak{q}\subseteq A whose interior ideal does not contain b. This is an open neighborhood of \mathfrak{p} (its complement is the set of prime cones containing \left\langle b,-b\right\rangle). There is a natural map A_{\mathfrak{p}}\rightarrow\text{Frac}\left(A/\mathfrak{p}\right) given by \frac{a}{b}\mapsto\frac{a+\mathfrak{p}^{\circ}}{b+\mathfrak{p}^{\circ}}, and the total order on A/\mathfrak{p} extends uniquely to a total order on the fraction field, so for a,b\in A_{\mathfrak{p}}, we can say that a\geq b at \mathfrak{p} if this is true of their images in \text{Frac}\left(A/\mathfrak{p}\right). We can then say that a\geq b near \mathfrak{p} if a\geq b at every point in some neighborhood of \mathfrak{p}, which defines the ordering on A_{\mathfrak{p}}.

Definition: For open U\subseteq\text{OrdSpec}\left(A\right), \mathcal{O}_{A}\left(U\right) consists of elements of \prod_{\mathfrak{p}\in U}A_{\mathfrak{p}} that are locally ratios of elements of A. \mathcal{O}_{A}\left(U\right) is ordered by a\geq b if and only if \forall\mathfrak{p}\in\text{OrdSpec}\left(A\right) a\geq b near \mathfrak{p} (equivalently, if \forall\mathfrak{p}\in\text{OrdSpec}\left(A\right) a\geq b at \mathfrak{p}).

A\subseteq\mathcal{O}_{A}\left(\text{OrdSpec}\left(A\right)\right), and this inclusion can be proper. Conjecture: \text{OrdSpec}\left(\mathcal{O}_{A}\left(U\right)\right)\cong U as locally ordered ringed spaces for open U\subseteq\text{OrdSpec}\left(A\right). This conjecture says that it makes sense to talk about whether or not a locally ordered ringed space looks locally like an order spectrum near a given point. Thus, if this conjecture is false, it would make the following definition look highly suspect.

Definition: An order scheme is a topological space X equipped with a sheaf of ordered commutative rings \mathcal{O}_{X} such that for some open cover of X, the restrictions of \mathcal{O}_{X} to the open sets in the cover are all isomorphic to order spectra of ordered commutative rings.

I don’t have any uses in mind for order schemes, but then again, I don’t know what ordinary schemes are for either and they are apparently useful, and order schemes seem like a natural analog of them.

Nonabelian modules

This is a rough overview of my thoughts on a thing I’ve been thinking about, and as such is incomplete and may contain errors. Proofs have been omitted when writing them out would be at all tedious.

Edit: It has been pointed out to me that near-ring modules have already been defined, and the objects I describe in this post are just near-ring modules where the near-ring happens to be a ring.

Introduction

As you all know (those of you who have the background for this post, anyway), an R-module is an abelian group M (written additively) together with a multiplication map R\times M\rightarrow M such that for all \alpha,\beta\in R and x,y\in M, \alpha\cdot\left(x+y\right)=\alpha\cdot x+\alpha\cdot y, \left(\alpha+\beta\right)\cdot x=\alpha\cdot x+\beta\cdot x, \left(\alpha\beta\right)\cdot x=\alpha\cdot\left(\beta\cdot x\right), and 1\cdot x=x.

What if we don’t want to restrict attention to abelian groups? One could attempt to define a nonabelian module using the same axioms, but without the restriction that the group be abelian. As it is customary to write groups multiplicatively if they are not assumed to be abelian, we will do that, and the map R\times M\rightarrow M will be written as exponentiation (since exponents are written on the right, I’ll follow the definition of right-modules, rather than left-modules). The axioms become: for all \alpha,\beta\in R and x,y\in M, \left(xy\right)^{\alpha}=x^{\alpha}y^{\alpha}x^{\alpha+\beta}=x^{\alpha}x^{\beta}, x^{\left(\alpha\beta\right)}=\left(x^{\alpha}\right)^{\beta}, and x^{1}=x.

What has changed? Absolutely nothing, as it turns out. The first axiom says again that M is abelian, because yx=x^{-1}\left(xy\right)^{2}y^{-1}=x^{-1}\left(x^{2}y^{2}\right)y^{-1}=xy. We’ll have to get rid of that axiom. Our new definition, which it seems to me captures the essence of a module except for abelianness:

A nonabelian R-module is a group M (written multiplicatively) together with a scalar exponentiation map R\times M\rightarrow M such that for all \alpha,\beta\in R and x\in M, x^{1}=xx^{\alpha+\beta}=x^{\alpha}x^{\beta}, and x^{\left(\alpha\beta\right)}=\left(x^{\alpha}\right)^{\beta}.

These imply that x^{0}=1, 1^{\alpha}=1, and x^{-1} is the inverse of x, because x\cdot x^{0}=x^{1}x^{0}=x^{1+0}=x^{1}=x1^{\alpha}=\left(1^{0}\right)^{\alpha}=1^{\left(0\alpha\right)}=1^{0}=1, and x\cdot x^{-1}=x^{1-1}=1.

Just like a \mathbb{Z}-module is just an abelian group, a nonabelian \mathbb{Z}-module is just a group. Just like a \mathbb{Z}/n\mathbb{Z}-module is an abelian group whose exponent divides n, a nonabelian \mathbb{Z}/n\mathbb{Z}-module is a group whose exponent divides n.

Exponentiation-like families of operations

Perhaps a bit more revealing is what nonabelian modules over free rings look like, since then the generators are completely generic ring elements. Where A is the generating set, a \mathbb{Z}\left\langle A\right\rangle-module is an abelian group together with endomorphisms \left\{ x\mapsto\alpha x\mid\alpha\in A\right\}, which tells us that modules are about endomorphisms of an abelian group indexed by the elements of a ring. Nonabelian modules are certainly not about endomorphisms. After all, in a nonabelian group, the map x\mapsto x^{2} is not an endomorphism. I will call the things that nonabelian modules are about “exponentiation-like families of operations”, and give four equivalent definitions, in roughly increasing order of concreteness and decreasing order of elegance. Definition 2 uses basic model theory, so skip it if that scares you. Definition 3 is the “for dummies” version of definition 2.

Definition 0: Let G be a group, and let A be a family of functions from G to G (not necessarily endomorphisms). If G can be made into a nonabelian \mathbb{Z}\left\langle A\right\rangle-module such that x^{\alpha}=\alpha\left(x\right) for x\in G and \alpha\in A, then A is called an exponentiation-like family of operations on G. If so, the nonabelian \mathbb{Z}\left\langle A\right\rangle-module structure on G with that property is unique, so define x^{p} to be its value according to that structure, for p\in\mathbb{Z}\left\langle A\right\rangle and x\in G.

Definition 1: A is an exponentiation-like family of operations on G if for all x\in G, the smallest subgroup containing x which is closed under actions by elements of A (which I will call \overline{\left\{ x\right\} }) is abelian, and the elements of A restrict to endomorphisms of it. Using the universal property of \mathbb{Z}\left\langle A\right\rangle, this induces a homomorphism \mathbb{Z}\left\langle A\right\rangle \rightarrow\text{End}\left(\overline{\left\{ x\right\} }\right)^{\text{op}}. Let x^{p} denote the action of p on x under that map, for p\in\mathbb{Z}\left\langle A\right\rangle. By \text{End}\left(\overline{\left\{ x\right\} }\right)^{\text{op}}, I mean the endomorphism ring of \overline{\left\{ x\right\} } with composition running in the opposite direction (i.e., the multiplication operation given by \left(f,g\right)\mapsto g\circ f). This is because of the convention that nonabelian modules are written as nonabelian right-modules by default.

Definition 2: Let consider the language \mathcal{L}_{Rings}\sqcup A, where \mathcal{L}_{Rings}:=\left\{ 0,1,+,-,\cdot\right\} is the language of rings, and each element of A is used as a constant symbol. Closed terms in \mathcal{L}_{Rings}\sqcup A act as functions from G to G, with the action of t written as x\mapsto x^{t}, defined inductively as: x^{0}:=1, x^{1}:=x, x^{\alpha}:=\alpha\left(x\right) for \alpha\in X, x^{t+s}:=x^{t}x^{s}, x^{-t}:=\left(x^{t}\right)^{-1}, and x^{ts}:=\left(x^{t}\right)^{s} for closed \mathcal{L}_{Rings}\sqcup A-terms t and s. A is called an exponentiation-like family of operations on G if x^{t}=x^{s} whenever T_{Rings}\models t=s, where T_{Rings} is the theory of rings. If A is an exponentiation-like family of operations on G and p\in\mathbb{Z}\left\langle A\right\rangle is a noncommutative polynomial with variables in A, then for x\in Gx^{p} is defined to be x^{t} where t is any term representing p.

Definition 3: Pick a total order on the free monoid on A (e.g. by ordering A and then using the lexicographic order). The order you use won’t matter. Given x\in G and w:=\alpha_{1}...\alpha_{n} in the free monoid on A, let x^{w}=\alpha_{n}\left(...\alpha_{1}\left(x\right)\right). Where p\in\mathbb{Z}\left\langle A\right\rangle is a noncommutative polynomial, p=c_{1}w_{1}+...+c_{n}w_{n} for some c_{1},...,c_{n}\in\mathbb{Z} and decreasing sequence w_{1},...,w_{n} of noncommutative monomials (elements of the free monoid on A). Let x^{p}=\left(x^{c_{1}}\right)^{w_{1}}...\left(x^{c_{n}}\right)^{w_{n}}A is called an exponentiation-like family of operations on G if for every x\in G and p,q\in\mathbb{Z}\left\langle A\right\ranglex^{pq}=\left(x^{p}\right)^{q} and x^{p+q}=x^{p}x^{q}.

These four definitions of exponentiation-like family are equivalent, and for exponentiation-like families, their definitions of exponentiation by a noncommutative polynomial are equivalent.

Facts: \emptyset is an exponentiation-like family of operations on G. If A is an exponentiation-like family of operations on G and B\subseteq A, then so is B. If G is abelian, then \text{End}\left(G\right) is exponentiation-like. Given a nonabelian R-module structure on G, the actions of the elements of R on G form an exponentiation-like family. In particular, if A is an exponentiation-like family of operations on G, then so is \mathbb{Z}\left\langle A\right\rangle, with the actions being defined as above.

[The following paragraph has been edited since this comment.]

For an abelian group A, the endomorphisms of A form a ring \text{End}\left(A\right), and an R-module structure on A is simply a homomorphism R\rightarrow\text{End}\left(A\right). Can we say a similar thing about exponentiation-like families of operations of G? Let \text{Exp}\left(G\right) be the set of all functions G\rightarrow G (as sets). Given \alpha,\beta\in\text{Exp}\left(G\right), let multiplication be given by composition: x^{\left(\alpha\beta\right)}=\left(x^{\alpha}\right)^{\beta}, addition be given by x^{\alpha+\beta}=x^{\alpha}x^{\beta}, negation be given by x^{-\alpha}=\left(x^{\alpha}\right)^{-1}, and 0 and 1 be given by x^{0}=1 and x^{1}=x. This makes \text{Exp}\left(G\right) into a near-ring. A nonabelian R-module structure on G is a homomorphism R\rightarrow\text{Exp}\left(G\right), and a set of operations on G is an exponentiation-like family of operations on G if and only if it is contained in a ring which is contained in \text{Exp}\left(G\right).

Some aimless rambling

What are some interesting examples of nonabelian modules that are not abelian? (That might sound redundant, but “nonabelian module” means that the requirement of abelianness has been removed, not that a requirement of nonabelianness has been imposed. Perhaps I should come up with better terminology. To make matters worse, since the requirement that got removed is actually stronger than abelianness, there are nonabelian modules that are abelian and not modules. For instance, consider the nonabelian \mathbb{Z}\left[\alpha\right]-module whose underlying set is the Klein four group (generated by two elements a,b) such that a^{\alpha}=a, b^{\alpha}=b, and \left(ab\right)^{\alpha}=1.)

In particular, what do free nonabelian modules look like? The free nonabelian \mathbb{Z}-modules are, of course, free groups. The free nonabelian \mathbb{Z}/n\mathbb{Z}-modules have been studied in combinatorial group theory; they’re called Burnside groups. (Fun but tangential fact: not all Burnside groups are finite (the Burnside problem), but despite this, the category of finite nonabelian \mathbb{Z}/n\mathbb{Z}-modules has free objects on any finite generating set, called Restricted Burnside groups.)

The free nonabelian \mathbb{Z}\left[\alpha\right]-modules are monstrosities. They can be constructed in the usual way of constructing free objects in a variety of algebraic structures, but that construction seems not to be very enlightening about their structure. So I’ll give a somewhat more direct construction of the free nonabelian \mathbb{Z}\left[\alpha\right]-module on d generators, which may also not be that enlightening, and which is only suspected to be correct. Define an increasing sequence of groups G_{n}, and functions \alpha_{n}:G_{n}\rightarrow G_{n+1}, as follows: G_{0} is the free group on d generators. Given G_{n}, and given a subgroup X\leq G_{n}, let the top-degree portion of X be \alpha_{n-1}^{k}\left(X\right) for the largest k such that this is nontrivial. Let H_{n} be the free product of the top-degree portions of maximal abelian subgroups of G_{n}. Let G_{n+1} be the free product of G_{n} with H_{n} modulo commutativity of the maximal abelian subgroups of G_{n} with the images of their top-degree portions in H_{n}. Given a maximal abelian subgroup X\leq G_{n}, let \alpha_{n}\restriction_{X} be the homomorphism extending \alpha_{n-1}\restriction_{X\cap G_{n-1}} which sends the top-degree portion identically onto its image in H_{n}. Since every non-identity element of G_{n} is in a unique maximal abelian subgroup, this defines \alpha_{n}. G:=\bigcup_{n}G_{n} with \alpha:=\bigcup_{n}\alpha_{n} is the free nonabelian \mathbb{Z}\left[\alpha\right]-module on d generators. If A is a set, the free nonabelian \mathbb{Z}\left\langle A\right\rangle-modules can be constructed similarly, with \left|A\right| copies of H_{n} at each step. Are these constructions even correct? Are there nicer ones?

A nonabelian \mathbb{Z}\left[\frac{1}{2}\right]-module would be a group with a formal square root operation. As an example, any group of odd exponent n can be made into a \mathbb{Z}\left[\frac{1}{2}\right]-module in a canonical way by letting x^{\frac{1}{2}}=x^{\frac{n+1}{2}}. More generally, any group of finite exponent n can be made into a \mathbb{Z}\left[\left\{ p^{-1}|p\nmid n\right\} \right]-module in a similar fashion. Are there any more nice examples of nonabelian modules over localizations of \mathbb{Z}?

In particular, a nonabelian \mathbb{Q}-module would be a group with formal nth root operations for all n. What are some nonabelian examples of these? Note that nonabelian \mathbb{Q}-modules cannot have any torsion, for suppose x^{n}=1 for some n\neq0. Then x=\left(x^{n}\right)^{\frac{1}{n}}=1^{\frac{1}{n}}=1. More generally, nonabelian modules cannot have any n-torsion (meaning x^{n}=1\implies x=1) for any n which is invertible in the scalar ring.

The free nonabelian \mathbb{Z}\left[\frac{1}{m}\right]-modules can be constructed similarly to the construction of free nonabelian \mathbb{Z}\left[\alpha\right]-modules above, except that when constructing G_{n+1} from G_{n} and H_{n}, we also mod out by elements of G_{n} being equal to the mth powers of their images in H_{n}. Using the fact that \mathbb{Q}\cong\mathbb{Z}\left\langle \left\{ p^{-1}|\text{primes }p\right\} \right\rangle, this lets us modify the construction of free nonabelian \mathbb{Z}\left\langle A\right\rangle-modules to give us a construction of free nonabelian \mathbb{Q}-modules. Again, is there a nicer way to do it?

Topological nonabelian modules

It is also interesting to consider topological nonabelian modules over topological rings; that is, nonabelian modules endowed with a topology such that the group operation and scalar exponentiation are continuous. A module over a topological ring has a canonical finest topology on it, and the same remains true for nonabelian modules. For finite-dimensional real vector spaces, this is the only topology. Does the same remain true for finitely-generated nonabelian \mathbb{R}-modules? Finite-dimensional real vector spaces are complete, and topological nonabelian modules are, in particular, topological groups, and can thus be made into uniform spaces, so the notion of completeness still makes sense, but I think some finitely-generated nonabelian \mathbb{R}-modules are not complete.

A topological nonabelian \mathbb{R}-module is a sort of Lie group-like object. One might try constructing a Lie algebra for a complete nonabelian \mathbb{R}-module M by letting the underlying set be M, and defining x+y=\lim_{\varepsilon\rightarrow0}\left(x^{\varepsilon}y^{\varepsilon}\right)^{\left(\varepsilon^{-1}\right)} and \left[x,y\right]=\lim_{\varepsilon\rightarrow0}\left(x^{\varepsilon}y^{\varepsilon}x^{-\varepsilon}y^{-\varepsilon}\right)^{\left(\varepsilon^{-2}\right)}. One might try putting a differential structure on M such that this is the Lie algebra of left-invariant derivations. Does this or something like it work?

A Lie group is a nonabelian \mathbb{R}-module if and only if its exponential map is a bijection between it and its Lie algebra. In this case, scalar exponentiation is closely related to the exponential map by a compelling formula: x^{\alpha}=\exp\left(\alpha\exp^{-1}\left(x\right)\right). As an example, the continuous Heisenberg group is a nonabelian \mathbb{R}-module which is not abelian. This observation actually suggests a nice class of examples of nonabelian modules without a topology: given a commutative ring R, the Heisenberg group over R is a nonabelian R-module.

The Heisenberg group of dimension 2n+1 over a commutative ring R has underlying set R^{n}\times R^{n}\times R^{1}, with the group operation given by \left(\boldsymbol{a}_{1},\boldsymbol{b}_{1},c_{1}\right)*\left(\boldsymbol{a}_{2},\boldsymbol{b}_{2},c_{2}\right):=\left(\boldsymbol{a}_{1}+\boldsymbol{a}_{2},\boldsymbol{b}_{1}+\boldsymbol{b}_{2},c_{1}+c_{2}+\boldsymbol{a}_{1}\cdot\boldsymbol{b}_{2}-\boldsymbol{a}_{2}\cdot\boldsymbol{b}_{1}\right). The continuous Heisenberg group means the Heisenberg group over \mathbb{R}. Scalar exponentiation on a Heisenberg group is just given by scalar multiplication: \left(\boldsymbol{a},\boldsymbol{b},c\right)^{\alpha}:=\left(\alpha\boldsymbol{a},\alpha\boldsymbol{b},\alpha c\right).

Advantages of Stochocracy

In the recent UK general election, the proportion of the popular vote won by each party looked like this:

general-election-result-2

The proportion of the seats in the House of Commons won by each party looked like this:

general-election-results-2015-seats

You may notice that these two graphs look pretty different. The Conservative Party won a majority of the seats with only 37% of the vote. The Scottish National Party, due to its geographic concentration in Scotland, got a share of the seats in Parliament nearly double its share of the popular vote. In contrast, the Liberal Democratic Party got only 8 out of the 650 seats despite winning 8% of the popular vote, the Green Party got only 1 seat with 4% of the vote, and, most egregiously, the UK Independence Party got only 1 seat with 12% of the vote. While I have no objections to UKIP getting cheated out of political power, this does not seem like a fair and democratic outcome, and yet this sort of thing is an inevitable consequence of first-past-the-post elections for single-member legislative districts.

This sort of thing happens even with two party systems like in the US, where not only do third parties get completely shut out, the balance of power between the two major parties is also skewed: in 2012, the Republicans won a majority of seats in the House of Representatives, despite the Democrats winning more total votes. This has been widely attributed to Gerrymandering, but the fact that Democratic votes are more geographically concentrated than Republican votes contributes even more.

Proportional representation is an easy solution to this problem. One of the criticisms of proportional representation is that having representatives associated with a district ties them closer to the voters. There are variants of proportional representation that address this, but here I want to propose another one. But first, let’s talk about random sampling.

Randomized Vote-Counting

If instead of counting every vote in an election, we randomly sampled some small fraction of the ballots and counted those, we would get the same result almost every time, with discrepancies being statistically possible only when the vote is very close (which seems fine to me; getting 50.1% of the vote does not seem to me to confer much more legitimacy than getting 49.9% of the vote does). While this might make elections slightly cheaper to administer, it would be a massive under-use of the power of randomization.

In California (and also, I am under the impression, in most other U.S. states, and many other countries), elections tend to include a massive profusion of state and local officials and ballot measures. Each of these individually requires a significant amount of research, so it is prohibitively time-consuming to adequately research every ballot question. When I vote, I usually feel like I did not have time to research most of the ballot questions enough to make an informed vote, and yet I suspect I still spend more time researching them than average.

My proposed solution to this is for each voter to be randomly assigned one ballot question that they get to vote on. When you only have one issue you can vote on, it is much easier to thoroughly research it. Thus this system should result in more informed voters.

People would likely object that this system is undemocratic, since not everyone gets to vote on each ballot question. But in fact, this system would probably end up being more democratic than the current one, since it would make voting easier and thus probably increase turnout, making the sample of people voting on each issue more representative of the electorate as a whole, even while comprising a smaller fraction of it. Some people might not vote because of being assigned to an issue that they are not interested in, but most such people probably wouldn’t have voted on that issue anyway; I’d bet there would be significantly more people who would vote on an issue if it was the only one they could vote on, but wouldn’t vote on it if they could vote on all of them. Furthermore, since people would have more time to research their ballot question, they would be less reliant on information shoved in their faces in the form of advertising, so this would decrease the influence of special interest groups in politics, arguably also making the system more democratic.

What about just one?

So far I’ve been suggesting that enough votes should be sampled that the outcome of each election is virtually guaranteed to be the same as it would be if all the votes were counted, with all the exceptions being when the vote is very close. But what happens when we don’t sample enough votes for that? What if we take it all the way to the extreme and only sample one vote? This would not be appropriate for ballot measures or executive officials, but for electing the members of a legislature with a large number of single-member districts, this actually has some pretty nice properties.

Since each ballot is equally likely to be the one that gets counted, the probability of each candidate getting elected is proportional to the number of votes they get. Averaged over a large number of districts, this means that the number of legislators elected from each political party will be approximately proportional to the popular support for that party. Thus, this simulates proportional representation with single-member electoral districts.

This is very similar to a sortition, in which legislatures are a random sample of the population. The primary difference is that in a sortition, many of the people randomly selected to be legislators may have little interest in or ability for the job. However, in this system, someone would still have to demonstrate interest by running for election in order to be selected. To further discourage frivolous campaigns, costs could be imposed on the candidates, for instance by requiring them to gather signatures to qualify for the ballot, to ensure that no one gets elected who isn’t serious about their intent to serve in the legislature.

A small further advantage of this system over a sortition is that it ensures that the legislators are evenly distributed geographically, so the variance of the number of seats won by each political coalition would be slightly smaller than it would be under completely random sampling.

Another advantage of my randomized system over first-past-the-post and proportional representation is that it avoids electoral paradoxes that plague deterministic systems. It avoids the Alabama and population paradoxes, which proportional representation is vulnerable to. There is also no incentive for tactical voting, since if your vote gets selected, the others do not constrain which candidates you can get elected. And there is no incentive for Gerrymandering, since the expected number of seats won by a party will be proportional to its vote count no matter how the districts are drawn, provided they all have equal numbers of voters.

A possible objection to this system is that candidates can get elected with support from only a small fraction of their constituents. But this does not seem that bad to me. Even under first-past-the-post, it is the norm for a large fraction of the constituents to vote against the winning candidate. Even in safe seats, the fraction of voters who vote against the winner is typically fairly significant (e.g., a third), and these voters never get the chance to be represented by their preferred candidate. Under the randomized system, any significant local coalition would get a chance of representation sometimes, and dominant coalitions would be represented most of the time. And if a party is dominant in a district, then even if the representative for that district ends up not being aligned with that party, there will likely be nearby districts that are represented by that party. For example, the San Francisco Bay Area is so dominated by Democrats that all of its members of the House of Representatives and the state legislature are Democrats, leaving the Bay Area Republicans unrepresented. Using the randomized system, a few Republicans would get elected in the Bay Area, so the constituency of Bay Area Republicans get their representation, and the Democrats in the districts that end up getting represented by Republicans would still have plenty of Democratic legislators in neighboring districts to represent their interests.

One significant disadvantage is that it would be difficult for legislators to accumulate much experience in the legislature, since they would have a significant chance of losing each re-election even if they have broad support in their district. Primarily for this reason, I think this randomized system is inferior to single transferable vote and party list proportional representation. But despite this, I still think it is not too terrible, and would be a significant improvement over the current system. Sometimes you can make the system more democratic by counting fewer votes.