Unsupervised Learning is NOT optimization
The overall point of this essay is to explain how a non-optimization-based unsupervised learning method can assign codes to a set of inputs in a way that embeds the (in general, multidimensional) similarity structure over that set of inputs based on a single-trial [thus, necessarily an on-line (sometimes called "sequential")] learning process. The key is that the codes are sparse distributed codes (SDCs), i.e., subsets of units chosen from a much larger (universal) set. This allows that the similarity structure over the inputs can be represented by intersection structure of the SDCs. This requires that the learning process, specifically, the process that chooses the units to be included in an SDC, maps more similar inputs to more highly intersecting SDCs.
The capabilities of Machine Learning (ML) are increasing very rapidly, e.g., LLMs, etc. But all leading ML models implement unsupervised learning as optimization, usually some form of gradient descent, e.g., Backpropagation. The offline training of the underlying "foundation" models requires computing the gradients of objective functions (e.g., reconstruction loss) for billions of parameters numerous times. This is what consumes the massive energy needed to train these models.
But, what is the goal of this training process? It is the creation of a model of the statistical structure of the input space, i.e., a "world model". The complete statistical structure of the input space, i.e., the total information contained in the input space, includes representations of the frequencies of occurrence of:
- every atomic feature in the input space, e.g., how often pixel (6,18) has intensity 0.73
- every pair of features (i.e., pairwise, or 1st order statistics), e.g., how often pixel (6,18) has intensity 0.73 AND pixel (14,25) has intensity 0.1
- every triple of features (2nd order statistics), etc.
- up to the highest-order statistics present, e.g., for a 32x32 pixel array, how often every particular setting of the 1024 pixel intensities occurs.
Possession of all these statistics, constitutes the total information contained in the input space and therefore, in principle, contains the answer to any conceivable question that could be asked about the input space. This is a combinatorial number of statistics. However, in practice, the physical constraints underlying any natural input space, specifically, the fact that that natural entities have strong recursive part-whole structure, means that almost all such higher-order statistics will be near zero, and thus do not need to be explicitly stored. We'll refer to the vanishingly small fraction of all statistics that are sufficiently above zero, and thus potentially relevant for making inferences about ongoing inputs, i.e., recognizing, classifying, predicting, as "important statistics"
So, the goal of unsupervised learning, which mainstream ML implements as gradient descent, is creation of a data structure that contains (embeds) all of these important statistics.
What if there was another, better, far more efficient (i.e., using far less power), way to produce such a data structure? There is! The model is called Sparsey. And its principles can be described succinctly as follows.
- Individual inputs are represented as sparse distributed codes (SDCs) chosen from a coding field (CF) of binary units.
- The algorithm that chooses SDCs preserves similarity, i.e., more similar inputs are mapped to more highly intersecting SDCs.
At a high level, those are the only two principles needed in order to explain how such a data structure can be built using only single trial learning. Why? Well, first consider this. The model is initially a tabula rasa, i.e., all its weights are zero. This means it knows nothing about the input space. In ML parlance, it has no idea what the latent variables underlying (generating) the input space are, nor how many there are, nor their valuednessess. The whole point of unsupervised learning is to discover the latent variables. So, how can the model assign codes--subsets of which will represent the different values of the different latent variables--based on single trials? That is, assume that after some amount of experience, i.e., after experiencing (and storing codes for) some number, N, of inputs, there is a particular pattern of intersections (pairwise and up to all orders present, i.e., N) that exists over all the stored codes. Assume that that intersection pattern has correctly captured the underlying latent variables. For example, there exists a subset of units that is included in the code of every input that has feature X (e.g., that is a chair). And, there is a different (though possibly overlapping) subset of units that is included in every input that has feature Y (e.g., that is blue), etc. If the model has no prior knowledge of what features exist in the input space, nor the number of features, nor their valuednesses, then how can it ultimately find (assign) these various subsets of units that represent the features? More starkly, how can it assign the very first code? Again, we're assuming single-trial learning. So the code assigned to the first input is set, once and for all, in that single learning instance. How can that code, which, in fact, must be assigned completely randomly (since all weights are zero), contain the correct subsets that represent all of the (yet-to-be-experienced) features?
There is a simple answer. If the codes, the SDCs, are sparse enough, then as each input is presented, and thus, as you assign (store) each SDC, the degrees of freedom that remain in the codespace are sufficient to represent future correlations in (i.e., not-yet-experienced dimensions of) the input space. The first code assigned can be random. The act of assigning that first code places permanent constraints on the set of physical units that ultimately, i.e., after numerous further inputs occur, represent the particular values of the particular latent variables present in that first input (and also, that ultimatley represent all values of all latent variables of the input space). It's lke the act of carving a statue out of a block of wood. With each bit of wood removed, the space of possible final statues decreases (in fact, exponentially): all possible final statues are confined to the block that remains after each bit is removed. The analogy isn't perfect because the wood block, and more to the point, the individual bits of wood removed, are dense in the latent space (which in the case of performing physical operations on physical objects is 3D), whereas the SDCs (analogous to the complements of the removed bits) are sparse in the codespace. The parenthetical underscores a different imperfection of the analogy...i.e., the atomic action in the wood carving case is a removal (subtraction) operation, whereas the atomic operation in the storing-memories-as-SDCs case is a storage (addition) operation. Nevetheless, the analogy is very important in that both processes can be viewed as sequences that progressively reduce the space of final possible entities.
Let's go through this again, more methodically, and with concrete numbers. Let the CF be composed of Q=100 WTA competitive modules (CMs), each having K=20 binary units. Thus, all SDCs are of size Q=100. There is an input field of binary units that is completely connected to this CF. We can assume all inputs have the same number of active features, S. The unsupervised learning process will just be the sequential presentation of single trials of each input. For each input, an SDC will be assigned and the weights from the S active input features to the Q active SDC units will be set to 1. In general, as each successive input is presented, many of the relevant weights will have already been set to 1 for a prior input. This is a natural input space, which therefore has an underlying (latent) dimensionality. That is, there will be some underlying number of dimensions that explain most (or all, according to some criterion, e.g., class) of the variation . But again, the model has zero prior knowledge as to how many of these latent dimensions there are, or what these dimensions are, or the valuednessess of these dimensions, i.e., all weights are initially zero.
Now, despite the fact that the model knows nothing a priori about the input space, suppose that we, as external observers, know that the input space has D underlying (latent) dimensions. For concreteness, let D=3. And for simplicity and concereteness, let each of those dimensions be 2-valued. Thus, we are saying that regardless of all the possible input patterns that might be presented to the model, there are only 23=8 underlying states of the physical entities that generate this input space. Now, let the first input, x1, be presented, and let the SDC, φ(x1), be assigned as its code. The first thing to note is that there are KQ (20100) possible codes that the model could assign as φ(x1). This code, φ(x1), is of size Q=100. φ(x1) was assigned randomly by choosing one (out of the K=20) units in each of the Q=100 CMs. We know that this input reflects one particular state, out of 8 possible states, of the world, and that it corresponds to one particular value (out of two) on each of three dimensions. The act of assigning this particular set of 100 units (out of the total of 2000 units that comprise the CF) as φ(x1), vastly constrains the space of codes that this model can assign to future inputs. That is, this set of 100 units collectively represent the specific values of the D=3 latent dimensions. We don't know which of these 100 units represents which value of which dimension. At this point, we only know that, collectively, these 100 represent all three values. The identities of the specific subsets of the 100 cells that represent each specific value of each latent dimension will only, indeed, can only, become known as future inputs are presented. Indeed, their identities are not determined until (a sufficiently large number of) these future inputs are presented and their codes assigned.
So, how do we formally talk about how the space of future codes decreases with each successive assignment of a code? Well, ok, so φ(x1) has been assigned. It's 100 cells and it represents three values (i.e., D=3). Let's also assume that the latent dimensions are completely uncorrelated, i.e., that the 8 possible values all occur at chance levels. Now a second input, x2, presents. x2 will have a specific value on each of the three latent dimensions. Consider one of the dimensions, d1. x1 also had one of two possible values for d1. Denote that value, d1(x1). That value must be physically represented by some subset of the 100 units of φ(x1). We don't yet know which units they are. And we don't know how many of those 100 units specifically represent d1(x1). Nevertheless, that set (possibly of size one) must exist. How does the fact of that set's existence constrain the set of units that represent d1(x2)? Again, we've assumed that there are only two values of d1. Suppose d1(x2) = d1(x1). Let's make the simplest assumption possible which is that in the final mapping from inputs to codes, i.e., after all inputs have been presented and assigned to codes, i.e., stored, the sets of units that represent each value of each dimension are disjoint. Let's call such a mapping "monolithic" (it can also be referred to as a "redundant localist" mapping).There are 23=8 unique value-dimension pairs that must be represented. So, since Q=100, these 8 unique quantities could be represented by sets of size Q/8 ≈ 12 units. But there is no requirement that all the sets need to be the same size, or even approximately the same size.
The first question to then ask is: when φ(x1), which represents three particular feature values, is assigned, how much has the code space available for future inputs been decreased?
Before answering that question, here is an example that helps to explain all this. (In progress)