The video below illustrates the fundamentally novel concept of Sparsey's code selection algorithm (CSA) (see Rinkus, 2010 or Rinkus, 2014 for improved CSA), which is that the property that more similar inputs are mapped to more similar codes (SISC) can be achieved simply by making the scalar level of noise in the code choice process vary inversely with the familiarity (directly with the novelty) of the input. This can only be achieved if the codes are distributed codes, rather than localist. In Sparsey, the codes (representations) of inputs are sparse distributed codes (SDCs), i.e., small subsets of an overall population of binary representational units. Therefore, the natural metric for similarity in code space is size of intersection.
We will first focus on the V-to-µ plot (lower right) and the three sliders controlling the sigmoid's parameters and on the histograms at upper right. Although the upper left panel ("Region with 6 WTA CMs") also changes as the sliders vary, please disregard it for the moment. It will be explained below.
At the start of the video, several clicks, which span a range of V values, are made in the V-to-µ plot. Each click adds a unit with that V value to the single CM represented at upper right. Note that the p values renormalize with each cell added. A unit's V value is a measure of how closely its current input matches its afferent weight pattern (tuning), i.e., the "evidence" or "support" that it should become active. The CSA transforms these local V values into relative win probabilities (µ values) based on a mac-global familiarity measure, G. The µ's are then renormalized to absolute win probabilities (p values). The p histogram reflects the particular settings of the sigmoid's three parameters (sliders at left) and the set of V values. In the applet version, the user could experiment with the parameters and different choices of V values to see the effect on the p distributions.
This video's upper left panel shows a tiny "toy" instance of a macrocolumn ("mac"), which consists of Q=6 winner-take-all (WTA) competitive modules (CMs), which I hypothesize correspond to cortical minicolumns ("mincs"). Thus, a single macrocolumn code consists of one winning unit in each of the Q mincs. When the user clicks in the V-to-µ plot, it adds a unit to all six of the CMs. Note however that the specific V values associated with the clicks are not used in the upper left panel. The clicks merely determine the number, K, of units per CM in the upper left panel. For each of the six CMs, a random set of K V-values are generated from from a constrained range. The details of the constraints depend on which of the five radio buttons at upper left is selected.
The four upper radio buttons represent scenarios in which G is assumed to be 1. In this case, there must be at least one cell in each CM that has V = 1. One cell (magenta) in each CM is randomly picked to have V = 1. The V values of the other K-1 cells in each CM are then chosen from a distribution whose parameters are set so as to simulate conditions at various points in the 'life' of the model.
- Familiar-Early: The idea is that early in the model's life, when few memories have been stored, there will be little crosstalk (interference, overlap) between memory traces. To simulate this. the V values of the other cells are chosen randomly from the interval [0,0.1].
- Familiar-Middle: In this case, there is still a single cell with V = 1. However, the other V values are now selected from the interval [0.1,0.5]. This simulates a later period of life after a lot more inputs have been stored and any given cell will have been used in the codes of many of those inputs. Since the expected number of occurrences of any particular input feature must increasea with the number of inputs stored, the cross-talk between codes (memory traces) must increase. Thus, although G is assumed to be 1, which, again, means that there is at least one cell in each CM with V=1, there will be a significant number of other units in each CM with appreciable V values.
- Familiar-Late: Again, one cell in each CM is randomly chosen to have V=1. The remaining V values are chosen from interval [0.2,0.8], indicating mounting cross-talk (interference) between traces.
- Familiar-Old: The model has stored so many traces that cross-talk is very high. Specifically, the K-1 other V values are chosen from interval [0.6,0.9].
A major principle to see here is that, for any given set of sigmoid parameters, increasing cross-talk reduces the expected number of CMs in which the cell with the highest V value ends up being chosen winner. When the input is perfectly familiar, i.e., when G=1, each such event constitutes an error, i.e., the wrong cell has been activated in the CM. However, a code consists of Q units; thus, making a few unit-level errors may still allow the code, as a whole, to exert the proper influence on downstream computations. More generally, the fact that in Sparsey, the Q units that comprise a code are chosen in independent processes is what allows for graded levels of similarity (intersection) between codes, i.e., the SISC property. Note that although we don't actually make the final winner choices in this video, the gradual flattening of the p distributions as one goes from "Early" to "Old", implies the graded reduction in the expected number of max-V cells that end up winning in their respective CMs. In the applet, the user can play with the sigmoid parameters to see how they effect the final p distributions.
- Novel-Early: If the input is highly novel, then no cell in any CM has a high V value. In fact, if G = 0, then all cells' V values must also be 0. When G = 0, the sigmoid collapses to the constant function. I should really modify the applet so that it pins the G slider all the way to the left when the user clicks this radion button. But for now, the user should explicitly do this. Here, rather than assuming all cells' V values are exactly zero, we assume they are distributed in some a small range near 0. The magenta color can be ignored in this case. The main point to see here is that when G=0, all cells end up equally likely to win. This corresponds to maximizing the average Hamming distance between the code being newly assigned at the current moment and all previously assigned codes. This in turn corresponds to maximizing the number of codes that can be stored.
The Inflection Point Position parameter, Y, controls where along the x-axis, the rapidly changing region of the V-to-µ map is. Select the Familiar-Early radio button. You will see highly peaked distributions in each CM (assuming that you have added some cells to the CMs by clicking in the V-to-µ plot). If you now slide the Y slider around, you will see that the range of Y for which the correct cells (magenta) have very high probability of winning is very large, extending from just over 0.1 (10) up to 1 (100). Now select the Familiar-Middle button. Note that the model can still ensure a high probability of reactivating the correct code (the magenta cells), but that the range of Y for which this is the case has contracted to the right', typically spanning from about 0.5 to 1.0. If you click on the "Late" and "Old" buttong, you will see the range of Y yielding a high probability of success further shrinking. By the time the model is 'Old', the crosstalk is so high that Y must be very near 1.0 (100) in order to have a good chance that the entire correct code (all six magenta cells) will be chosen.
The question arises, why not simply keep Y at a high value througout the model's life? The answer is that there is a tradeoff involving Y. When Y is very high, then even fairly high V values will be squashed to near-zero p values. This greatly diminishes the propensity of the model to assign similar (i.e., more overlapped) codes to similar moments. Keeping Y at or near 1.0, would simulate a person who stored almost all new experiences very uniquely, but was impoverished at forming categories based on similarity. Such a person would have very (i.e., too) strong episodic memory ability/capacity and weak semantic memory ability, perhaps something like a savant syndrome, e.g., Luria's famous patient S, "the mnemonist".
Y should depend on how saturated a regions's afferent synaptic matrices have become, i.e., what fraction of its synapses have been increased to the high setting, e.g., w=1, assuming binary synapses. I think that such a neurophysiological parameter would be quite easy for the brain to keep track of through the life of an organism. Moreover, it seems reasonable to believe that such a 'degree of saturation' parameter could be maintained on a region-by-region (or macrocolumn-by-macrocolumn) basis across cortex. Thus, for example, the brain region(s) that are most used in storing the memory/knowledge of a given individual's particular field of expertise might fill up faster than other regions.
The Sigmoid Eccentricity parameter simply controls how abruptly the the V-to-µ map changes from its low to its higher values. It affects the granularity of the spatiotemporal categories formed by the model. However, we will not discuss it further here.
In summary, the goal of this page and video is to show how a local (cell-level) measure of support, V, over a competitive population of cells can be transformed, using global (macrocolumn-level) information, G, into a distribution of relative likelihoods of becoming active (i.e., of winning the competition), µ, and then, by simple normalization, into a final probability distribution of winning, p. Note, we haven't told you how G is computed here (that is described in Rinkus, 2010) but it is extremely simple and is a constant-time operation (as is the entire CSA). In the Sparsey familiy of models, V is a product of three normalized summations; a cell's normalized bottom-up (U) summation, its normalized top-down (D) summation, and its normalized horizontal (H) summation. The H and D inputs carry temporal context information, i.e., about the sequence of inputs leading up to the current input. Therefore, G (which is computed from the V's) is formally a spatiotemporal similarity metric. Every code stored in every macrocolumn in Sparsey, and we believe in any cortical macrocolumn, formally represents a specific moment in time, not a specific purely spatial input pattern.
Sliding the G slider all the way right correspondes to G=1.0 (even though the slider values go from 0 to 100). This maximizes sigmoid height, providing the maximal advantage of high-V cells over low-V cells. Sliding it all the way left (G=0.0) collapses the V-to-µ map to the constant function, altimately causing all cells to be equally likely to win.
The logic is simple. G=1.0 means that the current moment is completely familiar. In this case, the model should reactivate the set of cells that are causing G to be high. These are the V=1.0 cells. Hence, the policy of maximizing the height of the V-to-µ map. On the other hand, when G ~ 0, meaning that the current moment is completely novel, the model should pick a new code to represent the novel moment. This new code should have low (or at least, minimal) intersection with any previously stored code. Hence, the policy of collapsing the V-to-µ map to the constant function. This constitues a fundamentally new way of using noise (randomness) to achieve SISC, which requires the use of SDCs.
You can now clearly see the main point of this video: even though the winner selection occurs independently in each CM, the probability of the entire correct code, i.e., all Q=6 units with V=1.0, being chosen (which indicates recognition of a previously experienced input) can be close to 1. Exactly how close depends on many model parameters, e.g., the V-to-µ function parameters, and the amount of information stored (roughly, "age") of the network.
NOTE: In this video, we do not show the inputs explicitly. Rather, the user can vary the level of input familiarity, G (see the G slider at left), and see the effect it has on the code assigned. More to the point, the user can see G's effect on the expected intersection of the code being assigned with the code of the most similar previously stored input.