Storage Capacity of a whole Hierarchy of Macrocolumns

This page has my general argument that for naturalistic input domains, a hierarchy of macrocolumns could operate indefinitely storing all novel, important inputs, without ever reaching capacity, and operating with O(1) time complexity for both learning and best-match retrieval.

My prior work has demonstrated that Sparsey’s core algorithm (for learning and inference) runs in fixed time, i.e., the number of operations needed either to store a new item or retrieve the best-matching item remains fixed as the number of stored items increases. A Sparsey mac (SDR coding field) is an associative memory in which items (memory traces) are stored in superposition. Thus, as the number of items stored grows, interference (crosstalk) will increase, reducing the average retrieval accuracy over the set of stored items. We define a mac's storage capacity to be the number of items that can be safely stored, i.e., stored so that average retrieval accuracy over all stored items exceeds a defined threshold, e.g., 97%. My 1996 Thesis results showed that a mac's storage capacity scales well, i.e., linearly in the weights and super-linearly in the units. Nevertheless, the numbers of units and weights needed to safely store N items grow with N. Since the number of operations needed to store or retrieve an item involves an iteration over the weights to the mac (i.e., each cell must sum its inputs), the best we can say for an individual mac is that it has fixed run time but not that it truly has O(1) time complexity.

However, if we can construct an overall system—specifically, a hierarchy of macs—in such a way that, throughout a lifetime operating and in a naturalistic domain, i.e., a domain having 1/f statistics, i.e., a domain whose objects/events are strongly compositional, i.e., a recursive part-whole hierarchy/heterarchy:

  1. no mac ever exceeds its capacity
  2. the rate at which macs approach capacity, i.e., the rate at which new inputs occur, decreases quickly, e.g., exponentially, with level; and
  3. processing (learning / retrieving) an input entails a single iteration over all the macs, i.e., a single bottom-up, level-by-level processing pass over all the macs

then the overall system can remain able to store new inputs without losing old knowledge and with fixed time storage and best-match retrieval over potentially very long lifetimes.  I refer to such a system as effectively possessing O(1) time complexity storage and best-match retrieval. Point 3 is needed because if the run time of each mac is fixed, then the sum of the run times of all macs is also fixed. Indeed, Sparsey's algorithm satisfies Point 3. Furthermore, in practice, as can be seen in many of this website's animations, only a small fraction of the macs activate, and thus contibute to run time, on each time step. [Furthermore, it is not necessary that a processing pass be limited to only a single iteration over all macs (though that is the case for the current version of Sparsey): any fixed number of iterations over the macs implies that the total processing time is fixed.]

With regard to Point 1, there are two scenarios by which a mac can avoid exceeding capacity over the lifetime of the overall model.  Either: A) we can "freeze" the mac when it reaches capacity, preventing any further codes from being stored in it; or B) it can happen that the set of novel inputs generated by the naturalistic input domain is smaller than the mac's capacity. Broadly, scenario A dominates in the lower level macs and scenario B dominates at the higher levels.  Part of the reason scenario B dominates in higher level macs is the freezing that occurs in lower level macs.

First we consider the lower level macs. For concreteness, let's consider the first internal level (L1) of macs, e.g., V1. An L1 mac has a small receptive field (RF), i.e., a small aperture onto the visual field.  Sparsey assumes the input level (L0) is an array of binary pixels. Thus, an L1 mac's RF might be a 8x8-pixel patch of the input level. Furthermore, we assume input video frames are preprocessed by being edge-filtered, binarized, and skeletonized.  For naturalistic video and such a small aperture, this generally yields frames containing one/few edge segments, often with quite low curvature. Figure 1a shows examples of such inputs. Figure 1b shows examples of inputs that would be extremely unlikely to occur given the preprocessing.

Likely and unlikely features in a small RF

Figure 1: Possible input patterns (features), assuming the described preprocessing, for a hexagonal RF of about 70 pixels. (a) Features likely to occur. (b) Features not likely to occur.

In addition, Sparsey implements the policy that a mac only activates if the number of active units in its RF is within a small range. Consider a smaller hexagonal RF with 37 pixels (approx. 6x6) and suppose the range is [9,10], i.e., an SDR is only activated in the mac if 9 or 10 pixels are active in its RF.  In that case, there are C(37,9) + C(37,10) = ~473 million possible input patterns for which the mac will become active. However, given the preprocessing, only a small fraction of those, perhaps in the tens or hundreds of thousands, are likely to ever occur.  Figure 2 shows some examples of patterns that might occur.

acceptable and unacceptable features for 6 to 10 pixel range

Figure 2: Possible input patterns (features), assuming the described preprocessing, for a hexagonal RF of about 37 pixels. (a) If the allowable pixel activation range is [9,10], these inputs will activate the mac and be assigned to a code. (b) These inputs will not be assigned since they are outside the allowable range.

Nevertheless, for biologically realistic instantiations of a mac, e.g., Q=70, K=20, likely at most only a few thousand codes could be safely stored.  [This estimate is based on roughly extrapolating from these capacity results in which a model with Q=100, K=40, and a 10x10-pixel input level, was able to store, to 97% recall accuracy, 3,084 10-frame sequences (equivalent to 30,840 individual frames). We would expect lower capacity in the realistic case due to the lower Q and K values and the need to map similar inputs to similar codes (which was not imposed in the hyperlinked studies).]  Thus, as the eye experiences the world, it is likely that the number of unique inputs this mac experiences, will quickly exceed its capacity. To prevent this mac from losing memories, we will have to freeze it.  In practice, the freezing is implemented simply by preventing any further changes to the weights comprising the mac's bottom-up afferent weight matrix: this effectively prevents new codes from being stored.

We must now consider the consequences of freezing a mac. We will refer to the set of codes stored in the mac, or rather their corresponding inputs, as the mac's basis. This consideration is done with two questions.

  1. First, given the assumptions above, i.e., the preprocessing, the tight range of input activation required to activate the mac, and the naturalistic input domain, how good a job can the frozen basis do at representing all future inputs to the mac? More formally, we define the fidelity of a basis to be the expected intersection of the mac's input to the nearest basis element, averaged over all future inputs for which the mac activates. Given all the assumptions of this example, we expect that a basis of only a few thousand, or even a few hundred, could yield quite high fidelity, approaching 100%. We intend to conduct such an empirical analysis in the near future.  Figure 3 provides some intuition as to why we might expect high fidelity even with a relatively small basis.
  2. intuition why small basis could nevertheless yield high fidelity

    Figure 3: Illustration of why a small basis might be expected to yield high fidelity. The question is wouldn't each pattern in the top row be a sufficiently close representative of the any of the patterns below it in its column?

  3. Second, given that an L1 mac corresponds to only a small portion of the whole visual field and thus that its input pattern may correspond to only a small portion of an object or scene to be recognized, how high a fidelity is really required at L1? This concept is illustrated in Figure 4. Broadly, the idea is simply that the object of interest here, the plane, spans many macs' RFs. The overall shape of the plane, i.e., the relative position of its wings, nose, tail, etc., is carried not by the precise features active in each RF, but by the relative positions of the RFs themselves. This information, i.e., the overall pattern of active L1 macs, can, in principle, be recognized by an L2 mac whose RF included all the L1 macs shown. Thus, to the extent that some of the meaning of an overall input image is carried by the spatial topology of the macs at a level, J, this reduces the fidelity needed by the level J macs in order for the level J+1 macs to achieve any given level of fidelity. Figure 4b shows the case where the closest-matching basis element to the actual contour (red in panel a) is activated.  Figure 4d shows that even if very poorly matching basis elements are activated in many/all of the apertures, substantial information, e.g., that the object has four lobes is still present.  Again, we intend to empirically analyze this principle in the near future.
  4. illustration of meaning partially carried by topology

    Figure 4: Illustration that in a hierarchy, some information can be carried by the topology, i.e., where features occur in the global input space, which reduces the amount of information, e.g., fidelity, needed in individual portions of the global space.

Overall, these considerations suggest that freezing learning in lower level macs, e.g., in V1, V2, may be quite acceptable: it permanently preserves the learning in these macs (i.e., protects against catastrophic forgetting), yet allows that the frozen bases might yield sufficienly high fidelity for representing all future inputs to the system.

Now, we consider the higher level macs.  First, it is essential that any autonomous, intelligent system be able to store new information, i.e., new permanent memorry traces, for its entire lifetime, cf. "lifelong learning".  In practice, we will be satisfied with a statistical argument that the expected time to reach the condition where all macs comprising the overall hierarchy have reached their capacities is some large value, e.g., a human lifetime, 10 human lifetimes, etc. More precisely, we will be satisfied with a statistical argument that the rate at which new inputs are expected to occur (to macs) decreases exponentially with level. Thus, broadly, the envisioned scenario is that as the system experiences the world, many/all of the macs at lower levels may reach capacity and be frozen rather early in the system's lifetime, but that due to the statistical argument below, many/some of the macs at higher levels will never reach capacity, and thus never be frozen, and thus be able to store new inputs throughout the expected lifetime of the system. I emphasize that there is substantial evidence, across many/all sensority modalites, for "critical periods", i.e., the phenomenon that if a particular cortical region does not experience input, or certain classes of input, prior to some age, that region never develops the ability to recognize such inputs.  I believe that the purpose of biological critical periods is precisely to serve the functionality described here, i.e., prevent too many codes from being stored in macrocolumns, and thus prevent catastrophic forgetting.  Furthermore, there is plentiful anecdotal evidence that as we age, the features/concepts with which we experience, intrepret, deal with, the world become entrenched.  In general, over many decades, one retains to ability to form new higher level concepts (abstractions), but they are likely built out of long-fixed low-level features/concepts.

So, by what reasoning can we claim that the rate at which macs at higher levels approach capacity likely slows down exponentially with level?  First, as described above, the particular preprocessing used and the tight criteria for activating a mac already greatly constrain the set of inputs expected to be experienced and stored in a mac. Suppose that we impose that following the point of freezing, not only will no new codes ever be stored (learned) in the mac, but also that only the set of already stored codes will ever be allowed to activate in the mac.  Now, suppose that the RF of a level J+1 mac is a set of levl J macs.  That is, the boundary of the level J+1 mac follows exactly the boundary of the set of level J macs comprising its RF. That means that following the point at which all level J macs freeze, the set possible of inputs to a level J+1 mac that can ever possibly occur is fixed.  Suppose also that a level J+1 mac only activates if the number of active level J macs in its RF is within a tight range. For concreteness, suppose an L2 mac's RF consists of 9 L1 macs and that the L2 mac will ony activate if exactly three of its nine L1 macs is active, i.e., if the number of active L1 macs is in the range [3,3].  Thus, the space of all possible inputs to the L2 mac can be partitioned into sets corresponding to C(9,3)=84 3-part compositions. Suppose that L1's critical period has elapsed (though, the freezing actually occurs on a mac-by-mac basis) and that the basis size of all the frozen L1 macs is 1000. In this case, the total number of actual input patterns that the L2 mac can ever experience for the remainder of the system's life is 84x1000x1000x1000 = 84x109. This may seem like a large space, but we must compare it to the total number of unique L0 input patterns that were initially possible, i.e., prior to any learning at any level, within the "composite L1 RFs" corresponding to the 84 specific 3-part compositions. Let's consider just one of those 84 3-part compositions. We said above that the space of possible patterns that would cause any of the L1 macs to activate was C(37,6) + C(37,10) (~473 million). However, due to the filtering, we speculated that only perhaps a few tens/hundreds of thousand patterns, let's say 105, for concreteness, would actually ever arise in the RF of any given L1 mac. Thus, the total unique patterns for a single 3-part composition is 1015, and the total for all 84 3-part compostions is 84x1015. Thus, the effect of the critical period has been to reduce the input space to the L2 mac by 6 orders of magnitude.

The above reasoning applies at every level. In general, we could quantify the reduction in input space at level J+1 post freezing of level J macs, but the broad conclusion is that the combined effect of hierarchy, compositional RFs, and critical periods, reduces the space of possible inputs to a level J mac exponentially relative to its initial size, i.e., prior to any learning. Note that in Sparsey, all weights are zero at "birth".  Further, the size of that relative reduction increases exponentially with level.

However, our ultimate goal (Point 2, above) is to show that the expected rate at which new inputs occur decreases exponentially with level. If that is true, then the higher levels of a model with only a small number of (e.g., <10) levels could have very low expected rates of occurrence of new features.  Such a model would therefore have a very long expected lifetime operating in a natural world without exhausting its memory capacity. Broadly, the intution that supports this is that natural worlds have 1/f statistics, i.e., the expected frequency of input patterns, i.e., features, decreases approximately linearly with the size/complexity of the features. This reflects the strongly, hierarchical, recursive, compositional structure of the natural world, both in space and time, i.e., entities are made of parts, which are made of smaller scale parts, etc. The forces/relations that exist between parts at the same scale and across scales severely limits the configuration spaces of wholes (and intermediate compositions) and therefore the possible appearances of wholes. What this essentially means is that only a relatively tiny fraction of the space of all possible inputs can actually occur, but those that do occur tend to recur far more frequently than chance, i.e., a "heavy tail". If a relatively small set of input patterns occur far more frequently than chance, that implies that the rate at which new patterns occur must be much lower than chance.  And, note that a novel input, X, is assigned a code only on its first occurrence: all subsequent occurrences of X simply reactivate its previously assigned code, i.e., X is recognized. Moreover, the sensitivity of a mac's match computation (i.e., the computation of the familarity of a new input given the set of previously stored inputs) can be parametrically varied.  In general, there will be hypersphere around X within which a novel input, X', will, with high probability, cause X's code to reactivate exactly.  This acts to further reduce the effective rate at which new codes are assigned, i.e., the rate at which the mac approaches its storage capacity.

Sparsey has one further principle that further reduces the rate at which a mac adds new codes. It is the principle that the number of time steps that a code stays active, i.e., its persistence, increases with level.  In almost all studies to date, we have simply doubled persistence with level. This explicitly halves the rate at which new codes are assigned at each higher level and acts on top of the statistical principles described above.

While not a proof, I believe the above principles collectively support the claim that the expected rate of occurrence of new inputs drops exponentially with level and therefore that we can expect hierarchies of macs to have long operational lifetimes in natural worlds without ever exhausting memory capacity. Empirically verifying and quantifying this property is a high priority task for the near future.  As part of this analysis I will also compute fidelity measures for each level. That is, in addition to the requirement that the overall model retains the ability to learn new permanent memory traces throughout its life, it must also be the case that the model builds an internal model that captures the domain's underlying statistics and thus does well at recognizing new instances and making predictions.  Given that the algorithm that runs in each mac, for both learning and best-match retrieval, has a fixed number of operations for the life of the system, and that an instance of either learning or retrieval for the whole model involves a single iteration over the macs comprising the hierarchy, the time complexity of learning and best-match retrieval for the whole model is also fixed for the life of the system, and thus, effectively O(1).