Elie
Bienenstock
Division of Applied Mathematics
Brown University
Summary of presentation at the ESF Workshop
on
Principles of information coding and processing
in the brain
Trieste, Italy, 4-6 September 1999
Rather than discussing the “binding problem”—a phrase which I find imprecise and stale—I would like to discuss specific idiosyncratic short-lived functional links. In short, fast links.
Here is a simple thought experiment to illustrate the notion of a fast link. Consider familiar object classes, such as fish, cloud, hat, telephone, etc. Each such object is a semantic item, which I will also call a basic category or a generic class. It can come in a great diversity of instances. An instance, or view, is a complete specification, pixel by pixel, of an image of an object in the class. Let’s say there are one hundred instances for each object class. Two instances in the same class can be as different as you can possibly imagine, as long as both unambiguously represent the same generic object, say a hat. In the experiment, the subject is first shown briefly a sample image with four objects in it. Each of the four objects is chosen randomly from a large collection of 10,000 (we have in the order of 10,000 basic categories in our brain). The instance is chosen randomly from the one hundred available. The four objects in the sample image are actually presented as two pairs, say an upper pair and a lower pair. The pairs are also defined randomly. To make things more explicit, there is a thick double-headed arrow connecting the two objects in the upper pair, say the fish and the telephone, and a thick double-headed arrow connecting the two objects in the lower pair, the hat and the cloud. The subject is aked to memorize these assocations in short-term memory. Shortly thereafter, and possibly after the presentation of a masking image, the subject is shown a test image with the same four objects as appeared in the sample image, but the instances are different, the positions are random, and there are no arrows. One of the objects is circled, and the subject is asked to point to the other element of the pair, i.e., to the object that was paired, in the sample image, with the circled object. My guess is that with a little training subjects will perform very well in such a task.
If this is indeed the case, we may say that showing a sample image with two pairs of objects in it has resulted in the establishment of two of these “fast links” in the subject’s working memory. In this example, the fast links are associative links between semantic items. In the first paragraph however, I defined a fast link more generally as a “specific idiosyncratic short-lived functional link.” The link is functional in the sense that it causes representations to affect each other. In the present case the link connects high-level representations, and it also communicates bidirectionally with low-level representations. The link is short-lived since it is probably all but erased by a few more trials involving other objects. It can nevertheless be transferred to long-term memory, for instance by asking the subject to memorize the last trial in the session. The link is idiosyncratic in the sense that it never occurred before, and is created on the fly, to satisfy the needs of the task at hand. Finally, the link is specific since it is one out of 49,995,000.
From this thought experiment I wish to conclude that our brains have the ability to compute with fast links between arbitrary pairs of items—so far we are talking about a very simple kind of computation but things will get more complicated below. The pairs are arbitrary in the sense that nothing, prior to the presentation of randomly constructed images, predisposed the two items in a pair to become linked. It is likely that the brain uses some sort of “common currency” to establish and compute with such links. This could plausibly be a form of “von der Malsburg synapse,” or synchrony of firing—on a time resolution to be specified—or a combination thereof.
But do we actually have a clear picture of how the items to be linked are themselves encoded in neural hardware? Personally I’m not sure we do, and I presume that participants in the workshop may differ about this issue. Surely there is plenty of evidence for rate-coded representations of objects, e.g. in IT. Yet it seems to me that in most studies involving high-level (“invariant”) representations the stimuli used are either part of a small overlearned class (e.g. Miyashita 1988), or have high salience, such as faces (Rolls 1992). I’m not sure whether this is true of Edmund’s more recent data as well. I take it Ed would argue that it is not unreasonable to assume that each of the 10,000 or so basic objects (semantic items) that we carry in our minds is rate-encoded in a specific cell population—although current electrophysiological methods may not allow us to find cells that code for an object that is neither overlearned nor particularly salient.
I would have no serious objection to such a proposal. And yet, even if at an appropriate abstraction level rates were all there was to the representation of an object, it would be hard for me to conceive how rates alone could (1) achieve this highly invariant representation in the first place, and (2) make it useful in various types of computation. To rephrase these two points, I would argue that representations need to be, respectively, content-sensitive and context-sensitive. I will now elaborate on the first point, using a long (very long) digression.
Let me start with a simple observation from computer vision: the interpretation of unrestricted natural images is a daunting computational task for which no solution exists at the present time. There are several sources to this difficulty. One is the ubiquitous presence of ambiguities. Another is that objects in natural scenes present themselves in an immense variety of poses, illuminations, states of contiguity and occlusion. An approach that is often used to try and overcome these difficulties is to use a learning algorithm, to construct knowledge about objects “from scratch.” In statistical terms, a “machine” that uses such a learning algorithm is called a non-parametric, or model-free, estimator. The hope is that one could show the machine enough instances for it to subsequently recognize arbitrary unseen instances. Although some success along these lines has been achieved in limited environments, non-parametric methods do not come close to being able to deal with natural scenes, in which the variability of the presentation of instances, including not only pose but also context, is immense. From a theoretical perspective, the problem can be traced to the “curse of dimensionality.” This means that the number of parameters that one tries to estimate (roughly of the order of the size of the image) is so big that the number of examples required is prohibitive.
In relation to this issue, Stuart Geman and I (Geman, Bienenstock and Doursat 1992) have argued that the deep problem in computational neuroscience is not the learning problem per se but rather the identification of the a priori structure (what is called a bias in statistical estimation theory) that makes it possible for real brains to perform effectively in enormously high-dimensional spaces with so remarkably few training examples. Vision and other cognitive tasks demand highly structured and parametric models, and we believe that a lot of insight into the nature of these models will be needed before we can understand the mechanisms used by the brain to achieve invariant responses of the type described for instance by Edmund Rolls in IT.
Stuart Geman, Dan Potter and I have been working for a few years now on what we call a compositional model for vision, to try and address just this issue (Bienenstock, Geman and Potter 1997; Geman, Potter and Chi 1998). Formally, our proposal combines models used in computational linguistics (unification grammars) with Bayesian probability and the information-theoretic notion of “Minimum Description Length” (MDL). The gist of this proposal is that the representation of an object such as a hat is a complex composition, which involves a large collection of fast links between object parts that sit at different levels of a hierarchy. These parts, which come in relatively small numbers, are familiar and reutilized in other compositions. Examples of parts are small line elements, extended lines, curves, angles, T-junctions, surface patches, parts of objects, and entire objects. Each one of these comes with attributes, such as position, orientation, size, thickness, color, texture, etc. There are countless ways to instantiate a generic object—e.g. the generic object hat—by composing parts with each other in appropriate arrangements. As a result, the representation of a hat ends up being generic and idiosyncratic. What makes it idiosyncratic is the way the parts are combined with each other, by fast links. These fast links are “vertical”: they connect items in different levels of the hierarchy. What makes the composition generic is the use of a given high-level item somewhere near the top levels of the hierarchy.
A compositional structure such as just outlined is a highly structured model, in which each composition rule is specified by a small number of parameters. These parameters are easy to learn. For instance, when learning the rule that governs the composition of two straight lines into a right angle, one estimates essentially two parameters: the allowed range for the distance between the endpoints of the lines, and the allowed range for the angle between them. Hopefully then, the curse of dimensionality has been broken, by dividing the high-dimensional estimation problem into smaller independent problems.
The mode of computation in such a “composition machine” is, very roughly, as follows. Upon presentation of an image, the machine seeks to interpret it by constructing a hierarchical composition that is legitimate in that it agrees both with the data and with the composition rules learned. In general, there are many legitimate interpretations. In particular, there always is the trivial interpretation, where nothing at all gets composed. Consider for instance an image that contains five adjacent, aligned, black dots. The trivial interpretation of this image is that this alignment of five dots occurred purely by chance. An alternative interpretation is that it occurred as an instance of the generic symbol “line element.” Associated with each of the two interpretations—trivial or line element—is a probability. If the model is right—and the composition rule for the symbol “line element” has been learned correctly—the probability of interpreting the image as a line element will be much larger than the trivial interpretation. From an information-theoretic viewpoint, the interpretation is better in the sense that it allows a more succinct description (smaller number of bits). In general, the model contains a reasonably small repertoire of symbols, sitting at different hierarchical levels, with their composition rules. These composition rules are partly recursive, as they define symbols in terms of simpler symbols. The model seeks an interpretation that is globally optimal, in the Bayesian (equivalently MDL) sense just outlined. This problem raises non-trivial computational issues. Interestingly, it fits rather well with the suggestions made by Wolfgang in his introduction to the Coding and Computation session of the workshop.
Let me now return to the issue of whether invariant representations—or representations of generic objects—can be achieved by rates alone. The proposal I outlined is just that, a proposal, and as such it doesn’t prove anything. However, to the extent that one finds the proposal convincing—it does show promise as a computer-vision approach—one may also find it suggestive of the need to activate large idiosyncratic configurations of fast links in each and every computation. Even if at the “generic-hat” level the neural representation is essentially a rate code, it is hard to see how one could dispense with a more complicated medium to achieve the recognition of a hat, if indeed recognition entails the instantiation of “generic hat” into one particular object.
Finally, I would contend that representations need to be context-sensitive as well. This, again, would appear to require extensive use of fast links. To illustrate, a hat will typically occur in a specific context: it would sit on somebody’s head, or be used to protect that head from the weather, or make some form of statement, social, religious, sports-related or other, or be displayed in a store window, etc. Associated with each of these possibilities is an endless range of possible computations, which involve: (1) the specific instance of the hat; (2) any number of specific items in the same scene; (3) any number of specific relationships, spatial and otherwise, between these items. Such computations will typically straddle a broad range of abstraction levels, from the lowest—position, color, size, style of object—to the highest—semantic category of object—and will require the use of fast links between the neural representations at these different levels.
I would not want to venture too far into speculations as to the nature of such fast links. As mentioned, von der Malsburg synapses are a natural candidate. Some manifestation in neural activity should also exist. This could include synchrony of firing, or more general accurately-timed events. These could for instance be synfire-type events, but could also be considerably more general. Thus, if we assume that two items are represented in the brain by two complex spatio-temporal patterns defined by respective probability distributions, say on two disjoint cell populations, one may want to look for the signature of a fast link between these two items in any departure from conditional independence between the two populations. Statistical methods can be devised to test for such a departure, i.e. reject the “null hypothesis” which states that the processes are independent (Date, Bienenstock and Geman, 1998).
References
Bienenstock, E., Geman, S., and Potter, D. (1997) Compositionality, MDL Priors, and Object Recognition. In: Advances in Neural Information Processing Systems 9 , M.C. Mozer, M.I. Jordan, and T. Petsche eds, MIT Press, pp 838-844. abstract full article (postscript)
Date, A., Bienenstock, E., and Geman, S. (1998) On the Temporal Resolution of Neural Activity. Technical report, Division of Applied Mathematics, Brown University. abstract full article (postscript)
Geman, S., Bienenstock, E., and Doursat, R. (1992) Neural Networks and the Bias/Variance Dilemma, Neural Computation, 4: 1--58.
Geman, S., Potter, D., and Chi Z. (1998) Composition systems. Technical report, Division of Applied Mathematics, Brown University.
Miyashita, Y., and Chang, H.S., (1988) Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature, 331:68--70.
Rolls, E.T. (1992) Neurophysiological mechanisms underlying face processing
within and beyond the temporal cortical visual areas. Philosophical
Transactions of the Royal Society 335: 11-21.