Our first work in this area concentrated on specific algorithms for specific object recognition problems. In particular, we have developed an approach based on a context-free grammar and nonparametric "noise'' models for recognizing mines in shallow water optical imagery (sponsored by ONR and the Naval Surface Warfare Center). We have also developed a robust algorithm for reading bar codes in visible-light images; this algorithm is presently being used for wafer tracking in the semiconductor manufacturing industry.
Current foundational work is much broader, not concentrating on specific algorithms for specific applications, yet building on what we have learned from the early algorithm development efforts. The current effort is referred to as the compositional approach to vision. Compositionality refers to the hierarchical structure of conceptual units by which entities at one level are grouped into larger structures at higher levels.
We propose to develop a general framework for machine vision based upon the idea of composition. In this framework, more-or-less local entities are recursively composed, subject to syntactic restrictions, to form objects and object groupings. The restricted composition rules amount to an explicit formulation of contextual constraints.
One approach to evaluating groupings is with description length, wherein both the raw data (e.g. the pixels) and the part-whole relationships (the hierarchy) are described in a binary code. Through entropy, information theory relates description length to probability, whereby higher probabilities are assigned shorter codes. Therefore, minimum description length (MDL) recognition is Bayesian inference under a prior distribution that favors succinct descriptions. Oftentimes the MDL formulation offers a compelling evaluation of groupings that is much less evident in the more traditional probabilistic/Bayesian viewpoint.
The actual recognition engine generates multiple compositional structures, corresponding to multiple scene interpretations, which are later resolved by appealing to the minimum-description-length (maximum aposteriori probability) principle. The essence of the approach is that ambiguities are propagated "upwards" while the detection of higher order structures is propagated "downwards" until a best interpretation emerges. For instance, in object recognition the correct grouping of pixels into an object may be hard to identify bottom-up. But once a suitable object model is found, it strongly supports all the lower level groupings of edges, curves, etc. needed to delineate its shape. Run to its conclusion, the output is a hierarchical interpretation of all of the pixels and pixel groupings in a scene.
Composition rules can be formulated at multiple scales, and this leads
to compositional descriptions at multiple resolutions. Coarse descriptions
are computationally easier, since they require fewer groupings. Finer
descriptions are more accurate and actually require shorter descriptions,
since the raw data is better accommodated. Recognition is then a
process of building an ever more detailed compositional description, and
can be stopped and queried at any stage during the construction.