Overview and Theoretical Considerations

The visual systems in primate and avian species appear to have undergone widely different evolutionary histories (see Husband & Shimizu (2001). As a result, the anatomical layout of avian and human visual systems is strikingly different. There do, however, appear to be analogous pathways, operating in parallel. Moreover, pigeons and primates have evolved with many of the same pressures: objects must be recognized from different vantage points, different distances, when they are partially occluded by other objects, etc. So, it is possible that common selective pressures may have yielded visual systems in which there are analogous mechanisms underlying visual functioning.

The study of avian visual cognition can provide insight into the extent to which the structurally different avian and primate visual systems may function similarly or differently. The set of studies in the present chapter on "Object Recognition" examined the attributes of objects that contribute to the perception and recognition of familiar images. The particular training objects and manipulations of object features were chosen because they are similar to those used in the study of object recognition in humans by Biederman and colleagues. However, the training methods employed in the human and pigeon studies are quite different, a factor which may contribute to any observed differences in performance between the two species.

Geon Structural Descriptions. Pigeons demonstrated a sensitivity to the spatial organization of object components. When the geons were spatially displaced from their normal position, recognition accuracy suffered. Moreover, pigeons were able to discriminate between different spatial arrangements of the same geons. Finally, the pigeons demonstrated generalization to pseudo-objects that contained the same spatial organization as a reinforced pseudo-object (see section on relative importance of geons and interrelations). The geons themselves were also seen to control behavior. First, there was above-chance performance to scrambled versions of objects, that must have been due to the presence of local features other than the vertices between components. The only local features that were available were the geons and their component pieces of contour. Second, the principle of three-geon sufficiency was observed: the deletion of a single geon had no effect on recognition, but the deletion of three geons had a large detrimental effect (see section on importance of components). Third, pigeons demonstrated substantial generalization to pseudo-objects containing the same geons as a reinforced pseudo-object (see section on relative importance of geons and interrelations). These combined results indicate that pigeons, too, may store geon structural descriptions of objects. The similarities in primal object recognition among pigeons and people encourages the use of a single theoretical account that may apply to many species of animal.

Invariant Object Recognition. In the study of invariances in object recognition in pigeons, there was significant transfer to stimuli that were rotated in depth, moved, or altered in size (see section on invariance). However, only translational invariance was complete. Both rotation in depth and changes in size produced systematic decrements in accuracy: as the degree of rotation or change in size of the training object was increased, greater decrements in performance were observed. In human subjects, these kind of systematic decrements have been observed in studies on rotational and size invariance, but usually the costs are in reaction time, not recognition accuracy (e.g., Bartram, 1974; Hayward & Tarr, 1997; Jolicoeur, 1987; Shepard & Metzler, 1972; Tarr et al., 1998). In other studies, changes in rotation or size produced no effect on recognition accuracy or speed (e.g., Biederman & Bar, 1999; Biederman & Cooper, 1992; Biederman & Gerhardstein, 1993; Fiser & Biederman, 1995).

Thus far, the only operation that has produced complete invariance in the pigeon is training with multiple examples along the to-be-manipulated dimension. In the work on rotational invariance, training with widely spaced views results in a flat generalization gradient. Complete translational invariance was also only observed when multiple training locations were used (but, the number and particular set of training locations has not been systematically studied). We did not conduct investigations with multiple training sizes, so we don't know whether complete size invariance can be observed.

In sum, it is difficult to determine the degree of similarity between pigeons and humans in their to sensitivity to variations in the size, location, or angle of view because (1) humans sometimes act like pigeons, demonstrating a systematic generalization decrement, but at other times demonstrate complete invariance, and (2) rotation in depth or change in size most commonly affects reaction times in humans, but affects recognition accuracy in pigeons -- this difference may indicate different underlying mechanisms (see below). Further work may determine the conditions which affect the degree of invariance that is observed. Important considerations are likely the features of the stimuli in the training set (see, for example, Beiderman & Gerhardstein, 1993) and the experimental procedure which is employed in training.

Many of the experiments presented in this chapter were designed to contrast Recognition-by-components (Biederman, 1987), a model which incorporates both features and their organization, with Particulate feature theory (Cerella, 1986), a feature-only model. Therefore, the chapter was organized around contrasting these two competing views. From the experiments on the importance of spatial organization alone, it is apparent that a feature-only account is insufficient to explain the mechanism of object recognition in the pigeon. The remainder of the discussion will therefore focus on evaluating RBC and contrasting it with other template models of object recognition.

Recognition-by-components. A number of the present results are consistent with RBC. First, the dual importance of object components and their spatial organization is consistent with the notion that object representations are stored as geon structural descriptions, which specify the object components and their spatial interrelations. Second, the spatial interrelations appeared to be more salient than component shape. RBC contends that there are a small number of primitives (about 30) that make up all objects in the environment, much like the small number of phonemes that make up human language. These primitive components can be combined to make up hundreds of thousands of objects through the use of variations in the particular sets of components and their organizational rules. Therefore, two objects may contain the same geons, but in different interrelations, thereby producing two discriminably different images (see cup-pail example). The organizational specifications of objects therefore can provide information above and beyond the specification of the constituent components. Third, there was considerable generalization to objects that appeared in novel locations, with altered viewpoints, or were changed in size. RBC predicts that object recognition should be invariant across these kinds of manipulations, providing that the same structural description can be extracted from the original and modified objects.

However promising RBC may seem in explaining a wide range of phenomena in pigeon visual perception, it does appear to suffer from some limitations. In particular, RBC has difficulty in dealing with the systematic deficit in accuracy that occurred when an object was rotated in depth away from the training viewpoint. Biederman and Gerhardstein (1993) argued that if the operation of rotation alters the geon structural description, then complete rotational invariance will not occur. The testing viewpoints did involve changes in the geon structural descriptions, so these changes may have produced the observed generalization decrement. Accordingly, training with multiple views may have resulted in a more complete structural description, thereby allowing for the complete generalization that occurred in the second experiment.

A recent study by Peissig et al. (2000), however, disclosed similar results using single geons rotated in depth. They reported that training with only a single viewpoint of an individual geon resulted in a generalization gradient, with systematic fall-offs in performance on either side of the training orientation. These results are problemmatic for RBC because the images were geons. By definition, geons are supposed to be able to be detected relatively indepdently of viewpoint. The geon structural description of Peissig et al.'s images should not have changed with rotation.

An equally problemmatic finding was the systematic decrement in accuracy that was observed when object size was manipulated, relative to the training size. Here, it is difficult to see how RBC could account for the lack of true size invariance. None of the geons or their spatial interrelations were altered, so there should have been complete generalization across all sizes.

Because true rotational and size invariances are not always observed in human subjects, it is possible that these limitations of RBC are not restricted to pigeons. What needs to be determined are the factors that affect the degree of rotational and size invariance that occur, in both species.

Template models. Template models propose that metric specifications of an object such as its viewpoint (Edelman & Bülthoff, 1992; Tarr & Bülthoff, 1995; Tarr & Pinker, 1989; Ullman, 1989) or size (Kosslyn, 1987; Ullman, 1989) are stored in memory along with specifications of the object's shape. (See Heinemann & Chase, 1990 and the chapters by Don Blough (2001) and Chase & Heinemann (2001) for further descriptions.) When the object is now encountered with new metric specifications, the recognition process engages in transformations of the current object in order to find a match to a stored representation. When a new viewpoint is encountered, the transformation is mental rotation; when a new size is encountered, the transformation is mental zooming.

Template models would predict costs in time or accuracy in recognizing objects when metric specifications have changed. Because larger changes in size or angle of view produce the need for more extensive transformations, template models predict the systematic decrement in recognition accuracy that was observed in the size and rotational invariance studies. Moreover, template models that allow for the storage of multiple views of frequently encountered objects (e.g., Tarr & Pinker, 1989), would also predict better performance to novel views when training occurred with multiple viewpoints.

Although template models may provide a better description of the rotational and size invariance studies than RBC, they have difficulty explaining the remainder of the results. These models have not been extended to deal with the importance of components and their spatial organization to object recognition. It is not even clear how these models would attack this problem. Template models assume that an integrated representation of an object is stored. Therefore, any transformations that may occur operate at the level of the entire object. A reorganized object would therefore be treated as an entirely new object. This did not appear to be the case in our pigeon data (or in humans). The pigeons responded to scrambled versions significantly above chance, indicating that they must have recognized the fact that the scrambled versions contained the same components as the original, but they also recognized the fact that the parts were not in the proper spatial concatenation. Template models also have difficulty in explaining the fact that pigeons can generalize to a set of complementary contours after having been trained with the opposing half of the contours (Van Hamme, Wasserman, & Biederman, 1992 -- see section importance of components for a further description). Here, there are no shared features to mediate generalization between the original and complementary set of contours, so there would be no basis for template matching to occur.

Biederman and colleagues (e.g., Biederman & Cooper, 1992; Beiderman & Gerhardstein, 1993) have argued that metric properties of objects such as size, location, and viewpoint are processed separately from shape attributes. In macaques, ablation of the inferior temporal cortex results in gross impairments in shape discriminations, whereas ablation of the posterior parietal region impairs the use of spatial cues in targeting the location of an object (Mishkin & Appenzeller, 1987; Ungerleider & Mishkin, 1982). Biederman and Cooper (1992) contended that the dorsal system may encode metric attributes such as rotation in depth and size in addition to spatial position:

When we pick up a cup by its handle, our motor movements are exquisitely tuned to the cup's position, size, and the orientation of the handle in depth. Thus, in a single skilled movement, we may reach to the right in the direction of the cup, simultaneously bend our wrist if the handle is on the right side (keeping our wrist straight if the handle is directly in front) and making a bridge between our thumb and fingers just wide enough to accept the handle. None of the information critically important for this act appears to be required for speeded object recognition; conversely, the identity of the object need not affect how it is picked up. (p. 130).

Accordingly, Biederman and colleagues have argued that cases in which there were costs in recognizing rotated (e.g., Bartram, 1974; Hayward & Tarr, 1997; Shepard & Metzler, 1972; Tarr et al., 1998) or sized (e.g., Jolicoeur, 1987) stimuli may reflect the participation of the dorsal pathway in the recognition process, whereas cases in which rotational (e.g., Biederman & Bar, 1999; Biederman & Gerhardstein, 1993) or size (Biederman & Cooper, 1992; Fiser & Biederman, 1995) invariance are complete may reflect the sole operation of the ventral pathway. The determination of whether the ventral system alone or both systems participate may rely on the experimental procedure that is used. One interesting feature of the multiple systems approach is that it may allow for an understanding of the pattern of similarities and differences in object recognition by pigeons and humans. It is possible that shape recognition systems in humans and pigeons may operate similarly, but the system for determining metric properties may differ in its neural mechanisms and/or in the extent to which it participates in entry-level object identification.

At the present time, it appears that no current theory of visual object recognition is capable of accounting for all of the facts of pigeon (or human) visual perception. However, RBC provides a reasonably good qualititative description of many facts of object recognition. Given that the object recognition process in the human and the pigeon may be quite similar, it may be possible to employ a single theory of object recognition for both species.