Chapter 8 Human Object Recognition and Computational Models

Seoung, Yeji

The City University of New York – Hunter College

8.1 Introduction

The ability to identify visual information is important in humans and animals for their survival. When you walk around downtown, you might identify buildings, traffic signs, a car that is approaching to you or faces of people on the street. We effortlessly recognize and classify visually presented objects and faces of people with high-accuracy (Biderman, 1987), even though each object produces tremendous variations in appearance (Logothetis, & Sheingerg, 1996). Such automatic brain process in visual recognition system enables us to build conceptual representation through generalizations of a novel object into an existing category, but also through identifications of similar characteristics from different kinds of things (Grill-Spector et al., 2001). For example, when you see a golden retriever on the street, you can categorize it as a dog and distinguish the differences between a poodle and it and their shared features such as having four legs or a tail. Although there are various functions of our vision (DiCarlo, & Cox, 2007), in the present review, recognition will be referred to a task including both identification and categorization: identification in which one can recognize a specific object or face among others, and categorization in which one can recognize a dog among other object classes (Poggio, & Ullman, 2013).

Although computerized recognition systems not completely duplicate individuals’ recognition performances, the studying of such artificial models contribute to understanding in process on human visual recognition systems (Pinto et al., 2008). A large body of literature has been interested in determining whether or not the computational approaches reproduce a realistic theory of human/animal object recognition. Since visual performance in the brain is attributed by more than 50 percent of the neocortex (Felleman, & Van Essen, 1991), it is not surprising to be difficult to emulate this ability in computational methods. Early computational approaches focused primarily on recognizing three-dimensional (3D) objects, including artifacts (e.g., buildings, tables, and automobiles), animals and human faces. The main problem of such computational methods is that representations of objects are two-dimensionally produced on the retina at first, even if we recognize them as 3D images with different visual variations depending on its pose and lighting (Ullman, 1996). More recently developed computational models, especially including inspired by brain-based approaches, enable to recognize meaningful patterns on object (e.g., Lazebnik, Schmid, & Ponce, 2006; Mutch, & Lowe, 2006; Wang, Zhang, & Fei-Fei, 2006; Zhang, Berg, & Malik, 2006).

A natural way to understand this general theme is to first try to review the basic capacities of the primate recognition system. After a brief description of some general principles of object recognition, this paper explores specific findings of effects and phenomena in the object recognition literatures. Then, this article will discuss whether or not each computational pattern classification theory can explain the phenomena. To that end, the current paper will investigate a possible solution motivated by the ventral visual stream in the brain to deal with the challenges of object recognition in modern computational models.

8.2 Human Object Recognition System

We first need to explore specialized regions in the brain where are activated or not when recognizing objects or faces (Johnson, 1980) in order to adapt human object recognition theory into computational pattern models. This section will review how we perform object recognition in the brain.

8.2.1 The homology of human and macaque’s visual systems

Object recognition task seems easy to perform for human and primates. When representing visual stimuli, the common features of the brain activations in the lateral occipital complex (LOC) (Orban, Van Essen, & Vanduffel, 2004) and in the inferotemporal cortex (IT) (Kriegeskorte et al., 2008) have been observed in the studies on a comparison of macaques to humans. Compared to macaques’ brain, we have little figured out how neurons in the human brain are associated with each other or how neuronal chemical reactions are responded when performing object recognition (Clarke et al., 1999). However, Orban et al. (2004) found that there were similar brain activities of both the humans and the macaques in retinotopic visual regions, including early visual cortex (V1, V2, V3) and several mid-level visual areas (V4, MT, and V3A). Thus, each area transmits a population-based visual information to other brain areas (Felleman, & Van Essen, 1991). Beyond retinotopic cortex, several visual fields work on object recognition differently: the regions of MT/MST act as identifying object motion (Watson et al., 1993; Tootell & Taylor, 1995); the fields of TEO, V4, and V8 function as color detector (Engel, Zhang, & Wandell, 1997; Hadjikhani, Liu, Dale, Cavanagh, & Tootell, 1998; Bartels & Zeki, 2000); the area of KO is activated by the kinetic motion recognition (Van Oostende, Sunaert, Van Hecke, Marchal, & Orban, 1997).

The lateral occipital complex (the LOC) known as non-retinotopic areas is a crucial region in performing the recognition task as well (Grill-Spector et al., 1998; Tootell, Mendola, Hanjikhani, Liu &, Dale, 1998). The LOC includes several brain fields like the lateral bank of the fusiform gyrus (Grill-Spector et al., 1998; Tootell, Mendola, Hanjikhani, Liu &, Dale, 1998). The LOC is sensitive to identify pieces of objects as well as objects as a whole (Grill-Spector et al., 1998). In addition to the functions of the LOC, such capability to recognize visual items is also observed in the inferotemporal cortex (IT) (Gross, Rocha, & Bender, 1972; Ito, Tamura, Fujita, & Tanaka, 1995). Specifically, IT neuronal activities are shown in at least more than six areas (Taso et al., 2003; Tsao et al., 2008a; Ku et al., 2011), suggesting that the IT might establish face recognition processing as well as non-human visual stimuli (Tsao et al., 2008b). Overall, studies on the homology of humans and macaques in visual mechanisms emphasizes the crucial role of the LOC and the IT in object recognition.

8.2.2 Object-selective visual areas in the human brain

The advent of brain image techniques such as positron emission tomography (PET) and functional magnetic resonance imaging (fMRI) makes it possible to explore the neurobiological basis of object recognition in humans. A large body of literature has been revealed that working on object recognition is localized in specific areas, which are called the ventral visual processing stream including the occipital and temporal lobes (Miyashita, 1993; Orban, 2008; Rolls, 2000). For example, several studies using PET have provided evidence that the ventral and temporal areas were strongly activated when subjects were visually presented to individuals’ faces and objects (Haxby, Grady, Ungerleider, & Horwitz, 1991; Kosslyn et al., 1994; Nobre, Allison, & McCarthy, 1994; Woldorff, et al., 1997). In addition, those fields are even stimulated when subjects are asked to see shapes of objects passively (Corbetta, Miezin, Dobmeyer, Shulman, & Petersen, 1991; Haxby et al., 1994). Furthermore, the study using fMRI found that the extent of activation in the LOC depends on the qualities of visual stimuli and whether they provide apparent shape or not (Malach et al., 1995).

A variety of recognition deficits have been revealed when patients have damages in the fusiform and occipito-temporal junction (Farah, Hammond, Mehta, & Ratcliff, 1989; Damasio, Tranel, & Damasio, 1990; Goodale, Milner, Jakobson, & Carey, 1991; Feinberg, Schindler, Ochoa, Kwan, & Farah, 1994; Farah, Klein, & Levinson, 1995; Moscovitch, Winocur, & Behrmann, 1999). The studies used by Event-related potentials (ERPs) found that compared to when presenting jumbled control images, stronger activities in the LOC were shown for a vast array of artifacts (e.g., furniture, buildings, and tools) (McCarthy, Puce, Belger, & Allison, 1999; Allison, Puce, Spencer, & McCarthy, 1999). These studies also reported that when presenting individuals’ faces, the brain activities are specialized in the areas of the middle and anterior fusiform gyrus (McCarthy, Puce, Belger, & Allison, 1999; Allison, Puce, Spencer, & McCarthy, 1999). In conclusion, the development of brain image techniques has led to understand the significant role of the ventral visual stream where the brain regions are associated with recognition in object items.

8.2.3 Facial visual areas in the human brain

Evidence from a large body of studies suggests that the brain regions working on face recognition are different from the areas involved in object recognition, though partially overlapped with each other (Behrmann et al., 1992; Caldara et al., 2003; Desimone, 1991; Tanaka, & Farah, 1993; Turk, & Pentland, 1991; Perrett et al., 1992). For example, it is hard for patients with prosopagnosia to recognize human faces since they have damages in specific brain areas, particularly of bilateral (Damasio et al., 1982; Gauthier et al., 1999) or unilateral right occipito-temporal lesions (De Renzi, 1986; Landis et al., 1986). On the other hand, it has been reported that while preserving face recognition, patients suffering from agnosia cannot identify object stimuli (Moscovitch et al., 1997). These findings suggest that the brain works independently for object and face recognition.

In the same vein, neuroimaging studies using fMRI report the significant role of the occipito-temporal regions in face recognition (Kanwisher et al., 1991; McCarthy et al., 1997). Specifically, the human fusiform gyrus, the fusiform face area (FFA) is exclusively dedicated to face processing (Kanwisher, McDermott, & Chun, 1997; Grill-Spector, Knouf, & Kanwisher, 2004). Neuronal evidence revealed that neuronal activities in the FFA were strongly responded when presented face images rather than when presented non-face stimuli (Tsao et al., 2006). Taken together, FFA is a crucial area which acts as face recognition, thus suggesting there are the distinct brain areas between object and face recognition.

8.3 The behavioral phenomena of interest in object recognition

Understanding in object recognition would be required to investigate how phenomena link to recognition systems (DiCarlo, Zoccolan, & Rust, 2012). The current section will discuss several phenomena that are observed in object recognition literature, and whether or not each computational model can account for the effect.

8.4 The behavioral phenomenon: Other-race effect

Although we encounter thousands of people in our lifetime, we can easily recognize a face among different individuals. Identifying or classifying people into several categories including race can have an effect on such automatic brain mechanism. According to Feingold (1914), our face recognition capabilities have more to do with same race (SR) than faces of another race (i.e., other-race faces (OR)), thus this effect is called “other-race” effect (ORE). For many years, a number of studies have supported this recognition effect (Caldara, & Abdi, 2006; Furl et al., 2002; Goldstein, & Chance, 1985; O’Toole, & Peterson, 1996; O’Toole et al., 1994; Valentine, 1991). This phenomenon suggests that it might be hard to perceive the uniqueness or individuality of other-race faces (Furl et al., 2002).

Although ORE is robustly and reliably observed in the psychological literature, it is still controversial whether it can account for the other-race phenomenon. Several hypotheses have been suggested to explain the other-race phenomenon (O’Toole et al., 1994), but there is little support for some accounts (Brigham, 1986): we inherently have more difficulty identifying faces of some races than others; prejudicial attitudes impede recognition of OR faces; and face recognition is processed more superficially in viewing OR faces than SR faces.

A fourth possibility, which has been suggested by most studies, emphasizes that the extent of having an experience about particular races can influence on ORE (O’Toole et al., 1991; Furl et al., 2002). Imagine, we have had a great experience with SR faces rather than OR faces, and then we would tend to easily recognize for SR faces than for OR faces (Caldara, & Abdi, 2006). In other words, if we have a bunch of experience with other race faces, we would better recognize them. While other studies (e.g., Brigham, & Barkowitz, 1978; Lavarkas, Buri, & Mayzner, 1976; Malpass, & Kravitz, 1969; Ng, & Lindsay, 1994) fail to find convincing results for this hypothesis, it is important to note that at least one study (brigham et al., 1982) found a small but significant effect of contact experience on ORE. Thus, it is an obscure field on the underlying explanations for ORE effect (Meissner, & Brigham, 2001) and the further study would be necessary.

8.4.1 Face space model

ORE has an advantage in classifying faces, even if it leads us to have the difficulty of recognizing OR faces. When subjects are asked to categorize individuals’ faces into an identical race, for example, they classify the distinct race of faces more quickly rather than their own race of faces (Caldara et al., 2004; Valentine, & Endo, 1992). According to Valentine (1991), such other race advantage can be explained by an exemplar model. In the pattern computational systems, the face-stimuli inputs are placed in a multidimensional space, usually Euclidean (Valentine, 1991). The directions and distances of each input from the average face are calculated and this information determines the locations where each stimulus is placed in the space (Valentine, 1991), and psychophysical data support this view (Leopold et al., 2001). Empirical evidence showed that the distances of typical faces from the origin were shorter than those of other faces (Burton, & Vokey, 1998). Indeed, a set of face stimuli is presented as a dot on the multidimension of the space, and thus OR faces are densely clustered while SR faces are broadly distributed in the space (Caldara et al., 2004). Due to this high-density pattern for OR faces, we can quickly classify faces of the distinct race rather than SR faces (Caldara et al., 2004). More interestingly, these particular patterns lead to increase in the difficulty of discrimination of different exemplars, and thus suggests other-race effects (Caldara, & Abdi, 2006).

Although the face space model is useful to explain the ORE, a lack of explanation for encoding is a significant drawback in this model (Caldara, & Abdi, 2006). To deal with such problem, Burton and Vokey (1998) extracted the statistical properties from real faces to clarify the dimensions. Such way will also fix the weights of dimensions by learning inputs of faces (Caldara, & Abdi, 2006). Moreover, perceptual learning can play an important role in obtaining these dimension (Caldara, & Abdi, 2006).

8.4.2 Perceptual learning theory

O’Toole et al. (1995) proposed that perceptual learning can play a significant role in understanding the mechanism for ORE. This so-called perceptual learning theory suggests that as face recognition ability develops, individuals will obtain the discriminating skills among individual human faces by learning to use the perceptual dimensions. O’Toole et al. (1996) found the benefits of perceptual learning on recognition on SR faces. Neuronal evidence also showed that individuals’ neural networks were strongly stimulated by new faces from the reference group (Caucasian) rather than faces from the other-race group (Asia), when subjects trained a set of Caucasian face-stimuli as a reference group, (O’Toole et al., 1996). This result is consistent with the finding in computational recognition pattern that it did more accurately recognize Caucasian faces than other-race faces, when Caucasian faces were trained and set as a reference group (Furl et al., 2002).

There are at least two major similarities between the theoretical face-space model and perceptual learning theory (Caldara, & Abdi, 2006). In order to explain ORE, both face-space model and perceptual learning theory emphasis on the significance of variance experience and the degrees of inner representations (Caldara, & Abdi, 2006). Since both theories above have advantages and disadvantages to explain ORE, Caldara and Abdi (2006) suggest a complemented approach by using neural networks which are associated with conceptual representations.

8.4.3 Neural networks evidence

Given both the limitation of face-space model and perceptual learning theory, Caldara and Abdi (2006) improved an algorithm, motivated by neuronal network associations, aimed to construct conceptual representations of face recognition in a multidimension space and to clarity whether the models can account for the ORE.

Their simulation results revealed that when the SR faces were learnt as a target group, the face representations were broadly spread in the face-space while the OR faces were densely clustered (Caldara, & Abdi, 2006). Neuronal network patterns provided that face-space model is optimal to respond to OR faces, suggesting that when individuals have more experience with SR faces, they would take advantage of perceptual learning, and thus they will be skillful to recognize SR faces rather than OR faces (Caldara, & Abdi, 2006). Their findings are consistent with scientific evidence that perceptual learning is a crucial factor to explain the ORE (Furl et al., 2002; O’Toole et al., 1994; O’Toole et al., 1996). Thus, although perceptual learning can improve to encode distinctions relevant for SR faces, it makes be hard to recognize OR face representations (Caldara, & Abdi, 2006). Overall, neuronal networks evidence supports the explanations of the face-space model and perceptual learning theory on the ORE.

8.5 The behavioral phenomenon: Unfamiliar face

We effortlessly recognize faces of different people, but this ability is significantly different from familiar faces, which belong to our personal acquaintances, and unfamiliar faces (Bindemann, Avetisyan, & Rakow, 2012; Bruce et al., 2001; Hancock et al., 2000; Jenkins, & Burton, 2011; Johnston, & Edmonds, 2009). Although recognizing familiar faces is strikingly stable and reliable performance regardless of poor visual conditions (e.g., poor illumination, low-quality images, and variable viewpoints), recognition of unfamiliar faces appears remarkably poor, even without any poor viewing conditions (Bindemann et al., 2012; Bruce et al., 2001; Hancock et al., 2000). Behavioral (e.g., Bindemann, & Sandford, 2011) and psychophysical evidence (e.g., Haxby et al., 2001) have been supported this phenomenon.

8.5.1 Face-space model

The face-space model also reviewed in 3.1.2, but here focus on face-space model accounting for unfamiliar face. This well-known model pursues to reproduce human’s conceptual representations of faces. One way of the underlying dimensions is multi-dimensional scaling (MDS) in terms of similarity inspired by exemplar models (Busey, 1998). According to Busey (1998), for example, he created six identifiable dimensions (e.g., age, facial hair, and hair color) in order to replicate human face recognition performance by using a bald man image set. When target faces are not bald, the important dimension would be hair color and style (Hancock et al., 2000).

In addition, a statistical analyzing on a face set like principal component analysis (PCA) is a commonly used method in computational recognition task (Hancock et al., 2000). This approach represents a set of faces as a small number of global eigenvectors, which encode the major variations in the input set (Grudin, 2000). However, one serious drawback of PCA is the limitation when there is large within-class variance in the image sets (Kalocsai et al., 1998). Thus, it is likely that efficient information would be neglected because there are several possible views to recognize each face in the dataset (Hancock et al., 2000). Although PCA might provide the dimensionality of the space, not all dimensions are labeled and interpreted (Hancock et al., 2000). According to Dailey et al (1999), Compared to a model using PCA, MDS data by particularly using a kernel density estimation model is more predictable in human face recognition. Although there are efforts to apply the process of human face recognition to computational models and they have become to perform face recognition, it is hard to achieve to reproduce it completely resemble to human perception (Hancock et al., 2000).

8.6 How we deal with the difficulties of computational models?

8.6.1 Core Recognition

Human and animals have abilities to extremely accurately and quickly recognize objects in their visual systems, which are supported by empirical evidence. For example, humans are able to recognize a briefly presented image in as short as 350ms (Rousselet, Fabre-Thorpe, & Thorpe, 2002; Thorpe, Fize, & Marlot, 1996), and monkeys can do it in 250ms (Fabre-Thorpe., Richard, & Thorpe, 1998). Event-related potential (ERP) experiments found that complex visual processing of object recognition is achieved in 150 ms (Thorpe et al., 1996). Such ability is referred to “core recognition” that the primates are able to perceive and classify visually-presented objects quickly and accurately (DiCarlo, & Cox, 2007).

8.6.2 Invariance problem

Due to the fact that transformed images with tremendous variants are preserved as an identity (DiCarlo, Zoccolan, & Rust, 2012), one can recognize two-dimensionally presented items on the retina (DiCarlo, & Cox, 2007). The perceptual constancy is a significant mechanism for working on object recognition performance without any trouble which is provided by the changes in lighting, size, and backgrounds (Grill-Spector et al., 2001). That is, the variability of the world and the recognizer would lead to enormous images of each object that must be categorized into the identical category (e.g., “dog”) (DiCarlo et al., 2012; Grill-Spector et al., 2001). Thus, human capabilities to perceive and classify objects are not impeded by enormous variabilities of positions, scales, poses, illumination, and clutter (DiCarlo et al., 2012; Grill-Spector et al., 2001).

Besides, one can recognize each part of object (e.g., eyes and legs of a dog) as well as the object as a whole (e.g., a dog), and one can categorize conceptual representations into an existing category like “cats,” “apartments,” “bicycles,” which is called intraclass variability (Grill-Spector et al., 2001). Although each object is presented in visually infinite variants, visual systems achieve the equivalence of all of these different patterns of each object without any confusion with images of all other possible objects. Such human recognition ability that is less sensitive to different visual appearances is referred to ‘cue-invariance’ (Grill-Spector et al., 2001).

Although this invariance problem can be an impediment for computational models to reproduce human’s recognition completely, especially when items with infinite variants are implemented, both empirical (Thorpe, Fize, & Marlot, 1996) and physiological (Hung, Kreiman, Poggio, & DiCarlo, 2005) findings suggest that we would obtain a clue from visual stream to solve this invariance problem rapidly (DiCarlo, & Cox, 2007; DiCarlo et al., 2012; Grill-Spectore, & Malach, 2001; Grill-Spector et al., 2001). For example, Grill-Spector et al. (1998) reported that when subjects were presented to visually-variance objects, the object-selective brain fields, particularly the LOC, are actively stimulated. This finding implies that cue-invariance is observed in our visual recognition system. Kourtzi and Kanwisher (2000) investigated the levels of the brain activities in the LOC. Their results revealed that when subjects are presented to grayscale objects as a whole, the responses of the LOC were stronger compared to when presented to line drawings. In addition, their findings showed that when presented to pairs of same images, both levels of brain activities in LOC were similar (Kourtzi and Kanwisher, 2000). The similar brain responses were shown even when subjects passively saw pairs of stimuli which they were the identical objects but had different morphs (e.g., two different kinds of golden retrievers) (Kourtzi and Kanwisher, 2000). Moreover, Grill-Spector and Malach (2001) conducted a studying for invariant properties of the LOC by using functional magnetic resonance-adaptation (fMR-A). They reported that this area can work on object recognition regardless of the object’s variabilities in size and position (Grill-Spector, & Malach, 2001). These results above demonstrate that there is the cue-invariance in visual object system and emphasize the crucial role of the LOC in object recognition.

In the same vein, studies of fMRI revealed that IT sub-populations show the cue-invariance problem as well (Majaj et al., 2012) and this problem is not limited to a performance of object (i.e., non-face stimuli) recognition (Freiwald, & Tsao, 2010). Thus, when individuals recognize faces of other people, the similar recognition processing has been observed (Freiwald, & Tsao, 2010). Strikingly, it is important to note that IT neuronal populations can strongly support for human pattern recognition theory compared to neuronal populations of the earlier visual systems (Freiwald, & Tsao, 2010; Hung et al., 2005; Rust, & DiCarlo, 2010). Even when subjects perceive intricate visual morphs, IT neuronal populations are readily activated (Brincat, & connor, 2004; Desimone et al., 1984; Perrett et al., 1982; Rust, & DiCarlo, 2010; Tanaka, 1996). Moreover, viewing relatively trivial changes in object items can stimulate IT neurons such as variabilities of object position and size (Brincat, & Connor, 2004; Ito et al., 1995; Li et al., 2009; Rust, & DiCarlo, 2010; Tovee et al., 1994), various poses (Logothesis et al., 1994), variabilities of lighting (Vogels, & Biederman, 2002), and changes of cluster (Li et al., 2009; Missal et al., 1999; Zoccolan et al., 2005). Thus, visual object representations are constructed with invariance problem in the IT neuronal populations (DiCarlo et al., 2012).

8.6.3 The explanation of IT neuronal populations on object recognition

The transmission of visual object information in the brain is a hierarchical processing: visual information is sent first to the retina; from there, this information is transmitted to the lateral geniculate nucleus of the thalamus (LGN), and then to the occipital lobe, specifically area V1 to V2 to V4 to IT (Felleman, & Van Essen, 1991). Traditionally, biologically inspired computational models have tried to duplicate 2D images, and this endeavor implies that 2D visual information is transmitted from early visual systems (areas V2 and V4) to final systems (IT) in the ventral visual stream (Anzai, Peng, & Van Essen, 2007; Gallant, Braun, & Van Essen, 1993; Pasupathy, & Connor, 2001). At each stage, although neurons are tuned for component-level shape, IT can be involved in holistic shape tuning through learning (Baker, Behrmann, & Olson, 2002). Although a large body of literature have figured out the mechanisms of the early visual systems like the area V1 (Lennie, & Movshon, 2005), we do not clearly understand the mechanisms of the final stage such as the IT (Hung et al., 2005; Rust, & DiCarlo, 2010). However, relatively recent several studies have revealed that the activities of the IT populations are clear and stable to perform recognizing objects as well as faces representations that have a variety of variabilities ranging from position to background component (Hung et al., 2005; Rust, & DiCarlo, 2010). Moreover, neuronal analyses suggested that the IT neuronal populations were activated when subjects performed face recognition tasks (Freiwald, & Tsao, 2010). Such neuronal activities could confirm the cue-invariance problem in our visual system (Freiwald, & Tsao, 2010). Thus, the mechanisms of the IT neuronal activities can explain for human invariant object recognition behavior (Majaj et al., 2012). These studies also demonstrate that the explanations for object recognition based on the analysis of the IT populations are more clearly accounted for our visual functions than those motivated by the early visual stream (Freiwald, & Tsao, 2010; Hung et al., 2005; Rust, & DiCarlo, 2010).

DiCarlo et al. (2012) summarized the neurophysiological evidence for IT neuronal populations as follows. First, the IT populations decode and transfer visual representations within 50 ms. Furthermore, after presented images, decoding visual information is accessible beginning under 100 ms. Additionally, the IT populations decode visual representations into neuronal formats with preserving cue-invariance of the objects (DiCarlo et al., 2012). Finally, these simple weighted summation codes are observed when subjects are presented objects without any training for a set of images (Hung et al., 2005). Taken together, by decoding visual representations, IT neuronal populations in the final stage of the ventral visual stream can be applied to computational pattern models (Pinto et al., 2010), but also can account for human object recognition behavior (Hung et al., 2005; Rust, & DiCarlo, 2010).

8.6.4 Shape similarity vs. semantic category information in IT neuronal populations

The efforts of understanding human object recognition behavior have led to develop several computational models. Particularly, the artificial systems based on IT neuronal population representation suggest better explanations for the processing of object recognition performance. However, it is still debatable how visual information is represented and placed in the visual areas. There are two main hypotheses in visual representations: shape similarity vs. semantic category (Baldassi et al., 2013; Huth et al., 2012; Khaligh-Razavi, & Kriegeskorte, 2014). In the shape similarity view, we might consider the IT as a visual representation, because objects are segmented and clustered into visually similar groups. Thus, visual features of objects can be a crucial factor for visual representations in the brain. On the other hand, in the semantic category view, semantic information is significant criteria for representations of visual information. In this view, we might think of the IT as a visuo-semantic representation (Khaligh-Razavi, & Kriegeskorte, 2014).

Several literature claims that simple or intricate visual components are coded into IT neurons which serve as visual representations based on shape similarity (Brincat, & Connor, 2004; Kayaert, Biederman, & Vogels, 2003; Kayaert, Biederman, Beeck, & Vogels, 2005; Yamane et al., 2008; Zoccolan, et al., 2007; Zoccolan et al., 2005). For example, Yamane et al. (2008) report that the representations of IT neuronal populations provide evidence for the important role of shape similarity components in constructing visual neural signals. A study of the IT of monkey used fMRI showed that the visual representations of animate objects were overlapped by inanimate objects in the IT (Freedman et al., 2006). Importantly, compared to the prefrontal cortex, in the IT neuronal populations, the levels of the activities for visually liked objects were not significantly different from those for objects which were clustered by semantic category (e.g., between cat-like and dog-like stimuli) (Freedman et al., 2006). Overall, these studies support the shape similarity theory for visual representations that IT neurons play a significant role in reproducing object identity (Kourtzi, & Connor, 2011).

Although biological evidence has revealed that IT neurons can be considered as a visual representation, recent several studies argue that visual objects are coded into the IT populations in terms of semantic category information (Huth et al., 2012). For example, compared to clusters based on their visual feature similarity, the visual representations are more clustered into several semantic categories such as animals, non-animate objects, and faces (Huth et al., 2012). Although a general semantic area for visual representations in the brain has not been observed, single fields provide the evidence that visual representations are organized by semantically related categories (Connolly et al., 2012; Just et al., 2010; Konkle and Oliva, 2012; Kriegeskorte et al., 2008; Naselaris et al., 2009; O’Toole et al., 2005). In the IT of monkey’s studies, the groups of visual representations in the brain were divided by semantic categories and the similar patterns showed in human’s brain as well. Particularly, the groups for inanimate objects were clearly segregated from those for animate objects (Connolly et al., 2012; Just et al., 2010; Konkle and Oliva, 2012; Kriegeskorte et al., 2008; Naselaris et al., 2009; O’Toole et al., 2005). Moreover, fMRI studies found that semantic-based metrics can explain for visual representations in the monkey’s IT populations patterns (Bell et al., 2009), which is consistent to the results from several studies that semantically segregated visual representations can meaningfully support for human object recognition patterns (Downing, Jiang, Shuman, & Kanwisher, 2001; Kanwisher, 2010; Kanwisher, McDermott, & Chun, 1997; Mahon et al., 2007; Mahon, & Caramazza, 2009; Naselaris et al., 2009).

This semantic category theory could be distinguished by studies that produce object-defining visual features and contrast their explanatory power. However, semantic category-based model cannot explain the functions of the IT populations without shape similarity-based model. Thus, visual representations in the IT are segmented by visual similarity as well as semantic category (Kriegeskorte et al., 2008; Connolly et al., 2012; Huth et al., 2012; Carlson et al., 2013). In order to reproduce visual representational metrics that are similar to those of IT, even unintended property variation with explicit images requires semantic information (Cadieu et al., 2014; Yamins et al., 2014). In fact, an efficient way to account for the visual representations in the IT is that visual similarity appearances does not impede semantic features of objects, and thus suggesting a correlation with each other (Khaligh-Razavi, & Kriegeskorte, 2014).

8.6.5 Computational models accounting for the IT representation

Computational frameworks have been developed to partly resemble the similarity patterns observed in IT cortex of the primates (Khaligh-Razavi, 2014). Sensible as individuals do, however, artificial models cannot fully perform object recognition performance. The development of pattern computational models has led to test realistic theories and to provide an effective measure accounting for primate visual object recognition (Pinto et al., 2008). The question raises whether the current computational recognition models are completely able to support the explanations for the IT neuronal populations and for recognition behaviors. Here the current paper will compare several computational approaches motivated by a biological process to other artificial approaches, and discuss whether those models can suggest the visual object representations in the primate’s IT.

Khaligh-Razavi and Kriegeskorte (2014) categorized the models in mainly two ways with subordinate categories: (1) Unsupervised with category labels: (a) Biologically-inspired object-vision models (e.g., HMAX, VisNet, Stable model, Sparse localized features (SLF), Biological transform (BT), and convolutional network) (Ghodrati et al., 2012; Ghodrati et al., 2014; Hinton, 2012; Jarrett et al., (2009); LeCun, & Bengio, 1995; Riesenhuber, & Poggio, 1999; Serre, Oliva, & Poggio, 2007; Sountsov, Santucci, & Lisman, 2011; Wallis, & Rolls, 1997); (b) Computer-vision models (e.g., GIST, SIFT, PHOG, PHOW, self-similarity features, geometric blur) (Bosch, Zisserman, & Munoz, 2007; Deselaers, & Ferrari, 2010; Lazebnik, Schmid, & Ponce, 2006; Lowe, 1999; Ojala, Pietikainen, & Maenpaa, 2001; Oliva, & Torralba, 2001); (2) Supervised with category labels: (a) Biologically-inspired object-vision models: GMAX and supervised HMAX (Ghodrati et al., 2012), which these approaches can discriminate animate-objects from inanimate-objects due to the training from 884 images set; deep supervised convolutional neural network (DNN) (Krizhevsky et al., 2012), which can perform object recognition by learning from a bunch of semantically-categorized images used by ImageNet (Deng et al., 2009). While computer vision models perform several local image descriptors, biologically-inspired computational models are a hierarchical model consisting of a set of transforms that make an invariant representation of the input image in a neutrally plausible way (Khaligh-Razavi, & Kriegeskorte, 2014).

According to Khaligh-Razavi and Kriegeskorte (2014), they investigated 37 computational approaches whether they can provide evidence for the explanations of the human’s IT about the visual representations. A set of objects physically nonoverlapped with each other was used for the artificial systems to reweight and remix (Khaligh-Razavi, & Kriegeskorte, 2014). They found that the HMAX model and several computer-vision models predicted well early visual cortex responses. Furthermore, most of the models work to discriminate representation patterns of the IT from other visual areas (Khaligh-Razavi, & Kriegeskorte, 2014). Several models produce categorical divisions between animal and human faces, and this finding is consistent with the results of the IT populations of human and monkey that the group for human faces were differently placed in the IT compared to the group for animal faces (Khaligh-Razavi, & Kriegeskorte, 2014).

While several supervised models well performed recognition tasks, all unsupervised models could not succeed to distinguish between human and non-human faces. Besides, the unsupervised models failed to emulate animate/inanimate division of IT populations (Khaligh-Razavi, & Kriegeskorte, 2014). There are still limitations for computational models to reproduce semantic categorizations which are commonly found in the human IT. However, these results provide a powerful suggestion that pattern computational models trained with categorically labeled image sets could be efficient to account for the visual object representations of the IT (Khaligh-Razavi, & Kriegeskorte, 2014).

8.6.6 4.6. Deep Neural Networks

While deep neural networks have been developed with progress in useful learning algorithms (Hinton, Osindero, & Teh, 2006; Krizhevsky, Sutskever, & Hinton, 2012; LeCun, & Bengio, 1995), learning process from enormous dataset enables the computational models to classify and recognize visually presented objects. More recent studies report that compared to other computational models, the responses liked human IT populations can be better predicted by the new deep computational models. Furthermore, these new models better produce categorical visual divisions in the vision fields in both human and monkey (Cadieu et al., 2014; Khaligh-Razavi, & Kriegeskorte, 2014).

The new deep neural computational approaches have similar components that are inspired by the primate visual systems (Khaligh-Razavi, & Kriegeskorte, 2014). First of all, feedforward hierarchical structure is the common feature in the new deep neural models. In fact, these models convert to visual information from each prior stage to following stage. Next, each stage has abilities that can linearly filter the former stage which is represented as a nonlinear structure by reducing a single linear transformation (Khaligh-Razavi, & Kriegeskorte, 2014). Besides, after decoding linear visual information of each stage, they are constructed convolutedly. This computed information allows to produce efficient parameters. At the same time, the decoded visual inputs transmit visual information with the cue-invariance (LeCun, & Bengio, 1995). Furthermore, with increasing from stage to stage, the visual representations are placed at a space based on image-visual information and are clustered by shape similarity or semantic category information. Moreover, the neural networks models include four or more layers of representation (Gengio, 2009). Although fewer units or complex patterns are necessary, the deep networks are able to confirm accurate visual information. Finally, the new deep neural networks models can be learned by supervision with a large number of pictures which are categorically labeled (e.g., more than a million image sets) (Krizhevsky, Sutskever, & Hinton, 2012). Thus, it might be possible that the more similar to IT the computational models perform, the better they process object recognition.

8.6.7 4.7. The advantages of hierarchical features in computational models

Unlike computer-vision models, biologically inspired object visual models show a hierarchical structure which features of visual representations increase in the intricacy of information from lower stage to upper stage (Poggio, & Ullman, 2013). Such hierarchical visual models are powerful to replicate the visual object recognition in the IT populations.

There are several possible advantages for hierarchical structure of visual representations. First, hierarchical computational approaches can take advantage of the response of IT neurons, which achieve an efficient and robust recognition performance, though objects produce infinite appearances which are influenced by illumination, position, and recognizers (Logothetis et al., 1994; Logothetis, & Sheinberg, 1996). Although computer vision systems can quite readily achieve scale and position invariance by simply matching target objects with dataset which contains images with different scales and positions (Valentin, & Abdi, 1996), this methodology is inadequate to apply realistic recognition theory (Poggio, & Ullman, 2013). Second, hierarchical approaches can offer a benefit in efficiency in quickness and in useful resources (Poggio, & Ullman, 2013). Hierarchical models allow to perform even complex objects recognition by training with over a million of images. Thus, these approaches can achieve a successful recognition performance with learning process of visual images (Poggio, & Ullman, 2013). Finally, the possible advantage of hierarchies is that these models can identify and classify parts of objects (e.g., ears and a tail of a dog) as well as objects as a whole (e.g., a dog and a cat) (Epshtein, Lifshitz, & Ullman, 2008).

8.7 5. Conclusion

Object recognition is a remarkable ability in everyday life. The present paper attempted to review the abilities to recognize visually presented-objects and faces, then review several phenomena and effects which are observed in object recognition literature and discussed whether or not computerized artificial systems can explain processing in object recognition. While the current literature reviewed the significant role of ventral stream systems in object recognition and a potential solution of the problems that are observed in computational model, their functional relevance remains to be established. It might be hard to fully understand how the brain solves object recognition as yet. However, artificial models have become far more computationally sophisticated. New theories suggest great promise for explaining object recognition process by understanding from all of domains.

8.8 References

Allison, T., McCarthy, G., Nobre, A., Puce, A., & Belger, A. (1994). Human extrastriate visual cortex and the perception of faces, words, numbers, and colors. Cerebral cortex, 4(5), 544-554.

Allison, T., Puce, A., Spencer, D. D., & McCarthy, G. (1999). Electrophysiological studies of human face perception. I: Potentials generated in occipitotemporal cortex by face and non-face stimuli. Cerebral cortex, 9(5), 415-430.

Anzai, A., Peng, X., & Van Essen, D. C. (2007). Neurons in monkey visual area V2 encode combinations of orientations.Nature neuroscience,10(10), 1313.

Baker, C. I., Behrmann, M., & Olson, C. R. (2002). Impact of learning on representation of parts and wholes in monkey inferotemporal cortex.Nature neuroscience,5(11), 1210.

Baldassi, C., Alemi-Neissi, A., Pagan, M., DiCarlo, J. J., Zecchina, R., & Zoccolan, D. (2013). Shape similarity, better than semantic membership, accounts for the structure of visual object representations in a population of monkey inferotemporal neurons.PLoS computational biology,9(8), e1003167.

Bartels, A., & Zeki, S. (2000). The architecture of the colour centre in the human visual brain: new results and a reviw. European Journal of Neuroscience, 12, 172-193.

Behrmann, M., Moscovitch, M., & Winocur, G. (1999). Vision and visual mental imagery.Case studies in the neuropsychology of vision. Psychology Press, UK, 81-110.

Behrmann, M., Winocur, G., & Moscovitch, M. (1992). Dissociation between mental imagery and object recognition in a brain-damaged patient.Nature,359(6396), 636.

Bell, A. H., Hadj-Bouziane, F., Frihauf, J. B., Tootell, R. B., & Ungerleider, L. G. (2009). Object representations in the temporal cortex of monkeys and humans as revealed by functional magnetic resonance imaging.Journal of neurophysiology,101(2), 688-700.

Bengio, Y. (2009). Learning deep architectures for AI.Foundations and trends in Machine Learning,2(1), 1-127.

Biederman, I. (1987). Recognition-by-components: a theory of human image understanding. Psychological review, 94(2), 115-147.

Bindemann, M., Avetisyan, M., & Rakow, T. (2012). Who can recognize unfamiliar faces? Individual differences and observer consistency in person identification.Journal of Experimental Psychology: Applied,18(3), 277.

Bindemann, M., & Sandford, A. (2011). Me, myself, and I: Different recognition rates for three photo-IDs of the same person.Perception,40(5), 625-627.

Bosch, A., Zisserman, A., & Munoz, X. (2007, July). Representing shape with a spatial pyramid kernel. InProceedings of the 6th ACM international conference on Image and video retrieval(pp. 401-408). ACM.

Brigham, J. C. (1986). The influence of race on face recognition. InAspects of face processing(pp. 170-177). Springer, Dordrecht.

Brigham, J. C., & Barkowitz, P. (1978). Do “they all look alike?” The effect of race, sex, experience, and attitudes on the ability to recognize faces.Journal of Applied Social Psychology,8(4), 306-318.

Brincat, S. L., & Connor, C. E. (2004). Underlying principles of visual shape selectivity in posterior inferotemporal cortex.Nature neuroscience,7(8), 880.

Bruce, V., Henderson, Z., Newman, C., & Burton, A. M. (2001). Matching identities of familiar and unfamiliar faces caught on CCTV images.Journal of Experimental Psychology: Applied,7(3), 207.

Burton, A. M., & Vokey, J. R. (1998). The face-space typicality paradox: Understanding the face-space metaphor.The Quarterly Journal of Experimental Psychology: Section A,51(3), 475-483.

Busey, T. A. (1998). Physical and psychological representations of faces: Evidence from morphing.Psychological Science,9(6), 476-483.

Cadieu, C. F., Hong, H., Yamins, D. L., Pinto, N., Ardila, D., Solomon, E. A., … & DiCarlo, J. J. (2014). Deep neural networks rival the representation of primate IT cortex for core visual object recognition.PLoS computational biology,10(12), e1003963.

Caldara, R., & Abdi, H. (2006). Simulating the ‘other-race’effect with autoassociative neural networks: further evidence in favor of the face-space model.Perception,35(5), 659-670.

Caldara, R., Rossion, B., Bovet, P., & Hauert, C. A. (2004). Event-related potentials and time course of the ‘other-race’face classification advantage.Neuroreport,15(5), 905-910.

Caldara, R., Thut, G., Servoir, P., Michel, C. M., Bovet, P., & Renault, B. (2003). Face versus non-face object perception and the ‘other-race’effect: a spatio-temporal event-related potential study.Clinical Neurophysiology,114(3), 515-528.

Clarke, S., Riahi-Arya, S., Tardif, E., Eskenasy, C., & Probst, A. (1999). Thalamic projections of the fusiform gyrus in man. European Journal of Neuroscience, 11, 1835-1838.

Carlson, J. M., Cha, J., & Mujica-Parodi, L. R. (2013). Functional and structural amygdala–anterior cingulate connectivity correlates with attentional bias to masked fearful faces.Cortex,49(9), 2595-2600.

Connolly, A. C., Guntupalli, J. S., Gors, J., Hanke, M., Halchenko, Y. O., Wu, Y. C., … & Haxby, J. V. (2012). The representation of biological classes in the human brain.Journal of Neuroscience,32(8), 2608-2618.

Corbetta, M., Miezin, F. M., Dobmeyer, S., Shulman, G. L., & Petersen, S. E. (1991). Selective and divided attention during visual discriminations of shape, color, and speed: functional anatomy by positron emission tomography. Journal of neuroscience, 11(8), 2383-2402.

Dailey, M. N., Cottrell, G. W., & Busey, T. A. (1999). Facial memory is kernel density estimation (almost). InAdvances in neural information processing systems(pp. 24-30).

Damasio, A. R., Damasio, H., & Van Hoesen, G. W. (1982). Prosopagnosia Anatomic basis and behavioral mechanisms.Neurology,32(4), 331-331.

Damasio, A. R., Tranel, D., & Damasio, H. (1990). Face agnosia and the neural substrates of memory. Annual review of neuroscience, 13(1), 89-109.

Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009, June). Imagenet: A large-scale hierarchical image database. InComputer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on(pp. 248-255). IEEE.

Deselaers, T., & Ferrari, V. (2010, June). Global and efficient self-similarity for object classification and detection. InComputer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on(pp. 1633-1640). IEEE.

Desimone, R., Albright, T. D., Gross, C. G., & Bruce, C. (1984). Stimulus-selective properties of inferior temporal neurons in the macaque.Journal of Neuroscience,4(8), 2051-2062.

Desimone, R. (1991). Face-selective cells in the temporal cortex of monkeys.Journal of cognitive neuroscience,3(1), 1-8.

De Renzi, E. (1986). Current issues on prosopagnosia. InAspects of face processing(pp. 243-252). Springer, Dordrecht.

DiCarlo, J. J., & Cox, D. D. (2007). Untangling invariant object recognition. Trends in cognitive sciences, 11(8), 333-341.

DiCarlo, J. J., Zoccolan, D., & Rust, N. C. (2012). How does the brain solve visual object recognition?. Neuron, 73(3), 415-434.

Downing, P. E., Jiang, Y., Shuman, M., & Kanwisher, N. (2001). A cortical area selective for visual processing of the human body.Science,293(5539), 2470-2473.

Engel, S., Zhang, X., Wandell, B. (1997). Colour tuning in human visual cortex measured with functional magnetic resonance imaging. Nature, 388, 68-71.

Epshtein, B., Lifshitz, I., & Ullman, S. (2008). Image interpretation by a single bottom-up top-down cycle.Proceedings of the National Academy of Sciences,105(38), 14298-14303.

Fabre-Thorpe, M., Richard, G., & Thorpe, S. J. (1998). Rapid categorization of natural images by rhesus monkeys. Neuroreport, 9(2), 303-308.

Farah, M. J., Hammond, K. M., Mehta, Z., & Ratcliff, G. (1989). Category-specificity and modality-specificity in semantic memory. Neuropsychologia, 27(2), 193-200.

Feinberg, T. E., Schindler, R. J., Ochoa, E., Kwan, P. C., & Farah, M. J. (1994). Associative visual agnosia and alexia without prosopagnosia.Cortex,30(3), 395-411.

Feingold, G. A. (1914). Influence of environment on identification of persons and things.J. Am. Inst. Crim. L. & Criminology,5, 39.

Felleman, D. J., & Van, D. E. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral cortex (New York, NY:1991), 1(1), 1-47.

Freiwald, W. A., & Tsao, D. Y. (2010). Functional compartmentalization and viewpoint generalization within the macaque face-processing system.Science,330(6005), 845-851.

Freedman, D. J., Riesenhuber, M., Poggio, T., & Miller, E. K. (2005). Experience-dependent sharpening of visual shape selectivity in inferior temporal cortex.Cerebral Cortex,16(11), 1631-1644.

Furl, N., Phillips, P. J., & O’Toole, A. J. (2002). Face recognition algorithms and the other-race effect: computational mechanisms for a developmental contact hypothesis.Cognitive Science,26(6), 797-815.

Gallant, J. L., Braun, J., & Van Essen, D. C. (1993). Selectivity for polar, hyperbolic, and Cartesian gratings in macaque visual cortex.Science,259(5091), 100-103.

Gauthier, I., Tarr, M. J., Anderson, A. W., Skudlarski, P., & Gore, J. C. (1999). Activation of the middle fusiform’face area’increases with expertise in recognizing novel objects.Nature neuroscience,2(6), 568.

Ghodrati, M., Khaligh-Razavi, S. M., Ebrahimpour, R., Rajaei, K., & Pooyan, M. (2012). How can selection of biologically inspired features improve the performance of a robust object recognition model?.PloS one,7(2), e32357.

Ghodrati, M., Farzmahdi, A., Rajaei, K., Ebrahimpour, R., & Khaligh-Razavi, S. M. (2014). Feedforward object-vision models only tolerate small image variations compared to human.Frontiers in computational neuroscience,8, 74.

Goldstein, A. G., & Chance, J. E. (1985). Effects of training on Japanese face recognition: Reduction of the other-race effect.Bulletin of the Psychonomic Society,23(3), 211-214.

Goodale, M. A., Milner, A. D., Jakobson, L. S., & Carey, D. P. (1991). A neurological dissociation between perceiving objects and grasping them. Nature, 349(6305), 154-155.

Grill-Spector, K., Kourtzi, Z., & Kanwisher, N. (2001). The lateral occipital complex and its role in object recognition. Vision research, 41(10-11), 1409-1422.

Grill-Spector, K., Knouf, N., & Kanwisher, N. (2004). The fusiform face area subserves face perception, not generic within-category identification.Nature neuroscience,7(5), 555.

Grill-Spector, K., Kushnir, T., Hendler, T., Edelman, S., Itzchak, Y., Malach, R. (1998). A sequence of object processing stages revealed by fMRI in the human occipital lobe. Human Brain Mapping, 316-328.

Gross, C., Rocha, M., Bender, D. (1972). Visual properties of neurons in inferotemporal cortex of the Macaque. Journal of Neurophysiology, 35, 96-111.

Grudin, M. A. (2000). On internal representations in face recognition systems.Pattern recognition,33(7), 1161-1177.

Hadjikhani, N., Liu, A., Dale, A., Cavanagh, P., Tootell, R. (1998). Retinotopy and color sensitivity in human visual cortical area V8. Nature Neuroscience, 1, 235-241.

Hancock, P. J., Bruce, V., & Burton, A. M. (2000). Recognition of unfamiliar faces.Trends in cognitive sciences,4(9), 330-337.

Haxby, J. V., Grady, C. L., Ungerleider, L. G., & Horwitz, B. (1991). Mapping the functional neuroanatomy of the intact human brain with brain work imaging. Neuropsychologia, 29(6), 539-555.

Haxby, J. V., Gobbini, M. I., Furey, M. L., Ishai, A., Schouten, J. L., & Pietrini, P. (2001). Distributed and overlapping representations of faces and objects in ventral temporal cortex.Science,293(5539), 2425-2430.

Haxby, J. V., Hoffman, E. A., & Gobbini, M. I. (2002). Human neural systems for face recognition and social communication.Biological psychiatry,51(1), 59-67.

Hung, C. P., Kreiman, G., Poggio, T., & DiCarlo, J. J. (2005). Fast readout of object identity from macaque inferior temporal cortex.Science,310(5749), 863-866.

Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., … & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.IEEE Signal Processing Magazine,29(6), 82-97.

Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets.Neural computation,18(7), 1527-1554.

Huth, A. G., Nishimoto, S., Vu, A. T., & Gallant, J. L. (2012). A continuous semantic space describes the representation of thousands of object and action categories across the human brain.Neuron,76(6), 1210-1224.

Ito, M., Tamura, H., Fujita, I., Tanaka, K. (1995). Size and position invariance of neuronal responses in monkey inferotemporal cortex. Journal of Neurophysiology, 73, 218-226.

Jenkins, R., & Burton, A. M. (2011). Stable face representations.Philosophical Transactions of the Royal Society of London B: Biological Sciences,366(1571), 1671-1683.

Jerabek, P. (1997). Retinotopic organization of early visual spatial attention effects as revealed by PET and ERPs. Human brain mapping, 5(4), 280-286.

Jarrett, K., Kavukcuoglu, K., & LeCun, Y. (2009, September). What is the best multi-stage architecture for object recognition?. InComputer Vision, 2009 IEEE 12th International Conference on(pp. 2146-2153). IEEE.

Johnson, K. O., (1980). Sensory discrimination: Decision process. Journal of Neurophysiology, 43(6), 1771-1792.

Johnston, R. A., & Edmonds, A. J. (2009). Familiar and unfamiliar face recognition: A review.Memory,17(5), 577-596.

Just, M. A., Cherkassky, V. L., Aryal, S., & Mitchell, T. M. (2010). A neurosemantic theory of concrete noun representation based on the underlying brain codes.PloS one,5(1), e8622.

Kalocsai, P., Zhao, W., & Elagin, E. (1998, April). Face similarity space as perceived by humans and artificial systems. InAutomatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on(pp. 177-180). IEEE.

Kanwisher, N., McDermott, J., & Chun, M. M. (1997). The fusiform face area: a module in human extrastriate cortex specialized for face perception.Journal of neuroscience,17(11), 4302-4311.

Kanwisher, N. (2010). Functional specificity in the human brain: a window into the functional architecture of the mind.Proceedings of the National Academy of Sciences,107(25), 11163-11170.

Kanwisher, N. (1991). Repetition blindness and illusory conjunctions: Errors in binding visual types with visual tokens.Journal of Experimental Psychology: Human Perception and Performance,17(2), 404.

Kanwisher, N., McDermott, J., & Chun, M. M. (1997). The fusiform face area: a module in human extrastriate cortex specialized for face perception.Journal of neuroscience,17(11), 4302-4311.

Kayaert, G., Biederman, I., & Vogels, R. (2003). Shape tuning in macaque inferior temporal cortex.Journal of Neuroscience,23(7), 3016-3027.

Kayaert, G., Biederman, I., Op de Beeck, H. P., & Vogels, R. (2005). Tuning for shape dimensions in macaque inferior temporal cortex.European Journal of Neuroscience,22(1), 212-224.

Khaligh-Razavi, S. M., & Kriegeskorte, N. (2014). Deep supervised, but not unsupervised, models may explain IT cortical representation.PLoS computational biology,10(11), e1003915.

Konkle, T., & Oliva, A. (2012). A real-world size organization of object responses in occipitotemporal cortex.Neuron,74(6), 1114-1124.

Kosslyn, S. M., Alpert, N. M., Thompson, W. L., Chabris, C. F., Rauch, S. L., & Anderson, A.

K. (1994). Identifying objects seen from different viewpoints A PET investigation. Brain, 117(5), 1055-1071.

Kourtzi, Z., & Connor, C. E. (2011). Neural representations for object perception: structure, category, and adaptive coding.Annual review of neuroscience,34, 45-67.

Kriegeskorte, N., Mur, M., Ruff, D. A., Kiani, R., Bodurka, J., Esteky, H., … & Bandettini, P. A. (2008). Matching categorical object representations in inferior temporal cortex of man and monkey.Neuron,60(6), 1126-1141.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. InAdvances in neural information processing systems (pp. 1097-1105).

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. InAdvances in neural information processing systems (pp. 1097-1105). Ku, S. P., Tolias, A. S., Logothetis, N. K., & Goense, J. (2011). fMRI of the face-processing network in the ventral temporal lobe of awake and anesthetized macaques.Neuron,70(2), 352-362.

Lazebnik, S., Schmid, C., & Ponce, J. (2006). A discriminative framework for texture and object recognition using local image features. InToward Category-Level Object Recognition (pp. 423-442). Springer, Berlin, Heidelberg.

Landis, T., Cummings, J. L., Benson, D. F., & Palmer, E. P. (1986). Loss of topographic familiarity: An environmental agnosia.Archives of neurology,43(2), 132-136.

Lavrakas, P. J., Buri, J. R., & Mayzner, M. S. (1976). A perspective on the recognition of other-race faces.Perception & Psychophysics,20(6), 475-481.

Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Computer vision and pattern recognition, 2006 IEEE computer society conference on (Vol.2, pp. 2169-2178). IEEE.

LeCun, Y., & Bengio, Y. (1995). Convolutional networks for images, speech, and time series.The handbook of brain theory and neural networks,3361(10), 1995.

Lennie, P., & Movshon, J. A. (2005). Coding of color and form in the geniculostriate visual pathway (invited review).JOSA A,22(10), 2013-2033.

Leopold, D. A., Rhodes, G., Müller, K. M., & Jeffery, L. (2005). The dynamics of visual adaptation to faces.Proceedings of the Royal Society of London B: Biological Sciences,272(1566), 897-904.

Li, N., Cox, D. D., Zoccolan, D., & DiCarlo, J. J. (2009). What response properties do individual neurons need to underlie position and clutter “invariant” object recognition?.Journal of Neurophysiology,102(1), 360-376.

Logothetis, N. K., Pauls, J., Bülthoff, H. H., & Poggio, T. (1994). View-dependent object recognition by monkeys.Current biology,4(5), 401-414.

Logothetis N. K., & Sheinberge D. L. (1996). Visual object recognition. Annual review of neuroscience, 19(1), 577-621.

Lowe, D. G. (1999). Object recognition from local scale-invariant features. InComputer vision, 1999. The proceedings of the seventh IEEE international conference on(Vol. 2, pp. 1150-1157). Ieee.

McCarthy, G., Puce, A., Gore, J. C., & Allison, T. (1997). Face-specific processing in the human fusiform gyrus.Journal of cognitive neuroscience,9(5), 605-610.

Majaj, N., Hong, H., Solomon, E., & DiCarlo, J. J. (2012). A unified neuronal population code fully explains human object recognition.Cosyne Abstracts.

Mahon, B. Z., Milleville, S. C., Negri, G. A., Rumiati, R. I., Caramazza, A., & Martin, A. (2007). Action-related properties shape object representations in the ventral stream.Neuron,55(3), 507-520.

Mahon, B. Z., & Caramazza, A. (2009). Concepts and categories: A cognitive neuropsychological perspective.Annual review of psychology,60, 27-51.

Malach, R., Reppas, J. B., Benson, R. R., Kwong, K. K., Jiang, H., Kennedy, W. A., … & Tootell, R. B. (1995). Object-related activity revealed by functional magnetic resonance imaging in human occipital cortex. Proceedings of the National Academy of Sciences, 92(18), 8135-8139.

Malpass, R. S., & Kravitz, J. (1969). Recognition for faces of own and other race.Journal of personality and social psychology,13(4), 330.

McCarthy, G., Puce, A., Belger, A., & Allison, T. (1999). Electrophysiological studies of human face perception. II: Response properties of face-specific potentials generated in occipitotemporal cortex. Cerebral cortex, 9(5), 431-444.

Meissner, C. A., & Brigham, J. C. (2001). Thirty years of investigating the own-race bias in memory for faces: A meta-analytic review.Psychology, Public Policy, and Law,7(1), 3.

Missal, M., Vogels, R., Li, C. Y., & Orban, G. A. (1999). Shape interactions in macaque inferior temporal neurons.Journal of Neurophysiology,82(1), 131-142.

Miyashita, Y. (1993). Inferior temporal cortex: where visual perception meets memory. Annual review of neuroscience, 16(1), 245-263.

Moscovitch, M., Winocur, G., & Behrmann, M. (1997). What is special about face recognition? Nineteen experiments on a person with visual object agnosia and dyslexia but normal face recognition.Journal of cognitive neuroscience,9(5), 555-604.

Mutch, J., & Lowe, D. G. (2006). Multiclass object recognition with sparse, localized features. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on (Vol. 1, pp. 11-18). IEEE.

Naselaris, T., Prenger, R. J., Kay, K. N., Oliver, M., & Gallant, J. L. (2009). Bayesian reconstruction of natural images from human brain activity.Neuron,63(6), 902-915.

Ng, W. J., & Lindsay, R. C. (1994). Cross-race facial recognition: Failure of the contact hypothesis.Journal of Cross-Cultural Psychology,25(2), 217-232.

Ojala, T., Pietikäinen, M., & Mäenpää, T. (2001, March). A generalized local binary pattern operator for multiresolution gray scale and rotation invariant texture classification. InInternational Conference on Advances in Pattern Recognition(pp. 399-408). Springer, Berlin, Heidelberg.

Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope.International journal of computer vision,42(3), 145-175.

Orban, G. A. (2008). Higher order visual processing in macaque extrastriate cortex. Physiological reviews, 88(1), 59-89.

Orban, G. A., Van Essen, D., & Vanduffel, W. (2004). Comparative mapping of higher visual areas in monkeys and humans.Trends in cognitive sciences,8(7), 315-324.

O’toole, A. J., Deffenbacher, K. A., Valentin, D., & Abdi, H. (1994). Structural aspects of face recognition and the other-race effect.Memory & Cognition,22(2), 208-224.

O’Toole, A. J., Peterson, J., & Deffenbacher, K. A. (1996). An ‘other-race effect’ for categorizing faces by sex.Perception,25(6), 669-676.

O’toole, A. J., Vetter, T., Bülthoff, H. H., & Troje, N. F. (1995). The role of shape and texture information in sex classification.

O’toole, A. J., Jiang, F., Abdi, H., & Haxby, J. V. (2005). Partially distributed representations of objects and faces in ventral temporal cortex.Journal of cognitive neuroscience,17(4), 580-590.

Pasupathy, A., & Connor, C. E. (2001). Shape representation in area V4: position-specific tuning for boundary conformation.Journal of neurophysiology,86(5), 2505-2519.

Pinto, N., Cox, D. D., & DiCarlo, J. J. (2008). Why is real-world visual object recognition hard?. PLoS computational biology, 4(1), 151-156.

Perrett, D. I., Hietanen, J. K., Oram, M. W., & Benson, P. J. (1992). Organization and functions of cells responsive to faces in the temporal cortex.Phil. Trans. R. Soc. Lond. B,335(1273), 23-30.

Poggio, T., & Ullman, S. (2013). Vision: are models of object recognition catching up with the brain?. Annals of the New York Academy of Sciences, 1305(1), 72-82.

Perrett, D. I., Rolls, E. T., & Caan, W. (1982). Visual neurones responsive to faces in the monkey temporal cortex.Experimental brain research,47(3), 329-342.

Pinto, N., Cox, D. D., & DiCarlo, J. J. (2008). Why is real-world visual object recognition hard?.PLoS computational biology,4(1), e27.

Pinto, N., Majaj, N., Barhomi, Y., Solomon, E., & DiCarlo, J. J. (2010). Human versus machine: comparing visual object recognition systems on a level playing field.Cosyne Abstracts.

Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex.Nature neuroscience,2(11), 1019.

Rousselet, G. A., Fabre-Thorpe, M., & Thorpe, S. J. (2002). Parallel processing in high-level categorization of natural images. Nature neuroscience, 5(7), 629-630.

Rolls, E. T. (2000). The orbitofrontal cortex and reward. Cerebral cortex, 10(3), 284-294.

Rust, N. C., & DiCarlo, J. J. (2010). Selectivity and tolerance (“invariance”) both increase as visual information propagates from cortical area V4 to IT.Journal of Neuroscience,30(39), 12978-12995.

Serre, T., Oliva, A., & Poggio, T. (2007). A feedforward architecture accounts for rapid categorization.Proceedings of the national academy of sciences,104(15), 6424-6429.

Sountsov, P., Santucci, D. M., & Lisman, J. E. (2011). A biologically plausible transform for visual recognition that is invariant to translation, scale, and rotation.Frontiers in computational neuroscience,5, 53.

Tanaka, K. (1996). Inferotemporal cortex and object vision.Annual review of neuroscience,19(1), 109-139.

Tanaka, J. W., & Farah, M. J. (1993). Parts and wholes in face recognition.The Quarterly journal of experimental psychology,46(2), 225-245.

Tsao, D., Freiwald, W., Knutsen, T., Mandeville, J., Tootell, R. (2003). Faces and objects in macaque cerebral cortex. Nature Neruoscience, 6, 989-995.

Tsao, D. Y., Freiwald, W. A., Tootell, R. B., & Livingstone, M. S. (2006). A cortical region consisting entirely of face-selective cells.Science,311(5761), 670-674.

Tsao, D., Moeller, S., Freiwald, W. (2008a). Comparing face patch systems in macaques and humans. Proceedings of the National Academy of Sciences,105(49), 19514-19519.

Tsao, D. Y., Schweers, N., Moeller, S., & Freiwald, W. A. (2008b). Patches of face-selective cortex in the macaque frontal lobe.Nature neuroscience,11(8), 877.

Tootell, R., Taylor, J. (1995). Anatomical evidence for MT and additional cortical visual areas in humans. Cerebral Cortex, 5, 39-55.

Tovee, M. J., Rolls, E. T., & Azzopardi, P. (1994). Translation invariance in the responses to faces of single neurons in the temporal visual cortical areas of the alert macaque.Journal of neurophysiology.

Turk, M., & Pentland, A. (1991). Eigenfaces for recognition.Journal of cognitive neuroscience,3(1), 71-86.

Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381(6582), 520-522.

Tootell, R., Mendola, J., Hadjikhani, N., Liu, A., Dale, A. (1998). The representation of the ipsilateral visual field in human cerebral cortex. Proceedings of the Natural Academy of Science USA, 95, 818-824.

Ullman, S. (1996). High-level vision: Object recognition and visual cognition (Vol.2). Cambridge, MA: MIT press.

Valentine, T. (1991). A unified account of the effects of distinctiveness, inversion, and race in face recognition.The Quarterly Journal of Experimental Psychology Section A,43(2), 161-204.

Valentin, D., & Abdi, H. (1996). Can a linear autoassociator recognize faces from new orientations?.JOSA A,13(4), 717-724.

Valentine, T., & Endo, M. (1992). Towards an exemplar model of face processing: The effects of race and distinctiveness.The Quarterly Journal of Experimental Psychology Section A,44(4), 671-703.

Van Oostende, S., Sunaert, S., Van Hecke, P., Marchal, G., & Orban, G. (1997). The kinetic occipital (KO) region in man: an fMRI study. Cerebral Cortex, 7, 690-701.

Vogels, R., & Biederman, I. (2002). Effects of illumination intensity and direction on object coding in macaque inferior temporal cortex.Cerebral Cortex,12(7), 756-766.

Wang, G., Zhang, Y., & Fei-Fei, L. (2006). Using dependent regions for object categorization in a generative framework. In Computer Vision and Pattern Recognition, 20006 IEEE Computer Society Conference on (Vol. 2, pp. 1597-1604). IEEE.

Wallis, G., & Rolls, E. T. (1997). Invariant face and object recognition in the visual system.Progress in neurobiology,51(2), 167-194.

Watson, D., Myers, R., Frackowiak., R., Hajnal, J., Woods, R., Mazziotta, J., Shipp, S., Zeki, S. (1993). Area V5 of the human brain: evidence from a combined study using positron emission tomography and magnetic resonance imaging. Cerebral Cortex, 3, 79-94.

Woldorff, M. G., Fox, P. T., Matzke, M., Lancaster, J. L., Veeraswamy, S., Zamarripa, F., … & Ullman, S. (1996). High-level vision: Object recognition and visual cognition (Vol.2). Cambridge, MA: MIT press.

Yamane, Y., Carlson, E. T., Bowman, K. C., Wang, Z., & Connor, C. E. (2008). A neural code for three-dimensional object shape in macaque inferotemporal cortex.Nature neuroscience,11(11), 1352.

Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., & DiCarlo, J. J. (2014). Performance-optimized hierarchical models predict neural responses in higher visual cortex.Proceedings of the National Academy of Sciences,111(23), 8619-8624.

Zhang, H., Berg, A. C., Maire, M., & Malik, J. (2006). SVM-KNN: Discriminative nearest neighbor classification for visual category recognition. In Computer Vision and Pattern Recognition, 2006 IEEE computer Society Conference on (Vol. 2, pp. 2126-2136). IEEE.

Zoccolan, D., Cox, D. D., & DiCarlo, J. J. (2005). Multiple object response normalization in monkey inferotemporal cortex.Journal of Neuroscience,25(36), 8150-8164.

Zoccolan, D., Kouh, M., Poggio, T., & DiCarlo, J. J. (2007). Trade-off between object selectivity and tolerance in monkey inferotemporal cortex.Journal of Neuroscience,27(45), 12292-12307.