Tomaso A. Poggio, Massachusetts Institute of Technology
Present learning algorithms have a high sample complexity and shallow architectures. One of the most obvious differences is the apparent ability of people and animals to learn from very few examples (“poverty of stimulus” problem). Are hierarchical architectures the answer to this? Visual cortex: hierarchical architecture from neuroscience to a class of models. In this area, the dorsal stream is for “where” and the ventral stream is for “what”. The ventral stream in the human has an order of magnitude (at least) more neurons than in our close taxonomic relatives.
As you go from V1 to higher areas in the VS, the optimal stimulus increases in visual complexity – by the time you get to the IT area, that area is really only being stimulated by images that are at the complexity level of faces. In the VS, there are both feedforward and backprojection connections. How far can we push the simplest type of FF hierarchical models? It’s a good place to start. 30-50 ms for the image to go from retina to this area. The model of visual recognition (millions of units) is based on neuroscience of the cortex. This software is available online.Overcomplete dictionary of “templates” or images “patches” is learned during an unsupervised learning stage from ~10000 natural images by tuning S units. Preprocessing stages lead to a representation that has lower sampling complexity than the image itself. He refers to the sample complexity of the preprocessing stage as the # of labeled examples required by the classifier at the top.
What can we say about how the model works? There is a long series of comparisons that were made based on literature and collaborations. It is a hierarchical feedforward model of the VS. There is data resulting from the model in IT, V4, V1, etc, and psychophysics. The latter involves rapid categorization – it’s a good example in that backprojection is not allowed. In the measure of accuracy, the model and human observers perform similarly. There is a high correlation of correct responses (difficult images are difficult for both, and so on). It was surprising to find this kind of agreement. When we compare that to computer visual systems of the time (a couple years ago), they found the model based on the neuroscience of the visual cortex did a better job a labelling things correctly. Hierarchical FF models of visual cortex may be wrong, but it presents a challenge for “classical” learning theory.
They have started to develop a theory called HKM. There are a number of fashionable models going under the name of deep-learning networks. You can consider images as functions. Functions can be interpreted as greyscale images when working with a vision problem, for example. What follows is a series of technical slides on the algorithm that I didn’t quite get.
Extensions of the model to videos and sequences of images. The specific system discussed is for looking at mice in cages. They want to classify simple behaviours over a couple of seconds (grooming, walking etc). Collected ~100 hours of videos and then perform an automated analysis. The system is almost as good as humans – they agree about 70% of the time (between labellers), which is about the same as between labellers and the system. They’re doing 24-hour monitoring of 4 different strains for testing the system. You can infer the mouse strain from the behaviour with about 50% accuracy with 10 mins of video.
Limits of present FF models: vision is more than categorization or identification: it is image understanding/inference/parsing. Our visual system can “answer” almost any kind of question about an image or video (a Turing test for vision). The types of models he’s describing, he doesn’t think could handle this type of “turing”-style test.
Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!