Katholieke Universiteit Leuven
IDIAP  Research Institute

CARTER

Classification of visual scenes using Affine invariant Regions and TExt Retrieval methods

Project timeline: from the 1st September 2004 to 30th August 2005.

landscape
city
indoors
city
Fig.1 - Some examples of scene images representing the scene types used in this project.

Overview

Project description (in this page)

Representations and classification methods

Analogy with text

Some results

Databases used for experiments

Resulting publications

References

Acknowledgments

People

IDIAP: Pedro Quelhas, Florent Monay, Jean-Marc Odobez, and Daniel Gatica-Perez

K.U.Leuven: Mihai Osian, Tinne Tuytelaars, and Luc Van Gool

Contacts

Daniel Gatica-Perez (IDIAP)

Tinne Tuytelaars (KUL)

Goal

The goal of this project is to investigate whether image representations based on local invariant features, and document analysis algorithms such as probabilistic latent semantic analysis, can be successfully adapted and combined for the specific problem of scene categorization. More precisely, our aim is to distinguish between indoor/outdoor or city/landscape images, as well as (in a later stage) more diverse scene categories. This is interesting in its own right in the context of image retrieval or automatic image annotation, and also helps to provide context information to guide other processes such as object recognition or categorization.

So far, the intuitive analogy between local invariant features in an image and words in a text document has only been explored at the level of object rather than scene categories. Moreover, it has mostly been limited to a bags-of-keywords representation. Introducing visual equivalents for more evolved text retrieval methods to deal with word stemming, spatial relations between words, synonyms and polysemy is the prime research objective of this project, as well as studying the statistics of the extracted local features to determine to which degree the analogy between local visual features and words really holds in the context of scene classification, or how the local features based description needs to be adapted to make it hold.

Motivation

Local, viewpoint invariant features have recently made a rather impressive entrance in the field of computer vision [8, 16, 17]. They have already proven their potential in long-standing problems such as viewpoint independent object recognition and wide baselinematching, and are recently also being used for challenging tasks such as object categorization and texture classification. Thanks to their local character, they provide robustness to image clutter, partial visibility and occlusion. Thanks to their invariant nature, changes in viewpoint can be dealt with in a natural way while at the same time robustness to changes in lighting conditions can be included.

In a sense, these local invariant features show much commonalities with the role played by words in traditional document analysis techniques, in that they are local, they have a high repeatability between similar documents or images of similar scenes respectively, and they have a relatively high discriminative power. This has also been exploited in recent work by Sivic and Zisserman [14], where local invariant features are clustered into so-called visualwords , which then allow to efficiently search through a video for frames of the same object or scene using inverted files. Zhu et al. applied vector quantization on fixed-size image windows to extract a bag-of-keyblocks representation, and used a simple direct matching technique for an image retrieval task [21]. More recently, Opelt et al. proposed to use Adaboost to learn classifiers from a set of visual feature types, including local invariant ones. The framework implicitly performs feature selection, and was applied to object detection [11] . Finally, Csurka et al reported good results on object matching and multi-class categorization (from 5 to 10 objects), with a system based on a bag-of-words representation built from local invariant features and naive Bayes and Support Vector Machines (SVM) classifiers [2] .

Here, we want to go one step further, by combining several local features with good invariant properties and applying more state-of-the-art document analysis techniques, to classify scenes in a set of categories. Scene classification is interesting in its own right in the context of image indexing, and also helps to provide contextual information to guide other processes such as object recognition or categorization. From the application point of view, scene classification is relevant in multimodal systems for organization of personal and professional imaging collections, including only images (e.g. photos) or mixed media (e.g. annotated photos or videos with speech transcriptions).

The problem of scene classification using low-level features has been studied in image and video retrieval for several years [18, 15, 19]. Color, texture, and shape features have been used in combination with supervised learning methods to classify images into a small number of semantic classes, including indoor/outdoor, city/landscape, or sunset/forest/mountain. Hierarchical classification has often been used to deal with the multi-class case [19]. To our knowledge however, the use of invariant local descriptors for scene classification has not been investigated.

Instead of representing documents simply as bags of words, as in traditional text vector-space representations, we will focus on more advanced techniques. In particular, generative models for collections of discrete data have attracted recent attention. In these models, documents in a collection are modeled as mixtures of aspects, where the aspects are discrete hidden variables that capture co-occurrence information between elements in the corpus that the simple vector-space representation usually cannot. These latent spaces approaches, like Probabilistic Latent Semantic Analysis (PLSA) [6], allow to address the issues related to synonymy (different words may represent the same concept) and polysemy (the same word may represent different concepts in different contexts). Some of the methods are amenable to both dimensionality reduction and clustering. Embedded in a supervised learning task, these methods have been applied to text-based categorization with promising results [7]. Finally, although variations of these methods have been recently applied to the problem of modeling annotated images [1, 9, 10], the methods have relied mostly on global image features without much viewpoint and/or illumination invariance. In this view, the proposed project has also potential for applications to other areas in computer vision.

Technical contributions and novelty

We investigated an approach for scene classification, based on the use of bags-of-visterms (i.e. quantized invariant local descriptors) to represent scenes. Our work demonstrates that this approach is successful to classify scenes.

Furthermore we investigated the visterm statistics to provide new insights about the analogy between the bag-of-visterms representation and text, we have conducted a study of sparsity, co-occurrence, and discriminative power of visterms, which complements and extends the initial work by [14], in a different media source.

We presented a novel approach for scene classification, based on the use of probabilistic latent space models [6], successful in text modeling, to build scene representations beyond the bag-of-visterms. Latent space models capture co-occurrence information between elements in a collection of discrete data that simpler representations usually cannot, and allow to address issues related to synonymy (different visterms may represent the same scene type) and polysemy (the same visterm may represent different scene types in different contexts), which can be encountered in scene classification. We shown that Probabilistic Latent Semantic Analysis (PLSA) allows for the extraction of a compact, discriminant representation for accurate scene classification, that outperforms global scene representations, and remains competitive with recently proposed approaches. As shown with extensive experiments, This compact representation is especially robust when labeled training data is scarce, and allows for a greater re-usability of our framework, as labeling is a time-consuming task. All of our findings are based on extensive experiments.

A novel approach for scene ranking and clustering, based on the successful use of the PLSA formulation. We show that PLSA is able to automatically capture meaningful scene aspects from data, where scene similarity is evident, which makes our PLSA-derived representation useful to explore the scene structure of an image collection, and thus turning it into a tool with potential in visualization, organization, browsing, and annotation of images in large collections.

We investigated the application of text retrieval methods, namely methods for document similarity analysis, to the problem of scene classification. Also, we inspected the properties of the visualwords and compared those to

Software

This project was developed using C++ and MATLAB. Clustering and classification methods used are part of the TORCH machine learning library.