Abstract:
Visual attention is a mechanism in human perception, which selects
relevant regions from a scene and provides these regions for
higher-level processing as object recognition. This enables
humans to act effectively in their environment despite the complexity
of perceivable sensor data. Computational vision systems face the same
problem as humans: there is a large amount of available information
that has to be processed and if an efficient processing shall be
achieved, maybe even real-time performance in robotic applications,
the order in which a scene is investigated has to be determined in an
intelligent way.
A promising approach to achieve this is the use of computational
attention systems that simulate human visual attention.
This thesis introduces the biologically motivated computational
attention system VOCUS (Visual Object detection with a CompUtational
attention System) that detects regions of interest in images. It
operates in two modes, in an exploration mode in which no task is
provided, and in a search mode with a specified target. In exploration
mode, regions of interest are defined by strong contrasts (e.g. color
or intensity contrasts) and by the uniqueness of a feature. For
example, a black sheep is salient in a flock of white sheep. In
search mode, the system uses previously learned information of a
target object to bias the saliency computations with respect to the
target. In various experiments, it is shown that the target is in
average found with less than three fixations, that usually less than
five training images suffice to learn the target information, and that
the system is mostly robust with regard to viewpoint changes and illumination
variances.
Furthermore, we demonstrate how VOCUS profits from additional sensor
data: we apply the system to depth and reflectance data from a 3D
laser scanner and show the advantages that the laser modes provide. By
fusing the data of both modes, we demonstrate how the system is able
to consider distinct object properties and how the flexibility of the
system increases by considering different data. Finally, the regions
of interest provided by VOCUS serve as input for a classifier that
recognizes the object in the detected region. We show how and in which
cases the classification is sped up and how the detection quality is
improved by the attentional front end. Especially useful is this
approach if many object classes have to be considered, a frequently
occurring situation in robotics.
VOCUS provides a powerful approach to improve existing vision systems
by concentrating computational resources to regions that are more
likely to contain relevant information. The more the complexity and
power of vision systems increases in the future, the more they will
profit from an attentional front-end like VOCUS.