Abstract |
We focus on the problem of vision-based scene understanding, i.e. “lifting” a scene
which is observed by visual means across time, to a symbolic representation that can be
processed by a computational system. We are interested in dynamic indoor scenes, in
which humans purposefully interact with their environment. We observe that existing approaches
have been performing scene understanding mainly through coarse modelling of
the observed processes, as more detailed modelling is very demanding in terms of computational
resources and exhibits difficulties with respect to the required integration of
computer vision methods.
We suggest that currently, it is indeed feasible to incorporate detailed scene modelling,
which can be easily integrated with computer vision techniques and can efficiently cope
with the associated computational requirements. With respect to scene understanding, we
are in position to model and simulate the process of image acquisition through 3D rendering
(appearance), and the dynamics of the observed processes through physics simulation
(behavior). Thus, we identify 3D rendering and physics simulation as two significant processes
towards scene understanding. We propose the combination of the simulation power
of these tools with powerful optimization methods, in order to yield powerful inference
tools towards scene understanding.
More specifically, we consider the process of scene understanding as an optimization
problem. We design parametric models that describe what can take place in a dynamic
scene and how this can be observed by visual means. We define these parameters to
constitute the domain of the optimization problem. Optimization is decoupled from modelling
and is performed in a hypothesize-and-test framework which is implemented based
on black box optimization techniques. The outcome of the optimization is the instance
of the parametric models which best “explain” the observations. Ultimately, in the context
of this work, the tested hypotheses are in agreement with laws of physics as they
originate from physics simulators. For every hypothesis, its compatibility with actual observations
of the scene is evaluated through 3D rendering. Thus, our proposal focuses on
three points: (a) forward modelling of the scene, (b) incorporation of physics simulation
and (c) exploitation of black-box optimization methods.
We have developed a computational framework which, based on the above, performs
aspects of 3D scene understanding. We present this framework and its application to the
problems of 3D tracking and motion estimation. We emphasize the necessity for the incorporation
of physics. More specifically, we show that by acknowledging that visual
observations regard physical phenomena governed by laws of physics, we can even apply
inference on initially “hidden” parameters. More specifically, we can estimate parameters
that prior to incorporating physics were not directly observable, and which can be
recovered only by attributing observations to side-effects of physical processes.
The proposed computational framework has been employed to solve problems that
vary from tracking a single object to tracking two hands while interacting with many objects,
in 3D and from different visual modalities and camera arrangements. Through a
series of experiments we show how important it is to incorporate computer graphics and
physics processes in 3D scene understanding. These processes were successfully used as
black box simulation tools and their inherent complexity has not hindered the integration
with computer vision processes, thanks to the design choice of employing black-box optimization.
We were also able to show that the proposed framework exhibits a favorable
scalability profile when applied to domains of increasing complexity. Through careful
design, the invocation of otherwise expensive simulations can be performed so efficiently
that interactive processing frame rates are achieved. All the above advocate a modular
computational solution to 3D scene understanding problems with a clear potential for
improvement or generalization: substituting parts with better or more general modules
automatically improves the entire framework.
|