Abstract |
The problem of hand pose estimation and tracking is both theoretically and
practically interesting. It is a challenging problem that hasn't been solved in its
full generality despite the significant amount of effort that has been devoted to it.
This thesis presents methods to track the position, orientation and full articulation
of human hands in various everyday scenarios.
Investigated scenarios include tracking one or two hands and tracking the hand(s)
in isolation or in interaction with the environment. Design choices for the various
presented methods regard the type of input, the selection of appropriate visual
cues and furthermore the way they are synthesized and evaluated, as well as the
optimization algorithms used to solve the formulated optimization problems. All
scenarios use markerless visual observations of the scene as input. We explore the
visual cues of skin color, edges, depth map, and visual hull. These observations can
come either from a network of cameras or from an RGB-D sensor. The choice of
input type partially mandates the visual cues that are employed.
We follow a model-based approach to the problem, formulating the pose estimation
task for each frame as an optimization problem. The search space of this
problem uses the adopted representation for the hand kinematics. For the case of
single hand, the search space is this set of kinematics parameters, whereas for hand-object
or hand-hand interaction, this search space is appropriately augmented to
include all the tracked entities. This joint consideration, while resulting in optimization
problems with tens of parameters, has the advantage that the interaction
between the tracked objects can be effortlessly modeled and evaluated. The temporal
continuity assumption is used by initializing the search for a frame near the
solution for the previous frame.
Joint modeling of the observed entities in the scene allows for effortlessly treating
scenarios of complex interaction between these entities. For the case of hand-object
interaction, we show how the observed occlusions can provide useful information
instead of being an obstacle. For the case of two hands in strong interaction, to the
best of our knowledge, the presented results involve the most complex hand-hand
interaction attempted so far in the relevant literature.
For the task of optimizing the objective functions that result from the adopted
formulation of the problem, we use black-box optimization algorithms. Specifically,
variants of Particle Swarm Optimization (PSO) are employed in most scenarios.
PSO is an evolutionary optimization algorithm that is derivative-free and easily
parallelizable. It is suitable for our task, since it is well-suited to multi-modal, non-differentiable objective functions. A novel evolutionary optimization algorithm is
also presented in this thesis, and applied to two of the examined scenarios. This
algorithm exploits the useful properties of quasi-random sampling, as well as the
power of evolutionary computing.
The various computational steps of all presented methods are carefully designed
so that they include parallelizable computations. It is then possible to make use of
modern hardware such as the GPU architecture, resulting in practical systems that
achieve real-time or interactive frame-rates.
|