Abstract |
Marker-less articulated human body pose recovery and tracking is a challenging
problem of great importance, with strong theoretical and practical implications. The
recent introduction of low-cost depth cameras triggered a number of interesting new
works, pushing forward the state of the art. However, despite the remarkable
progress, estimating the body pose in realistic, complex scenarios is still an open
research task.
In this thesis we propose and develop a markerless model-based method to recover
and track the full body pose, from RGB-D sequences, in arbitrary scenarios where
users can freely enter or leave the scene, move, act and interact with other users or
the environment. Our research focuses mainly on the problem of handling occlusions,
either across body parts belonging to the same user, or across different users. At the
same time, we attempt to tackle additional important issues encountered in the
problem at hand, such as dealing with the large diversity of human bodies or the
unconstrained initialization of tracking.
Towards this goal, we introduced the novel concept of Top View Reprojection (TVR)
of cylindrical objects, which uniquely defines the pose of a cylinder based on certain
quantitative appearance properties of its Top View, i.e. the view aligned with the
cylinder's main axis. Based on this, the problem of estimating the pose of a cylindrical
object becomes that of estimating the corresponding Top View. Interestingly, the
developed formulation of TVR remains unaffected from factors such
as noisy or missing data.
Capitalizing on the TVR concept, we represent the human body by a cylinder-based
model, consisting of 11 body parts. The body is uniformly treated within the TVR
framework following a local optimization technique; body parts,
represented as
cylinders, are examined in a top-to-bottom sequential order, starting from the head.
For each body part a set of hypotheses is generated and tracked over time by a Particle
Filter (PF). To evaluate each hypothesis, we employ a novel metric that considers the
virtual Top View of the corresponding body part. The latter, in conjunction with
regular depth information, effectively copes with difficult and ambiguous cases, such
as severe inter-and intra-person occlusions.
For evaluation purposes,
we conducted several series of experiments addressing
realistic scenarios of gradually increased difficulty, involving varying number of users
interacting with each other. We further compared the performance of the proposed
method against that of state-of-the-art approaches using public or own-collected
datasets with ground truth annotation. The presented quantitative and qualitative
results attest for the effectiveness of our approach.
|