|
Identifier |
000429227 |
Title |
Unsupervised co-segmentation of actions in motion capture data and videos |
Alternative Title |
Μη-εποπτευόμενη συν-τμηματοποίηση δράσεων σε ακολουθίες δεδομένων κίνησης και εικόνων |
Author
|
Παπουτσάκης, Κωνσταντίνος Ε
|
Thesis advisor
|
Αργυρός, Αντώνιος
|
Reviewer
|
Τραχανιάς, Παναγιώτης
Τζιρίτας, Γεώργιος
Ζαμπούλης, Ξενοφών
Κοσμόπουλος, Δημήτριος
Παναγιωτάκης, Κωνσταντίνος
Δάρας, Πέτρος
|
Abstract |
We focus on the problem of temporal co-segmentation of actions in sequences of 3D
motion capture data and in image sequences (videos). Given two data sequences
representing action relevant information, the goal is to detect and temporally cosegment all pairs of matching sub-sequences (temporal segments), where the
segments of a pair represent a common (identical or similar) action or sub-action. This
is an important and challenging problem in the research communities of Computer
Vision, Pattern Recognition and Machine Learning, which despite the research efforts
devoted to its solution, remains unsolved in its full generality.
We investigate the problem of interest by following a data-driven, unsupervised
approach, where no a-priori models and labels of the actions represented in the
sequences are available. Various challenging scenarios and conditions are considered,
i.e., (a) one or multiple actions are demonstrated by different subjects in each
sequence, (b) the number of common actions between the sequences may be
unknown, (c) the common actions may be located anywhere in the sequences, (d)
instances of the common action or sub-action can be of variable duration and of
different speed and execution style and (e) actions may involve a single or multiple
humans, generic objects or even complex human-object interactions.
Two novel, efficient methodologies are proposed in this thesis to deal with this
problem. They are based on a stochastic optimization approach and a deterministic,
graph-based approach, respectively. Furthermore, we leverage the robust
performance of the proposed temporal action co-segmentation strategies to develop
a method that estimates the similarity of the original sequences and provides
meaningful arguments supporting this estimation, making a step towards explainable
assessment of video and action similarity.
Specifically, two novel methods are introduced to perform temporal co-segmentation
between two sequences of motion capture data (3D/6D human skeletal data and/or
object pose data) or of RGB images. Each data sequence is treated as a multivariate
time-series for any of the data modalities. The first method discovers and co-segments
the N best pairs of common sub-sequences (commonalities) between the compared
time-series by minimizing a cost function that expresses their non-linear temporal
alignment cost. The cost is quantified using the Dynamic Time Warping (DTW) method
and its minimization is treated as a stochastic optimization problem that is solved
using Canonical Particle Swarm Optimization (PSO). The PSO method relies on
evolutionary search strategies to minimize the DTW-based cost function and is applied
iteratively in order to discover the N best commonalities. The second method treats
temporal action co-segmentation as a search problem on a graph defined on the
matrix of the pair-wise Euclidean distances (EDM) of the frame-wise features between
the two compared time-series. An efficient graph-based search algorithm is used for
solving the problem of discovering N commonalities. The number of the N best
commonalities to be discovered for two time-series may be unknown or given a-priori.
Both methods have been extensively tested using pairs of image sequences (videos)
or pairs of sequences containing 3D motion capture data. Various types of action
scenarios have been considered such as physical exercises, daily living activities and
human-object interaction, while quantitative experiments demonstrate the
effectiveness of the proposed methods in comparison to existing, state-of-art
approaches.
In addition, a novel method is proposed for fine-grained similarity assessment of two
actions in videos that capitalizes on the effectiveness of temporal co-segmentation
between the trajectories of the tracked human joints and/or the tracked objects and
their semantic relatedness. A graph matching approach based on Graph Edit Distance
is employed to combine the object-level features and semantic information, towards
computing spatio-temporal correspondences between objects across videos, if these
objects are semantically related, if/when they interact similarly, or both.
The proposed framework aspires to take an important step towards explainable
assessment of video and action similarity. It is evaluated on publicly available datasets
on the tasks of action classification, action matching and action-based ranking in
triplets of videos and is shown to compare favorably to state-of-the-art unsupervised
and supervised learning methods.
Keywords: Temporal action co-segmentation, video similarity, temporal alignment,
pairwise action ranking, action matching, action recognition, Graph Edit Distance,
Particle Swarm Optimization.
|
Language |
English |
Subject |
Action matching |
|
Action recognition |
|
Graph Edit Distance |
|
Pairwise action ranking |
|
Particle SwarmOptimization. |
|
Temporal action co-segmentation |
|
Temporal alignment |
|
Video similarity |
|
Αναγνώριση δράσεων / δραστηριοτήτων |
|
Αναζήτηση / ανάκτηση ακολουθιών εικόνων |
|
Αντιστοίχιση / ταίριασμα δράσεων |
|
Κατάταξη ομοιότητας δράσεων κατα ζεύγη |
|
Ομοιότητα ακολουθιών εικόνων |
|
Ομοιότητα ακολουθών δεδομένων καταγραφής κίνησης |
|
Χρονική συντμηματοποίηση δράσεων / δραστηριοτήτων |
Issue date |
2020-03-27 |
Collection
|
School/Department--School of Sciences and Engineering--Department of Computer Science--Doctoral theses
|
|
Type of Work--Doctoral theses
|
Views |
1098 |