E-Locus - Institutional Repository of the University of Crete

Home Collections School/Department School of Sciences and Engineering Department of Computer Science Doctoral theses

Doctoral theses

Current Record: 32 of 121

[Add to Basket]

Identifier

000429227

Title

Unsupervised co-segmentation of actions in motion capture data and videos

Alternative Title

Μη-εποπτευόμενη συν-τμηματοποίηση δράσεων σε ακολουθίες δεδομένων κίνησης και εικόνων

Author

Παπουτσάκης, Κωνσταντίνος Ε

Thesis advisor

Αργυρός, Αντώνιος

Reviewer

Τραχανιάς, Παναγιώτης
Τζιρίτας, Γεώργιος
Ζαμπούλης, Ξενοφών
Κοσμόπουλος, Δημήτριος
Παναγιωτάκης, Κωνσταντίνος
Δάρας, Πέτρος

Abstract

We focus on the problem of temporal co-segmentation of actions in sequences of 3D motion capture data and in image sequences (videos). Given two data sequences representing action relevant information, the goal is to detect and temporally cosegment all pairs of matching sub-sequences (temporal segments), where the segments of a pair represent a common (identical or similar) action or sub-action. This is an important and challenging problem in the research communities of Computer Vision, Pattern Recognition and Machine Learning, which despite the research efforts devoted to its solution, remains unsolved in its full generality. We investigate the problem of interest by following a data-driven, unsupervised approach, where no a-priori models and labels of the actions represented in the sequences are available. Various challenging scenarios and conditions are considered, i.e., (a) one or multiple actions are demonstrated by different subjects in each sequence, (b) the number of common actions between the sequences may be unknown, (c) the common actions may be located anywhere in the sequences, (d) instances of the common action or sub-action can be of variable duration and of different speed and execution style and (e) actions may involve a single or multiple humans, generic objects or even complex human-object interactions. Two novel, efficient methodologies are proposed in this thesis to deal with this problem. They are based on a stochastic optimization approach and a deterministic, graph-based approach, respectively. Furthermore, we leverage the robust performance of the proposed temporal action co-segmentation strategies to develop a method that estimates the similarity of the original sequences and provides meaningful arguments supporting this estimation, making a step towards explainable assessment of video and action similarity. Specifically, two novel methods are introduced to perform temporal co-segmentation between two sequences of motion capture data (3D/6D human skeletal data and/or object pose data) or of RGB images. Each data sequence is treated as a multivariate time-series for any of the data modalities. The first method discovers and co-segments the N best pairs of common sub-sequences (commonalities) between the compared time-series by minimizing a cost function that expresses their non-linear temporal alignment cost. The cost is quantified using the Dynamic Time Warping (DTW) method and its minimization is treated as a stochastic optimization problem that is solved using Canonical Particle Swarm Optimization (PSO). The PSO method relies on evolutionary search strategies to minimize the DTW-based cost function and is applied iteratively in order to discover the N best commonalities. The second method treats temporal action co-segmentation as a search problem on a graph defined on the matrix of the pair-wise Euclidean distances (EDM) of the frame-wise features between the two compared time-series. An efficient graph-based search algorithm is used for solving the problem of discovering N commonalities. The number of the N best commonalities to be discovered for two time-series may be unknown or given a-priori. Both methods have been extensively tested using pairs of image sequences (videos) or pairs of sequences containing 3D motion capture data. Various types of action scenarios have been considered such as physical exercises, daily living activities and human-object interaction, while quantitative experiments demonstrate the effectiveness of the proposed methods in comparison to existing, state-of-art approaches. In addition, a novel method is proposed for fine-grained similarity assessment of two actions in videos that capitalizes on the effectiveness of temporal co-segmentation between the trajectories of the tracked human joints and/or the tracked objects and their semantic relatedness. A graph matching approach based on Graph Edit Distance is employed to combine the object-level features and semantic information, towards computing spatio-temporal correspondences between objects across videos, if these objects are semantically related, if/when they interact similarly, or both. The proposed framework aspires to take an important step towards explainable assessment of video and action similarity. It is evaluated on publicly available datasets on the tasks of action classification, action matching and action-based ranking in triplets of videos and is shown to compare favorably to state-of-the-art unsupervised and supervised learning methods. Keywords: Temporal action co-segmentation, video similarity, temporal alignment, pairwise action ranking, action matching, action recognition, Graph Edit Distance, Particle Swarm Optimization.

Language

English

Subject

Action matching

Action recognition

Graph Edit Distance

Pairwise action ranking

Particle SwarmOptimization.

Temporal action co-segmentation

Temporal alignment

Video similarity

Αναγνώριση δράσεων / δραστηριοτήτων