Abstract |
The recognition of human activities in video sequences represents a longstanding
objective within the domain of Computer Vision. This endeavor holds vast implications
across a diverse spectrum of applications, encompassing fields such as assistive
technologies and human-robot interactions, spanning both industrial and everyday
life contexts. In the most complex manifestation of the problem, we are dealing with
activities that may comprise of, (a) multiple constituent actions characterized by varying
temporal structures, (b) action groups that are hard to distinguish due to high similarity
in their characteristics, and, (c) large portions of shared action sub-sequences. Amidst
this multifaceted landscape, the overarching objective is the unwavering achievement of
robust human action recognition. This dissertation proposes several supervised learning
models and techniques for addressing the problem of action recognition robustness, with
a special interest on the challenge of disambiguation between actions that exhibit similar
appearance and motion characteristics, commonly referred as fine-grained.
We investigate fine-grained action recognition under two perspectives. As a first
direction, motivated by the ability of language to provide context to video data and the
on-going advancements in language models, we present three approaches that exploit
semantic ambiguity and distinctiveness of action labels to assist video action recognition
models. Our approaches exploit knowledge from large-scale text-corpora to define
semantic similarities between the available action labels. These semantic similarities are
then utilized either as a means to strictly penalize model mis-classifications to actions
with similar semantic context, or to define multi-granular action class associations based
on abstract or finer contextual relations of the lexical descriptions of the action labels.
Additionally, we present a flexible multi-granular temporal aggregation framework based
on the latter direction which facilitate the learning of human action recognition models,
under both single- and dual-dataset learning scenarios. This framework is particularly
advantageous when dealing with under-represented actions in human action/activity
recognition datasets, which is common characteristic of the fine-grained action class. It
empowers the models to gain meaningful insights and distinctions even for actions with
limited data availability.
In our subsequent set of contributions, our efforts are primarily motivated by the
general observation that actions, whether of a fine-grained nature or in their broader
generality, are intricately associated with the transformative impact they exert upon
the states of scene elements. To capture this characteristic, we propose a novel
supervised approach, structured around the concept of task repetitiveness, for learning
representations from videos suitable for enriching the discrimination ability of action
recognition models, especially in the case of fine-grained actions. We also contribute a set
of datasets that aims to highlight and explore the characteristics of repetitive actions, and
the effect of exploiting task repetitiveness to enrich the general understanding of human
actions.
This dissertation introduces innovative model architectures that harness the semantic
relationships between human actions and their associated label annotations. It also
investigates the implications and attributes of task repetitiveness in the realm of human
action comprehension, incorporating a series of novel model designs and datasets
to support this exploration. A comprehensive evaluation of these methodologies is
conducted across established benchmarks and contemporary state-of-the-art models.
The dissertation culminates by delineating the distinctive features of prospective
research avenues and highlighting unresolved issues within the domain of human action
understanding research.
|