Abstract |
Deep learning, a thriving field of machine learning, has witnessed an unprecedented revolution during the last decade. The powerful idea of hierarchical representation learning combined with the abundance of data the digital era e↵ortlessly
provides has led to breathtaking achievements in numerous scientific fields. Nevertheless, applications exist where a plethora of annotated training data is not
available due to privacy restrictions, annotation difficulties, or prohibitive costs.
Developing deep learning approaches that can be e↵ective in such low-data regime
scenarios is still a largely open problem.
In this work we consider such a low-data regime scenario for the problem of
image classification, which is fundamental problem of Computer Vision. In the
literature this is a setting also known as few-shot visual learning. In this case,
given only a very small set of annotated images representing the available categories
(e.g. even a single annotated image per category), the correct classification of an
unlabeled image set is required. A common approach, termed metric learning, is
to project both sets on a space, where samples are clustered with respect to their
categories, in order to classify them using a similarity metric.
Following the metric learning paradigm, we propose a methodology that utilizes deep embedding functions to project the samples on the embedding space.
To implement these embedding functions, we combine the representation power
of vision transformers, a state-of-the art deep learning architecture, amplified by
employing pre-trained self-supervised foundation models. Undoubtedly, a few-shot
learning algorithm should harness every bit of available information from the annotated data to be e↵ective under this low data regime. Hence, instead of just
incorporating prior knowledge, encoded in the embedding functions parameters,
we additionally exploit the information exchange between those functions. Specifically, we conduct a case study that can be summarized in two main questions; (i)
Is an exchange of information between the embedding functions beneficial for the
problem at hand? (ii) In what way this exchange of information can be established?
In an attempt to answer these questions, we propose three main methods.
These are namely ParallelVits, ParallelVits+Encoder, and BlendedVits. ParallelVits method undertakes the role of a performance baseline since it restricts the
information flow between the embedding functions, whereas the rest of the methods enable information exchange by leveraging the flexibility of vision transformers
architecture. Moreover, several hyper-parameters of the employed meta-learning
framework, the neural network architectures, and the aforementioned methods
have been put under scrutiny. The evaluation of our method has led to some
interesting findings as well as to very promising experimental results, leading to
near state-of-the-art performance in the miniImageNet dataset.
|