Abstract |
Speaker recognition is the process of recognizing a speaker automatically,
based on specific features extracted from the speech signal. A broad range
of applications exploits at its core the process of speaker recognition,
where usually the presence of environmental noise in the speech signal
impedes the inference of correct decisions. An additional factor, which
contributes to the difficulty of recognizing a speaker correctly, is the limited
amount of available training and evaluation data.
Focusing on overcoming the above limitations, this dissertation is divided in
two main parts. In the first part, the problem of speaker recognition is
reduced in an equivalent classification problem. To this end, we develop
and study the performance of classification techniques, which are based on
the framework of sparse representations, where we focus on the task of
speaker identification by employing highly limited amounts of training and
evaluation data, in environments with high levels of noise. The main
assumption that governs these techniques is that the identified speech
signal, and specifically the features that have been extracted from this
signal, can be expressed as a sparse linear combination in terms of the
columns of an overcomplete matrix, which is often referred in the literature
with the term “dictionary”. The optimally estimated sparse weights of the
linear combinations, the so-called sparse codes, which are obtained as the
solutions of an optimization problem, are then employed for the final
identification of the speaker based on a minimum reconstruction error
criterion.
Extending the previous classification method based on sparse
representations, we study the efficiency of a method for discriminative
dictionary learning. This method estimates jointly the dictionary comprising
of the training data in conjunction with an appropriate linear classifier. The
advantage of this approach is that it results in sparse codes, which are
characterized by enhanced discriminative capability. Extensive comparisons
with probabilistic models, which are based on the hypothesis that the
extracted speech features follow a generalized Gaussian distribution, as
well as with some of the state-of-the-art classification methods, such as
Gaussian mixture models and joint factor analysis, revealed the superiority
of the proposed method.
The second part of this dissertation focuses on the use of low-rank
techniques as a powerful tool for extracting reliable features from a speech
signal. More specifically, a technique for recovering a low-rank matrix is
designed, which is employed for the reconstruction of those spectral
regions of a speech signal, which are unreliable due to the presence of
noise. The reconstruction of the unreliable spectral regions is performed by
adopting the Singular Value Thresholding (SVT) algorithm, based on the
assumption that the logarithmic magnitude representation of a speech
signal in the time-frequency domain, obtained via the short-time Fourier
transform (STFT), is of low rank. The comparison against the widely used
method of sparse imputation, which is based on sparse representations,
reveals the superiority of our proposed approach in terms of producing
more reliable features.
Finally, we propose an extension of the matrix completion method, which
exploits the prior knowledge that the data matrix is low rank, as well as the
knowledge that the data can be represented efficiently in terms of a
dictionary. In particular, we proposed an algorithm for joint low-rank
representation and matrix completion (J-SVT). J-SVT is superior when
compared with the standard SVT with respect to the computation of the
low-rank representation of a data matrix in terms of a given dictionary, by
employing a small number of observations from the original matrix.
Through extensive simulations, we observed an improvement of the
reconstruction error achieved by the J-SVT, in contrast to the typical SVT,
for several distinct experimental scenarios.
|