Abstract |
Speech is one of the most important abilities that we have, since it is one of the principal
ways of communication with the world. In the past few years a lot of interest has been
shown in developing voice-based applications. Such applications involve the isolation of
speech from an audio file. The algorithms that achieve this are called Voice Detection
algorithms. From the analysis of a given input audio signal, the parts containing voice are
kept while the other parts (noise, silence, etc) are discarded. In this way a great reduction
of the information to be further processed is achieved.
The task of Voice Detection is closely related with Speech/Nonspeech Classification. In
addition, Singing Voice Detection and Speech/Music Discrimination can be seen as subclasses of what we generally call Voice Detection. When dealing with such tasks, an audio
signal is given as an input to a system and is then processed. The signal is usually analysed
in frames, from which features are extracted. The frame duration depends mostly on the
application and sometimes on the features being used. Many features have been proposed
until now. There are two categories in which the features could be divided, time domain
and frequency domain features. In time domain the short time energy, the zero-crossing
rate and autocorrelation based features are most often used. In frequency domain cepstral
features are most frequently used, due to the useful information about speech presence.
To be more specific, in Singing Voice Detection and in Speech/Music Discrimination the
state-of-the-art feature are the Mel-Frequency Cepstral Coefficients. It has been reported,
that this particular feature provides the best performance in the majority of the cases.
In this thesis an algorithm is developed that performs voice detection in spontaneous
and real-life recordings from music lessons. The content of the recordings was such that
the proposed algorithm was challenged to discriminate both speech and singing voice from
music and other noises. A classic approach for this problem would use MFCCs as the
discrimination feature and an SVM classifier for the classification into “speech” or “nonspeech”. In our work the methodology of this approach is expanded by preserving the
MFCCs as the main feature and incorporating three other features namely, the Cepstral
Flux, the Clarity and the Harmonicity. Cepstral Flux is extracted from the Cepstrum,
while Clarity and Harmonicity are time-domain autocorrelation-based features. The goal
is to improve with these additional features the performance of the system that uses only
the MFCCs. So, different combination of the three additional features with the MFCCs
were examined and evaluated. A 10-fold cross-validation is applied on segments, which are
labelled as “speech” or “nonspeech”. The database used for the training and the testing
purposes of our algorithm consists of three seminars. Two of them concern traditional cretan music classes with lira and the third one traditional cretan music classes with lute.
Each recording has been carried out under different environmental conditions.
Performance evaluation was conducted using the Detection Error Tradeoff (DET) and Receiver Operating Characteristic (ROC) curves as a visual evaluation tool. Also, the Equal
Error Rate (EER), the Efficiency and the Area Under the Curve (AUC) were computed in
each case. Each seminar was evaluated separately, as well as all together. A combination
of training and testing sets from different seminars was also done, to be able to provide
reliable results. It is shown that the use of the additional features significantly enhances
the performance of the classic algorithm that uses only the MFCCs from about 0.5% to
20%. Specifically, it is observed that three out of the five combinations stand out, by
reducing about 20% the miss probability given a false alarm probability equal to 5%.
|