E-Locus - Institutional Repository of the University of Crete - Voicing detection in spontaneous and real-life recordings from music lessons

Home Voicing detection in spontaneous and real-life recordings from music lessons

Results - Details

[Add to Basket]

Identifier

000394828

Title

Voicing detection in spontaneous and real-life recordings from music lessons

Alternative Title

Ανίχνευση φωνής σε ερασιτεχνικές καταγραφές μουσικών σεμιναρίων υπό πραγματικές συνθήκες

Author

Γιαννικάκη, Σοφία Ελπινίκη Σ.

Thesis advisor

Στυλιανού, Ιωάννης

Reviewer

Μουχτάρης, Αθανάσιος
Μπενέτος, Εμμανουήλ

Abstract

Speech is one of the most important abilities that we have, since it is one of the principal ways of communication with the world. In the past few years a lot of interest has been shown in developing voice-based applications. Such applications involve the isolation of speech from an audio file. The algorithms that achieve this are called Voice Detection algorithms. From the analysis of a given input audio signal, the parts containing voice are kept while the other parts (noise, silence, etc) are discarded. In this way a great reduction of the information to be further processed is achieved. The task of Voice Detection is closely related with Speech/Nonspeech Classification. In addition, Singing Voice Detection and Speech/Music Discrimination can be seen as subclasses of what we generally call Voice Detection. When dealing with such tasks, an audio signal is given as an input to a system and is then processed. The signal is usually analysed in frames, from which features are extracted. The frame duration depends mostly on the application and sometimes on the features being used. Many features have been proposed until now. There are two categories in which the features could be divided, time domain and frequency domain features. In time domain the short time energy, the zero-crossing rate and autocorrelation based features are most often used. In frequency domain cepstral features are most frequently used, due to the useful information about speech presence. To be more specific, in Singing Voice Detection and in Speech/Music Discrimination the state-of-the-art feature are the Mel-Frequency Cepstral Coefficients. It has been reported, that this particular feature provides the best performance in the majority of the cases. In this thesis an algorithm is developed that performs voice detection in spontaneous and real-life recordings from music lessons. The content of the recordings was such that the proposed algorithm was challenged to discriminate both speech and singing voice from music and other noises. A classic approach for this problem would use MFCCs as the discrimination feature and an SVM classifier for the classification into “speech” or “nonspeech”. In our work the methodology of this approach is expanded by preserving the MFCCs as the main feature and incorporating three other features namely, the Cepstral Flux, the Clarity and the Harmonicity. Cepstral Flux is extracted from the Cepstrum, while Clarity and Harmonicity are time-domain autocorrelation-based features. The goal is to improve with these additional features the performance of the system that uses only the MFCCs. So, different combination of the three additional features with the MFCCs were examined and evaluated. A 10-fold cross-validation is applied on segments, which are labelled as “speech” or “nonspeech”. The database used for the training and the testing purposes of our algorithm consists of three seminars. Two of them concern traditional cretan music classes with lira and the third one traditional cretan music classes with lute. Each recording has been carried out under different environmental conditions. Performance evaluation was conducted using the Detection Error Tradeoff (DET) and Receiver Operating Characteristic (ROC) curves as a visual evaluation tool. Also, the Equal Error Rate (EER), the Efficiency and the Area Under the Curve (AUC) were computed in each case. Each seminar was evaluated separately, as well as all together. A combination of training and testing sets from different seminars was also done, to be able to provide reliable results. It is shown that the use of the additional features significantly enhances the performance of the classic algorithm that uses only the MFCCs from about 0.5% to 20%. Specifically, it is observed that three out of the five combinations stand out, by reducing about 20% the miss probability given a false alarm probability equal to 5%.

Language

English, Greek

Subject

Audio proccessing

Cepstral flux

Clarity

Harmonicity

MFCC

SVM

Speech / music discrimination

Speech / nonspeech classification

Αρμονικότητα