Abstract |
Emotional (or stressed/expressive) speech can be defined as the speech style produced by
an emotionally charged speaker. Speakers that feel sad, angry, happy and neutral put a certain
stress in their speech that is typically characterized as emotional. Processing of emotional speech
is assumed among the most challenging speech styles for modelling, recognition, and classifications.
The emotional condition of speakers may be revealed by the analysis of their speech,
and such knowledge could be effective in emergency conditions, health care applications, and as
pre-processing step in recognition and classification systems, among others.
Acoustic analysis of speech produced under different emotional conditions reveals a great
number of speech characteristics that vary according to the emotional state of the speaker. Therefore
these characteristics could be used to identify and/or classify different emotional speech
styles. There is little research on the parameters of the Sinusoidal Model (SM), namely amplitude,
frequency, and phase as features to separate different speaking styles. However, the
estimation of these parameters is subjected to an important constraint; they are derived under
the assumption of local stationarity, that is, the speech signal is assumed to be stationary inside
the analysis window. Nonetheless, speaking styles described as fast or angry may not hold this
assumption. Recently, this problem has been handled by the adaptive Sinusoidal Models (aSMs),
by projecting the signal onto a set of amplitude and frequency varying basis functions inside the
analysis window. Hence, sinusoidal parameters are more accurately estimated.
In this thesis, we propose the use of an adaptive Sinusoidal Model (aSM), the extended adaptive
Quasi-Harmonic Model (eaQHM), for emotional speech analysis and classification. The
eaQHM adapts the amplitude and the phase of the basis functions to the local characteristics of
the signal. Firstly, the eaQHM is employed to analyze emotional speech in accurate, robust, continuous,
time-varying parameters (amplitude and frequency). It is shown that these parameters
can adequately and accurately represent emotional speech content. Using a well known database of pre-labeled narrowband expressive speech (SUSAS) and the emotional database of Berlin, we
show that very high Signal to Reconstruction Error Ratio (SRER) values can be obtained, compared
to the standard Sinusoidal Model (SM). Specifically, eaQHM outperforms SM in average
by 100% in SRER. Additionally, formal listening tests,on a wideband custom emotional speech
database of running speech, show that eaQHM outperforms SM from a perceptual resynthesis
quality point of view. The parameters obtained from the eaQHM models can represent more
accurately an emotional speech signal. We propose the use of these parameters in an application
based on emotional speech, the classification of emotional speech. Using the SUSAS and
Berlin databases we develop two separate Vector Quantizers (VQs) for the classification, one for
amplitude and one for frequency features. Finally, we suggest a combined amplitude-frequency
classification scheme. Experiments show that both single and combined classification schemes
achieve higher performance when the features are obtained from eaQHM.
|