Your browser does not support JavaScript!

Home    Ανίχνευση Ασυνεχειών στη Συνδετική Σύνθεση Φωνής με Ακουστικές Μονάδες  

Results - Details

Add to Basket
[Add to Basket]
Identifier uch.csd.msc//2006pantazis
Title Ανίχνευση Ασυνεχειών στη Συνδετική Σύνθεση Φωνής με Ακουστικές Μονάδες
Alternative Title Detection of Discontinuities in Concatenative Speech Synthesis
Creator Pantazis, Ioannis
Abstract Last decade, unit selection synthesis became a hot topic in speech synthesis research. Unit selection gives the greatest naturalness due to the fact that it does not apply a large amount of digital signal processing to the recorded speech, which often makes recorded speech sound less natural. In order to find the best units in the database, unit selection is based on two cost functions, /target cost /and /concatenation cost/. Concatenation cost refers to how well adjacent units can be joined. The problem of finding a concatenation cost function is broken into two subproblems; into finding the proper parameterizations of the signal and into finding the right distance measure. Recent studies attempted to specify which concatenation distance measures are able to predict audible discontinuities and thus, highly correlates with human perception of discontinuity at concatenation point. However, none of the concatenation costs used so far, can measure the similarity (or, (dis-)continuity) of two consecutive units efficiently. Many features such as line spectral frequencies (LSF) and Mel frequency cepstral coefficients (MFCC) have been used for the detection of discontinuities. In this study, three new sets of features for detecting discontinuities are introduced. The first set of features are obtained by modeling the speech signal as a sum of harmonics with time varying complex amplitude, which yield a nonlinear speech model. The second set of features is based on a nonlinear speech analysis technique which tries to decompose speech signals into AM and FM components. The third feature set exploits the nonlinear nature of the ear. Using Lyon’s auditory model, the behaviour of the cochlea is measured by evaluating neural firing rates. To measure the difference between two vectors of such parameters, we need a distance measure. Examples of such measures are absolute distance (/l/1 norm) and Euclidean distance (/l/2 norm). However, these measures are naive and provide rather poor results. We further suggest using Fisher’s linear discriminant as well as a quadratic discriminant as discrimination functions. Linear regression, which employs a least-squares method, was also tested as a discrimination function. The evaluation of the objective distance measures (or concatenation costs) as well as the training of the discriminant functions was performed on two databases. To build a database, a psychoacoustic listening experiment is performed and listener’s opinions are obtained. The first database was created by Klabbers and Veldhuis in Holland while, the second database was created by Stylianou and Syrdal at AT&T Labs. Therefore, we are able to compare same approaches on different databases and obtain more robust results. Results obtained from the two different psychoacoustic listening tests showed that nonlinear harmonic model using Fisher’s linear discriminant or linear regression performed very well in both tests. It was significantly better than MFCC separated with Euclidean distance which a common concatenation cost in modern TTS systems. Another good concatenation cost, but less good than nonlinear harmonic model, is AM-FM decomposition again with Fisher’s linear discriminant or linear regression. These results indicate that *a concatenation cost which is based on nonlinear features separated by a statistical discriminant function *is a good choice.
Issue date 2006-12-01
Date available 2007-10-11
Collection   School/Department--School of Sciences and Engineering--Department of Computer Science--Post-graduate theses
  Type of Work--Post-graduate theses
Views 517

Digital Documents
No preview available

Download document
View document
Views : 3