Abstract |
In daily speech the linguistic information plays a major role in the communication
between people. However, voice quality and individuality are important in speech
recognition and understanding. For instance, it is exceptionally significant to understand and discriminate between two or more speakers in a radio or a television
program. Voice individuality, apart from providing the aforementioned advantages in
communication, enriches our daily life with variety.
For a number of modern applications it is important to create and maintain data
bases for different speakers, for example, in gaming, in text-to-speech synthesis and
in cartoon movies. This may be time consuming and expensive, depending on the
requirements of the application. Speaker interpolation (SI) is the process of producing
an intermediate voice between two or more speakers, while voice conversion (VC) is
the technique of processing the voice of one person, namely the source speaker, such
that his/her voice resembles the voice of another person, namely the target speaker.
Moreover, the converted or interpolated speech should sound natural and intelligible.
Despite the extended research in VC, high-quality voice conversion has not been
achieved yeet. A number of reasons explain this current shortcoming, with the main
ones being a) the oversmoothing effect by using of statistical modeling b) inaccurate estimation of the speaker-depended features and c)the inadequacy of the used
synthesis methods. Voice conversion methods are based on spectral envelope information, which represents the vocal tract, since it has an important role on speech individuality. In conventional VC the excitation signal of the source speaker is ex-
tracted first by inverse filtering. Then this excitation signal is filtered from the vocal
tract of the target speaker. In speech interpolation the excitation signal is filtered
from an interpolated vocal tract of the given speakers.
The scope of this thesis is to deal with this research gap and achieve high quality
speech interpolation and voice conversion of parallel corpora using accurate meth-
ods for spectral envelope estimation (true envelope), time and frequency alignment
(piecewise linear time and frequency warping), and speech synthesis (interpolated
lattice filter or overlap and add). With the use of precise methods in each processing
step it was expected to reduce the artifacts currently met in voice conversion. In
speech interpolation the produced vocal tract is not just an interpolation between
the given speakers, but the vocal tract length can be altered, producing a broad range of voices. Hence, given a limited data base a substantially larger one that contains
individual speakers for every use can be created.
|