Abstract |
A speech production model that views speech as the result of passing a glottal
excitation waveform through a time-varying linear filter (the latter modeling the
resonant characteristics of the vocal tract) is widely used in digital speech signal
processing. In many speech applications, two possible states of the glottal excitation
can be assumed: voiced or unvoiced. Voice models often split the speech
spectrum into these two (or even more) voiced/unvoiced frequency bands using respective
cutoff frequencies. Voiced speech is usually modeled deterministically in the
lower frequencies, while a stochastic approach is used for the upper frequency part.
A so-called Maximum Voiced Frequency separates the deterministic and stochastic
parts. However, it can be observed from the actual voice production mechanisms
that the amplitude spectrum of the voice source decreases smoothly without any
abrupt frequency changes that would justify such a classification of the spectrum
in deterministic and stochastic components. Accordingly, it becomes a struggle for
multiband models to estimate these cutoff frequencies. Consequently, artifacts produced
by multiband methods can degrade the perceived quality. Moreover, the Fan
Chirp Transformation (FChT), which uses a linear frequency basis adapted to the
nonstationarities of the speech signal, has demonstrated that harmonicity is present
at frequencies higher than those usually considered as voiced based on the Discrete
Fourier Transform (DFT). This motivates alternative models which are based on a
full-band modeling approach.
Sinusoidal and harmonic models aim to represent the speech signal with a set of
parameters such as frequencies, amplitudes and phases. The accuracy and precision
of the model parameters are key issues. All voice models have to be both precise
and fast in order to represent the speech signal adequately and be able to process
large amounts of data in a reasonable amount of time. So far, the Sinusoidal Model
(SM), where the glottal excitation is represented as a sum of sine waves, has been
widely used for many applications such as speech analysis, coding and modifications.
However, as we show in our evaluations in this thesis, the estimated parameters are
not as accurate as the ones computed by harmonic models. The adaptive Quasi-
Harmonic Model (aQHM) has been proposed as an alternative and more adaptive
method for speech analysis, that uses some of the attributes of the harmonicity of
a signal. The aQHM offers even more flexibility than the FChT by using a set of
adaptive non-linear basis functions. However, due to the assumption made by aQHM,
that the initial frequency tracks can have a confined error, a frequency matching
problem may occur. Hence, neither method is very suitable for full-band modeling
of a speech signal.
Harmonic models were initially designed for representation of the deterministic
part of the speech, but, as implied by the FChT, the need of a cutoff frequency limit
is questionable. Thus, exploiting the properties of aQHM, the full-band adaptive
Harmonic Model (aHM) along with its corresponding algorithms for the estimation
of harmonics up to the Nyquist frequency has been proposed. The aHM model
uses the Least Squares (LS) solution in the Adaptive Iterative Refinement (AIR)
algorithm in order to properly estimate a refinement of the f0 curve without the
problems caused by frequency errors. Even though aHM-AIR using LS allows for a
robust estimation of the harmonic components, it lacks the computational efficiency
that would make its use convenient for large databases, due to the use of the LS
solution.
In this thesis, a Peak-Picking (PP) approach is suggested as a substitution to the
LS solution used by the AIR algorithm. In order to integrate the adaptivity scheme
of aHM in the PP approach, an adaptive Discrete Fourier Transform (aDFT), whose
frequency basis can fully follow the variations of the f0 curve, is also proposed. In
order to evaluate the performance of the proposed method, the computational time
has been calculated and an average time reduction of almost four times has been
shown when comparing the proposed improvements to the original LS-based aHMAIR
algorithm. Additionally, the quality of the re-synthesis is preserved compared
to the aHM-AIR using LS. With the use of Signal-To-Reconstruction-Error (SRER)
and Perceptual Evaluation of Speech Quality (PESQ), we show that the speech
reconstructed using aHM-AIR with PP and aDFT retains the quality of aHM-AIR
using LS. Finally, formal listening tests show that the speech reconstructed by aHMAIR
with PP and aDFT is very similar to the one reconstructed by aHM-AIR using
LS.
|