E-Locus - Institutional Repository of the University of Crete

Home Search

Results - Details

Search command : Author="Μουχτάρης" And Author="Αθανάσιος"

Current Record: 19 of 29

[Add to Basket]

Identifier

000391352

Title

Speech analysis / synthesis using an adaptive harmonic model

Alternative Title

Ανάλυση και σύνθεση λόγου με χρήση ενός προσαρμοστικού αρμονικού μοντέλου

Author

Μόρφη, Γνωστοθέα-Βερονίκη Χ

Thesis advisor

Μουχτάρης, Αθανάσιος

Reviewer

Τσακαλίδης, Παναγιώτης
Τζιρίτας, Γεώργιος

Abstract

A speech production model that views speech as the result of passing a glottal excitation waveform through a time-varying linear filter (the latter modeling the resonant characteristics of the vocal tract) is widely used in digital speech signal processing. In many speech applications, two possible states of the glottal excitation can be assumed: voiced or unvoiced. Voice models often split the speech spectrum into these two (or even more) voiced/unvoiced frequency bands using respective cutoff frequencies. Voiced speech is usually modeled deterministically in the lower frequencies, while a stochastic approach is used for the upper frequency part. A so-called Maximum Voiced Frequency separates the deterministic and stochastic parts. However, it can be observed from the actual voice production mechanisms that the amplitude spectrum of the voice source decreases smoothly without any abrupt frequency changes that would justify such a classification of the spectrum in deterministic and stochastic components. Accordingly, it becomes a struggle for multiband models to estimate these cutoff frequencies. Consequently, artifacts produced by multiband methods can degrade the perceived quality. Moreover, the Fan Chirp Transformation (FChT), which uses a linear frequency basis adapted to the nonstationarities of the speech signal, has demonstrated that harmonicity is present at frequencies higher than those usually considered as voiced based on the Discrete Fourier Transform (DFT). This motivates alternative models which are based on a full-band modeling approach. Sinusoidal and harmonic models aim to represent the speech signal with a set of parameters such as frequencies, amplitudes and phases. The accuracy and precision of the model parameters are key issues. All voice models have to be both precise and fast in order to represent the speech signal adequately and be able to process large amounts of data in a reasonable amount of time. So far, the Sinusoidal Model (SM), where the glottal excitation is represented as a sum of sine waves, has been widely used for many applications such as speech analysis, coding and modifications. However, as we show in our evaluations in this thesis, the estimated parameters are not as accurate as the ones computed by harmonic models. The adaptive Quasi- Harmonic Model (aQHM) has been proposed as an alternative and more adaptive method for speech analysis, that uses some of the attributes of the harmonicity of a signal. The aQHM offers even more flexibility than the FChT by using a set of adaptive non-linear basis functions. However, due to the assumption made by aQHM, that the initial frequency tracks can have a confined error, a frequency matching problem may occur. Hence, neither method is very suitable for full-band modeling of a speech signal. Harmonic models were initially designed for representation of the deterministic part of the speech, but, as implied by the FChT, the need of a cutoff frequency limit is questionable. Thus, exploiting the properties of aQHM, the full-band adaptive Harmonic Model (aHM) along with its corresponding algorithms for the estimation of harmonics up to the Nyquist frequency has been proposed. The aHM model uses the Least Squares (LS) solution in the Adaptive Iterative Refinement (AIR) algorithm in order to properly estimate a refinement of the f0 curve without the problems caused by frequency errors. Even though aHM-AIR using LS allows for a robust estimation of the harmonic components, it lacks the computational efficiency that would make its use convenient for large databases, due to the use of the LS solution. In this thesis, a Peak-Picking (PP) approach is suggested as a substitution to the LS solution used by the AIR algorithm. In order to integrate the adaptivity scheme of aHM in the PP approach, an adaptive Discrete Fourier Transform (aDFT), whose frequency basis can fully follow the variations of the f0 curve, is also proposed. In order to evaluate the performance of the proposed method, the computational time has been calculated and an average time reduction of almost four times has been shown when comparing the proposed improvements to the original LS-based aHMAIR algorithm. Additionally, the quality of the re-synthesis is preserved compared to the aHM-AIR using LS. With the use of Signal-To-Reconstruction-Error (SRER) and Perceptual Evaluation of Speech Quality (PESQ), we show that the speech reconstructed using aHM-AIR with PP and aDFT retains the quality of aHM-AIR using LS. Finally, formal listening tests show that the speech reconstructed by aHMAIR with PP and aDFT is very similar to the one reconstructed by aHM-AIR using LS.

Language

English

Subject

Peak picking

Voice analysis

Ανάλυση φωνής

Αρμονικό μοντέλο

Προσαρμοστικό

Issue date

2015-03-20

Collection

School/Department--School of Sciences and Engineering--Department of Computer Science--Post-graduate theses