E-Locus - Institutional Repository of the University of Crete - End-to-end neural based Greek text-to-speech synthesis

Home End-to-end neural based Greek text-to-speech synthesis

Results - Details

[Add to Basket]

Identifier

000425995

Title

End-to-end neural based Greek text-to-speech synthesis

Alternative Title

Από-άκρη-σε-άκρη νευρωνική σύνθεση ομιλίας από κείμενο για την Ελληνική Γλώσσα

Author

Σισαμάκη, Ειρήνη Δ.

Thesis advisor

Στυλιανού, Γιάννης

Reviewer

Τσιάρας, Βασίλης
Πανταζής, Γιάννης

Abstract

Text-to-speech (TTS) synthesis is the automatic conversion of written text to spoken language. TTS systems play an important role in natural human-computer interaction. Concatenative speech synthesis and statistical parametric speech synthesis were the prominent methods used for decades. In the era of Deep learning, end-to-end TTS systems have dramatically improved the quality of synthetic speech. The aim of this work was the implementation of an end-to-end neural based TTS system for the Greek Language. The neural network architecture of Tacotron-2 is used for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to acoustic features, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from the predicted acoustic features. Developing TTS systems for any given language is a significant challenge and requires large amount of high quality acoustic recordings. Because of this, these systems are only available for the most commonly and widely spoken languages. In this work, experiments are described for various languages and databases which are freely available. A Greek database, initially created for speech recognition, has been obtained from ILSP (Institute for Language and Speech Processing). In our first experiment, only 3 hours of recorded speech in Greek have been used. Then the technique of language adaptation has been applied, using 3 hours in Greek and 18 hours in Spanish. We also have applied speaker adaptation in order to produce speech with specific speakers from our database. Our TTS system for Greek can generate good quality of speech with very natural prosody. An evaluation with a listening test by 30 volunteers gave a score in MOS (Mean Opinion Score) of 3.15 to our model and 3.82 to the original recordings.

Language

English

Subject

Neural networks

Tacotron-2

Issue date

2019-11-22

Collection

School/Department--School of Sciences and Engineering--Department of Computer Science--Post-graduate theses

Type of Work--Post-graduate theses

Views

409

Digital Documents
	Download document View document Views : 9