E-Locus - Institutional Repository of the University of Crete

Home Collections Type of Work Doctoral theses

Doctoral theses

Current Record: 9 of 2491

[Add to Basket]

Identifier

000463839

Title

Improving generative adversarial networks and its applications in speech synthesis

Alternative Title

Βελτιωμένα παραγωγικά ανταγωνιστικά δίκτυα με εφαρμογές στη σύνθεση φωνής

Author

Dipjyoti, Paul

Thesis advisor

Στυλιανού, Ιωάννης

Abstract

In this thesis, we explore significant advancements in machine learning. We focus on improving algorithms for Generative Adversarial Networks (GANs) and using them to improve image generation and computer speech generation. Given the recent strides in GAN training, it is imperative to address and enhance the stability of the training process. Consequently, the first part of this thesis places a distinct emphasis on exploring algorithmic advancements tailored to improved GAN training. The objective is to delve into strategies that mitigate challenges and instabilities encountered during the training of GANs, thereby contributing to the overall refinement of the training process. We propose a novel weight-based algorithm aimed at strengthening the Generator. The theoretical underpinnings of this approach suggest that it outperforms the baseline algorithm by creating a more potent Generator at each iteration. Empirical results show substantial accuracy improvements and faster convergence rates across synthetic and image datasets. The improvements range between 5% and a remarkable 50%. In the realm of GAN loss functions, we introduce a novel approach based on cumulant generating functions. This technique offers a fresh perspective on GAN loss functions by encompassing various divergences and distances based on cumulant generating functions and relies on a recently derived variational formula. We show that the corresponding optimization is equivalent to Renyi divergence mini-mization, thus offering a (partially) unified perspective of GAN losses: the Renyi family encompasses Kullback-Leibler divergence (KLD), reverse KLD, Hellinger distance, and χ²-divergence. Besides, it enhances training stability, particularly when weaker discriminators are employed, and demonstrates substantial improvements in synthetic image generation on datasets like CIFAR-10 and Imagenet. Disentangled representations are crucial for capturing probability distributions and measuring divergences effectively. Mutual Information (MI) estimation, specifically through Kullback-Leibler Divergence (KLD), is commonly used to enforce disentanglement. We explore using variational representations, particularly based on minimizing Renyi divergences, as an alternative to KLD. Renyi divergences offer advantages in comparing different types of distributions. The text emphasizes using scalable neural network estimators for efficient MI estimation. Despite the potential for large statistical estimation, incorporating a variational representation based on Renyi divergences proves feasible and effective. The method is particularly successful in enhancing stability in real biological data, enabling the detection of rare sub-populations even with limited samples. Moreover, the difficulty of precisely estimating divergences poses a significant challenge in many machine learning tasks, especially when dealing with high-dimensional datasets that can lead to increased variance. In addressing this challenge, we suggest a solution: incorporating an explicit variance penalty (VP) into the objective function of the divergence estimator. This added penalty aims to decrease the variance associated with the estimator, providing a potential way to enhance the accuracy of divergence estimations. In this part of the thesis, our attention shifts to practical uses in speech synthesis, such as transforming one voice into another (voice conversion) and turning written text into spoken words (text-to-speech synthesis). We introduce innovative techniques for voice conversion that focus on many-to-many voice conversion. Leveraging concepts from the previous weight-based algorithm, we propose a weight multiplication approach to enhance the Generator’s gradients, making it more adept at fooling the Discriminator. This results in a robust Weighted StarGAN (WeStarGAN) system. Notably, WeStarGAN achieves significantly superior performance compared to conventional methods. It garners preference scores of 75% and 65% in terms of speech subjective quality and speaker similarity, respectively. Neural vocoders often struggle with generalization, especially to unseen speakers and conditions. Here, we introduce the Speaker Conditional WaveRNN (SC-WaveRNN), which leverages speaker embeddings to improve speech quality and performance. This variant significantly outperforms baseline WaveRNN, achieving impressive improvements of up to 95% in terms of Mean Opinion Score (MOS) for unseen speakers and conditions. We extend this work further by implementing a multi-speaker text-tospeech (TTS) synthesis approach, effectively tackling zero-shot speaker adaptation. In the realm of Universal TTS, we present a system capable of generating speech with various speaking styles and speaker characteristics, all without the need for explicit style annotation or speaker labels. We propose a novel approach based on Renyi Divergence and Disentangled Representation. This innovative method effectively reduces content and style leakage, resulting in substantial improvements in word error rate and speech quality. Our proposed algorithm achieves improvements of approximately 16-20% in MOS speech quality, alongside a 15% boost in MOS-style similarity. Lastly, the growing use of digital assistants highlights the importance of TTS synthesis systems on modern devices. Ensuring clear speech generation in noisy environments is crucial. Our innovative transfer learning approach in TTS harnesses the power of amalgamating two effective strategies: Lombard speaking style data and Spectral Shaping and Dynamic Range Compression (SSDRC). This extended system, Lombard-SSDRC TTS, significantly improves intelligibility, with relative enhancements ranging from 110% to 130% in speech-shaped noise (SSN) and 47% to 140% in competing-speaker noise (CSN) compared to state-of-the-art TTS methods. Subjective evaluations further confirm substantial improvements, with a median keyword correction rate increase of 455% for SSN and 104% for CSN compared to the baseline TTS method.

Language

English

Subject

Deep learning

Generative models

Machine Learning

Neural networks

Speech processing

Text to speech synthesis

Βαθιά μάθηση

Γεννητικά μοντέλα

Επεξεργασία ομιλίας