Abstract |
In this thesis, we explore significant advancements in machine learning. We focus on improving
algorithms for Generative Adversarial Networks (GANs) and using them to improve image generation
and computer speech generation.
Given the recent strides in GAN training, it is imperative to address and enhance the stability of
the training process. Consequently, the first part of this thesis places a distinct emphasis on exploring
algorithmic advancements tailored to improved GAN training. The objective is to delve into strategies
that mitigate challenges and instabilities encountered during the training of GANs, thereby contributing
to the overall refinement of the training process. We propose a novel weight-based algorithm aimed at
strengthening the Generator. The theoretical underpinnings of this approach suggest that it outperforms
the baseline algorithm by creating a more potent Generator at each iteration. Empirical results show
substantial accuracy improvements and faster convergence rates across synthetic and image datasets. The
improvements range between 5% and a remarkable 50%.
In the realm of GAN loss functions, we introduce a novel approach based on cumulant generating
functions. This technique offers a fresh perspective on GAN loss functions by encompassing various
divergences and distances based on cumulant generating functions and relies on a recently derived variational formula. We show that the corresponding optimization is equivalent to Renyi divergence mini-mization, thus offering a (partially) unified perspective of GAN losses: the Renyi family encompasses Kullback-Leibler divergence (KLD), reverse KLD, Hellinger distance, and χ2-divergence. Besides, it
enhances training stability, particularly when weaker discriminators are employed, and demonstrates substantial improvements in synthetic image generation on datasets like CIFAR-10 and Imagenet.
Disentangled representations are crucial for capturing probability distributions and measuring divergences effectively. Mutual Information (MI) estimation, specifically through Kullback-Leibler Divergence (KLD), is commonly used to enforce disentanglement. We explore using variational representations, particularly based on minimizing Renyi divergences, as an alternative to KLD. Renyi divergences offer advantages in comparing different types of distributions. The text emphasizes using scalable neural network
estimators for efficient MI estimation. Despite the potential for large statistical estimation, incorporating a variational representation based on Renyi divergences proves feasible and effective. The method is particularly successful in enhancing stability in real biological data, enabling the detection of rare sub-populations even with limited samples. Moreover, the difficulty of precisely estimating divergences poses
a significant challenge in many machine learning tasks, especially when dealing with high-dimensional
datasets that can lead to increased variance. In addressing this challenge, we suggest a solution: incorporating an explicit variance penalty (VP) into the objective function of the divergence estimator. This
added penalty aims to decrease the variance associated with the estimator, providing a potential way to
enhance the accuracy of divergence estimations.
In this part of the thesis, our attention shifts to practical uses in speech synthesis, such as transforming one voice into another (voice conversion) and turning written text into spoken words (text-to-speech synthesis). We introduce innovative techniques for voice conversion that focus on many-to-many voice conversion. Leveraging concepts from the previous weight-based algorithm, we propose a weight multiplication approach to enhance the Generator’s gradients, making it more adept at fooling the Discriminator. This results in a robust Weighted StarGAN (WeStarGAN) system. Notably, WeStarGAN achieves significantly superior performance compared to conventional methods. It garners preference scores of
75% and 65% in terms of speech subjective quality and speaker similarity, respectively.
Neural vocoders often struggle with generalization, especially to unseen speakers and conditions.
Here, we introduce the Speaker Conditional WaveRNN (SC-WaveRNN), which leverages speaker embeddings to improve speech quality and performance. This variant significantly outperforms baseline
WaveRNN, achieving impressive improvements of up to 95% in terms of Mean Opinion Score (MOS) for
unseen speakers and conditions. We extend this work further by implementing a multi-speaker text-tospeech (TTS) synthesis approach, effectively tackling zero-shot speaker adaptation.
In the realm of Universal TTS, we present a system capable of generating speech with various speaking
styles and speaker characteristics, all without the need for explicit style annotation or speaker labels. We
propose a novel approach based on Renyi Divergence and Disentangled Representation. This innovative method effectively reduces content and style leakage, resulting in substantial improvements in word error
rate and speech quality. Our proposed algorithm achieves improvements of approximately 16-20% in
MOS speech quality, alongside a 15% boost in MOS-style similarity.
Lastly, the growing use of digital assistants highlights the importance of TTS synthesis systems on
modern devices. Ensuring clear speech generation in noisy environments is crucial. Our innovative transfer learning approach in TTS harnesses the power of amalgamating two effective strategies: Lombard
speaking style data and Spectral Shaping and Dynamic Range Compression (SSDRC). This extended
system, Lombard-SSDRC TTS, significantly improves intelligibility, with relative enhancements ranging
from 110% to 130% in speech-shaped noise (SSN) and 47% to 140% in competing-speaker noise (CSN)
compared to state-of-the-art TTS methods. Subjective evaluations further confirm substantial improvements, with a median keyword correction rate increase of 455% for SSN and 104% for CSN compared to
the baseline TTS method.
|