Abstract |
Speech is the most effective way to communicate ideas generated in human minds.
However, spoken communication in real life is often affected by noise in the
surroundings which can substantially reduce the intelligibility and perceived quality
of the signal. Techniques to enhance the communication have been proposed in the
past and successfully tested in modern engines like Amazon Alexa, allowing it to
operate in adverse conditions. The ambient noise can disrupt both signal acquisition
by a device as well as speech perception by the listener. Speech enhancement (SE)
techniques are developed to restore speech from its disrupted observations, and
listening enhancement (LE) techniques are designed to improve the perceived
intelligibility by altering the speech before its presentation in noise as the naturally
produced speech is not always very intelligible.
Often SE and LE systems are operated as two independent modules in modern
devices , which limit their performance. The effort in this thesis is to combine the SE
and LE enhancement techniques to have an end-to-end system for communication
applications. We approach the problem from the neural networking perspective. As
such, multiple novel architecturesfor SE and LE were invented, and the conceptsfrom
those models have been used to build the final end-to-end system.
Regarding speech enhancement (SE), three new architectures have been invented;
two of which are in the feature domain and one in the waveform domain. The feature
domain architectures formulate the enhancement task in the short-time Fourier
transform (STFT) representation of speech, therefore, are parametrically less
complex. Features from the two-dimensional (2D) representation of speech are
extracted with the use gruCNN neural cell, which is found effective in isolating noises
with high variance. The gruCNN-SE model has outperformed state-of-the-art speech
enhancementsystems with standard convolution (CNN) and long short-term memory
(LSTM) cells. Subsequently, a bidirectional extension of gruCNN module (BigruCNN)
is proposed with the inclusion of backward dependencies among the 2D frames.
Besides, a novel waveform domain network with a characteristic dilation pattern (SEFFTNet) is presented. The SE-FFTNet is found efficient in learning the statistical
dissimilarity of speech and noise in a noisy observation.
Regarding listening enhancement (LE), a novel WaveNet-like architecture to improve
the listener's intelligibility in noise (wSSDRC) is proposed. The wSSDRC system
performs both spectral shaping (SS) and dynamic range compression (DRC) of the
input for intelligibility enhancement. The model is found to produce a median
absolute intelligibility boost of 39% for normal hearing and 38% for hearing-impaired
listeners in stationary noise over the unprocessed speech.
Subsequently, a novel end-to-end system which combines the objectives of SE and LE
is proposed to enhance the intelligibility of noisy observations. The end-to-end
system was found to increase the listeners’ keyword correct rate in stationary noise
from 2.5% to 60% at 0 dB input SNR, and from about 10% to 75% at 5 dB input SNR,
compared with the unprocessed speech, while substantially outperforming the
modular setup with SE followed by LE.
|