Abstract |
Automatic Speech Recognition (ASR) was initially introduced in the 1950s. Since then, a lot of effort
has been made to improve speech recognition in single channel recordings. In the last few years, many
researchers have shown interest in the combination of speech recognition and multichannel recordings,
as many every day devices incorporate multiple microphones. These microphones are usually placed in
specific topologies allowing us to take advantage of the directivity of the input signal and achieve more
robust speech enhancement. Some examples of devices and applications are mobile phones, tablets, home
automation services such as Amazon Echo and Google Home, digital personal assistants like Google Now,
Siri, Cortana etc.
In the course of this thesis, we aim to create a robust ASR system combined with a front-end to improve
speech recognition in challenging environments such as reverberant rooms with or without background
noise. The experiments we examined included scenarios with stationary and moving speakers as well as
overlapping speakers. To approach this problem, we divided it into three phases. The first phase was the
experimentation on the training data for the acoustic model. Three acoustic models were trained to define
the best acoustic model, one with clean speech signals, one with processed speech signals and one with
the combination of the previous two training sets. During the second phase, we tested several front-ends,
i.e. array processing techniques, and evaluated them in the context of their speech recognition performance.
Each array processing technique consists of two main modules, a beamformer and a postfilter. In addition
to that, we proposed a new front-end framework based on the binary masks and a Wiener postfilter which
achieved better recognition results. The recognition results showed that the combination of a Superdirective
beamformer followed by a Wiener postfilter performs better on single speaker experiments while the same
beamformer combined with Binary Masks performs better on overlapping speaker experiments. The last
phase was to use the outcome of the first and the second phase in order to create a robust combination of a
front-end and an acoustic model.
In order to evaluate the performance of each acoustic model and each front-end, we used a common
speech recognition metric known as Word Error Rate (WER). The final proposed acoustic model combined
with the proposed front-end led to a significant improvement in WER in all experiments, i.e. stationary
speaker, moving speaker and overlapping speakers. The relative improvement in terms of WER of the
processed speech signals over the unprocessed speech signals for the three experiments is 62.4% for stationary
speaker, 57.9% for moving speaker and 49.6 % for overlapping speakers. In particular, the modification we
proposed for the binary masks used in the front-end for the scenarios with overlapping speakers, that is a
spectral floor and a more strict criterion on the application of the binary masks, led to a relative improvement
of 9.9% in WER results.
|