Abstract |
Speech recordings are everywhere, from social media, YouTube, and online learning to
podcasts and audiobooks. In today’s fast-paced world, it is sometimes necessary to speed
up speech recordings in order to promote faster information consumption. A population
group that benefits the most from such technologies is visually impaired individuals who
employ screen reading on their mobile phones. A series of algorithms have been developed
for the time-scale expansion or compression of speech recordings. It is well known that
fast speech, also known as time-scale compressed speech, is less intelligible due to a loss
of speech parts that are important in distinguishing syllables and words. The majority of
these parts are non-stationary in nature, such as transient sounds, plosives, and fricatives.
In this work, we investigate algorithms for non-stationary speech protection in order
to provide highly intelligible time-scale compression. We base our experiments on the socalled Waveform Similarity Overlap-and-Add (WSOLA) method of time-scale compression.
WSOLA is capable of providing both uniform and non-uniform time-scale compression. We
propose to characterize speech waveforms according to their non-stationarity using simple
time and frequency domain criteria. Utilizing a frame-by-frame analysis, the first criterion
(C1) is based on the RMS energy of each frame. Additionally, we implement a Line Spectral
Frequency (LSF)-based criterion, named C2, and in combination with C1, we end up with
a hybrid non-stationarity detection criterion named C3. C1 and C3 are implemented on
dataset of Greek speech recordings named GrHarvard. The latter consists of 720 sentences
from both genders that form 72 phonemically balanced lists of 10 sentences each.
Intelligibility and preference experiments were performed on four of the GrHarvard
lists involving both sighted and visually impaired individuals. Subsequently, a statistical
analysis was carried out to assess the significance of the differences in the results obtained
from both experiments’ tests. In the first experiment, we conducted a comparative analysis
involving uniform WSOLA, non-uniform C1-based WSOLA, and non-uniform C3-based
WSOLA. The principal objective was to assess whether the incorporation of protective
measures had a positive or negative impact on the intelligibility of speech signals. The
findings consistently demonstrated that C1-based WSOLA outperformed the others in both
intelligibility and user preference. It was followed by C3-based WSOLA, with uniform
WSOLA ranking last. In this experiment, characterized by substantial differences, the
majority of observed variations were found to be statistically significant. In the second
experiment, our objective was to assess the same three methods under equal words per
minute (WPM) conditions. This made it challenging for users to distinguish between
different methods and resulted in more uniform outcomes. Differences primarily stemmed
from variations within the signals, related to the sizes of their stationary and non-stationary
parts. Even though the C1-based method tended to achieve the highest intelligibility
(in most cases except at 0.25), it remained challenging to definitively determine which
method was superior in both preference and intelligibility tests. Yet, despite our initial
expectations of better performance in the results of the visually impaired group compared
to the control group, such variations did not materialize, mainly due to the limited number
of visually impaired participants willing to participate in our tests. Consequently, all of
these challenges led the majority of observed results not to attain statistical significance,
even though a discernible pattern was occasionally evident among the methods.
Future work may include further parameter tuning of the stationarity detection algorithm. As an example, different lengths of analysis and hop frames can be used, as
well as pitch-synchronous analysis in stationary parts of speech. Furthermore, the base
method used for time-scale compression can be replaced by other more complex models
for time-scale compression (such as the Harmonic+Noise model). Finally, further experiments - including a larger sample of visually impaired people - could strengthen statistical
conclusions about the performance of each method.
|