E-Locus - Institutional Repository of the University of Crete

Home Collections School/Department School of Sciences and Engineering Department of Computer Science Post-graduate theses

Post-graduate theses

Search command : Author="Στεφανίδης" And Author="Κωνσταντίνος"

Current Record: 24 of 833

[Add to Basket]

Identifier

000460697

Title

Exploration of non-stationary speech protection for highly intelligible time-scale compression.

Alternative Title

Εξερεύνηση προστασίας μη στάσιμου λόγου για υψηλής καταληπτότητας συμπίεση σε χρονική κλίμακα

Author

Πανταλός, Παναγιώτης Ε.

Thesis advisor

Στυλιανού, Ιωάννης

Reviewer

Πανταζής, Ιωάννης
Τσαγκατάκης, Γρηγόριος

Abstract

Speech recordings are everywhere, from social media, YouTube, and online learning to podcasts and audiobooks. In today’s fast-paced world, it is sometimes necessary to speed up speech recordings in order to promote faster information consumption. A population group that benefits the most from such technologies is visually impaired individuals who employ screen reading on their mobile phones. A series of algorithms have been developed for the time-scale expansion or compression of speech recordings. It is well known that fast speech, also known as time-scale compressed speech, is less intelligible due to a loss of speech parts that are important in distinguishing syllables and words. The majority of these parts are non-stationary in nature, such as transient sounds, plosives, and fricatives. In this work, we investigate algorithms for non-stationary speech protection in order to provide highly intelligible time-scale compression. We base our experiments on the socalled Waveform Similarity Overlap-and-Add (WSOLA) method of time-scale compression. WSOLA is capable of providing both uniform and non-uniform time-scale compression. We propose to characterize speech waveforms according to their non-stationarity using simple time and frequency domain criteria. Utilizing a frame-by-frame analysis, the first criterion (C1) is based on the RMS energy of each frame. Additionally, we implement a Line Spectral Frequency (LSF)-based criterion, named C2, and in combination with C1, we end up with a hybrid non-stationarity detection criterion named C3. C1 and C3 are implemented on dataset of Greek speech recordings named GrHarvard. The latter consists of 720 sentences from both genders that form 72 phonemically balanced lists of 10 sentences each. Intelligibility and preference experiments were performed on four of the GrHarvard lists involving both sighted and visually impaired individuals. Subsequently, a statistical analysis was carried out to assess the significance of the differences in the results obtained from both experiments’ tests. In the first experiment, we conducted a comparative analysis involving uniform WSOLA, non-uniform C1-based WSOLA, and non-uniform C3-based WSOLA. The principal objective was to assess whether the incorporation of protective measures had a positive or negative impact on the intelligibility of speech signals. The findings consistently demonstrated that C1-based WSOLA outperformed the others in both intelligibility and user preference. It was followed by C3-based WSOLA, with uniform WSOLA ranking last. In this experiment, characterized by substantial differences, the majority of observed variations were found to be statistically significant. In the second experiment, our objective was to assess the same three methods under equal words per minute (WPM) conditions. This made it challenging for users to distinguish between different methods and resulted in more uniform outcomes. Differences primarily stemmed from variations within the signals, related to the sizes of their stationary and non-stationary parts. Even though the C1-based method tended to achieve the highest intelligibility (in most cases except at 0.25), it remained challenging to definitively determine which method was superior in both preference and intelligibility tests. Yet, despite our initial expectations of better performance in the results of the visually impaired group compared to the control group, such variations did not materialize, mainly due to the limited number of visually impaired participants willing to participate in our tests. Consequently, all of these challenges led the majority of observed results not to attain statistical significance, even though a discernible pattern was occasionally evident among the methods. Future work may include further parameter tuning of the stationarity detection algorithm. As an example, different lengths of analysis and hop frames can be used, as well as pitch-synchronous analysis in stationary parts of speech. Furthermore, the base method used for time-scale compression can be replaced by other more complex models for time-scale compression (such as the Harmonic+Noise model). Finally, further experiments - including a larger sample of visually impaired people - could strengthen statistical conclusions about the performance of each method.

Language

English

Subject

Intelligibility

Non-stationarity protection

Speech processing

Speech rate

Επεξεργασία ομιλίας

Καταληπτότητα

Προστασία μη στασιμότητας

Ρυθμός ομιλίας

Συμπίεση κλίμκας χρόνου