Abstract |
Speech rhythm refers to the rhythmic patterns and timing variations that occur
in spoken language. It encompasses the natural flow, stress patterns, and timing
of speech sounds, syllables, and words. Rhythm consists an important dynamic
prosodic feature of speech that is linked with speech perception. The detection of
speech rhythm is a significant task with diverse applications. In this study, the
focus is on using rhythmic measures to estimate voice preference. The motivation behind this research arises from the belief that voices demonstrating specific
rhythm patterns are generally preferred by individuals.
In this thesis, speech rhythm was studied as a possible predictor of listener preference. Even though rhythm can be perceived by humans, there is no ubiquitously
accepted definition or measure for speech rhythm in the scientific community. In
the literature, there is strong evidence that rhythm is encoded in the amplitude
envelope of a signal. Mainly, the envelope is decomposed into partials and then the
corresponding instantaneous frequency is extracted which is assumed to carry the
information regarding the signal’s rhythmicity. Two techniques were utilized to
achieve the decomposition of the envelope into meaningful components. The first
technique, which was proposed in a previous study, includes extracting rhythmic
measures via an Empirical Mode Decomposition (EMD) of the envelope. Here,
it is suggested to extract the same measures by using an AM-FM decomposition
on the envelope instead of EMD. This modification has the potential to improve
the accuracy of the resulting values since EMD isn’t mathematically robust. The
envelope, although informative to some extent, is a simplified representation of
the speech signal. It lacks important elements like pitch, which could potentially
contribute to the understanding of rhythm. Relying solely on the envelope may
overlook relevant rhythmic features present in the speech signal. We hypothesize
that the rhythmicity of speech is closely related to the manner in which individuals transition between syllables. Therefore, an approach that directly captures
the rhythmicity of speech was introduced by considering the segment of the speech
signal associated with syllable transitions. This, e↵ectively addresses the concern
of information loss that occurs during envelope extraction.
During this research, data consisting of speech signals from multiple speakers
were utilized. The information regarding the preferred speakers, as determined
by listeners, was also available. This knowledge allowed the investigation of the
underlying factors contributing to voice preference and the analysis of the specific
characteristics that make certain speakers more preferred than others. The experiments were extended beyond natural speaking rate, namely for fast speaking
style, and the preference and rhythm in fast speech was explored as well.
Statistical analyses were conducted to evaluate the suitability of rhythmic metrics derived from envelope and signal-based techniques for the task at hand. Findings revealed that the envelope-derived metrics are heavily influenced by speech
rate and they are not well-suited for accurately capturing rhythm. In contrast,
syllables transition derived directly from the speech signal showcased promising
results. A satisfactory separation between preferred and non-preferred speakers
was achieved, e↵ectively capturing certain characteristics that influence listeners’
preference. One-way ANOVA and pairwise comparison tests were preformed to
validate the statistical significance of the di↵erences between speakers.
The results based on syllables transition indicate promising avenues for future
research. Considering the multi-component nature of preference, the exploration
of additional metrics becomes crucial in improving the overall performance which
will lead to a comprehensive and reliable evaluation of listener preference.
|