PhD Thesis
My PhD thesis focusses mainly on two aspects of unit selection speech
synthesis: corpus design and data administration and selection during
synthesis.
To design an appropriate corpus for a unit selection system, I examined
a large text corpus of German newspapers (HGC), football text and
touristical texts of different citys. Therefore these texts have been
transcribed automatically via the IMS Festival System and statistical
analyses about the distribution of different unit types (i.e. words,
syllables, phones and diphones) taking into account or leaving out
annotated features (i.e. stress, tone, accent, positional attributes,
word class, phonemic context). The annotated text material comprised
about 300.000 sentences. Because of the LNRE-nature of language I
decided to cover the most common syllables and to complete the rest
with the then missing diphones. Using a greedy-algorithm, I generated
a new smaller corpus, consisting of about 4000 sentences. These
sentences were recorded with a professional male speaker and a
professional female speaker and afterwards automatically labelled
with an aligner constructed at the IMS. Some of the files have been
hand corrected afterwards.
The second part was the design and implementation of a data
management module to administrate the speech corpus efficiently
and to admit efficient access to the required units.
Therefore I decided to
use an decision tree approach to cluster similar units, with each
level of the tree representing a special feature (i.e. previous
phoneme, stress, tone, accent etc.). The order of the features is
given by the user and motivated linguistically. The trees can be
easily rebuilt, if a feature order seems to be suboptimal. The
features represent only symbolic attributes of the units to overcome
unsecure predictions of the system. All different unit types use the
same tree model but with different feature orders, depending on the
unit type. The access to the appropriate clusters happens efficiently
via indices representing a feature order.
The unit selection process uses PSM algorithm, i.e. a top down
strategy which prefers longer units.
The new module is integrated in the IMS Festival Speech Synthesis
System and is written in c++.
In my work I further will examine which basic unit is most appropriate for
synthesis, i.e. phone, demi-phone or diphone. Also I will compare
different feature orders to determine the most important features for
perception.