 |
synthesis@smartkom project plan (excerpt)
The goal of the speech synthesis project within the SmartKom consortium
is to develop a speech synthesis module that is capable of producing natural
sounding German speech. This general goal is achieved when the user of
the SmartKom system is satisfied with the system's voice. This includes
intelligible speech and a friendly voice, but also the appropriateness
of the system response for a given task and within a certain dialogue state.
The speech output component to be developed has to be in accordance with
other modes of the multimodal interaction possibly used in parallel to
speech. The goal, therefore, includes the successful integration of the
module within the SmartKom system. An additional goal of this project
is to invent and explore innovative methods within the framework of speech
synthesis in order to contribute to the research carried out in this field.
Objectives
A. Natural speech.
The naturalness of the speech output is related to two different aspects,
the segmental quality of speech as well as the prosodic quality. This goal
can only be evaluated with formal evaluations.
B. Friendly voice.
In order to obtain a friendly voice that can be used within the system
we have to select an appropriate speaker. The objective is to find a speaker
whose voice will sound natural when used for data-based speech synthesis
including stability for a specific signal processing method.
C. Adequacy.
The synthesized speech has to be accepted by the user of the system given
a specific task and interaction state. This may include the partial implementation
of non-German synthesis for tasks where the data is usually non-German.
The speech output has to be coordinated with the other modalities that
might be used in parallel or in sequence. The speech has to be applicable
to the three different SmartKom application scenarios Public, Mobile and
Home/Office.
D. Innovative methods.
In order to underline the research characteristics of this project we have
to allocate enough time to pursue non-direct solutions that could eventually
lead to innovative technologies. Regular publications on established conferences,
relevant workshops and articles in research magazines are essential to
show that important work is carried out at the IMS in the field of speech
synthesis.
E. System integration.
Although being a presupposition to all the other objectives system integration
is mentioned as an objective on its own. It includes the porting of the
system to the platform required in the SmartKom system (possibly even a
different operating system). Another important objective is the specification
and implementation of the interfaces to and from the synthesis module.
General Approach and Contractual Aspects
As the technical task is to develop a speech output module for a multimodal
system, all sub-tasks of a text-to-speech (TTS) system and of a concept-to-speech
(CTS) system have to be taken care of. For the following items detailed
work will be carried out within the project.
A. Selecting a friendly voice
Among a number of available speakers we have to find the one whose voice
is most appropriate for speech synthesis. The general approach is to perceptually
evaluate test recordings of the different speakers under various conditions.
B. Create a new diphone voice
The baseline speech synthesis method is a state-of-the-art diphone synthesis.
We will first use freely available diphones and then record diphones with
the voice selected under task A. The diphones have to be integrated into
the system.
C. Construction of the speech synthesis database
The speech database will be recorded from the speaker selected under A.
Its size and content has to reflect the nature of the speech synthesis
methods that will be used for speech synthesis (see task C and D), as well
as the application domains of the system.
D. Development and integration of natural speech synthesis
methods
On top of the diphone system a new synthesis approach using non-uniform
segments will be developed. It will be based on the experience of known
approaches published in the field but will take into account the specific
character of the project.
E. Development of a prosody module for multimodal speech
synthesis
We will develop a method of prosody prediction that is capable of generating
the appropriate accents and prosodic boundaries in a certain dialogue situation.
To do so syntactic and other context information will be deducted from
the language generation component that is developed within the SmartKom
project.
F. Definition of the Interfaces between SmartKom modules
Various interfaces will have to be defined between the speech output component
and other modules of the SmartKom system. Among them are the interfaces
to the l anguage generation module, the audio module, the dynamic lexicon
and the presentation manager. The interface language is XML with the interface
described in Schemata.
G. Evaluation of speech synthesis
To ensure optimal speech synthesis quality and a system response that is
appropriate in a specific dialogue context the speech output component
has to be formally evaluated. Since only few criteria can be tested objectively
the main emphasis is put on perceptual evaluation.
H. System integration
The system has to be integrated into the SmartKom testbed and into the
SmartKom demonstrator. This task may include porting the module from Linux
to a different operation system like Windows.
For several other tasks that are necessary to
build the speech output component we will integrate already existing results,
off-the-shelf solutions or research work outside the SmartKom project.
Among these are the speech synthesis architecture itself, the general text-pre-processing,
the fullform lexicon, the duration prediction and F0 modeling. The SmartKom
project group is integrated within the phonetics group of IMS from which
valuable input is expected. We also aim at several cooperations that will
have positive impact on the project. Currently cooperations are negotiated
with the Faculté Politechnique de Mons (in the field of speech signal
processing), the Oregon Graduate Institute (for the tasks of evaluation
and non-uniform unit selection) and the Royal Institute of Technology KTH
(for multimodal speech synthesis). Additional technical help will come
from several technical committees to be organized internally, namely the
project administration, the Festival administration group, the audio studio
administration group and the system administration group.
|