synthesis@smartkom project plan (excerpt)
The goal of the speech synthesis project within the SmartKom consortiumis to develop a speech synthesis module that is capable of producing naturalsounding German speech. This general goal is achieved when the user ofthe SmartKom system is satisfied with the system's voice. This includesintelligible speech and a friendly voice, but also the appropriatenessof the system response for a given task and within a certain dialogue state.The speech output component to be developed has to be in accordance withother modes of the multimodal interaction possibly used in parallel tospeech. The goal, therefore, includes the successful integration of themodule within the SmartKom system. An additional goal of this projectis to invent and explore innovative methods within the framework of speechsynthesis in order to contribute to the research carried out in this field.
A. Natural speech.
The naturalness of the speech output is related to two different aspects,the segmental quality of speech as well as the prosodic quality. This goalcan only be evaluated with formal evaluations.
B. Friendly voice.
In order to obtain a friendly voice that can be used within the systemwe have to select an appropriate speaker. The objective is to find a speakerwhose voice will sound natural when used for data-based speech synthesisincluding stability for a specific signal processing method.
The synthesized speech has to be accepted by the user of the system givena specific task and interaction state. This may include the partial implementationof non-German synthesis for tasks where the data is usually non-German.The speech output has to be coordinated with the other modalities thatmight be used in parallel or in sequence. The speech has to be applicableto the three different SmartKom application scenarios Public, Mobile andHome/Office.
D. Innovative methods.
In order to underline the research characteristics of this project we haveto allocate enough time to pursue non-direct solutions that could eventuallylead to innovative technologies. Regular publications on established conferences,relevant workshops and articles in research magazines are essential toshow that important work is carried out at the IMS in the field of speechsynthesis.
E. System integration.
Although being a presupposition to all the other objectives system integrationis mentioned as an objective on its own. It includes the porting of thesystem to the platform required in the SmartKom system (possibly even adifferent operating system). Another important objective is the specificationand implementation of the interfaces to and from the synthesis module.
General Approach and Contractual Aspects
As the technical task is to develop a speech output module for a multimodalsystem, all sub-tasks of a text-to-speech (TTS) system and of a concept-to-speech(CTS) system have to be taken care of. For the following items detailedwork will be carried out within the project.
A. Selecting a friendly voice
Among a number of available speakers we have to find the one whose voiceis most appropriate for speech synthesis. The general approach is to perceptuallyevaluate test recordings of the different speakers under various conditions.
B. Create a new diphone voice
The baseline speech synthesis method is a state-of-the-art diphone synthesis.We will first use freely available diphones and then record diphones withthe voice selected under task A. The diphones have to be integrated intothe system.
C. Construction of the speech synthesis database
The speech database will be recorded from the speaker selected under A.Its size and content has to reflect the nature of the speech synthesismethods that will be used for speech synthesis (see task C and D), as wellas the application domains of the system.
D. Development and integration of natural speech synthesismethods
On top of the diphone system a new synthesis approach using non-uniformsegments will be developed. It will be based on the experience of knownapproaches published in the field but will take into account the specificcharacter of the project.
E. Development of a prosody module for multimodal speechsynthesis
We will develop a method of prosody prediction that is capable of generatingthe appropriate accents and prosodic boundaries in a certain dialogue situation.To do so syntactic and other context information will be deducted fromthe language generation component that is developed within the SmartKomproject.
F. Definition of the Interfaces between SmartKom modules
Various interfaces will have to be defined between the speech output componentand other modules of the SmartKom system. Among them are the interfacesto the l anguage generation module, the audio module, the dynamic lexiconand the presentation manager. The interface language is XML with the interfacedescribed in Schemata.
G. Evaluation of speech synthesis
To ensure optimal speech synthesis quality and a system response that isappropriate in a specific dialogue context the speech output componenthas to be formally evaluated. Since only few criteria can be tested objectivelythe main emphasis is put on perceptual evaluation.
H. System integration
The system has to be integrated into the SmartKom testbed and into theSmartKom demonstrator. This task may include porting the module from Linuxto a different operation system like Windows.
For several other tasks that are necessary tobuild the speech output component we will integrate already existing results,off-the-shelf solutions or research work outside the SmartKom project.Among these are the speech synthesis architecture itself, the general text-pre-processing,the fullform lexicon, the duration prediction and F0 modeling. The SmartKomproject group is integrated within the phonetics group of IMS from whichvaluable input is expected. We also aim at several cooperations that willhave positive impact on the project. Currently cooperations are negotiatedwith the Faculté Politechnique de Mons (in the field of speech signalprocessing), the Oregon Graduate Institute (for the tasks of evaluationand non-uniform unit selection) and the Royal Institute of Technology KTH(for multimodal speech synthesis). Additional technical help will comefrom several technical committees to be organized internally, namely theproject administration, the Festival administration group, the audio studioadministration group and the system administration group.