In this thesis the problem of automatic segmentation and handling of large heterogeneous speech corpora is addressed. While standard database systems handle large data in a very efficient way, they lack of methods to use automatic segmentation and conversion tools as a part of their core system. For this reason, a new approach based on the combination of standard database technique and automatic speech recognition techniques is adverted.

In the last years corpus-based techniques have become widely used 1 . So, for instance, in di­sciplines like automatic speech recognition or speech synthesis, large speech corpora have become a key technology. However, to the extent to which such corpora increase in number and comprehensiveness, the number of different tools for creating, maintaining and searching these corpora increases, too. The shortcoming of these speech corpora and corresponding tools is that most of them are realized in different notations and/or formats. To overcome these deficits, approaches to unify speech corpora by developing a very basic, but mathematically well motivated, annotation model have been proven very useful (see for example Bird&Liberman, Bird&Harrington 2 )

This work contributes to such approaches by focusing on maintenance and retrieval of spontaneous speech signals by investigating linguistic annotations, automatic segmentation, and the use of database systems.

An overview of the connection between such speech signals and the corresponding annotations is presented in the first chapter, while the second chapter describes the most important annotation models and systems that are recently discussed in literature. In the subsequent chapter the most relevant annotation and segmentation systems (manual as well as automatic ones) are introduced and discussed, whereas the relevant database design techniques are worked out in chapter 4. Chapter 5 reports on experiments that were conducted over the last two years by means of an automatic segmen­tation tool (a so-called "aligner"). The aim of these experiments is to enrich the annotation of large spontaneous speech corpora by time marks automatically produced by the aligner. The underlying database system (COSMAS II 3 ) is also described.

On the basis of these results and experiences a new speech database approach is presented. The approach is characterized by a highly modular architecture, which consists of a freely available database engine, a database management system which was developed from scratch, and an analysis tool. The database management system facilitates new features that are not available in standard systems, such as the generation of annotation or the transformation of speech signal formats. The system as well as the underlying technology are addressed in detail in chapter 6. Conclusions and future work are presented in chapter 7, and, finally, speech corpora and system documentation are made available in the appendix.

Linguistic Annotation

In the simplest case, "linguistic annotation" is an orthographic representation of speech, sometimes time-aligned to an audio recording. But there are many other anno­tations, for instance morphological analy­sis, part-of-speech tagging and syntactic bracketing, phonetic segmentation and labeling, annotation of disfluencies, prosodic phrasing, intonation, gesture, and discourse structure. Others are the marking of co-reference, named entity` tagging, sense tagging, as well as phrase-level or word-level translation. With reference to (Bird&Liberman) such linguistic annotations may apply to both text and recorded signals.
The focus of this work lies on the presentation and discussion of annotation models for recorded speech signals. In order to facilitate a machine-based treatment, all these possible annotations have to integrate into one annotation model. A survey of annotation models and cor­responding systems which can be found in recent literature is presented in chapter 2. This survey shows that in most cases these annotation models are organized in terms of tiers, e.g., a phoneme tier or a syllable tier. Figure 1 shows a typical tier structured model (in this case the annotation model of the EMU 4 system).

Fig.1: A tier-based annotation model

The major difference between the models lies in the organization of the tiers. In the EMU4 model the hierarchy is built by indexes which establish relations between the segments of different tiers. Relations within one tier are also possible. Because of these common basic techniques of notation it is possible to convert all annotation models into one approach such as the one of Bird&Liberman. In principle, their approach, the 'annotation graph' model, is defined as a mathematical culculus where all queries are well formed, no matter whether an anotation is complete or not.

The Annotation and Segmentation Systems

There are many different methods for the generation of the linguistic annotation. The annotations can be generated manually supported by some programs, such as annotation editors, and automatically by annotations systems. The automatic annotation systems are devided into pure speech recognition systems and systems that are based on speech recognition techniques. Because standard speech recognition systems still suffer from a high word error rate the latter ones are the most popular systems. The systems that are based on speech recognition techniques are known as automatic segmentation systems. Although these systems usually generate many annotation tiers in the above sense, they refer to segmentation systems because they need a manual orthographic transcription of the corresponding speech signal, which has to be provided by hand. However, the great advantage of such systems is their ability to generate time marks of the annotations, too.

A typical representative of an automatic segmentation system is the Stuttgart Alignment System 5 which is used in the work described now (and will be referred to as the "aligner"). The aligner identifies the linguistically relevant boundaries within the speech signal by means of time alignment between the orthographic annotation and the speech signal. The aligner works in accor­dance to the principles of automatic speech recognition (ASR) and therefore the same ASR algorithms are used to design this system. This means that the aligner is based on Hidden Markov Models (HMM`s, in the HTK version of formerly Entropic, now Cambridge Uni­versity Engineering Department 6 ). The architecture of the aligner is shown in Figure 2.

Fig.2: System overview of the aligner

Roughly spoken, the segmentation process consists of three stages. In the first stage the orthographic representation is transformed into a phonetic representation (a grapheme-to-phoneme conversion by a two-step approach of lexicon look-up and rule-based conversi­on). In the second stage the phonetic transcription is transformed into a system-internal representation (a so-called grammar network). The time alignment between the transfor­med text and the speech signal, which is also transformed in terms of some specific signal processing algorithm (mel frequency cepstral coefficients), follows in the final stage. How this alignment process was applied to concrete speech corpora is described in the following subsection.

Experiments with an automatic segmentation system on large spontaneous speech corpora in context with a Speech Database System

At the Institut für Deutsche Sprache (IDS), Mannheim, there are two tools for pre-processing and searching discourse transcripts: the Discourse Structure Manager and the retrieval system COSMAS II, respectively. The Discourse Structure Manager imports transcripts from different encoding systems, like a partitur editor 7 for the annotation of the discourse speech data, builds up an internal discourse structure, enriches it by collaborating with tools such as an aligner, calculates the Spoken Language Metric, and generates an transcript in a special sgml notation. The Spoken Language Metric describes discourse properties, which are stored into the retrievel system COSMAS II. The enrichment by the results of the aligner extends the transcript retrieval sy­stem to a much more powerful speech and transcript retrieval system (and therefore to a Speech Database System).

However, in order to achieve the needed time references for the transcripts, much work remained to be done, because many typical discourse mistakes had been produced in the segmentation process.

The most typical mistakes are the following:

  • discourse phenomena such as turntaking signals, hesitation (interjections)

  • different types of non-speech sequences such as noise, laughter, applause or music

  • technical problems such as bad audio quality, buzzing sound

  • simultaneous passages of two ore more speakers

  • In order to get a robust alignment system for these specific discourse-based problems several different experimental scenarios were defined. First, mistakes like turntaking signals, hesitations (interjections) and different types of non-speech sequences such as noise, laughter, and so on, were categorized as so-called model-based problems of the aligner. Mo­del-based problems in this context means that there are no specific HMM`s for phenome­na such as interjections. All the interjections that occur in the corpora are treated as normal text segments. This means that they undergo grapheme-to-phoneme conversion and are modeled by the concatenation of the corresponding phoneme HMM`s.  It is not possi­ble to recognize these segments if they are lengthened or articulated with multiple peaks. For this reason we designed new HMM`s which were not concatenated by phonemes but trained as one-word models (so called whole word models). By this, we get an increase in the accuracy of up to 13% after optimising the models in several experiments. Second, mistakes like simultaneous passages, but also technical problems of sound signals, were categorized as signal-based problems of the aligner. Most of these problems are sol­ved until now; for the problems concerning simultaneous passages further work is still required, however, some conceptional solutions and a first prototype are already available. Further experiments will show which ideas and methods are the most adequate ones.

The active database system IMSPhoBase

The speech corpora of the IMS 8 comprise many different types of annotations, such as phoneme level, word level, syllable level, part-of-speech level or syntax level as well as fundamental frequen­cy, prosodic information, and speech signals in different sound formats. Some segmenta­tion are generated completely manually, some are generated semi-automatically or fully automatically. There are many speech signals left which have not yet been segmented. The IMSPhoBase 9 system is a speech database system which was designed to handle all these extremely heterogeneous corpora. The system was implemented with extensions to several transformation programs in order to be able to generate new annotation levels. For this reason, the above-mentioned aligner as well as other signal-based conversions programs are embedded into the system . There are no limits to incorporating conversion programs.   Because of the ability to convert data into another format or generating new data the database is called an active system.

The advantage of such an approach is that a user does not have to carry out several transformations before working on the data. All in­formation is available to the user, no matter if it is already in the database or if it still has to be generated in the background. For example, should there be more speech signals than phoneme label files in the database, the missing phoneme labels will be generated automati­cally during the query process. Alternatively, missing labels can be generated by batch processing.

The main features of the database system are:

  • a simple annotation model. A simple but efficient tier-based annotation model is used to integrate the annotation levels into the database.

    • the database engine mSQL 10 . Because mSQL is a public domain software the whole system can be freely distributed. mSQL provides several interfaces which are used in this system, for example the C interface that sets up the connection to the database management system.

      • the database management system (DBMS) is a own and therefore free implementation. For the DBMS no standard tool could be used because of the specific tasks of the approach. So a DBMS was developed from scratch and implemented in C (the IMSPhoBase-MS). The IMSPhoBase-MS provides two data interfaces which are both command line-based. One interface is organized like a tree-based menu for manipulating and maintaining the data for non-expert users, and the other interface is used for direct access of external programs, such as statistic programs.

        • two different user modi for the IMSPhoBase-MS. All the features of the interactive menu mode are also available in a batch mode, in which all task are driven by lists without interaction of a user.

          • active annotation generation component. Several programs are conceptionally integrated in the system that generate new data, such as annotations and different speech signal formats.

            • the external analysis program. The analysis program is implemented in Perl and its design is characterized by its capability of freely defining all features the user is searching for, and of combining these features logically. The adjustment range for the context of hits and the high flexibility concerning the output format is also provided. The possible definition of features covers wide range from phonetic features to phonemes, words, syllables and part-of-speech tags.

              • open system architecture. System design is modular and therefore new moduls, such as external statistics programs, can be incorporated without any redesign of the system. The IMSPhoBase-MS and the analysis tool can be used both in combination and separtely.

              • The whole system is shown in Figure 3. The arcs between the modules indicate the data flow in the system. The database management system serves as an interface between the user and the database.

                Fig3.: The system overview of IMSPhoBase

                Architecture and design of this system support several different kinds of use: phonetic and phonological claims, for example, can be verified by search and evaluation on the large corpora that are covered by the system. On the other hand, one can prepare data by searching the database and adapting the results by the analysis tool for a speech recognizer training tool.

                In this work a new approach of combining database techniques and an automatic segmentation system as well as other conversation tools for speech signals, is decribed. Although the system is in a prototype stage, it has proven to be very useful. New annotation levels can be generated and imported by the database system without interaction of a user. Because of this, users do not have to take care of the use of different programs nor of the import syntax of the database. The modular architecture of the whole system allows a rapid integration of new programs or new structured annotation data. The feature-based analysis of the result of a database query is realized in a separate module which is implemented in perl and can be very easily put into a web-based application, too.

Literature & more
[1] for example: Young, S and Bloothooft G. Corpus-based methods in language and speech pro­cessing. Text, Speech and Language Technology, Vol. 2. Kluwer Academic Publishers, 1997
[2] Bird S., Liberman M. A formal framework for linguistic annotation. In: Speech Annotation and Corpus Tools, Special Issue of Speech Communication, Vol. 33, in print. And:  Bird S., Harrington J. Speech Annotation and Corpus Tools. In: Speech Annotation and Corpus Tools, Special Issue of Speech Communication, Vol. 33, in print.
[3] []
[4] []
[5] [/fak5/ims/~rapp/]
[6] []
[7] []
[8] Experimental Phonetics Group of the Institut für Maschinelle Sprachverarbeitung (IMS)
[9] IMSPhoBase stands for speech database of the IMS
[10] mSQL is a database engine from Hughes Technologies, []