|
|
|
|
|
Abstract |
|
|
|
|
In this thesis the problem of automatic segmentation
and handling of large heterogeneous speech corpora is addressed. While standard
database systems handle large data in a very efficient way, they lack of
methods to use automatic segmentation and conversion tools as a part of their
core system. For this reason, a new approach based on the combination of
standard database technique and automatic speech recognition techniques is
adverted.
In the last years corpus-based techniques have
become widely used
. So, for instance, in disciplines like automatic speech recognition
or speech synthesis, large speech corpora have become a key technology. However,
to the extent to which such corpora increase in number and comprehensiveness,
the number of different tools for creating, maintaining and searching these
corpora increases, too. The shortcoming of these speech corpora and corresponding
tools is that most of them are realized in different notations and/or formats.
To overcome these deficits, approaches to unify speech corpora by developing
a very basic, but mathematically well motivated, annotation model have been
proven very useful (see for example Bird&Liberman,
Bird&Harrington
)
This work contributes to such approaches by
focusing on maintenance and retrieval of spontaneous speech signals by
investigating linguistic annotations, automatic segmentation, and the use
of database systems.
An overview of the connection between such speech
signals and the corresponding annotations is presented in the first chapter,
while the second chapter describes the most important annotation models and
systems that are recently discussed in literature. In the subsequent chapter
the most relevant annotation and segmentation systems (manual as well as
automatic ones) are introduced and discussed, whereas the relevant database
design techniques are worked out in chapter 4. Chapter 5 reports on experiments
that were conducted over the last two years by means of an automatic
segmentation tool (a so-called "aligner"). The aim of these experiments
is to enrich the annotation of large spontaneous speech corpora by time marks
automatically produced by the aligner. The underlying database system (COSMAS
II ) is also described.
On the basis of these results and experiences
a new speech database approach is presented. The approach is characterized
by a highly modular architecture, which consists of a freely available database
engine, a database management system which was developed from scratch, and
an analysis tool. The database management system facilitates new features
that are not available in standard systems, such as the generation of annotation
or the transformation of speech signal formats. The system as well as the
underlying technology are addressed in detail in chapter 6. Conclusions and
future work are presented in chapter 7, and, finally, speech corpora and
system documentation are made available in the appendix. |
|
|
|
|
|
|
Linguistic Annotation |
|
|
|
|
In the simplest case, "linguistic annotation"
is an orthographic representation of speech, sometimes time-aligned to an
audio recording. But there are many other annotations, for instance
morphological analysis, part-of-speech tagging and syntactic bracketing,
phonetic segmentation and labeling, annotation of disfluencies, prosodic
phrasing, intonation, gesture, and discourse structure. Others are the marking
of co-reference, named entity` tagging, sense tagging, as well as phrase-level
or word-level translation. With reference to (Bird&Liberman) such linguistic
annotations may apply to both text and recorded signals.
The focus of this work lies on the presentation
and discussion of annotation models for recorded speech signals. In order
to facilitate a machine-based treatment, all these possible annotations have
to integrate into one annotation model. A survey of annotation models and
corresponding systems which can be found in recent literature is presented
in chapter 2. This survey shows that in most cases these annotation models
are organized in terms of tiers, e.g., a phoneme tier or a syllable tier.
Figure 1 shows a typical tier structured model (in this case the annotation
model of the EMU
system).
Fig.1: A tier-based annotation
model
The major difference between the models lies
in the organization of the tiers. In the EMU4 model the hierarchy is built
by indexes which establish relations between the segments of different tiers.
Relations within one tier are also possible.
Because of these common basic techniques of notation it is possible to convert
all annotation models into one approach such as the one of Bird&Liberman.
In principle, their approach, the 'annotation graph' model, is defined as
a mathematical culculus where all queries are well formed, no matter whether
an anotation is complete or not.
|
|
|
|
|
|
|
The Annotation and Segmentation Systems |
|
|
|
|
There are many different methods for the generation
of the linguistic annotation. The annotations can be generated manually supported
by some programs, such as annotation editors, and automatically by annotations
systems. The automatic annotation systems are devided into pure speech
recognition systems and systems that are based on speech recognition techniques.
Because standard speech recognition systems still suffer from a high word
error rate the latter ones are the most popular systems.
The systems that are based on speech recognition techniques are known
as automatic segmentation systems. Although these systems usually generate
many annotation tiers in the above sense, they refer to segmentation systems
because they need a manual orthographic transcription of the corresponding
speech signal, which has to be provided by hand. However, the great advantage
of such systems is their ability to generate time marks of the annotations,
too.
A typical representative of an automatic segmentation system is the Stuttgart
Alignment System
which is used in the work described now (and will be referred to as the
"aligner"). The aligner identifies the linguistically relevant boundaries
within the speech signal by means of time alignment between the orthographic
annotation and the speech signal. The aligner works in accordance to
the principles of automatic speech recognition (ASR) and therefore the same
ASR algorithms are used to design this system. This means that the aligner
is based on Hidden Markov Models (HMM`s, in the HTK version of formerly Entropic,
now Cambridge University Engineering
Department ). The
architecture of the aligner is shown in Figure 2.
Fig.2: System overview
of the aligner
Roughly spoken, the segmentation process consists
of three stages. In the first stage the orthographic representation is
transformed into a phonetic representation (a grapheme-to-phoneme conversion
by a two-step approach of lexicon look-up and rule-based conversion).
In the second stage the phonetic transcription is transformed into a
system-internal representation (a so-called grammar network). The time alignment
between the transformed text and the speech signal, which is also
transformed in terms of some specific signal processing algorithm (mel frequency
cepstral coefficients), follows in the final stage. How this alignment process
was applied to concrete speech corpora is described in the following subsection.
|
|
|
|
|
|
|
Experiments with an automatic segmentation system on large
spontaneous speech corpora in context with a Speech Database
System |
|
|
|
|
At the Institut für Deutsche Sprache (IDS), Mannheim, there are two
tools for pre-processing and searching discourse transcripts: the Discourse
Structure Manager and the retrieval system COSMAS II, respectively. The Discourse
Structure Manager imports transcripts from different encoding systems, like
a partitur editor
for the annotation of the discourse speech data, builds up an internal discourse
structure, enriches it by collaborating with tools such as an aligner, calculates
the Spoken Language Metric, and generates an transcript in a special sgml
notation. The Spoken Language Metric describes discourse properties, which
are stored into the retrievel system COSMAS II. The enrichment by the results
of the aligner extends the transcript retrieval system to a much more
powerful speech and transcript retrieval system (and therefore to a Speech
Database System).
However, in order to achieve the needed time
references for the transcripts, much work remained to be done, because many
typical discourse mistakes had been produced in the segmentation process.
The most typical mistakes are the following:
-
discourse phenomena such as turntaking signals,
hesitation (interjections)
-
different types of non-speech sequences such
as noise, laughter, applause or music
-
technical problems such as bad audio quality,
buzzing sound
-
simultaneous passages of two ore more speakers
In order to get a robust alignment system for
these specific discourse-based problems several different experimental scenarios
were defined. First, mistakes like turntaking signals, hesitations
(interjections) and different types of non-speech sequences such as noise,
laughter, and so on, were categorized as so-called model-based problems of
the aligner. Model-based problems in this context means that there
are no specific HMM`s for phenomena such as interjections. All the
interjections that occur in the corpora are treated as normal text segments.
This means that they undergo grapheme-to-phoneme conversion and are modeled
by the concatenation of the corresponding phoneme HMM`s. It is not
possible to recognize these segments if they are lengthened or articulated
with multiple peaks. For this reason we designed new HMM`s which were not
concatenated by phonemes but trained as one-word models (so called whole
word models). By this, we get an increase in the accuracy of up to 13% after
optimising the models in several experiments.
Second, mistakes like simultaneous passages, but also technical problems
of sound signals, were categorized as signal-based problems of the aligner.
Most of these problems are solved until now; for the problems concerning
simultaneous passages further work is still required, however, some conceptional
solutions and a first prototype are already available. Further experiments
will show which ideas and methods are the most adequate ones.
|
|
|
|
|
|
|
The active database system IMSPhoBase |
|
|
|
|
The speech corpora of the
IMS comprise many
different types of annotations, such as phoneme level, word level, syllable
level, part-of-speech level or syntax level as well as fundamental
frequency, prosodic information, and speech signals in different sound
formats. Some segmentation are generated completely manually, some
are generated semi-automatically or fully automatically. There are many speech
signals left which have not yet been segmented. The
IMSPhoBase system
is a speech database system which was designed to handle all these extremely
heterogeneous corpora. The system was implemented with extensions to several
transformation programs in order to be able to generate new annotation levels.
For this reason, the above-mentioned aligner as well as other signal-based
conversions programs are embedded into the system . There are no limits to
incorporating conversion programs. Because
of the ability to convert data into another format or generating new data
the database is called an active system.
The advantage of such an approach is that a
user does not have to carry out several transformations before working on
the data. All information is available to the user, no matter if it
is already in the database or if it still has to be generated in the background.
For example, should there be more speech signals than phoneme label files
in the database, the missing phoneme labels will be generated
automatically during the query process. Alternatively, missing labels
can be generated by batch processing.
The main features
of the database system are:
|
|
|
|
|
|
|
Literature & more |
|
|
|
|
[1] |
for
example: Young, S and Bloothooft G. Corpus-based methods in language and
speech processing. Text, Speech and Language Technology, Vol. 2. Kluwer
Academic Publishers, 1997 |
|
[2] |
Bird S., Liberman M. A formal framework for linguistic annotation. In:
Speech Annotation and Corpus Tools, Special Issue of Speech Communication,
Vol. 33, in print. And: Bird S., Harrington J. Speech Annotation and
Corpus Tools. In: Speech Annotation and Corpus Tools, Special Issue of Speech
Communication, Vol. 33, in print. |
|
[3] |
[http://www.ids-mannheim.de/zdv/cosmas2] |
|
[4] |
[http://www.shlrc.mq.edu.au/emu/] |
|
[5] |
[/fak5/ims/~rapp/aligner.ps.gz] |
|
[6] |
[http://htk.eng.cam.ac.uk/] |
|
[7] |
[http://www.ids-mannheim.de/prag/dida] |
|
[8] |
Experimental Phonetics Group of the Institut
für Maschinelle Sprachverarbeitung (IMS) |
|
[9] |
IMSPhoBase stands for speech database of
the IMS |
|
[10] |
mSQL is a database engine from Hughes
Technologies, [http://www.Hughes.com.au/products/msql] |
|
|
|
|
|
|
|
|
|
|
|