Prosody

Silvia Quazza & Juan Maria Garrido



1. Coding Purpose

In this chapter we describe a framework for the annotation of speech corpora at the level of prosodic analysis. The scope of the Prosody Level includes phonetic transcription, intonation annotation and prosodic phrasing. The intended phenomena pertain to aspects of speech that are not explicitly represented in its orthographic transcription, which may be considered the starting point for the other linguistic annotation levels considered in MATE. So, the Prosody Level integrates the linguistic description of dialogues with information closer to their actual acoustic realization. The common reference to the speech signal allows to align prosodic annotations with orthographic transcription and higher linguistic levels, enabling cross level analyses.

1.1 Scope

Prosodic phenomena are specific to spoken language. They concern the way in which speech sounds are acoustically realized: how long they are, how high and how loud. Such acoustic modulations are used by human speakers to express a variety of linguistic or paralinguistic features, from stress and syntactic boundaries, to focus and emphasis or pragmatical and emotional attitudes. Linguistics and speech technology have approached prosody from a variety of points of view, so that a precise definition of the scope of prosodic research is not easy. A main distinction can be drawn between acoustic-phonetic analyses of prosody and more abstract, linguistic, phonological approaches.

Linguistically relevant prosodic events concur to express sentence structure: they highlight linguistic units by marking their boundaries and suggesting their function. Linguistic-phonological descriptions of prosody, usually identify a set of prosodic units (phonological units with a scope wider than a segment), and a set of prosodic phenomena which are ‘superimposed’ on these units. Prosodic units are the natural scope of prosodic events. Several types of prosodic units (differing mainly in their scope) have been proposed: paragraphs, sentences, intonation groups, intermediate groups, stress groups, feet, syllables, mora... Although prosody is by definition suprasegmental, prosodic analyses take often the phoneme as their minimal unit, where to measure rhythmical variations and locate intonation events. The family of prosodic phenomena includes the suprasegmental features of intonation, stress, rhythm and speech rate, whose variations are relevant to express the function of the different prosodic units: the prominent syllable in the word will be marked by stress, a falling intonation contour will mark the conclusion of a sentence, a faster speech rate and lower intonation characterize a parenthetical phrase...

Such prosodic features are physically realized in the speech chain in terms of variations of a set of acoustic parameters. Acoustic-phonetic analyses identify the following ‘phonetic correlates of prosody’: fundamental frequency (f0), length changes in segmental duration, pauses, loudness, voice quality.

Depending on the research purpose and point of view, prosodic phenomena can be marked in a speech corpus by simple diacritics in its orthographic transcription, or by labels classifying intonation contours and unit boundaries according to some phonological theory, or by detailed measures of the acoustic-phonetic parameters.

We refer to D1.1 for a more detailed discussion concerning prosodic phenomena and their possible codings. What we state here are our minimal assumptions on the scope of prosodic coding:

  • coding should take into account at least segmental duration, pauses and intonation
  • it should consider the structuring role of prosody and provide means to delimit prosodic units by marking phrase boundaries
  • finally, it should allow both detailed phenomenological descriptions and more abstract functional ones, providing distinct levels for phonetic and phonological annotation.

  •  

     
     
     

    2. Existing Schemes

    Coding prosody appears as a complex task, which has to deal with the intrinsic complexity of prosodic phenomena and with the variety of purposes, theories and points of view from which prosody can be approached. Such complexity is reflected in the wide variety of existing schemes that can be found in the literature. Examples of coding schemes more or less explicitly inspired by the different intonation theories and approaches are reviewed in D1.1. The review, by no means exhaustive, gives brief descriptions of the following schemes:

    PROSPA
    IPA
    TEI
    ToBI
    SAMPA
    SAMPROSA
    INTSINT
    SAMSINT
    IPO
    TSM
    TILT
    VERBMOBIL
    KIM
    PROZODIAG (Lund)
    Goeteborg
    These schemes have revealed differences both in the covered phenomena, and in the underlying theoretical assuptions. They reflect the different purposes of prosodic analysis, which go from the phenomenological description of prosody in itself, to the study of its relations with discourse structure and to its applications in speech technology - synthesis, recognition and dialogue systems. As stated in D1.1:
    "Each experimental study has adopted some kind of prosodic representation suited to its purposes, from abstract labels to acoustic measures".
    The schemes range from simple diacritic symbols integrating the orthographic transcription of corpora intended for linguistic analyses (e.g. TEI, Göteborg...), to theory-dependent phonological labels for intonation contours and phrasing (e.g. ToBI), to phonetic-acoustic representation of the f0 curve (e.g. INTSINT, IPO, TILT, ...).

    The conclusion in D1.1 is that defining a unique ‘standard’ coding scheme by choosing a single prosody annotation scheme seems a difficult task at this moment. Although, in the era of large speech corpora, there is a definite need for a common notation allowing for easy data exchange and comparison, a single scheme would certainly dissatisfy some of the many points of view in the field, would be unsuitable to some of the intended purposes, would be too detailed or too poor, too theoretically committed or lost in insignificant details.
     
     

    3. The MATE 'meta-scheme' for prosody annotation

    3.1 The 'meta-scheme'

    Due to the variety of points of view in prosodic studies and the difficulty in selecting the most representative coding schemes, the MATE proposal for the Prosody Level offers a 'meta-scheme', a framework where different existing notation conventions can be integrated and possibly new ones can be developed. The framework is detailed enough to suit the richer phonetic and phonological schemes and flexible enough to admit partial filling of its structure and to allow for different schemes to cooperate.

    Its definition reflects the multi-level nature of prosodic research - the fact that prosody can be studied both with a phonetic and a phonological approach - and the useful distinction between prosodic units and prosodic phenomena.

    The MATE ‘meta-scheme’ for prosody is a four-layer annotation structure, in which the different elements discussed in D1.1 can be accommodated. The sublevels are the following:

    1) phonetic transcription

    conceived for the representation of phonetic segments (the ‘phones’), but also of other phenomena related to the segmental aspects of prosody, such as pauses, and other sub-word units such as syllables

    2) phonetic representation of intonation

    intended for the phonetic annotation of intonation phenomena, where the shape of fundamental frequency curves (and possibly of other acoustic correlates of intonation, such as energy, which at present are not included in the meta-scheme) is described in detail, by means of stylization and/or explicit labelling

    3) phonological representation of intonation

    reserved for those schemes which annotate intonation from a phonological point of view, in terms of functional or underlying representations, and mark the role of relevant intonation events with respect to prosodic units

    4) prosodic phrasing

    intended for the segmentation of utterances in terms of high-level prosodic units (tone units, intonation groups, etc.)

    The four levels do not represent a fixed hierarchy. The two phonetic levels, intended for phoneme segmentation and f0 description, are directly aligned with the speech signal and in this sense may be considered as base levels. The two phonological levels, describing the linguistically relevant intonation events and the prosodic structure of the utterance, keep a natural relationship both with the base prosodic levels and with other linguistic units. So, different links can be established between levels. It is conceivable to associate an intonation event such as "pitch accent" or a "boundary contour" to the word or phrase (orthographic level) on which it occurs as well as to the syllable or vowel on which it reaches its f0 target (phonetic transcription level) or to the corresponding configuration of pitch movements (phonetic description of f0). The following picture sketches the possible links between levels:

    In the actual use of the scheme, the levels and their links can be fully or partially specified. In a linguistic text-oriented analysis, prosody could be considered in its function, leaving out the details of its realization. In this case, the sole phonological levels may be filled and linked to the orthographic level of words. Complex schemes like ToBI ([Silverman et al., 1992], D1.1A) could be used in this way, or simpler schemes providing labels to distinguish types of accents, associated with words, and types of intonation boundaries.

    In a speech technology context, a more signal-oriented approach could be adopted. In order to recognize or synthesize prosodic patterns, detailed phonetic descriptions are necessary, requiring both phonetic segmentation and phonetic representation of intonation - in terms of pitch movements or target f0 levels. The annotator would in this case look at the signal to segment it and possibly stylize its f0 profile and accurately label the stylized curve. For a complete analysis, he would link the detected units and events - phonemes and f0 variations - to the phonological descriptions of intonation contours and phrase structure.

    3.2 Instances of the 'meta-scheme'

    The first goal of the MATE ‘meta-scheme’ is then to provide an empty framework where the existing (or future) prosody annotation schemes could be represented in a common (and accordingly compatible and easy-to-compare) format. But it has been also conceived to allow the annotation of corpora using some of the most widely used existing annotation schemes. For each layer, (at least) one existing coding scheme has been adapted to XML, in order to be integrated within the MATE workbench and provide both a ready-to-use instance of the meta-scheme and an example and guideline for future adaptation of other schemes.

    The chosen schemes for each level are the following:

    1) phonetic transcription: SAMPA ([Wells et al. 1992]; D1.1A )

    2) phonetic representation of intonation: INTSINT ([Hirst, 1991, 1994; Hirst & Di Cristo, 1998]; D1.1A), IPO ([t’Hart et al., 1990]; D1.1A )

    3) phonological representation of intonation: ToBI (‘Tones’layer) ([Silverman et al. 1992]; D1.1A )

    4) prosodic phrasing: ToBI (‘Break-Indices’ layer)

    Widespread schemes have been preferred as examples. In the case of phonetic description of intonation, two schemes have been selected in order to represent both the 'pitch movement' approach and the 'target level' one. It should be noted that for some schemes a reference definition is available, although not so strictly respected in actual applications (ToBI has a number of 'variants' and is subject to language-adaptation). For IPO, the reference is the classical text in which the methodology of perceptual analysis of intonation has been proposed, which was not explicitly intended to define a notation system. In any case, some simplifications or additions to the original schemes have been performed, in order to obtain a coherent adaptation.

    As suggested above, each scheme can be used alone or can be integrated with the others. One could for example keep with IPO methodology and use SAMPA for phoneme segmentation and IPO for f0 description (and possibly a newly defined IPO-like "pitch configuration" scheme for the phonological level...). Or integrate the four layers using SAMPA, INTSINT and ToBI. To allow such modular approach, separate DTD's have been defined for each pair layer:scheme. These DTD's are included in the Annex.

    The elements and attributes identified in the selected schemes are described in detail in the following. It should be noted that level 2), both in its IPO and INTSINT instances, has an inner structure corresponding to a typical three-step procedure in the phonetic annotation of intonation: obtain the raw f0 curve (element <f0>), stylize it (elements <closecopy> and <momel>) and label it (elements <pitmove> and <intone>). At the phonetic segmentation level, a useful extension is the <syllable> element, to which the element <phone> can be subordinated and to which the phonological intonation labels could profitably be linked. For each of the other levels a single main element is defined: <tobitone> for level 3) (<target>, <f0range>, and <repair> are accessory information), <breakindex> for level 4).

    The list of elements adapted to XML, which is accordingly available for use in the MATE workbench, is the following:

    1) Phonetic transcription
    <syllable>
    <phone>
    2) Phonetic representation of intonation
    <f0>

    <closecopy> (IPO)

    <pitmove> (IPO)

    <momel> (INTSINT)

    <intone> (INTSINT)

    3) Phonological representation of intonation
    <tobitone>

    <target>

    <f0range>

    <repair>

    4) Prosodic phrasing
    <breakindex>
    In the following, each pair layer:scheme will be described in a separate paragraph. For layer 2, to avoid duplication of descriptions, a single description will be given of the element <f0> for the raw f0 curve, that is present in both schemes IPO and INTSINT. Moreover, it should be noted that there is apparently no formal difference between the respective elements for the stylized curve <closecopy> and <momel>, both consisting in target points on the f0 curve. The substantial difference is in the intended interpolation function between the target points, which is linear for <closecopy> and parabolic for <momel>, and in the intended stylization procedure (manual vs. automatic).



    4. Layer 1: Phonetic Transcription - SAMPA scheme