The phonetic representation of intonation should provide a detailed description of the utterance intonation profile, which is one of the main acoustic correlates of prosodic structure. The object of the description is fundamental frequency - an acoustic parameter which is calculated from the voiced portions of the speech signal by means of signal processing algorithms. Once the gaps of unvoiced phones have been interpolated, f0 is a continuous curve showing perceptually irrelevant variations, micro-prosodic variations due to phoneme quality and macro-prosodic variations which may have a linguistic function. A phonetic representation of this curve will ignore minor details but will describe the shape of the curve by classifying all its relevant features, that a functional phonological analysis could later interpret.
While there are relevant approaches (i.e. Fujisaki) describing intonation as a superposition of mathematically defined curves, here we consider the family of linear models representing f0 as a sequence of phonetic events. Two steps are necessary to obtain a phonetic representation of intonation:
- a stylization of the f0 curve, where irrelevant details (and possible errors of the pitch tracking algorithm) are removed and the curve is represented by a sequence of discrete elements: inflection points, interpolated with a linear or parabolic functionThe elements of the stylized curve are the 'relevant variations' of f0. Depending on the point of view and the underlying intonation theory, such variations may be seen in their movement between two f0 values or in their target value. So, you may see the curve as a chain of rises and falls or as a sequence of high and low values. The two approaches are represented by the two schemes that we have chosen as examples for layer 2. Both schemes start from the raw f0 curve (automatically obtained from the signal), represented as a sequence of frame by frame f0 values (<f0>). Both schemes rely on a stylization of the f0 curve, represented as a sequence of inflection points on the curve (named <closecopy> for IPO and <momel> for INTSINT, just to keep track of the different interpolation laws suggested by the two schemes). But the INTSINT description of the curve will directly label the inflection points as target tones, while IPO will label pitch movements from one inflection point to the following one. The difference will be reflected in the different use of the href attribute for the elements <intone> and <pitmove>, pointing to a single stylized element in one case and to two consecutive elements in the other.- a classification of the elements of the stylized curve

In the following, the two schemes will be described separately. The description of the base <f0> element will be given once, as the element and its use are common to the two schemes. It should be noted that when the stylized curve is imported as such (obtained outside the MATE workbench), in its <closecopy> or <momel> version, it could be the base reference for prosodic annotation and the <f0> element may be unnecessary.
5.1. Layer 2: Phonetic Representation of Intonation - f0 contours
5.1.1 Markup declaration
Fundamental frequency (pitch) is a parameter estimated from the acoustic signal, in its voiced (quasi-periodic) portions. It is defined as the inverse of period length and generally measured in Hz (number of periods per second). Period length could in principle be manually measured on the waveform, but it is usually estimated by pitch detection algorithms, whose output can be the series of points in time corresponding to period boundaries or, more often, a sequence of pairs [time interval : f0 value], where f0 is the average fundamental frequency measured on the time interval or frame (typically a few milliseconds).
Here we define an element to represent such raw f0 values, whose sequence provides the so-called f0 contour of the utterance. It should be noted that pitch estimation algorithms are not fully reliable, so that raw f0 values should be considered just a starting point of intonation analysis rather than its unquestionable objective reference.
5.1.2 The <f0> element
5.1.2.1 Description
This element has been included to allow each f0 value of an f0 contour to be considered as an XML element (and accordingly handled and displayed). Each <f0> element is intended to represent a pair [time interval : f0 value] of a f0 contour. The most useful representation of the <f0> element is a graphical display of the sequence of its values as a function of time (the f0 curve or contour or profile).
5.1.2.2 Data Source
The f0 contour is computed directly from the speech signal file (although some pitch detection algorithms rely on phonetic segmentation to obtain better estimates of fundamental frequency).
5.1.2.3 Segmentation/selection
<f0> elements will be generated automatically, from f0 values calculated by an f0 estimation algorithm. If possible, such an algorithm will be available in the workbench. Otherwise, the f0 values will be imported from external files.
5.1.2.4 Assignment
The attributes considered here for the <f0> element are the following:
5.1.3 Markup Tablevalue: the f0 value (in Hz) start: time start of the calculation frame end: time end of the calculation frame
|
|
|
| id | [ASCII] |
| value | [FLOAT] |
| start | [FLOAT] |
| end | [FLOAT] |
5.2 Layer 2: Phonetic Representation of Intonation - IPO scheme
5.2.1 Markup Declaration
The IPO methodology for the analysis of intonation relies on two main assumptions: the first is that what is not perceived is irrelevant for a linguistic description of intonation, the second is that we perceive tone variations (rise/fall movements) rather than tone levels (high/low). The steps in the perceptual analysis of intonation are:
1. obtain a stylized close copy of the original f0 curve, by approximating the original values with a sequence of straight segments: the re-synthesized signal should be perceptually equivalent to the original oneHere we consider only the first two steps, which pertain to the phonetic representation of intonation. In order to represent them, we need three hierarchically ordered elements:2. classify the f0 segments as pitch movements, according to their shape and position in the phone chain (the proper reference is the syllable)
3. build up a grammar of admissible configurations of pitch movements and link intonation patterns to linguistic functions
In principle, <f0> should be linked to the signal, <closecopy> to one <f0> element, and <pitmove> to two consecutive <closecopy> elements.<f0>, representing the points of the raw f0 curve <closecopy>, representing the inflection points in the stylized curve <pitmove>, representing the classified movements from one inflection point to the next one.
In actual annotation, it is not required that <closecopy> points coincide with <f0> points (a good stylization removes irrelevant excursions and possible pitch detection errors). Moreover, as suggested in the paragraph on coding procedures, if stylization is performed outside the MATE workbench, it could be directly imported, without reference to <f0>. In this case, the element <closecopy> will directly be aligned with the soundfile by means of its time attributes. Viceversa, a very simplified stylization (without the feedback of resynthesis), could be performed by directly linking <pitmove> to a sequence of <f0> elements, which could be thought of as approximated by a straight line. A further otpion would be to link the <pitmove> to the corresponding <syllable>: in this case, some of the precise acoustic content of the <pitmove> will be lost.

5.2.2 The <closecopy> element
5.2.2.1 Description
This element has been included to allow each f0 inflection point of a ‘close copy’ stylised f0 contour (used in the IPO annotation system as the phonetic base representation of f0 contours) to be considered – and accordingly handled – as an XML element. The close copy is intended as a clean version of the f0 curve, where errors and irrelevant details have been removed, gaps corresponding to unvoiced phonemes have been filled and only the relevant movements are apparent. A more detailed desription of the concept of‘close copy’ can be found in ‘t Hart et al. (1990), among others. Such stylized description of f0 as a function of time can be displayed as a sequence of straight segments connecting the relevant f0 values (inflection points), which may coincide with selected points (frames) in the raw f0 curve or simply approximate them.
5.2.2.2 Data Source
The starting point for the creation of a close copy should be the raw f0 contour, together with the whole speech signal to allow for resynthesis. Phonetic segmentation would be useful as accessory information. In the MATE workbench, the close copy will most probably be imported from external files.
5.2.2.3 Segmentation/selection
In the IPO methodology, the ‘close copy’ stylisations are defined by a resynthesis method which allows the perceptual definition of the relevant inflection points. The raw f0 curve is displayed, if possible aligned with phonetic segmentation. On this basis, the annotator draws a simplified curve which approximates the original one. He then listens to the speech resynthesized with the stylized artificial f0 values. He repeats these steps until he reaches the simplest stylization perceptually equivalent to the original contour.
The MATE workbench will not provide this complex environment. As a consequence, close copies will be imported from external files or will be obtained by a simplified (non-theory-conformant) procedure, where inflection points are directly selected on the raw f0 with no resynthesis feedback.
5.2.2.4 Assignment
The attributes considered here for the <closecopy> element are the following:
If the close copy is imported from an external file, the (link with the) <f0> element may not be necessary. In this case, as each inflection point is indeed a point, the two time attributes will have the same value. Alternatively, in case the close copy is obtained by selection of <f0> elements, href will point to the selected <f0>, from which the time attributes (and possibly the value) might be inherited, and consequently the two time values will be different (the first one corresponding to the beginning of the f0 calculation frame and the second one corresponding to the end of the frame).value: the stylized f0 value (in Hz) at the inflection point href: optional, points to an <f0> element start: time start of the stylised point end: time end of the stylised point
5.2.3 The <pitmove> element
5.2.3.1 Description
The element <pitmove> is intended for phonetic transcription of intonation contours according to IPO methodology. Whithin the IPO framework, a ‘pitch movement’ is a portion of ‘close-copy stylization’ between two inflection points. A complete phonetic description of the stylized f0 curve should capture its shape and its relation with the phone sequence. So it will classify its segments according to their size, direction (rise/fall) and position in the syllable. The principles for this classification are presented in ‘t Hart et al. (1990), which also provides a set of labels explicitly intended for Dutch. It should be noted that work proposes a methodology rather than a notational standard. There have been several applications of the IPO approach to different languages (such as English [Willems et al., 1988], French [Beaugendre et al., 1992], Italian [Quazza, 1991] or German [Brindopke et al., 1997]) and different symbols have been used for the same concepts. Here, to keep to a concrete and classical example, we refer to the original proposal for Dutch.
In the IPO approach, pitch movements are intended to be superimposed on an ideal declination grid, which determines the height of the flat movements: two main declination lines (at least for Dutch) are identified as trends in the sequence of peaks and valleys. Pitch movements can follow the baseline or the topline, or depart from them. Every pitch movement departing from the declination lines can be characterized in terms of the following parameters:
a) direction (rise/fall)The combination of these features provides a set of possible pitch movements, which are labelled with a figure (if the movement is rising) or a letter (if the movement is falling):b) timing (early in the syllable/late/very late)
c) rate of change (fast/slow)
d) size (full/half)
| transcription symbol | ||||||||||
| 1 | 2 | 3 | 4 | 5 | A | B | C | D | E | |
| Direction | ||||||||||
| rise | x | x | x | x | x | |||||
| fall | x | x | x | x | x | |||||
| Timing | ||||||||||
| early | x | x | x | x | ||||||
| late | x | x | ||||||||
| very late | x | x | ||||||||
| Rate of change | ||||||||||
| fast | x | x | x | x | x | x | x | x | ||
| slow | x | x | ||||||||
| Size | ||||||||||
| full | x | x | x | x | x | x | x | x | ||
| half | x | x | ||||||||
'Flat' pitch movements following the baseline or the topline are labelled with 0 or Ø respectively. A special diacritic '&' is used in the IPO transcription to join pitch movements occurring on the same syllable. For example a rise-fall with the peak in the middle of the syllable (pointed hat) could be labeled "1&A". In our formalization, where labels are assigned to <pitmove> elements, the diacritic '&' before a label will have the meaning "pitch movement realized on the same syllable as the preceding one". So, for example, a complex configuration rise-fall-rise occurring on a single syllable could be represented by three <pitmove>'s respectively labeled "1" "&A" "&2". Of course, two 'early' movements can't occur on the same syllable, so labels 1, 5, B, E can't be preceded by '&'.
5.2.3.2 Data Source
The IPO notation scheme annotates pitch movements taking as a starting point the ‘close copy’ stylisation. In order to select the proper labels also phonetic segmentation is a necessary reference, allowing to identify syllables.
5.2.3.3 Segmentation/selection
The IPO phonetic representation of intonation is a segmentation of the speech flow into consecutive pitch movements. Each <pitmove> covers a segment of the close copy stylized curve, the stretch between a <closecopy> inflection point and the following one. The annotator will label pitch movements by looking at the <closecopy>'s sequence, graphically displayed and aligned with the phonetic transcription of the utterance. He will define a <pitmove> by selecting two consecutive <closecopy> elements, from which the <pitmove> will inherit its time attributes start, end. Then, on the basis of the shape of the segment as displayed in the stylized curve and of its alignment with phones (syllable), he will assign the <pitmove> a proper label.
5.2.3.4 Assignment
The attributes considered here for the <pitmove> element are the following:
5.2.4 Exampletype: IPO symbol representing the movement href: two consecutive <closecopy> elements initial and final inflection points of the movement; alternatively, the <pitmove> might be linked with a <syllable> start: time start of the movement (inherited from the first linked <closecopy>) end: time end of the movement (inherited from the second linked <closecopy>)
The following example shows the Italian sentence "quell'artificio contabile sara` scoperto facilmente" read by a female speaker. In the picture, the vertical bars correspond to phoneme boundaries (phoneme symbols are not SAMPA...), the blue line to the original f0 curve and the red line to the stylized one (closecopy.
![]()
This example can be represented using the <closecopy>
and <pitmove> elements as below.
It is assumed that the closecopy is directly imported as a sequence of
inflection points (in this case f0.xml is not needed).
| closecopy.xml |
| <closecopy id="clscpy_001" value="207" start="130"
end="130"/>
<closecopy id="clscpy_002" value="243" start="540" end="540"/> <closecopy id="clscpy_003" value="285" start="690" end="690"/> <closecopy id="clscpy_004" value="212" start="860" end="860"/> <closecopy id="clscpy_005" value="189" start="1110" end="1110"/> <closecopy id="clscpy_006" value="159" start="1290" end="1290"/> <closecopy id="clscpy_007" value="209" start="1500" end="1500"/> <closecopy id="clscpy_008" value="206" start="1750" end="1750"/> <closecopy id="clscpy_009" value="246" start="2070" end="2070"/> <closecopy id="clscpy_010" value="226" start="2600" end="2600"/> <closecopy id="clscpy_011" value="148" start="2780" end="2780"/> <closecopy id="clscpy_012" value="144" start="3070" end="3070"/> |
| pitmove.xml |
| <pitmove id="pitm_001" type="4" href="closecopy.xml#
id(clscpy_001).. id(clscpy_002)"/>
<pitmove id="pitm_001" type="1" href="closecopy.xml# id(clscpy_002).. id(clscpy_003)"/> <pitmove id="pitm_001" type="B" href="closecopy.xml# id(clscpy_003).. id(clscpy_004)"/> <pitmove id="pitm_001" type="Ø" href="closecopy.xml# id(clscpy_004).. id(clscpy_005)"/> <pitmove id="pitm_001" type="B" href="closecopy.xml# id(clscpy_005).. id(clscpy_006)"/> <pitmove id="pitm_001" type="4" href="closecopy.xml# id(clscpy_006).. id(clscpy_007)"/> <pitmove id="pitm_001" type="Ø" href="closecopy.xml# id(clscpy_007).. id(clscpy_008)"/> <pitmove id="pitm_001" type="4" href="closecopy.xml# id(clscpy_008).. id(clscpy_009)"/> <pitmove id="pitm_001" type="0" href="closecopy.xml# id(clscpy_009).. id(clscpy_010)"/> <pitmove id="pitm_001" type="B" href="closecopy.xml# id(clscpy_010).. id(clscpy_011)"/> <pitmove id="pitm_001" type="Ø" href="closecopy.xml# id(clscpy_011).. id(clscpy_012)"/> |
5.2.5 Coding Procedure
The objective of phonetic transcription of intonation according to the IPO methodology is to obtain a stylized curve where the sequence of pitch movements is properly labeled. The MATE workbench will not provide a true stylization/resynthesis environment. It might provide a pitch tracking function to obtain the raw f0 curve, or alternatively a means to import it from external files. The most IPO-conformant coding procedure will directly import the stylized f0 curve, obtained with the help of a proper external environment for perceptual stylization (e.g. Winpitch, see http://www.winpitch.com), using the <closecopy> element with no need of the <f0> element, and will consist in the following steps:
open the speech file in order to listen to its intonation open the corresponding phonetic segmentation (<phone> and <syllable>) import the close copy and display it as a curve, aligned with phonetic segmentation define <pitmove> elements by selecting the segments of the stylized curve (delimited by two consecutive <closecopy> elements) and labeling each of them according to the following criteria: if it can be considered to coincide with the ideal baseline or topline, by a global look at the curve, label it 0 or Ø respectively otherwise choose the proper label on the basis of movement direction and size and of its position in the syllable, judged by looking at its phonetic alignment
If the close copy is not available, the third
step may be replaced by the following steps (a very simplified approximation
of the correct stylization procedure):
import or generate automatically the raw f0 curve and display it obtain a closecopy by selecting the 'relevant' <f0> points on the raw curve; base such stylization on the shape of the curve, the perceived intonation of the sound file and the alignment with syllables (accents, boundaries...)
5.2.6 Markup Table
|
|
|
| id | [ASCII] |
| value | [FLOAT] |
| href | <f0> |
| start | [FLOAT] |
| end | [FLOAT] |
|
|
|
| id | [ASCII] |
| type | 0, Ø, 1, 2, 3, 4, 5, A, B, C, D, E, &2, &3, &4, &A, &C, &D |
| href | <closecopy>..
<closecopy> |
| start | [FLOAT] |
| end | [FLOAT] |
5.3. Layer 2: Phonetic Representation of Intonation - INTSINT scheme
5.3.1 Markup Declaration
INTSINT is a coding system of intonation developed by Daniel Hirst and his colleagues at the CNRS centre of theAix-en-Provence University. It is conceived "to provide a purely formal encoding of the macroprosodic curve. Each target point of the stylised curve is coded by a symbol either as an absolute tone, defined globally with respect to the speakers pitch-range or as a relative tone, defined locally with respect to the inmediately neighbouring target-points"(Campione et al., 1997, p. 72). Descriptions of this method can be found in Hirst (1991,1994); Hirst & Di Cristo (1998), among other references.
The starting point is again the raw f0 curve, which is (automatically) stylized to remove irrelevant and micro-prosodic details. The stylized representation, called MOMEL (Hirst & Espesser, 1993), consists in a sequence of inflection points [time : f0 value], which should be interpolated by a parabolic function. As a second step, each target point in the MOMEL stylized curve is considered in its absolute or relative height and accordingly labeled as a high or low tone. The elements necessary to represent the INTSINT notation system are the following:
The three elements are hierarchically ordered, with a one-to-one mapping between <intone>'s and <momel>'s. The alignment with the soundfile is kept through the base element <f0>, although in case the <momel> stylized curve is directly imported, the link with <f0> can be skipped and <momel> can be directly aligned with the soundfile.<f0>, for the frames of the raw f0 curve <momel>, for the inflection points in the stylized curve <intone>, for the labeled tones

5.3.2 The <momel> element
5.3.2.1 Description
This element has been included to allow each f0 inflection point of a MOMEL stylised f0 contour (used in the INTSINT annotation system as the phonetic base representation of f0 contour) to be considered –and accordingly handled– as an XML element.
For a detailed description of the MOMEL stylization procedure, the reader is referred to Hirst & Espesser (1993), and Hirst (1994), among other references.
5.3.2.2 Data Source
The MOMEL stylised f0 contour is obtained automatically from the raw f0 curve.
5.3.2.3 Segmentation/selection
The calculated MOMEL stylised f0 values (or imported from the ‘mes’ tool) will be automatically converted to <momel> elements. ‘Mes’ is described at (and can be downloaded from) the following site: ‘http://www.lpl.univ-aix.fr/ext/projects/mes_signaix.htm/’.
5.3.2.4 Assignment
The attributes considered here for the <momel> elements are the following:
If the MOMEL curve is imported from the 'mes' tool, the reference to <f0> can be avoided. In this case, as each inflection point is indeed a point, the two time attributes will have the same value. Otherwise, they will be inherited from start, end of the <f0> frame.value: f0 value (in Hz) of the stylised point href: <f0>, optional start: time start of the stylised point end: time end of the stylised point
5.3.3 The <intone> element
5.3.3.1 Description
The target points in the MOMEL stylized curve can be phonetically labeled as tones, here represented by the <intone> element.
INTSINT includes two types of symbols to transcribe f0 tones:
1) Absolute TonesINTSINT includes three symbols to label the Absolute Tones, which are defined according to the speaker’s pitch range.
| T | top of the speaker’s pitch range |
| M | initial, mid value |
| B | bottom of the speaker’s pitch range |
2) Relative tonesRelative tones are coded in INTSINT considering the height of the preceding and following target points. Five different symbols exist to transcribe these Relative Tones:
| H | target higher than both immediate neighbours |
| L | target lower than both immediate neighbours |
| S | target not different to preceding target |
| U | target in a rising sequence |
| D | target in a falling sequence |
5.3.3.2 Data Source
The INTSINT representation is usually obtained from the MOMEL stylised f0 contour. So the <intone> element will be directly linked to <momel>. Phonetic segmentation is also useful to assign labels, although it is not strictly necessary.
5.3.3.3 Segmentation/selection
The INTSINT symbols are assigned to each inflection point of the MOMEL stylised contour, following a set of conventions which are described in Hirst (1991, 1994), Hirst et al. (1993) and Hirst & Di Cristo (1998), among other references.
In order to label <intone>'s, the <momel> elements should be displayed as a stylized curve (parabolic interpolation) aligned with phonetic segmentation.
The INTSINT symbols can also be automatically assigned to the MOMEL inflection points by means of the ‘mes’ tool.
5.3.3.4 Assignment
The attributes considered here for the <intone> element are the following:
5.3.4 Exampletype: the INTSINT symbol corresponding to the tone. href: points to a single <momel> element start: time start of the stylised point, inherited from <momel> end: time end of the stylised point, inherited from <momel>
The example presented here shows the MOMEL and INTSINT
annotation of the French utterance 'Il faut que je sois a Grenoble Samedi
vers quinze heures', using the <momel>
and <intone> elements.
| momel.xml |
| <momel id="mml_001" value="163"
start="106" end="106"/>
<momel id="mml_002" value="217" start="265" end="265"/> <momel id="mml_003" value="148" start="521" end="521"/> <momel id="mml_004" value="190" start="617" end="617"/> <momel id="mml_005" value="130" start="827" end="827"/> <momel id="mml_006" value="223" start="1249" end="1249"/> <momel id="mml_007" value="139" start="1614" end="1614"/> <momel id="mml_008" value="172" start="1822" end="1822"/> <momel id="mml_009" value="144" start="1983" end="1983"/> <momel id="mml_010" value="185" start="2078" end="2078"/> <momel id="mml_011" value="152" start="2248" end="2248"/> <momel id="mml_012" value="99" start="2505" end="2505"/> <momel id="mml_013" value="152" start="2730" end="2730"/> |
| intone.xml |
| <intone id="intn_001" type="L" href="momel.xml#id(mml_001)"/>
<intone id="intn_002" type="T" href="momel.xml#id(mml_002)"/> <intone id="intn_003" type="M" href="momel.xml#id(mml_003)"/> <intone id="intn_004" type="H" href="momel.xml#id(mml_004)"/> <intone id="intn_005" type="L" href="momel.xml#id(mml_005)"/> <intone id="intn_006" type="T" href="momel.xml#id(mml_006)"/> <intone id="intn_007" type="M" href="momel.xml#id(mml_007)"/> <intone id="intn_008" type="H" href="momel.xml#id(mml_008)"/> <intone id="intn_009" type="L" href="momel.xml#id(mml_009)"/> <intone id="intn_010" type="H" href="momel.xml#id(mml_010)"/> <intone id="intn_011" type="D" href="momel.xml#id(mml_011)"/> <intone id="intn_012" type="B" href="momel.xml#id(mml_012)"/> <intone id="intn_013" type="M" href="momel.xml#id(mml_013)"/> |
5.3.5 Coding Procedure
A specific tool, 'mes' (available at ‘http://www.lpl.univ-aix.fr/ext/projects/mes_signaix.htm/’), has been developed to perform automatic intonation transcription according to the INTSINT system. Both stylization and annotation can be performed automatically by 'mes'. So, the simplest way to get to INTSINT annotation in the MATE environment would be the following:
The <f0> element may not be necessary (unless it is used as a reference by other layers...). In case only <momel> is imported, <intone>'s may be created manually by the following procedure:import the raw f0 curve in <f0> import the MOMEL stylized curve in <momel> import the INTSINT annotation in <intone> link <momel> to <f0> and <intone> to <momel> (an automatic function should be provided for that by the workbench)
open the speech file in order to listen to its intonation open the corresponding phonetic segmentation import <momel> elements and display them as a stylized curve define <intone> elements by selecting every <momel> element (inflection point in the stylized curve) and mark it with the proper label
5.3.6 Markup Table
|
|
|
| id | [ASCII] |
| value | [FLOAT] |
| href | <f0> (optional) |
| start | [FLOAT] |
| end | [FLOAT] |
|
|
|
| id | [ASCII] |
| type | T, M, B, H, S, L, U, D |
| href | <momel> |
| start | [FLOAT] |
| end | [FLOAT] |