4.1 Markup Declaration
The layer of phonetic transcription is a base level intended for the representation of the minimal units for phonetic and prosodic analysis: phones and syllables. The level defines a base element, the <phone> element, corresponding to a segment in the speech signal, labeled according to its phonetic features. A <syllable> element may be added, consisting of a sequence of <phones>. The annotation at this level is a transcription and a segmentation, in the sense that it refers directly to the speech signal, recognizes the uttered sounds and splits the speech continuum into phonetic chunks. Each <phone> will then be classified with a phonetic label and associated with time information specifying its start and end instants. Higher linguistic levels, like the phonological prosodic levels or the orthographic word level, might inherit time information from the phonetic level by linking their elements with <phone> elements or <syllable> elements.
The scheme adopted here for phonetic transcription is SAMPA [Wells et al., 1992], which is intended for multi-lingual phonetic transcription. In the original SAMPA notation, a transcription is a stream of phonetic labels and diacritics, where labels classify phones and diacritics give further specifications about phones, with the exception of stress marks which implicitly refer to the following syllable. In our adaptation, the <syllable> element is made explicit as a second layer built on top of the <phone> layer.
4.2 The <phone> element
4.2.1 Description
For the annotation of phones, SAMPA (SAM Phonetic Alphabet) has been chosen, providing a multilingual and computer-readable inventory of phonetic symbols.
The transcription task using SAMPA involves the use of a set of symbols and diacritics, which can be combined to represent the phonetic realisation of phones.
The considered SAMPA symbols provide labels for vowels and consonants. A further symbol (taken from the SAMPROSA extension of the SAMPA scheme) is considered for pauses, which are marked as a special kind of sounds. Symbols can be combined together in some cases, e.g. two vowel symbols may be combined to represent diphthongs. The set of allowable combination may be language-dependent. A few diacritics are also available to mark additional features of phones: e.g. the length mark ":" may follow a phonetic label. The base symbols are listed below:
a) consonants
| IPA symbol | SAMPA symbol | phonetic description |
| b | voiced bilabial plosive | |
| c | voiceless palatal plosive | |
| C | voiceless palatal fricative | |
| d | voiced dental/alveolar plosive | |
| D | voiced dental fricative | |
| f | voiceless labiodental fricative | |
| g | voiced velar plosive | |
| G | voiced velar fricative | |
| h | voiceless glottal fricative | |
| j | palatal approximant | |
| k | voiceless velar plosive | |
| l | dental/alveolar lateral approximant | |
| L | palatal lateral appoximant | |
| m | bilabial nasal | |
| n | palatal nasal | |
| J | palatal nasal | |
| N | velar nasal | |
| p | voiceless bilabial plosive | |
| r | alveolar trill | |
| R | uvular trill/fricative | |
| s | voiceless alveolar fricative | |
| S | voiceless postalveolar fricative | |
| t | voiceless dental/alveolar plosive | |
| T | voiceless dental fricative | |
| v | voiced labiodental fricative | |
| w | labial-velar approximant | |
| x | voiceless velar fricative | |
| H | labial-palatal approximant | |
| z | voiced alveolar fricative | |
| Z | voiced postalveolar fricative | |
| ? | stod, glotal stop |
b) vowels
| IPA symbol | SAMPA symbol | phonetic description |
| a | open front unrounded | |
| A | open back unrounded | |
| { | near-open front unrounded | |
| 6 | near-open central unrounded | |
| Q | open back rounded | |
| O | open-mid back rounded | |
| e | close-mid front unrounded | |
| E | open-mid front unrounded | |
| @ | mid central unrounded (schwa) | |
| 3 | mid central unrounded | |
| i | close front unrounded | |
| I | near-close front unrounded lax | |
| o | close-mid back rounded | |
| 2 | close-mid front rounded | |
| 9 | open-mid front rounded | |
| & | open front rounded | |
| u | close back rounded | |
| U | near-close back rounded lax | |
| } | close central rounded | |
| V | open-mid back unrounded | |
| y | close front rounded | |
| Y | near-close front rounded lax |
c) pause
| SAMPA
(SAMPROSA) symbol |
phonetic description |
| ... | silent pause |
The following SAMPA diacritics may be combined with
the phonetic label (preceding or following it, according to the syntax
suggested by the example):
| SAMPA symbol | phonetic description | Example of use |
| ~ | nasalization | O~ |
| = | syllabic consonant | =n |
| : | length mark | a: |
The user is referred to Wells et al. (1992) for a detailed description of the SAMPA symbols and their corresponding usage. More information is also available at ‘http://www.phon.ucl.ac.uk/home/sampa/home.htm’, including guidelines for the use use of SAMPA for transcription in the following languages: Bulgarian, Croatian, Danish, Dutch, English, Estonian, French, German, Greek, Hungarian, Italian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovenian, Spanish and Swedish. A description of the SAMPROSA scheme can be found at ‘http://www.phon.ucl.ac.uk/home/sampa/samprosa.htm’.
4.2.2 Data Source
Phonetic transcription is usually carried out from speech files, where the speech sound is sampled. Speech files can be listened to and graphically displayed on a time axis, so that phones can easily be time-aligned to sound.
4.2.3 Segmentation/selection
Phonetic transcription is a segmentation task: the speech sound is segmented into a sequence of adjacent chunks, each corresponding to a <phone>. While in principle one could listen to the recorded speech and write down the perceived phones and the corresponding time (measured by a clock...), a reasonable segmentation procedure should rely on sampled speech, graphically shown as a waveform on a time axis and possibly also displayed in its spectrographic representation. The annotator would select a signal portion on the screen, listen to it and inspect its shape. Each phone will be characterized by its peculiar shape and show two transition zones where the boundaries with the adjacent phones should be placed. On this basis the annotator would recognize the uttered phone and segment it, possibly by mouse clicking on its start and end point on the screen.
4.2.4. Assignment
The attributes considered here for the <phone> element are the following:
4.3.1 Description
In many prosodic descriptions the syllable is taken
as the minimal prosodic unit, the building block of the rhythmical structure
and the scope of intonation events. Formally, it is a sequence of one or
more phonemes centered on a vocalic nucleus. Its precise definition is
language and theory dependent. In SAMPA, the diacritics for primary and
secondary stress are inserted at the beginning of the stressed syllable:
e.g. ["meZ@] (measure), [@"nVD@] (another). So, even if the
prosodic extension of SAMPA (SAMPROSA [Gibbon, 1989]) is not taken into
account, the notion of syllable is implicit in SAMPA phonetic notation.
Here an element <syllable> is defined
explicitly, linked to its component <phone>'s
and possibly carrying the stress mark, according to the following definition:
| " | primary stress |
| % | secondary stress |
The SAMPA primary stress symbol (") can not be used
in XML markup. For this reason, it has to be represented by """.
4.3.2 Data Source
Syllables are defined starting from <phone>'s.
4.3.3 Segmentation/selection
After the phonetic transcription has been obtained, syllables are defined by selecting their component phones, from syllable boundary to syllable boundary, according to the phonetic syllabification rules of the language (and of the chosen linguistic theory), and judged as to its accent degree. Language- and theory-dependent automatic procedures could be implemented for syllabification.
4.3.4. Assignment
The attributes considered here for the <syllable> element are the following:
4.4 Examplesstress: optional label specifying if the syllable is stressed, with primary (") or secondary (%) stress; if not specified, the syllable is unstressed
href: a sequence of <phone> elements
start: start of the first phone of the syllable, inherited from the first <phone> element
end: end of the last phone of the syllable, inherited from the last <phone> element
The following example shows the phonetic transcription
of the Spanish word 'casa' ('house') and its corresponding syllabic
segmentation, using the <phone> and
<syllable> elements:
| phone.xml |
| <phone id="phn_01" type="k" start="345"
end="390"/>
<phone id="phn_02" type="a" start="390" end="450"/> <phone id="phn_03" type="s" start="450" end="490"/> <phone id="phn_04" type="a" start="490" end="540"/> |
| syllable.xml |
| <syllable id="sllbl_01" stress="""
href="phone.xml# id(phn_001)..id(phn_002)"/>
<syllable id="sllbl_02" href="phone.xml# id(phn_003)..id(phn_004)"/> |
4.5 Coding Procedure
Manual phonetic segmentation would be helped by a software tool displaying the speech signal in its waveform and spectrographic representations, allowing listening, selecting signal portions, zooming, selecting pre-defined phonetic labels, choosing segmentation points on the time axis. The set of allowable phonetic labels (for the given language) should be defined in the DTD (the DTD included in the Annex does not define language-dependent symbol sets), while a specific coding guideline document will explicitly state the adopted set of segmentation criteria. The coding procedure would then be:
1. select the speech file and open the synchronized windows for phonetic segmentation and waveform and spectrum displayTools for automatic segmentation are available, often language dependent. Good performances are offered by phonetic aligners, that align a speech signal to a predefined phonetic transcription. The procedure in this case would be:2. zoom until a detailed inspection of the signal is possible
3. inspect and listen to the signal portion until the uttered phonemes are recognized
4. select a phonetic label for the first phone
5. identify its boundaries according to the segmentation criteria and mark them by placing the cursor on the proper point on the time-axis (this should automatically set the time attribute)
6. after phonetic segmentation is concluded, define syllables by selecting their component <phone>'s and, if stressed, by assigning the proper stress mark
1. listen to the speech sound and transcribe it as a sequence of phones4.6 Markup Table2. apply the phonetic aligner to the speech signal with its phonetic transcription and obtain its phonetic segmentation
3. import the phonetic segmentation in the MATE environment
4. define syllables as in step 6 above.
|
|
|
| id | [ASCII] |
| type | b,c,..,a,A,..,=,: |
| start | [FLOAT] |
| end | [FLOAT] |
|
|
|
| id | [ASCII] |
| stress | ", % |
| start | [FLOAT] |
| end | [FLOAT] |