Each file may contain an arbitrary number of document blocks
Document block:
#begin document
#end document
Content of :
Tabular data with the following columns, one word per line. Blocks are separated by empty lines.
A block may represent a sentence or other kind of grouping within a document.
Content of :
Arbitrary property declarations, one per line, in the form of
#
'key' may be any non-empty string apart from the two reserved tokens "begin" and "end".
The first whitespace sequence is used as delimiter between key and value, so the key itself is not allowed to contain
any whitespace characters. The value string on the other hand is not restricted in its content other than not being able
to contain any linebreak characters.
Block level fields:
Column Name Type Description
1 Word number number 0 to block_length-1
2 Word itself string This is the token as segmented/tokenized in the Treebank.
3 Part-of-Speech string
4 Features string[] Morphological features
5 Head number Word Number of the head in the dependency structure (0 means root)
6 DepRel string Dependency relation (if the word is not the root node)
7 Speaker/Author string This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data.
8 Speaker-Features string[] Special features array to store information about speakers
9 Named Entities string These columns identifies the spans representing various named entities.
10 Coreference string Coreference chain information encoded in a parenthesis structure.
11 Begin-Timestamp float Timestamp of word begin in audio file
12 End-Timestamp float Timestamp of word end in audio file
13 Syllable-SoundOffsets int[] Character based offsets for syllables in the word
14 Syllable-Labels string[] Syllable labels. This array is the main reference when determining the number of syllables in a word
15 Syllable-Timestamps float[] Timestamps of each syllable's begin in audio file
16 Syllable-Vowel string[] Phonetic vowel description
17 Syllable-Stress int[] List of stressed syllables in a word. Values refer to the Syllable-Labels array
18 Syllable-Duration float[] Duration of each syllable
19 Vowel-Duration float[] Duration of the vowel in a certain syllable.
20 Syllable-Startpitch float[]
21 Syllable-Midpitch float[]
22 Syllable-Endpitch float[]
22 Coda-Type String[]
23 Coda-Size int[]
22 Onset-Type String[]
23 Onset-Size int[]
24 Phoneme-Count int[]
25:30 PaintE-Parameters float[] PaintE parameters (total of 6 columns) for each syllable.
All arrays use the pipe-character ('|') as delimiter
The underscore character ('_') signals empty values