General markup guidelines
Andreas Mengel, IMS
This chapter deals with general considerations useful for the formal
encoding of linguistic annotation data.
1. Markup
As the term markup itself is often used in a very broad sense, the
following three sub-concepts that are covered should be distinguished:
-
the phenomena investigated: Some researchers will describe the words
and their properties, some will go for the relations of chunks of words,
and others investigate their sound structures;
-
the theory: This is how the linguistic phenomena are labelled, they
may be called words and the theory might claim that these can be
put into part of speech categories like A, B or C, or another theory might
say that the POS values are NN, OO, PP and QQ. And perhaps these categories
are not called POS but wc (word categories);
-
the markup: This is the kind of characters, the grammar, the formalism
used to represent the labels used by a given theory for the description
of linguistic material. So if a piece of the corpus is:
These three concepts (phenomena, theory, markup) are
orthogonal. So one can have XML, SGML, or xwaves/xlabel as the markup;
a Chomskyan or Tesnierian approach to syntax; and one can have descriptions
of the intonation, syntax, or semantics; and any possible combination of
the three.
2. Data files
There are various types of files specifying different kinds of information
to be encoded during the annotation of corpora entities:
|
non-XML data
|
-
source data: these are data that have been produced to record linguistic
data and prior to any annotation process
-
text: text data are transcripts of speech or other written documents
-
non-text: non-text data are binary data that represent recordings
of speech behaviour
-
audio: audio data may be speech files
-
graphical: graphical representations of speech behaviour or the
communication situation
-
pictures: pictures taken during a communication situation
-
video: visual recordings of the communication situation
|
|
(possibly) XML data
|
-
analysis data: mathematical analyses performed on any of the data
-
signal measurement data: evaluations of the speech signal, e.g.
f0 and other spectral analysis
-
statistical calculations: descriptive statistical analysis and statistical
tests on any of the data
-
annotation data: data that have been produced to categorize linguistic
phenomena
-
level annotation: annotation of individual linguistic levels of
description
-
knowledge resources: data that describe phenomena in a generic way
and relate to phenomena the properties of which are considered stable during
the linguistic situation annotated
-
lexical data: type information provided for individual words
-
universe data: information on the situation and objects in a given
communication situation
-
annotation process data: data that relate to the process of annotation
-
prescriptive information: these are data that are a reference for
the coder when annotating
-
unit definitions: definitions of individual units to segment and
categorize
-
coding guidelines: descriptions on the procedure of annotation dependant
on the linguistic level
-
descriptive information: this is information related to the actual
annotation
-
source data list: a list of data that have been used as source data
-
log files: information on actual annotation processes such as the
creation, modification or deletion of a tag
-
settings: these are customizations of the software environment
-
user privileges: settings that describe permissions of the user
for access and modification of data
-
graphical user interface data: general settings of the software
environment
-
display style sheets: specific display and access descriptions for
given annotation tasks
-
user preference data: user controlled software environment settings
|
The following figure represents these types of information and relations
among them. The ellipse in the middle symbolize annotation levels. Square
boxes are non-level information. Arrows indicate referring relations: Arrows
between levels denote the reference between levels, e.g. reference from
the word level to the phone-level; arrows from the level files to square
boxes symbolize reference to rules that have been used during the annotation;
arrows pointing from square boxes to ellipses indicate (meta-)annotation
of the level label files. The shading of boxes or ellipses denotes that
their contents is annotation of linguistic objects. Note that the direction
of the arrows is also a reflex of the process of annotation, i.e. in most
cases those objects pointed to are produced earlier.

The data produced during the annotation of a dialogue will be kept in
separate files, but it should also be ensured that their mutual relationship
and the fact that they belong to the same annotation project is transparent.
There are two strategies to specify this membership information:
-
list: A project resource file is established that keeps the file
names of the data used for the annotation project (annotation process information)
-
links: Relations are established between elements of different levels
of description; the elements refer to each other and are located in separate
files (level annotation information)
In general, links between any two types of documents can be established
by the means described in the table below.
| from |
to |
by means of |
| list of resources |
files used during annotation |
file name attribute |
| XML document |
non-XML document |
file name attribute |
| non-XML document |
any other document |
not possible |
| XML document |
XML document |
href attributes between elements |
The following figure represents the different types of files and explicit
references. It is an adapted version of a figure in D1.2 which has been
modified in order to also reflect the direction of reference. Bold lines
represent reference by file name, dotted lines represent reference (href-attributes)
between annotation elements of different level annotation files.

3 Markup Conventions
This section is dedicated to all markup aspects that can be described on
a general level and serve to be applied to level specific markup descriptions.
The following is a general guideline for the mapping of XML syntax onto
linguistic concepts. In order to make this mapping as uniform, efficient,
and consistent as possible these guidelines discuss some of the problems
encountered and offers proposals for solutions.
XML has been chosen to be used for the representation of annotation
data in the MATE project. Besides the actual good support of software for
this standard, there are the following reasons for XML as the representation
model:
XML uses
-
one general description model containing
-
elements
-
their attributes
-
part-whole relations betwen elements
-
other relations between elements
-
one encoding format to represent, relate, and distinguish these kinds of
information
This is the reason why XML can be assumed to be able to encode markup irrespective
of the phenomena described or theoretical approach taken. The underlying
(object oriented like) model of description XML is based on can be assumed
by almost all theories: There are phenomena (elements) some of which can
be split into sub-phenomena (embedded elements), these phenomena have relations
and properties (attributes).
On the formal side this model is reflected by a uniform syntax described
above: XML-elements are enclosed by angular brackets
<xyz att="val">
inside which the values (val) of attributes (att)
are specified: In XML Elements are those entities that conceptually
group together descriptions of linguistic items. In XML element names are
put after the
<, such as in <word>. Attributes
are those entities that allow further description of entities by specifying
property dispositions. Attribute names denote the property dimension, its
actual status for a given element is described by an attribute value.
In <word pos="NN"> pos is the attribute, the exact
value is NN.
PCDATA are any characters which are not included
in a pair of one opening (<) and one closing (>)angular
bracket (see below).
Despite this general approach, XML does not provide the following:
typed/grammar specification of attribute values for the distinction of
floats, characters or for the definition of attribute values by regular
expressions or BNF grammars,
inference models for element values that allow for centralized specification
of properties that are shared by more than one element,
applicability restriction of attributes that are mutual exclusive e.g.
words that are nouns cannot have tempus information and case information
is not applicable to verbs.
The XML community is aware of these problems and proposals are under way
to improve and extend the XML standard. The inference problem is discussed
in more detail in the chapter on cross-level annotation.
3.1 Minimal redundancy
The markup produced should be minimally redundant. That means that any
information applying to more elements that conceptually depend on each
other should only be represented once in the document if it is possible
to find general means to infer the information marked at one element when
accessing the other. On the level of defining elements, attributes, and
values for concepts to be annotated, this principle has the effect that
only those attributes and values that cannot be inferred from the element
name or the values of other attributes of the element, have to be specified.
To give an example: In the case of segmenting speech into phones, one would
not have to specify the voicing of sounds as extra attributes once the
identity of a sound is determined, thus
| redund.xml |
| <phone
type="a:" voiced="true"/> |
is only sensible if the theory or the application assumes that there
are vowels which can be voiceless (e.g. in the case of whispering).
In general, elements used for tagging should not carry the theory itself
but that part of information that cannot be predicted only.
3.2 Maximal consistency
If rules for the mutual dependency of information represented at different
places of a document can be stated, this type of consistency should be
enhanced. One means of enhancement is the minimal redundancy principle,
as information placed only at one place that that can be inferred from
some other place of the document will only have to be updated once. Furthermore
maximal
consistency covers the area of reference for the storage of annotation:
If the structure of the tags of the items to be linked varies in an unpredictable
fashion across corpora or parts of corpora, no reliable (automatic) tagging
or retrieval can be guaranteed. A further prerequisite of level annotation
is the existence of one general model that can be applied to any kind of
tagging, i.e. there is a need for a minimal standard of tagging on all
levels to be labelled.
3.3 Universal parsability
The markup used should be universally parsable. This has three levels of
consequence.
-
First, a general grammar should be used, supporting a uniform representation
of different entity types. One example of this is XML which guarantees
that any linguistic information can be parsed by one type of parser once
it is encoded in XML.
-
The second level of parsability refers to the actions or the meaning
of the markup. This is the behaviour of a piece of software after having
parsed the file. Examples of this behaviour may be the representation,
display, and reactions to user input. This second level of parsability
cannot be encoded my XML, it has to be defined elsewhere.
-
A third level of parsability is the processing of information (stored
in XML, processed and displayed by the computer) by human annotators: The
information displayed must be accessible for them, too. Thus, additional
information is needed to represent the actual meaning and theory behind
the XML annotation applied to linguistic data.
3.4 Optimal maintainability
Typical applications of XML are hierarchies of different elements which
are nested. It is not possible however to design one general hierarchical
model in which all linguistic information to be described in speech can
be represented. It is easy for elements like sounds, words, phrases and
sentences, but is not in the case of sounds, pauses, background noise,
head turns, and co-reference. Entities of these different categories have
no conceptual dependencies which could be represented in hierarchical structures.
This is the reason why the tagging of conceptually or theoretically exclusive
levels must be put into different XML files. Yet, the encoding of mutual
theoretically dependent information should also reside in separate files,
as they will ideally be produced at different occasions and element-type
wise. If one has to add a higher level of annotation to an existing lower
level tagging file, this would mean that the file has to be altered requiring
complex manipulations of the file. Thus, each conceptually different level
of description should be placed in a file of its own.
3.5 Naming conventions
In general all names of elements, attributes, and values should be in lower
case only. This looks like a layout fashion but makes reading and style
of documents more consistent. Also - where possible - names used for elements,
attributes and values should consist of more than one or two characters.
In the case where names for elements proposed by the TEI guidelines are
used, the names used should be employed, although <u> and
<w>
are not favourable as they are not very intuitive for people not knowing
the TEI guidelines.
3.6 Linking information
In general, there should be the possibility to link and to align various
levels of description. For the sake of the next example assume it is a
sentence and a word markup. Suppose, there is a word tagger that provides
the user with basic tagging of words which results in a document like:
| word.xml |
<w id="w_01">take</w>
<w id="w_02">this</w>
<w id="w_03">example</w> |
Adding annotation of the sentence level would either
-
a) produce a (new) document which is a copy of the first one (or a new
version of the first document) plus sentence tags (<s>) added
| word2.xml |
<s>
<w id="w_01">take</w>
<w id="w_02">this</w>
<w id="w_03">example</w>
</s> |
OR
-
b) be a second file with <s> elements that hyperlink to the
first one
| sent.xml |
| <s href="word.xml#id(w_01)..(w_03)"/> |
As stated above, the second approach is recommended which - depending
on the DTD - treats the <w> elements as children of <s>
elements like in variant a), but in a non-invasive fashion [see [1]].
3.7 PCDATA
PCDATA are all textual entities which are not inside any element, i.e.
outside angular brackets (<>). In the case of orthographic
text that shall be marked up and integrated into a corpus of dialogue annotation,
words will be the basic objects of markup. Around each word - if marking
up words is the application - there would be an element start tag <w>
and an element end tag
</w>:
| word.xml |
<w id="w_001">These</w>
<w id="w_002">are</w>
<w id="w_003">the</w>
<w id="w_004">words</w> |
In this case, the elements are filled by PCDATA which we perceive as
orthographical words. Sentence annotation building upon this would add
<s>
and </s> around the text before:
| sentence.xml |
<s id="s_001">
<w id="w_001">These</w>
<w id="w_002">are</w>
<w id="w_003">the</w>
<w id="w_004">words</w>
</s> |
In this case, the <s> element is filled, too.
Empty elements are those which do not include neither other elements
nore PCDATA, e.g. in the case of sentence annotation that refers to other
elements by an href attribute:
| sentence2.xml |
| <s id="s_001" href="word.xml#id(w_001)..id(w_004)"/> |
It is recommended to use PCDATA only if these PCDATA are textual information
(marked up text from source data). The use of PCDATA inside an element
for specification of values is only preferable if these are very long and
explicit.
In the following two examples there are examples where location information
of a situation is provided. The second example is a case for choosing PCDATA.
| situation1.xml |
| <situation id="sit_0223" place="home"/> |
| situation2.xml |
<situation id="sit_0223">
<place id="loc_001">
The participants are sitting in
the living room
of the apartment of the speaker
named Martha.
</place>
</situation> |
All other information should be coded by attribute values, links and
embedded elements.
3.8 Elements vs. Attributes
In general when annotating speech data, elements are often entities that
have an extension in time. For many categories of speech phenomena, there
are not only labels but also notation systems, i.e. sets of symbols that
denote the item as such and its category (cf. ToBI, POS). When describing
linguistic levels, one has to decide if the standard labels will be used
as attribute values or as elements in the markup [2].
| book.xml |
| <book title="The Call of the Wild" author="London,
Jack"/>
or
<book author="London, Jack">The Call of the Wild</book>
or
<book>
<title>The Call of the Wild</title>
<author>London, Jack</author>
</book> |
Or, see the following options for representing prosody and phrase types:
Phrases:
| phrase.xml |
<phrase type="NP"/>
<phrase type="VP"/>
<phrase type="NP"/>
or
<NP/>
<VP/>
<NP/> |
ToBI labels:
| notobi.xml |
<pros type="L*"/>
<pros type="H*"/>
<pros type="L*H"/>
or
<L*/>
<H*/>
<L*H/> |
As a matter of convention, one should choose that level of abstraction
that allows to segment a series of entities, name each of these entities
by that one term which can be applied to all of them and encode their differences
by attributes and values respectively. Note, that there is an interrelation
between mutual exclusive attribute values and the choice of level of abstraction:
If the description level and element type chosen is <w> a part-of-speech
value of "noun" theoretically blocks the application of tense.
If the element types chosen were <noun>, <verb>,
<adj>,
etc., this would certainly not happen, but the very information that all
of these elements belong to one group of phenomena will be lost and is
not exploitable for query access to the data. One further solution is the
additional encoding of abstraction level information.
| tobi.xml |
<s>
<w>
<det num="sg" case="nom">The</det>
</w>
<w>
<noun num="sg" case="nom">tree</noun>
</w>
<w>
<verb num="sg" tense="past">grew</verb>
</w>
</s> |
Attribute values should be chosen in a way that allows as much conceptualization
as possible. Consider phonetic sounds: In the case where speech has to
be segmented into sounds, one could think of a set of attributes specifying
articulatory or auditive properties of these vowels as attributes, the
values of which are set to plus (+) or minus (-). The alternative would
obviously be to use the conventional IPA or SAM-PA symbols as values of
a single attribute, e.g. "type". The first alternative has two
disadvantages: If one assumes that the set of possible combinations of
articulatory or auditive properties is known and the sounds to be segmented
are limited to a small subset of all possible combinations, it seems an
effort too hard to choose this option. The other reason is that there might
be mutual dependencies of the values e.g. no plosive can be rounded. These
dependencies cannot be constrained automatically by the grammar of
a DTD or so. Thus it is highly recommended to choose that set of attributes
that guarantees maximal mutual independence of the attributes, i.e. find
entity descriptions which are used to encode typical attribute-value constellations,
in this case sound symbols. It might not always be possible to find a set
attributes that are not mutually dependent, cf. the word attribute CASE
which is not applicable to verbs as discussed above.
3.10 Time information
Since speech is a behaviour, behaviour is an action and action involves
the concept of time, time is an obviously important property of speech
units. In order to assess speech aspects like synchronicity, the sequence
or the duration of speech events etc. are important units to be described.
And as time information is the minimal chain of common reference across
levels, time description conventions should be standardized to the maximum.
There are various options for the encoding of this information:
-
First of all, one has to decide between specifying time in samples (signal
measurement points) or seconds/miliseconds: The advantage of using samples
is that samples are the finest grains available relative to the sample
frequency used for a given segment of recorded speech. The disadvantage
of using samples is that it is more difficult to compare time relations
across documents: In order to access time, the sample frequency would have
to be available for its calculation.
-
Second, the properties - and attributes respectively - to use have to be
chosen. In principle one could either employ start and end
or duration. As duration can be calculated by the other
two properties, but not the other way around, start and end
seem more appropriate and are recommended as standard attributes. Yet,
for some elements this concept seems difficult to apply, e.g. in the case
of f0 values, door slams or other events which conceptually do not seem
to have an extension in time. However, to be consistent, these elements
should have the same attributes with the specialty that the values of the
start
and the end attribute are equal.
-
Third, one could argue to leave one of the attributes out because many
times, the start time of an element equals the end time of the element
before. Unfortunately, this is not always the case, e.g., if there are
pauses, utterances of other speakers, or simply because the element is
located at the beginning or the end of the event chain. Some of these problems
could only be solved if one would have all physical events listed and annotated
in one document, what would require very complex DTDs and massive efforts
in the handling of information. As it seems easier to keep elements of
different conceptual levels in separate files, and it would be more complicated
for the user or the software to decide when to include this information
and when not, start and
end attribute should always appear
ensemble in an element tag.
-
The fourth option touches the completeness of time information for every
element on every level or whether time information could be inherited.
For time information, one would want to provide start and end
on only one layer and make elements from higher levels of description (e.g.,
words) that are conceptually related to these units (e.g., phones) point
to that information or inherit it. Basically two options can be considered:
-
There is a special kind of attribute that allows to inherit values of other
attributes, such that one can say that the value of the startinh
attribute of YZ elements inherits that information from the start
attributes of XY elements:
| yz.xml |
<xy id="xy001" start="000" end="002"/>
<xy id="xy002" start="002" end="005"/>
<xy id="xy003" start="005" end="009"/> |
| xy.xml |
| <yz startinh="xy001.start"/> |
-
There is no explicit start or end attribute in higher
level elements at all but for the sake of getting this information, a processor
has to go down in the element hierarchy and check all first children (of
first children)* until this information is found specified somewhere. Exactly
the same procedure is used for the end value, with the only difference
that one uses the last children (of children)* in that case.
| xy.xml |
<xy id="xy001" start="000" end="002"/>
<xy id="xy002" start="002" end="005"/>
<xy id="xy003" start="005" end="009"/> |
| yz.xml |
| <yz href="xy.xml#id(xy001)..id(xy003)"/> |

The second variant is recommended: If two levels of description are
situated in the same hierarchy and the elements of one of them are parent
elements of the other, then it is proposed to note start and end
information for every element of the lower or lowest level in that conceptual
hierarchy. The attributes start and end of the higher
level elements can be calculated by evaluating the start value
of the first embedded element and the end value of the last embedded
element [2]. This in accordance with the two principles
stated above, namely minimal redundancy and maximal consistency. If the
time information is changed and all time information were kept separately,
i.e. put to every level individually, then information of each element
on each level would have to be changed. If there is only one basic representation
all other tags refer to, and time information of these tags is inferred
form the basic tag, then the time information has only to be changed once
and changes applied to the time values on the lowest level will automatically
affect the time specification of the associated higher levels.
In the following example, the start value of the sentence (<s>)
would be 0.01 and the end value would be 0.62.
| time.xml |
<s id="s_001">
<w id="w_001" start="0.01" end="0.20>It</w>
<w id="w_002" start="0.20" end="0.37>was</w>
<w id="w_003" start="0.37" end="0.42>time</w>
<w id="w_004" start="0.42" end="0.62>again</w>
</s> |
To exploit this principle most effectively, time information and
thus the initial transcription of spoken material should be applied to
the level with the smallest units under investigation (cf. the section
on cross-level).
Time is relative and if an element has the start value of
4.34
which is to be compared to the start value of other elements in
another hierarchy, this may cause problems as this time information will
in both cases refer to the time elapsed relative to the beginning of the
document they reside in only but may be incompatible for many cases. It
is recommended that time is either specified relative to midnight, January
1, 1970 UTC (universal time code) or relative to the beginning of the recording.
For the latter case it is useful to apply the special attribute rectime
that states the time distance of the beginning of the recording relative
to midnight, January 1, 1970 UTC. Even if a software is not able to calculate
and compare the time information of elements of different documents, it
will be easy to check if the documents are compatible, and thus if a query
that compares
start and end information of these documents
is sensible.
References
[1]Isard, A., McKelvie, D. and Thompson, H.S.: Towards
a Minimal Standard for Dialogue Transcripts: A New Sgml Architecture for
the HCRC Map Task Corpus. Proceedings of the 5th International Conference
on Spoken Language Processing, ICSLP98, Sydney. http://www.cogsci.ed.ac.uk/~dmck/Papers/icslp98.ps
[2] The SGML/XML Web Page: SGML/XML Elements versus
Attributes. http://www.oasis-open.org/cover/elementsAndAttrs.html