General markup guidelines


Andreas Mengel, IMS

This chapter deals with general considerations useful for the formal encoding of linguistic annotation data.

1. Markup

As the term markup itself is often used in a very broad sense, the following three sub-concepts that are covered should be distinguished: These three concepts (phenomena, theory, markup) are orthogonal. So one can have XML, SGML, or xwaves/xlabel as the markup; a Chomskyan or Tesnierian approach to syntax; and one can have descriptions of the intonation, syntax, or semantics; and any possible combination of the three.
 
 

2. Data files

There are various types of files specifying different kinds of information to be encoded during the annotation of corpora entities:
 
 
non-XML data
  • source data: these are data that have been produced to record linguistic data and prior to any annotation process
    • text: text data are transcripts of speech or other written documents
    • non-text: non-text data are binary data that represent recordings of speech behaviour
      • audio: audio data may be speech files
      • graphical: graphical representations of speech behaviour or the communication situation
        • pictures: pictures taken during a communication situation
        • video: visual recordings of the communication situation
(possibly) XML data
  • analysis data: mathematical analyses performed on any of the data
    • signal measurement data: evaluations of the speech signal, e.g. f0 and other spectral analysis
    • statistical calculations: descriptive statistical analysis and statistical tests on any of the data
  • annotation data: data that have been produced to categorize linguistic phenomena
    • level annotation: annotation of individual linguistic levels of description
  • knowledge resources: data that describe phenomena in a generic way and relate to phenomena the properties of which are considered stable during the linguistic situation annotated
    • lexical data: type information provided for individual words
    • universe data: information on the situation and objects in a given communication situation
  • annotation process data: data that relate to the process of annotation
    • prescriptive information: these are data that are a reference for the coder when annotating
      • unit definitions: definitions of individual units to segment and categorize
      • coding guidelines: descriptions on the procedure of annotation dependant on the linguistic level
    • descriptive information: this is information related to the actual annotation
      • source data list: a list of data that have been used as source data
      • log files: information on actual annotation processes such as the creation, modification or deletion of a tag
      • settings: these are customizations of the software environment
        • user privileges: settings that describe permissions of the user for access and modification of data
        • graphical user interface data: general settings of the software environment
        • display style sheets: specific display and access descriptions for given annotation tasks
        • user preference data: user controlled software environment settings

The following figure represents these types of information and relations among them. The ellipse in the middle symbolize annotation levels. Square boxes are non-level information. Arrows indicate referring relations: Arrows between levels denote the reference between levels, e.g. reference from the word level to the phone-level; arrows from the level files to square boxes symbolize reference to rules that have been used during the annotation; arrows pointing from square boxes to ellipses indicate (meta-)annotation of the level label files. The shading of boxes or ellipses denotes that their contents is annotation of linguistic objects. Note that the direction of the arrows is also a reflex of the process of annotation, i.e. in most cases those objects pointed to are produced earlier.
 
 





The data produced during the annotation of a dialogue will be kept in separate files, but it should also be ensured that their mutual relationship and the fact that they belong to the same annotation project is transparent. There are two strategies to specify this membership information:

In general, links between any two types of documents can be established by the means described in the table below.
 
from to by means of
list of resources files used during annotation file name attribute
XML document non-XML document file name attribute
non-XML document any other document not possible
XML document XML document href attributes between elements

The following figure represents the different types of files and explicit references. It is an adapted version of a figure in D1.2 which has been modified in order to also reflect the direction of reference. Bold lines represent reference by file name, dotted lines represent reference (href-attributes) between annotation elements of different level annotation files.

3 Markup Conventions

This section is dedicated to all markup aspects that can be described on a general level and serve to be applied to level specific markup descriptions. The following is a general guideline for the mapping of XML syntax onto linguistic concepts. In order to make this mapping as uniform, efficient, and consistent as possible these guidelines discuss some of the problems encountered and offers proposals for solutions.

XML has been chosen to be used for the representation of annotation data in the MATE project. Besides the actual good support of software for this standard, there are the following reasons for XML as the representation model:

XML uses

This is the reason why XML can be assumed to be able to encode markup irrespective of the phenomena described or theoretical approach taken. The underlying (object oriented like) model of description XML is based on can be assumed by almost all theories: There are phenomena (elements) some of which can be split into sub-phenomena (embedded elements), these phenomena have relations and properties (attributes).

On the formal side this model is reflected by a uniform syntax described above: XML-elements are enclosed by angular brackets

<xyz att="val">

inside which the values (val) of attributes (att) are specified: In XML Elements are those entities that conceptually group together descriptions of linguistic items. In XML element names are put after the <, such as in <word>. Attributes are those entities that allow further description of entities by specifying property dispositions. Attribute names denote the property dimension, its actual status for a given element is described by an attribute value. In <word pos="NN"> pos is the attribute, the exact value is NN. PCDATA are any characters which are not included in a pair of one opening (<) and one closing (>)angular bracket (see below).

Despite this general approach, XML does not provide the following:

  • typed/grammar specification of attribute values for the distinction of floats, characters or for the definition of attribute values by regular expressions or BNF grammars,
  • inference models for element values that allow for centralized specification of properties that are shared by more than one element,
  • applicability restriction of attributes that are mutual exclusive e.g. words that are nouns cannot have tempus information and case information is not applicable to verbs.
  • The XML community is aware of these problems and proposals are under way to improve and extend the XML standard. The inference problem is discussed in more detail in the chapter on cross-level annotation.
     

    3.1 Minimal redundancy

    The markup produced should be minimally redundant. That means that any information applying to more elements that conceptually depend on each other should only be represented once in the document if it is possible to find general means to infer the information marked at one element when accessing the other. On the level of defining elements, attributes, and values for concepts to be annotated, this principle has the effect that only those attributes and values that cannot be inferred from the element name or the values of other attributes of the element, have to be specified. To give an example: In the case of segmenting speech into phones, one would not have to specify the voicing of sounds as extra attributes once the identity of a sound is determined, thus
     
    redund.xml
    <phone type="a:" voiced="true"/>

    is only sensible if the theory or the application assumes that there are vowels which can be voiceless (e.g. in the case of whispering).

    In general, elements used for tagging should not carry the theory itself but that part of information that cannot be predicted only.
     

    3.2 Maximal consistency

    If rules for the mutual dependency of information represented at different places of a document can be stated, this type of consistency should be enhanced. One means of enhancement is the minimal redundancy principle, as information placed only at one place that that can be inferred from some other place of the document will only have to be updated once. Furthermore maximal consistency covers the area of reference for the storage of annotation: If the structure of the tags of the items to be linked varies in an unpredictable fashion across corpora or parts of corpora, no reliable (automatic) tagging or retrieval can be guaranteed. A further prerequisite of level annotation is the existence of one general model that can be applied to any kind of tagging, i.e. there is a need for a minimal standard of tagging on all levels to be labelled.
     

    3.3 Universal parsability

    The markup used should be universally parsable. This has three levels of consequence.

    3.4 Optimal maintainability

    Typical applications of XML are hierarchies of different elements which are nested. It is not possible however to design one general hierarchical model in which all linguistic information to be described in speech can be represented. It is easy for elements like sounds, words, phrases and sentences, but is not in the case of sounds, pauses, background noise, head turns, and co-reference. Entities of these different categories have no conceptual dependencies which could be represented in hierarchical structures. This is the reason why the tagging of conceptually or theoretically exclusive levels must be put into different XML files. Yet, the encoding of mutual theoretically dependent information should also reside in separate files, as they will ideally be produced at different occasions and element-type wise. If one has to add a higher level of annotation to an existing lower level tagging file, this would mean that the file has to be altered requiring complex manipulations of the file. Thus, each conceptually different level of description should be placed in a file of its own.
     

    3.5 Naming conventions

    In general all names of elements, attributes, and values should be in lower case only. This looks like a layout fashion but makes reading and style of documents more consistent. Also - where possible - names used for elements, attributes and values should consist of more than one or two characters. In the case where names for elements proposed by the TEI guidelines are used, the names used should be employed, although <u> and <w> are not favourable as they are not very intuitive for people not knowing the TEI guidelines.
     

    3.6 Linking information

    In general, there should be the possibility to link and to align various levels of description. For the sake of the next example assume it is a sentence and a word markup. Suppose, there is a word tagger that provides the user with basic tagging of words which results in a document like:
     
    word.xml
    <w id="w_01">take</w>
    <w id="w_02">this</w>
    <w id="w_03">example</w>

    Adding annotation of the sentence level would either

    word2.xml
    <s>
      <w id="w_01">take</w>
      <w id="w_02">this</w>
      <w id="w_03">example</w>
    </s>

    OR

    sent.xml
    <s href="word.xml#id(w_01)..(w_03)"/>

    As stated above, the second approach is recommended  which - depending on the DTD - treats the <w> elements as children of <s> elements like in variant a), but in a  non-invasive fashion [see [1]].
     
     

    3.7 PCDATA

    PCDATA are all textual entities which are not inside any element, i.e. outside angular brackets (<>). In the case of orthographic text that shall be marked up and integrated into a corpus of dialogue annotation, words will be the basic objects of markup. Around each word - if marking up words is the application - there would be an element start tag <w> and an element end tag </w>:
     
    word.xml
    <w id="w_001">These</w>
    <w id="w_002">are</w>
    <w id="w_003">the</w>
    <w id="w_004">words</w>

    In this case, the elements are filled by PCDATA which we perceive as orthographical words. Sentence annotation building upon this would add <s> and </s> around the text before:
     

    sentence.xml
    <s id="s_001">
      <w id="w_001">These</w>
      <w id="w_002">are</w>
      <w id="w_003">the</w>
      <w id="w_004">words</w>
    </s>

    In this case, the <s> element is filled, too.

    Empty elements are those which do not include neither other elements nore PCDATA, e.g. in the case of sentence annotation that refers to other elements by an href attribute:
     

    sentence2.xml
    <s id="s_001" href="word.xml#id(w_001)..id(w_004)"/>

    It is recommended to use PCDATA only if these PCDATA are textual information (marked up text from source data). The use of PCDATA inside an element for specification of values is only preferable if these are very long and explicit.

    In the following two examples there are examples where location information of a situation is provided. The second example is a case for choosing PCDATA.
     

    situation1.xml
    <situation id="sit_0223" place="home"/>

     
    situation2.xml
    <situation id="sit_0223">
      <place id="loc_001">
        The participants are sitting in the living room
        of the apartment of the speaker named Martha.
      </place>
    </situation>

    All other information should be coded by attribute values, links and embedded elements.
     

    3.8 Elements vs. Attributes

    In general when annotating speech data, elements are often entities that have an extension in time. For many categories of speech phenomena, there are not only labels but also notation systems, i.e. sets of symbols that denote the item as such and its category (cf. ToBI, POS). When describing linguistic levels, one has to decide if the standard labels will be used as attribute values or as elements in the markup [2].
     
     
    book.xml
    <book title="The Call of the Wild" author="London, Jack"/>

    or

    <book author="London, Jack">The Call of the Wild</book>

    or

    <book>
      <title>The Call of the Wild</title>
      <author>London, Jack</author>
    </book>

    Or, see the following options for representing prosody and phrase types:
     

    Phrases:
     

    phrase.xml
    <phrase type="NP"/>
    <phrase type="VP"/>
    <phrase type="NP"/>

    or

    <NP/>
    <VP/>
    <NP/>

    ToBI labels:
     

    notobi.xml
    <pros type="L*"/>
    <pros type="H*"/>
    <pros type="L*H"/>

    or

    <L*/>
    <H*/>
    <L*H/>

    As a matter of convention, one should choose that level of abstraction that allows to segment a series of entities, name each of these entities by that one term which can be applied to all of them and encode their differences by attributes and values respectively. Note, that there is an interrelation between mutual exclusive attribute values and the choice of level of abstraction: If the description level and element type chosen is <w> a part-of-speech value of "noun" theoretically blocks the application of tense. If the element types chosen were <noun>, <verb>, <adj>, etc., this would certainly not happen, but the very information that all of these elements belong to one group of phenomena will be lost and is not exploitable for query access to the data. One further solution is the additional encoding of abstraction level information.
     

    tobi.xml
    <s>
      <w>
        <det num="sg" case="nom">The</det>
      </w>
      <w>
        <noun num="sg" case="nom">tree</noun>
      </w>
      <w>
        <verb num="sg" tense="past">grew</verb>
      </w>
    </s>

    Attribute values should be chosen in a way that allows as much conceptualization as possible. Consider phonetic sounds: In the case where speech has to be segmented into sounds, one could think of a set of attributes specifying articulatory or auditive properties of these vowels as attributes, the values of which are set to plus (+) or minus (-). The alternative would obviously be to use the conventional IPA or SAM-PA symbols as values of a single attribute, e.g. "type". The first alternative has two disadvantages: If one assumes that the set of possible combinations of articulatory or auditive properties is known and the sounds to be segmented are limited to a small subset of all possible combinations, it seems an effort too hard to choose this option. The other reason is that there might be mutual dependencies of the values e.g. no plosive can be rounded. These dependencies cannot be constrained automatically  by the grammar of a DTD or so. Thus it is highly recommended to choose that set of attributes that guarantees maximal mutual independence of the attributes, i.e. find entity descriptions which are used to encode typical attribute-value constellations, in this case sound symbols. It might not always be possible to find a set attributes that are not mutually dependent, cf. the word attribute CASE which is not applicable to verbs as discussed above.
     

    3.10  Time information

    Since speech is a behaviour, behaviour is an action and action involves the concept of time, time is an obviously important property of speech units. In order to assess speech aspects like synchronicity, the sequence or the duration of speech events etc. are important units to be described. And as time information is the minimal chain of common reference across levels, time description conventions should be standardized to the maximum.

    There are various options for the encoding of this information:


    The second variant is recommended: If two levels of description are situated in the same hierarchy and the elements of one of them are parent elements of the other, then it is proposed to note start and end information for every element of the lower or lowest level in that conceptual hierarchy. The attributes start and end of the higher level elements can be calculated by evaluating the start value of the first embedded element and the end value of the last embedded element [2]. This in accordance with the two principles stated above, namely minimal redundancy and maximal consistency. If the time information is changed and all time information were kept separately, i.e. put to every level individually, then information of each element on each level would have to be changed. If there is only one basic representation all other tags refer to, and time information of these tags is inferred form the basic tag, then the time information has only to be changed once and changes applied to the time values on the lowest level will automatically affect the time specification of the associated higher levels.

    In the following example, the start value of the sentence (<s>) would be 0.01 and the end value would be 0.62.
     


    To exploit this principle most effectively, time information and thus the initial transcription of spoken material should be applied to the level with the smallest units under investigation (cf. the section on cross-level).

    Time is relative and if an element has the start value of 4.34 which is to be compared to the start value of other elements in another hierarchy, this may cause problems as this time information will in both cases refer to the time elapsed relative to the beginning of the document they reside in only but may be incompatible for many cases. It is recommended that time is either specified relative to midnight, January 1, 1970 UTC (universal time code) or relative to the beginning of the recording. For the latter case it is useful to apply the special attribute rectime that states the time distance of the beginning of the recording relative to midnight, January 1, 1970 UTC. Even if a software is not able to calculate and compare the time information of elements of different documents, it will be easy to check if the documents are compatible, and thus if a query that compares start and end information of these documents is sensible.
     
     

    References

    [1]Isard, A., McKelvie, D. and Thompson, H.S.: Towards a Minimal Standard for Dialogue Transcripts: A New Sgml Architecture for the HCRC Map Task Corpus. Proceedings of the 5th International Conference on Spoken Language Processing, ICSLP98, Sydney. http://www.cogsci.ed.ac.uk/~dmck/Papers/icslp98.ps

    [2] The SGML/XML Web Page: SGML/XML Elements versus Attributes. http://www.oasis-open.org/cover/elementsAndAttrs.html