Section: A walk through the TIGER-XML format

2. A walk through the TIGER-XML format

2.1 TIGER-XML corpora

The following subsections are a kind of guided tour. They walk through the encoding format and explain its properties.

A TIGER-XML document consists of two parts. In the header you can find the corpus declaration and some meta information (cf. subsection 2.2). The body comprises the definition of the corpus graphs and their annotation (cf. subsection 2.3). The corpus body can also be divided into so-called subcorpora (cf. subsection 2.4). Corpus query matches can also be represented within the TIGER-XML format (cf. subsection 2.5).

2.2 Corpus header

The corpus header consists of two parts. General meta information about the corpus is encoded in the first part of the header: name of the corpus, author, date, short description, format and history. This corpus meta information is displayed by the TIGERSearch tool when presenting the corpora available. The ID of the corpus (cf. root element <corpus>) should be unique with regard to all indexed corpora.

<corpus id="TESTCORPUS">

<head>

  <meta>
    <name>Test corpus</name>
    <author>Wolfgang Lezius</author>
    <date>April 2003</date>
    <description>illustrates the TIGER-XML format</description>
    <format>NeGra format, version 3</format>
    <history>first version</history>
  </meta>
  ...
</head>
...
</corpus>

The second part of the corpus header provides information about the features used in the corpus. This feature declaration is obligatory for corpora to be indexed by the TIGERRegistry tool. Feature values and short explanations of the tags might be listed - this kind of meta information will be used by the TIGERSearch GUI as corpus documentation. If it does not make sense to list all the values of a feature in a corpus (e.g. for a feature word), the content of the corresponding feature element is empty.

In the following example, the feature word is declared as a feature of terminal nodes (T) and the feature cat as a feature of nonterminal nodes (NT). If a feature is used in both terminal and nonterminal nodes (e.g. case), its domain is called FREC (cf. description of the query language; section 8, chapter III). Element content of a feature value declaration is interpreted as an explanation of the feature value. Potential edge labels are declared in an <edgelabel> element, secondary edges in an <secedgelabel> element.

<head>
  ...
  <annotation>

    <feature name="word" domain="T"/>

    <feature name="pos" domain="T">
      <value name="ART">determiner</value>
      <value name="ADV">adverb</value>
      <value name="KOKOM">conjunction</value>
      <value name="NN">noun</value>
      <value name="PIAT">indefinite attributive pronoun</value>
      <value name="VVFIN">finite verb</value>
    </feature>

    <feature name="morph" domain="T">
      <value name="Def.Fem.Nom.Sg"/>
      <value name="Fem.Nom.Sg.*"/>
      <value name="Masc.Akk.Pl.*"/>
      <value name="3.Sg.Pres.Ind"/>
      <value name="--">not bound</value>
    </feature>

    <feature name="cat" domain="NT">
      <value name="AP">adjektive phrase</value>
      <value name="AVP">adverbial phrase</value>
      <value name="NP">noun phrase</value>
      <value name="S">sentence</value>
    </feature>

    <edgelabel>
      <value name="CC">comparative complement</value>
      <value name="CM">comparative concjunction</value>
      <value name="HD">head</value>
      <value name="MO">modifier</value>
      <value name="NK">noun kernel modifier</value>
      <value name="OA">accusative object</value>
      <value name="SB">subject</value>
    </edgelabel>

  </annotation>

</head>

2.3 Corpus body

The supported data model is based on so-called syntax graphs, i.e. directed acyclic graphs with a single root node. Thus, corpus graphs cannot be encoded by embedding XML elements. As a solution, all terminal and nonterminal nodes are listed and edges are explicitly encoded as elements. The following example illustrates the corpus graph encoding.

Figure: Example sentence and its annotation

<body>

<s id="s5">
  <graph root="s5_504">
    <terminals>
      <t id="s5_1" word="Die" pos="ART" morph="Def.Fem.Nom.Sg"/>
      <t id="s5_2" word="Tagung" pos="NN" morph="Fem.Nom.Sg.*"/>
      <t id="s5_3" word="hat" pos="VVFIN" morph="3.Sg.Pres.Ind"/>
      <t id="s5_4" word="mehr" pos="PIAT" morph="--"/>
      <t id="s5_5" word="Teilnehmer" pos="NN" morph="Masc.Akk.Pl.*"/>
      <t id="s5_6" word="als" pos="KOKOM" morph="--"/>
      <t id="s5_7" word="je" pos="ADV" morph="--"/>
      <t id="s5_8" word="zuvor" pos="ADV" morph="--"/>
    </terminals>
    <nonterminals>
      <nt id="s5_500" cat="NP">
        <edge label="NK" idref="s5_1"/>
        <edge label="NK" idref="s5_2"/>
      </nt>
      <nt id="s5_501" cat="AVP">
        <edge label="CM" idref="s5_6"/>
        <edge label="MO" idref="s5_7"/>
        <edge label="HD" idref="s5_8"/>
      </nt>
      <nt id="s5_502" cat="AP">
        <edge label="HD" idref="s5_4"/>
        <edge label="CC" idref="s5_501"/>
      </nt>
      <nt id="s5_503" cat="NP">
        <edge label="NK" idref="s5_502"/>
        <edge label="NK" idref="s5_5"/>
      </nt>
      <nt id="s5_504" cat="S">
        <edge label="SB" idref="s5_500"/>
        <edge label="HD" idref="s5_3"/>
        <edge label="OA" idref="s5_503"/>
      </nt>
    </nonterminals>
  </graph>
</s>

</body>

Please note: Feature values, represented as attribute-value pairs, cannot be omitted. If a feature value or edge label does not make sense for a token or inner node (e.g. in the example sentence the feature morph is sometimes unspecified), please use a meaningful symbol instead. We recommend you to use the symbol -- which is also used in our implemented import filters. When viewing a matching corpus graph using the TIGERGraphViewer, the display of a feature value or edge label such as -- can be suppressed (cf. subsection 7.5, chapter IV).

2.4 Subcorpora

As a corpus grows, it sometimes needs to be divided into several files. Therefore the concept of subcorpora has been introduced in the TIGER-XML format. In the main corpus a link is placed to a subcorpus. The subcorpus consists of corpus graphs or other embedded subcorpora. It can be validated using the subcorpus subschema of the TIGER-XML format (cf. section 4).

The embedding syntax is the following: Within the corpus body, an element <subcorpus> is placed. Its attributes name and external specify the name of the subcorpus and its URL, respectively.

Please note: As the link is represented as an URL, a protocol has to be specified. If the subcorpus is placed within the local file system, use the file: protocol. A relative path will be evaluated with regard to the path of the embedding XML file.

The following example illustrates the embedding:

Main corpus (main.xml)

<corpus>

  <head>
    ...
  </head>

  <body>
    <subcorpus name="embedded corpus" external="file:subcorpus.xml"/>
  </body>

</corpus>

Subcorpus (subcorpus.xml)

<subcorpus name="embedded corpus">

  <s id="s1">
  ...
  </s>

  ...
  
</subcorpus>

2.5 Corpus query matches

To be as flexible as possible, the TIGER-XML format has also been designed to represent corpus query matches. The following example illustrates the encoding of the match information for the query #v:[cat="NP"] > #w:[pos="NN"] and the matching corpus graph.

<matches>
    <match subgraph="s5_500">
      <variable name="#v" idref="s5_500"/>
      <variable name="#w" idref="s5_2"/>
    </match>
</matches>

Figure: Example sentence and its match visualization (red-colored)

Matches are represented by <match> elements. The <variable> elements refer to the corresponding graph nodes matching the variables #v and #w. Hence the IDs of the <t> and <nt> elements are essential for both the edge linking and match reference mechanism. The subgraph attribute of a <match> element refers to the root node of the matching subgraph.

In total, we get the following encoding of the corpus graph and query result:

<s id="s5">
  <graph root="s5_504">
    <terminals>
      <t id="s5_1" word="Die" pos="ART" morph="Def.Fem.Nom.Sg"/>
      <t id="s5_2" word="Tagung" pos="NN" morph="Fem.Nom.Sg.*"/>
      <t id="s5_3" word="hat" pos="VVFIN" morph="3.Sg.Pres.Ind"/>
      <t id="s5_4" word="mehr" pos="PIAT" morph="--"/>
      <t id="s5_5" word="Teilnehmer" pos="NN" morph="Masc.Akk.Pl.*"/>
      <t id="s5_6" word="als" pos="KOKOM" morph="--"/>
      <t id="s5_7" word="je" pos="ADV" morph="--"/>
      <t id="s5_8" word="zuvor" pos="ADV" morph="--"/>
    </terminals>
    <nonterminals>
      <nt id="s5_500" cat="NP">
        <edge label="NK" idref="s5_1"/>
        <edge label="NK" idref="s5_2"/>
      </nt>
      <nt id="s5_501" cat="AVP">
        <edge label="CM" idref="s5_6"/>
        <edge label="MO" idref="s5_7"/>
        <edge label="HD" idref="s5_8"/>
      </nt>
      <nt id="s5_502" cat="AP">
        <edge label="HD" idref="s5_4"/>
        <edge label="CC" idref="s5_501"/>
      </nt>
      <nt id="s5_503" cat="NP">
        <edge label="NK" idref="s5_502"/>
        <edge label="NK" idref="s5_5"/>
      </nt>
      <nt id="s5_504" cat="S">
        <edge label="SB" idref="s5_500"/>
        <edge label="HD" idref="s5_3"/>
        <edge label="OA" idref="s5_503"/>
      </nt>
    </nonterminals>
  </graph>
  <matches>
    <match subgraph="s5_500">
      <variable name="#w" idref="s5_2"/>
      <variable name="#v" idref="s5_500"/>
    </match>
    <match subgraph="s5_503">
      <variable name="#w" idref="s5_5"/>
      <variable name="#v" idref="s5_503"/>
    </match>
  </matches>
</s>