2.2 Corpus header

The corpus header consists of two parts. General meta information about the corpus is encoded in the first part of the header: name of the corpus, author, date, short description, format and history. This corpus meta information is displayed by the TIGERSearch tool when presenting the corpora available. The ID of the corpus (cf. root element <corpus>) should be unique with regard to all indexed corpora.

<corpus id="TESTCORPUS">

<head>

  <meta>
    <name>Test corpus</name>
    <author>Wolfgang Lezius</author>
    <date>April 2003</date>
    <description>illustrates the TIGER-XML format</description>
    <format>NeGra format, version 3</format>
    <history>first version</history>
  </meta>
  ...
</head>
...
</corpus>

The second part of the corpus header provides information about the features used in the corpus. This feature declaration is obligatory for corpora to be indexed by the TIGERRegistry tool. Feature values and short explanations of the tags might be listed - this kind of meta information will be used by the TIGERSearch GUI as corpus documentation. If it does not make sense to list all the values of a feature in a corpus (e.g. for a feature word), the content of the corresponding feature element is empty.

In the following example, the feature word is declared as a feature of terminal nodes (T) and the feature cat as a feature of nonterminal nodes (NT). If a feature is used in both terminal and nonterminal nodes (e.g. case), its domain is called FREC (cf. description of the query language; section 8, chapter III). Element content of a feature value declaration is interpreted as an explanation of the feature value. Potential edge labels are declared in an <edgelabel> element, secondary edges in an <secedgelabel> element.

<head>
  ...
  <annotation>

    <feature name="word" domain="T"/>

    <feature name="pos" domain="T">
      <value name="ART">determiner</value>
      <value name="ADV">adverb</value>
      <value name="KOKOM">conjunction</value>
      <value name="NN">noun</value>
      <value name="PIAT">indefinite attributive pronoun</value>
      <value name="VVFIN">finite verb</value>
    </feature>

    <feature name="morph" domain="T">
      <value name="Def.Fem.Nom.Sg"/>
      <value name="Fem.Nom.Sg.*"/>
      <value name="Masc.Akk.Pl.*"/>
      <value name="3.Sg.Pres.Ind"/>
      <value name="--">not bound</value>
    </feature>

    <feature name="cat" domain="NT">
      <value name="AP">adjektive phrase</value>
      <value name="AVP">adverbial phrase</value>
      <value name="NP">noun phrase</value>
      <value name="S">sentence</value>
    </feature>

    <edgelabel>
      <value name="CC">comparative complement</value>
      <value name="CM">comparative concjunction</value>
      <value name="HD">head</value>
      <value name="MO">modifier</value>
      <value name="NK">noun kernel modifier</value>
      <value name="OA">accusative object</value>
      <value name="SB">subject</value>
    </edgelabel>

  </annotation>

</head>