Book: The TIGER-XML treebank encoding format

V. The TIGER-XML treebank encoding format

1. Introduction

As there are various formats for the representation of linguistic corpora, there are also a number of formats for the encoding of syntactically annotated corpora: Penn Treebank, Susanne, Negra, and several other formats plus various kinds of parser or chunker output. Since applications like TIGERSearch cannot support all existing formats, it makes sense to define one single interface format for import and export. This format should be general enough to encode as many existing formats as possible. To solve the problem of Unicode character encoding, it is also advantageous to choose an XML-based approach.

The TIGER-XML format has been designed as an interface format. It is an XML-based equivalence of the corpus definition sublanguage of the TIGER description language (cf. chapter III). In addition to corpus definitions, the TIGER-XML format can also represent query results.

Any corpus to be processed by the TIGERSearch tool has to be encoded in the TIGER-XML format. For convenience, we have implemented corpus filters (i.e. converters to TIGER-XML) for many popular treebank and parser output formats like bracketing format, PennTreebank format, NeGra format etc. (cf. subsection 3.5, chapter VI for a list of implemented filters).

Why should you read this chapter? First of all, you might have corpora encoded in formats not supported as a corpus filter. So knowledge about TIGER-XML is essential for their conversion. In addition, TIGERSearch supports exporting the matches of a query in the TIGER-XML format. If you like to transform the XML output, e.g. by XSLT stylesheets, you will have to know how the XML document has been designed.

The first section walks through an example of the TIGER-XML format (cf. section 2). In the second section a real-life example is presented (cf. section 3). Finally, you can find a description of the XML schema used to validate TIGER-XML documents (cf. section 4).

If you are interested in the motivations that have influenced the development of the TIGER-XML format, you should have a look at Wolfgang Lezius' Ph.D. thesis [Lezius2002] (in German).

2. A walk through the TIGER-XML format

2.1 TIGER-XML corpora

The following subsections are a kind of guided tour. They walk through the encoding format and explain its properties.

A TIGER-XML document consists of two parts. In the header you can find the corpus declaration and some meta information (cf. subsection 2.2). The body comprises the definition of the corpus graphs and their annotation (cf. subsection 2.3). The corpus body can also be divided into so-called subcorpora (cf. subsection 2.4). Corpus query matches can also be represented within the TIGER-XML format (cf. subsection 2.5).

2.2 Corpus header

The corpus header consists of two parts. General meta information about the corpus is encoded in the first part of the header: name of the corpus, author, date, short description, format and history. This corpus meta information is displayed by the TIGERSearch tool when presenting the corpora available. The ID of the corpus (cf. root element <corpus>) should be unique with regard to all indexed corpora.

<corpus id="TESTCORPUS">

<head>

  <meta>
    <name>Test corpus</name>
    <author>Wolfgang Lezius</author>
    <date>April 2003</date>
    <description>illustrates the TIGER-XML format</description>
    <format>NeGra format, version 3</format>
    <history>first version</history>
  </meta>
  ...
</head>
...
</corpus>

The second part of the corpus header provides information about the features used in the corpus. This feature declaration is obligatory for corpora to be indexed by the TIGERRegistry tool. Feature values and short explanations of the tags might be listed - this kind of meta information will be used by the TIGERSearch GUI as corpus documentation. If it does not make sense to list all the values of a feature in a corpus (e.g. for a feature word), the content of the corresponding feature element is empty.

In the following example, the feature word is declared as a feature of terminal nodes (T) and the feature cat as a feature of nonterminal nodes (NT). If a feature is used in both terminal and nonterminal nodes (e.g. case), its domain is called FREC (cf. description of the query language; section 8, chapter III). Element content of a feature value declaration is interpreted as an explanation of the feature value. Potential edge labels are declared in an <edgelabel> element, secondary edges in an <secedgelabel> element.

<head>
  ...
  <annotation>

    <feature name="word" domain="T"/>

    <feature name="pos" domain="T">
      <value name="ART">determiner</value>
      <value name="ADV">adverb</value>
      <value name="KOKOM">conjunction</value>
      <value name="NN">noun</value>
      <value name="PIAT">indefinite attributive pronoun</value>
      <value name="VVFIN">finite verb</value>
    </feature>

    <feature name="morph" domain="T">
      <value name="Def.Fem.Nom.Sg"/>
      <value name="Fem.Nom.Sg.*"/>
      <value name="Masc.Akk.Pl.*"/>
      <value name="3.Sg.Pres.Ind"/>
      <value name="--">not bound</value>
    </feature>

    <feature name="cat" domain="NT">
      <value name="AP">adjektive phrase</value>
      <value name="AVP">adverbial phrase</value>
      <value name="NP">noun phrase</value>
      <value name="S">sentence</value>
    </feature>

    <edgelabel>
      <value name="CC">comparative complement</value>
      <value name="CM">comparative concjunction</value>
      <value name="HD">head</value>
      <value name="MO">modifier</value>
      <value name="NK">noun kernel modifier</value>
      <value name="OA">accusative object</value>
      <value name="SB">subject</value>
    </edgelabel>

  </annotation>

</head>

2.3 Corpus body

The supported data model is based on so-called syntax graphs, i.e. directed acyclic graphs with a single root node. Thus, corpus graphs cannot be encoded by embedding XML elements. As a solution, all terminal and nonterminal nodes are listed and edges are explicitly encoded as elements. The following example illustrates the corpus graph encoding.

Figure: Example sentence and its annotation

<body>

<s id="s5">
  <graph root="s5_504">
    <terminals>
      <t id="s5_1" word="Die" pos="ART" morph="Def.Fem.Nom.Sg"/>
      <t id="s5_2" word="Tagung" pos="NN" morph="Fem.Nom.Sg.*"/>
      <t id="s5_3" word="hat" pos="VVFIN" morph="3.Sg.Pres.Ind"/>
      <t id="s5_4" word="mehr" pos="PIAT" morph="--"/>
      <t id="s5_5" word="Teilnehmer" pos="NN" morph="Masc.Akk.Pl.*"/>
      <t id="s5_6" word="als" pos="KOKOM" morph="--"/>
      <t id="s5_7" word="je" pos="ADV" morph="--"/>
      <t id="s5_8" word="zuvor" pos="ADV" morph="--"/>
    </terminals>
    <nonterminals>
      <nt id="s5_500" cat="NP">
        <edge label="NK" idref="s5_1"/>
        <edge label="NK" idref="s5_2"/>
      </nt>
      <nt id="s5_501" cat="AVP">
        <edge label="CM" idref="s5_6"/>
        <edge label="MO" idref="s5_7"/>
        <edge label="HD" idref="s5_8"/>
      </nt>
      <nt id="s5_502" cat="AP">
        <edge label="HD" idref="s5_4"/>
        <edge label="CC" idref="s5_501"/>
      </nt>
      <nt id="s5_503" cat="NP">
        <edge label="NK" idref="s5_502"/>
        <edge label="NK" idref="s5_5"/>
      </nt>
      <nt id="s5_504" cat="S">
        <edge label="SB" idref="s5_500"/>
        <edge label="HD" idref="s5_3"/>
        <edge label="OA" idref="s5_503"/>
      </nt>
    </nonterminals>
  </graph>
</s>

</body>

Please note: Feature values, represented as attribute-value pairs, cannot be omitted. If a feature value or edge label does not make sense for a token or inner node (e.g. in the example sentence the feature morph is sometimes unspecified), please use a meaningful symbol instead. We recommend you to use the symbol -- which is also used in our implemented import filters. When viewing a matching corpus graph using the TIGERGraphViewer, the display of a feature value or edge label such as -- can be suppressed (cf. subsection 7.5, chapter IV).

2.4 Subcorpora

As a corpus grows, it sometimes needs to be divided into several files. Therefore the concept of subcorpora has been introduced in the TIGER-XML format. In the main corpus a link is placed to a subcorpus. The subcorpus consists of corpus graphs or other embedded subcorpora. It can be validated using the subcorpus subschema of the TIGER-XML format (cf. section 4).

The embedding syntax is the following: Within the corpus body, an element <subcorpus> is placed. Its attributes name and external specify the name of the subcorpus and its URL, respectively.

Please note: As the link is represented as an URL, a protocol has to be specified. If the subcorpus is placed within the local file system, use the file: protocol. A relative path will be evaluated with regard to the path of the embedding XML file.

The following example illustrates the embedding:

Main corpus (main.xml)

<corpus>

  <head>
    ...
  </head>

  <body>
    <subcorpus name="embedded corpus" external="file:subcorpus.xml"/>
  </body>

</corpus>

Subcorpus (subcorpus.xml)

<subcorpus name="embedded corpus">

  <s id="s1">
  ...
  </s>

  ...
  
</subcorpus>

2.5 Corpus query matches

To be as flexible as possible, the TIGER-XML format has also been designed to represent corpus query matches. The following example illustrates the encoding of the match information for the query #v:[cat="NP"] > #w:[pos="NN"] and the matching corpus graph.

<matches>
    <match subgraph="s5_500">
      <variable name="#v" idref="s5_500"/>
      <variable name="#w" idref="s5_2"/>
    </match>
</matches>

Figure: Example sentence and its match visualization (red-colored)

Matches are represented by <match> elements. The <variable> elements refer to the corresponding graph nodes matching the variables #v and #w. Hence the IDs of the <t> and <nt> elements are essential for both the edge linking and match reference mechanism. The subgraph attribute of a <match> element refers to the root node of the matching subgraph.

In total, we get the following encoding of the corpus graph and query result:

<s id="s5">
  <graph root="s5_504">
    <terminals>
      <t id="s5_1" word="Die" pos="ART" morph="Def.Fem.Nom.Sg"/>
      <t id="s5_2" word="Tagung" pos="NN" morph="Fem.Nom.Sg.*"/>
      <t id="s5_3" word="hat" pos="VVFIN" morph="3.Sg.Pres.Ind"/>
      <t id="s5_4" word="mehr" pos="PIAT" morph="--"/>
      <t id="s5_5" word="Teilnehmer" pos="NN" morph="Masc.Akk.Pl.*"/>
      <t id="s5_6" word="als" pos="KOKOM" morph="--"/>
      <t id="s5_7" word="je" pos="ADV" morph="--"/>
      <t id="s5_8" word="zuvor" pos="ADV" morph="--"/>
    </terminals>
    <nonterminals>
      <nt id="s5_500" cat="NP">
        <edge label="NK" idref="s5_1"/>
        <edge label="NK" idref="s5_2"/>
      </nt>
      <nt id="s5_501" cat="AVP">
        <edge label="CM" idref="s5_6"/>
        <edge label="MO" idref="s5_7"/>
        <edge label="HD" idref="s5_8"/>
      </nt>
      <nt id="s5_502" cat="AP">
        <edge label="HD" idref="s5_4"/>
        <edge label="CC" idref="s5_501"/>
      </nt>
      <nt id="s5_503" cat="NP">
        <edge label="NK" idref="s5_502"/>
        <edge label="NK" idref="s5_5"/>
      </nt>
      <nt id="s5_504" cat="S">
        <edge label="SB" idref="s5_500"/>
        <edge label="HD" idref="s5_3"/>
        <edge label="OA" idref="s5_503"/>
      </nt>
    </nonterminals>
  </graph>
  <matches>
    <match subgraph="s5_500">
      <variable name="#w" idref="s5_2"/>
      <variable name="#v" idref="s5_500"/>
    </match>
    <match subgraph="s5_503">
      <variable name="#w" idref="s5_5"/>
      <variable name="#v" idref="s5_503"/>
    </match>
  </matches>
</s>

3. Corpus example

The following listing shows the TIGER-XML representation of a small demo corpus. This corpus comprises two corpus graphs of the Wall Street Journal corpus of the PennTreebank. It makes use of edge labels and of secondary edges to represent coreference annotation. The TIGER-XML example file corpus.xml is also placed in the doc/examples/ subdirectory of your TIGERSearch installation.

Figure: Sentence 1: Pierre Vinken, 61 years old, will join the board...

Figure: Sentence 2: Rudolph Agnew, 55 years old and former chairman...

<corpus id="DEMO">

<head>
  <meta>
    <name>two sentences of Wall Street Journal corpus</name>
    <format>bracketing format</format>
  </meta>
  <annotation>
    <feature name="word" domain="T"/>
    <feature name="pos" domain="T">
       <value name=","/>
       <value name="-NONE-"/>
       <value name="."/>
       <value name="CC"/>
       <value name="CD"/>
       <value name="DT"/>
       <value name="IN"/>
       <value name="JJ"/>
       <value name="MD"/>
       <value name="NN"/>
       <value name="NNP"/>
       <value name="NNS"/>
       <value name="VB"/>
       <value name="VBD"/>
       <value name="VBN"/>
    </feature>
    <feature name="cat" domain="NT">
       <value name="ADJP"/>
       <value name="NP"/>
       <value name="PP"/>
       <value name="S"/>
       <value name="UCP"/>
       <value name="VP"/>
    </feature>
    <edgelabel>
       <value name="--">not bound</value>
       <value name="CLR"/>
       <value name="PRD"/>
       <value name="SBJ"/>
       <value name="TMP"/>
    </edgelabel>
    <secedgelabel>
       <value name="*"/>
    </secedgelabel>
  </annotation>
</head>

<body>

<s id="s1">
  <graph root="s1_500">
    <terminals>
      <t id="s1_1" word="Pierre" pos="NNP"/>
      <t id="s1_2" word="Vinken" pos="NNP"/>
      <t id="s1_3" word="," pos=","/>
      <t id="s1_4" word="61" pos="CD"/>
      <t id="s1_5" word="years" pos="NNS"/>
      <t id="s1_6" word="old" pos="JJ"/>
      <t id="s1_7" word="," pos=","/>
      <t id="s1_8" word="will" pos="MD"/>
      <t id="s1_9" word="join" pos="VB"/>
      <t id="s1_10" word="the" pos="DT"/>
      <t id="s1_11" word="board" pos="NN"/>
      <t id="s1_12" word="as" pos="IN"/>
      <t id="s1_13" word="a" pos="DT"/>
      <t id="s1_14" word="nonexecutive" pos="JJ"/>
      <t id="s1_15" word="director" pos="NN"/>
      <t id="s1_16" word="Nov." pos="NNP"/>
      <t id="s1_17" word="29" pos="CD"/>
      <t id="s1_18" word="." pos="."/>
    </terminals>
    <nonterminals>
      <nt id="s1_502" cat="NP">
        <edge label="--" idref="s1_1"/>
        <edge label="--" idref="s1_2"/>
      </nt>
      <nt id="s1_504" cat="NP">
        <edge label="--" idref="s1_4"/>
        <edge label="--" idref="s1_5"/>
      </nt>
      <nt id="s1_503" cat="ADJP">
        <edge label="--" idref="s1_504"/>
        <edge label="--" idref="s1_6"/>
      </nt>
      <nt id="s1_501" cat="NP">
        <edge label="--" idref="s1_502"/>
        <edge label="--" idref="s1_3"/>
        <edge label="--" idref="s1_503"/>
        <edge label="--" idref="s1_7"/>
      </nt>
      <nt id="s1_507" cat="NP">
        <edge label="--" idref="s1_10"/>
        <edge label="--" idref="s1_11"/>
      </nt>
      <nt id="s1_509" cat="NP">
        <edge label="--" idref="s1_13"/>
        <edge label="--" idref="s1_14"/>
        <edge label="--" idref="s1_15"/>
      </nt>
      <nt id="s1_508" cat="PP">
        <edge label="--" idref="s1_12"/>
        <edge label="--" idref="s1_509"/>
      </nt>
      <nt id="s1_510" cat="NP">
        <edge label="--" idref="s1_16"/>
        <edge label="--" idref="s1_17"/>
      </nt>
      <nt id="s1_506" cat="VP">
        <edge label="--" idref="s1_9"/>
        <edge label="--" idref="s1_507"/>
        <edge label="CLR" idref="s1_508"/>
        <edge label="TMP" idref="s1_510"/>
      </nt>
      <nt id="s1_505" cat="VP">
        <edge label="--" idref="s1_8"/>
        <edge label="--" idref="s1_506"/>
      </nt>
      <nt id="s1_500" cat="S">
        <edge label="SBJ" idref="s1_501"/>
        <edge label="--" idref="s1_505"/>
        <edge label="--" idref="s1_18"/>
      </nt>
    </nonterminals>
  </graph>
</s>

<s id="s3">
  <graph root="s3_500">
    <terminals>
      <t id="s3_1" word="Rudolph" pos="NNP"/>
      <t id="s3_2" word="Agnew" pos="NNP"/>
      <t id="s3_3" word="," pos=","/>
      <t id="s3_4" word="55" pos="CD"/>
      <t id="s3_5" word="years" pos="NNS"/>
      <t id="s3_6" word="old" pos="JJ"/>
      <t id="s3_7" word="and" pos="CC"/>
      <t id="s3_8" word="former" pos="JJ"/>
      <t id="s3_9" word="chairman" pos="NN"/>
      <t id="s3_10" word="of" pos="IN"/>
      <t id="s3_11" word="Consolidated" pos="NNP"/>
      <t id="s3_12" word="Gold" pos="NNP"/>
      <t id="s3_13" word="Fields" pos="NNP"/>
      <t id="s3_14" word="PLC" pos="NNP"/>
      <t id="s3_15" word="," pos=","/>
      <t id="s3_16" word="was" pos="VBD"/>
      <t id="s3_17" word="named" pos="VBN"/>
      <t id="s3_18" word="*" pos="-NONE-"/>
      <t id="s3_19" word="a" pos="DT"/>
      <t id="s3_20" word="nonexecutive" pos="JJ"/>
      <t id="s3_21" word="director" pos="NN"/>
      <t id="s3_22" word="of" pos="IN"/>
      <t id="s3_23" word="this" pos="DT"/>
      <t id="s3_24" word="British" pos="JJ"/>
      <t id="s3_25" word="industrial" pos="JJ"/>
      <t id="s3_26" word="conglomerate" pos="NN"/>
      <t id="s3_27" word="." pos="."/>
    </terminals>
    <nonterminals>
      <nt id="s3_502" cat="NP">
        <edge label="--" idref="s3_1"/>
        <edge label="--" idref="s3_2"/>
      </nt>
      <nt id="s3_505" cat="NP">
        <edge label="--" idref="s3_4"/>
        <edge label="--" idref="s3_5"/>
      </nt>
      <nt id="s3_504" cat="ADJP">
        <edge label="--" idref="s3_505"/>
        <edge label="--" idref="s3_6"/>
      </nt>
      <nt id="s3_507" cat="NP">
        <edge label="--" idref="s3_8"/>
        <edge label="--" idref="s3_9"/>
      </nt>
      <nt id="s3_509" cat="NP">
        <edge label="--" idref="s3_11"/>
        <edge label="--" idref="s3_12"/>
        <edge label="--" idref="s3_13"/>
        <edge label="--" idref="s3_14"/>
      </nt>
      <nt id="s3_508" cat="PP">
        <edge label="--" idref="s3_10"/>
        <edge label="--" idref="s3_509"/>
      </nt>
      <nt id="s3_506" cat="NP">
        <edge label="--" idref="s3_507"/>
        <edge label="--" idref="s3_508"/>
      </nt>
      <nt id="s3_503" cat="UCP">
        <edge label="--" idref="s3_504"/>
        <edge label="--" idref="s3_7"/>
        <edge label="--" idref="s3_506"/>
      </nt>
      <nt id="s3_501" cat="NP">
        <edge label="--" idref="s3_502"/>
        <edge label="--" idref="s3_3"/>
        <edge label="--" idref="s3_503"/>
        <edge label="--" idref="s3_15"/>
        <secedge label="*" idref="s3_18"/>
      </nt>
      <nt id="s3_513" cat="NP">
        <edge label="--" idref="s3_18"/>
      </nt>
      <nt id="s3_515" cat="NP">
        <edge label="--" idref="s3_19"/>
        <edge label="--" idref="s3_20"/>
        <edge label="--" idref="s3_21"/>
      </nt>
      <nt id="s3_517" cat="NP">
        <edge label="--" idref="s3_23"/>
        <edge label="--" idref="s3_24"/>
        <edge label="--" idref="s3_25"/>
        <edge label="--" idref="s3_26"/>
      </nt>
      <nt id="s3_516" cat="PP">
        <edge label="--" idref="s3_22"/>
        <edge label="--" idref="s3_517"/>
      </nt>
      <nt id="s3_514" cat="NP">
        <edge label="--" idref="s3_515"/>
        <edge label="--" idref="s3_516"/>
      </nt>
      <nt id="s3_512" cat="S">
        <edge label="SBJ" idref="s3_513"/>
        <edge label="PRD" idref="s3_514"/>
      </nt>
      <nt id="s3_511" cat="VP">
        <edge label="--" idref="s3_17"/>
        <edge label="--" idref="s3_512"/>
      </nt>
      <nt id="s3_510" cat="VP">
        <edge label="--" idref="s3_16"/>
        <edge label="--" idref="s3_511"/>
      </nt>
      <nt id="s3_500" cat="S">
        <edge label="SBJ" idref="s3_501"/>
        <edge label="--" idref="s3_510"/>
        <edge label="--" idref="s3_27"/>
      </nt>
    </nonterminals>
  </graph>
</s>

</body>

</corpus>

4. The TIGER-XML schema

The TIGER-XML format is validated against an XML schema. XML schema validation is supported by all major XML parsers. The schema is divided into three parts: the main schema, the subschema for the corpus header, and the subschema for subcorpora. The TIGER-XML schema and its two subschemas are placed in the schema/ subdirectory of your TIGERSearch installation.

Part 1: Main schema - TigerXML.xsd

<schema>

 <!-- ==================================================================
      XML Schema for the TIGER-XML format
      http://www.ims.uni-stuttgart.de/projekte/TIGER/public/TigerXML.xsd
      ==================================================================
      TIGER Project, Wolfgang Lezius
      IMS, University of Stuttgart, 04/01/2003
      ================================================================== -->


  <!-- ======================================================
       INCLUDES DECLARATION OF THE HEADER
       ====================================================== -->
  <include schemaLocation="TigerXMLHeader.xsd"/>


  <!-- ======================================================
       INCLUDES DECLARATION OF SUBCORPORA AND SENTENCES
       ====================================================== -->
  <include schemaLocation="TigerXMLSubcorpus.xsd"/>


  <!-- ======================================================
       DECLARATION OF THE CORPUS DOCUMENT
       ====================================================== -->

  <!-- declaration of the root element: corpus -->

  <element name="corpus">
  
    <complexType>

      <sequence>

        <choice>           
           <!-- header of the document is optional -->
           <element name="head" type="headType" minOccurs="0" maxOccurs="1"/>
        </choice>

        <element name="body" type="bodyType" minOccurs="1" maxOccurs="1"/>

      </sequence>

      <!-- corpus ID -->
      <attribute name="id" type="idType" use="required"/>

      <!-- optional attribute: TigerXML version; used by TIGERSearch only -->
      <attribute name="version" type="xsd:string" use="optional"/>

    </complexType>
  
  </element>


  <!-- declaration of the body type -->

  <complexType name="bodyType">

    <choice minOccurs="1" maxOccurs="unbounded">
      <element name="subcorpus" type="subcorpusType" minOccurs="1" maxOccurs="1"/>
      <element name="s" type="sentenceType" minOccurs="1" maxOccurs="1"/>
    </choice>

  </complexType>


</schema>

Part 2: Subschema for the corpus header - TigerXMLHeader.xsd

<schema>

 <!-- =======================================================================
      XML SubSchema for the header part of the TIGER-XML format
      http://www.ims.uni-stuttgart.de/projekte/TIGER/publicTigerXMLHeader.xsd
      =======================================================================
      TIGER Project, Wolfgang Lezius 
      IMS, University of Stuttgart, 04/01/2003
      ======================================================================= -->


  <!-- ======================================================
       DECLARATION OF THE HEADER
       ====================================================== -->


  <!-- declaration of the head element -->

  <element name="head" type="headType"/>


  <!-- declaration of the header type -->

  <complexType name="headType">

     <sequence>
        <element name="meta" type="metaType" minOccurs="0" maxOccurs="1"/>
        <element name="annotation" type="annotationType" minOccurs="0" maxOccurs="1"/>
     </sequence>    

     <!-- optional: reference to external header file 

          The header of a TigerXML corpus can also be stored in separate file. 
          This attribute points to the external header file. The pointer is
          an URI. Examples: file:relative.xml or file:/path/to/absolute.xml

          Note: If there is a pointer to an external file, the head
                element must be empty. -->

     <attribute name="external" type="xsd:anyURI"/>  

  </complexType>


  <!-- declaration of the meta information type -->

  <complexType name="metaType">

    <sequence>
      <element name="name" type="xsd:string" minOccurs="0" maxOccurs="1"/>
      <element name="author" type="xsd:string" minOccurs="0" maxOccurs="1"/>
      <element name="date" type="xsd:string" minOccurs="0" maxOccurs="1"/>
      <element name="description" type="xsd:string" minOccurs="0" maxOccurs="1"/>
      <element name="format" type="xsd:string" minOccurs="0" maxOccurs="1"/>
      <element name="history" type="xsd:string" minOccurs="0" maxOccurs="1"/>
    </sequence>    

  </complexType>
  

  <!-- declaration of the annotation type -->

  <complexType name="annotationType">

    <sequence>
      <element name="feature" type="featureType" minOccurs="1" maxOccurs="unbounded"/>
      <element name="edgelabel" type="edgelabelType" minOccurs="0" maxOccurs="1"/>
      <element name="secedgelabel" type="edgelabelType" minOccurs="0" maxOccurs="1"/>
    </sequence>

  </complexType>


  <!-- declaration of the feature type -->

  <complexType name="featureType">

    <sequence>
       <element name="value" type="featurevalueType" minOccurs="0" maxOccurs="unbounded"/>
    </sequence>
    
    <attribute name="name" type="featurenameType" use="required"/>

    <attribute name="domain" use="required">
       <simpleType>
         <restriction base="xsd:string">
           <enumeration value="T"/>     <!-- feature for terminal nodes -->
           <enumeration value="NT"/>    <!-- feature for nonterminal nodes -->
           <enumeration value="FREC"/>  <!-- feature for both -->
         </restriction>
       </simpleType>
    </attribute>

  </complexType>


  <!-- declaration of the (secondary) edge label type -->

  <complexType name="edgelabelType">

    <sequence>
       <element name="value" type="featurevalueType" minOccurs="0" maxOccurs="unbounded"/>
    </sequence>
    
  </complexType>


  <!-- declaration of the feature value type -->

  <complexType name="featurevalueType">

    <simpleContent>   <!-- element content: documentation of the feature value -->
      <extension base="xsd:string">
        <attribute name="name" type="xsd:string"/>
      </extension>
    </simpleContent>


  </complexType>


  <!-- ======================================================
       HEADER DECLARATIONS THAT SHOULD BE REFINED
       ====================================================== -->

  <!-- declaration of the FEATURE NAMES used in the corpus header;
       this type is unrestricted, but should be refined by a 
       specialised, corpus-dependent schema -->

  <simpleType name="featurenameType">

    <restriction base="xsd:string">
      <minLength value="1"/>
      <maxLength value="20"/>
      <whiteSpace value="preserve"/>
    </restriction>

  </simpleType>


</schema>

Part 3: Subschema for subcorpora - TigerXMLSubcorpus.xsd

<schema>

 <!-- ===========================================================================
      XML Schema for the subcorpus part of the TIGER-XML format
      http://www.ims.uni-stuttgart.de/projekte/TIGER/public/TigerXMLSubcorpus.xsd
      ===========================================================================
      TIGER Project, Wolfgang Lezius
      IMS, University of Stuttgart, 04/01/2003
      =========================================================================== -->

  <!-- ======================================================
       DECLARATION OF SUBCORPORA AND SENTENCES
       ====================================================== -->


  <!-- declaration of the subcorpus element -->

  <element name="subcorpus" type="subcorpusType"/>


  <!-- declaration of the subcorpus type -->

  <complexType name="subcorpusType">

    <!-- A subcorpus may comprise another subcorpora or sentences -->

    <choice minOccurs="0" maxOccurs="unbounded">
      <element name="subcorpus" type="subcorpusType" minOccurs="1" maxOccurs="1"/>
      <element name="s" type="sentenceType" minOccurs="1" maxOccurs="1"/>
    </choice>

    <!-- required: subcorpus name -->
 
    <attribute name="name" type="xsd:string" use="required"/>

    <!-- optional: reference to external subcorpus file 

         A subcorpus of a TigerXML corpus can also be stored in separate file. 
         This attribute points to the external subcorpus file. The pointer is
         an URI. Examples: file:relative.xml or file:/path/to/absolute.xml 

         Note: If there is a pointer to an external file, the subcorpus
               element must be empty. -->

    <attribute name="external" type="xsd:anyURI"/>  

  </complexType>


  <!-- declaration of the sentence type -->

  <complexType name="sentenceType">

    <sequence>
      <element name="graph" type="graphType" minOccurs="0" maxOccurs="1"/>
      <element name="matches" type="matchesType" minOccurs="0" maxOccurs="1"/>
    </sequence>

    <attribute name="id" type="idType" use="required"/>

  </complexType>


  <!-- declaration of the graph type -->

  <complexType name="graphType">

    <sequence>
      <element name="terminals" type="terminalsType" minOccurs="1" maxOccurs="1"/>
      <element name="nonterminals" type="nonterminalsType" minOccurs="1" maxOccurs="1"/>
    </sequence>

    <attribute name="root" type="idrefType" use="required"/>

    <!-- indicated that the exported sentence is discontinuous -->
    <attribute name="discontinuous" type="xsd:boolean" default="false" use="optional"/>

  </complexType>


  <!-- declaration of the terminals type -->

  <complexType name="terminalsType">

    <sequence>
      <element name="t" type="tType" minOccurs="1" maxOccurs="unbounded"/>
    </sequence>

  </complexType>


  <!-- declaration of the t element -->

  <complexType name="tType">

    <!-- secondary edges possible -->
    <sequence>
      <element name="secedge" type="secedgeType" minOccurs="0" maxOccurs="unbounded"/>
    </sequence>

    <attribute name="id" type="idType" use="required"/>    
    <attributeGroup ref="tfeatureAttributes"/>

  </complexType>


  <!-- declaration of the nonterminals type -->

  <complexType name="nonterminalsType">

    <sequence>
      <element name="nt" type="ntType" minOccurs="0" maxOccurs="unbounded"/>
    </sequence>

  </complexType>


  <!-- declaration of the nt element -->

  <complexType name="ntType">

    <!-- edge and secondary edges possible -->
    <sequence>
      <element name="edge" type="edgeType" minOccurs="0" maxOccurs="unbounded"/>
      <element name="secedge" type="secedgeType" minOccurs="0" maxOccurs="unbounded"/>
    </sequence>

    <attribute name="id" type="idType" use="required"/>    
    <attributeGroup ref="ntfeatureAttributes"/>

  </complexType>


  <!-- declaration of the edge type -->

  <complexType name="edgeType">

    <attribute name="idref" type="idrefType" use="required"/>    

    <attributeGroup ref="edgelabelAttribute"/>

  </complexType>


  <!-- declaration of the secondary edge type -->

  <complexType name="secedgeType">

    <attribute name="idref" type="idrefType" use="required"/>    

    <attributeGroup ref="secedgelabelAttribute"/>

  </complexType>


  <!-- declaration of the matches type -->

  <complexType name="matchesType">

    <sequence>
      <element name="match" type="matchType" minOccurs="1" maxOccurs="unbounded"/>
    </sequence>

  </complexType>


  <!-- declaration of the match type -->

  <complexType name="matchType">

    <sequence>
      <element name="variable" type="varType" minOccurs="1" maxOccurs="unbounded"/>
    </sequence>

    <attribute name="subgraph" type="idrefType" use="required"/>    

  </complexType>


  <!-- declaration of the variable type -->

  <complexType name="varType">

    <attribute name="name" type="xsd:string" use="required"/>    

    <attribute name="idref" type="idrefType" use="required"/>    

  </complexType>


  <!-- ======================================================
       SENTENCE DECLARATIONS THAT SHOULD BE REFINED
       ====================================================== -->

  <!-- declaration of the TERMINAL FEATURE ATTRIBUTES;
       this group is unrestricted, but should be refined by a 
       specialised, corpus-dependent schema -->

  <attributeGroup name="tfeatureAttributes">
  
    <anyAttribute processContents="skip"/>

  </attributeGroup>


  <!-- declaration of the NONTERMINAL FEATURE ATTRIBUTES;
       this group is unrestricted, but should be refined by a 
       specialised, corpus-dependent schema -->

  <attributeGroup name="ntfeatureAttributes">
  
    <anyAttribute processContents="skip"/>

  </attributeGroup>


  <!-- declaration of the EDGE-LABEL ATTRIBUTE;
       the label attribute is optional which should be refined by a 
       specialised, corpus-dependent schema -->

  <attributeGroup name="edgelabelAttribute">
  
    <attribute name="label" type="xsd:string" use="optional"/>    

  </attributeGroup>
    

  <!-- declaration of the SECONDARY-EDGE-LABEL ATTRIBUTE;
       the label attribute is optional which should be refined by a 
       specialised, corpus-dependent schema -->

  <attributeGroup name="secedgelabelAttribute">
  
    <attribute name="label" type="xsd:string" use="optional"/>    

  </attributeGroup>
 

  <!-- ======================================================
       ID and IDREF TYPE DECLARATIONS
       ====================================================== -->

  <!-- Even though XML Schema are a W3C Recommendation, schema
       support of XML parsers is still restricted. Using some
       parsers you might have problems with the ID and IDREF
       attributes in combination with an "anyAttribute"
       declaration. In this case, just modify the base type 
       of the following two declarations to "xsd:string".  -->


  <!-- declaration of idType -->

  <simpleType name="idType">

    <restriction base="xsd:ID"/>

  </simpleType>


  <!-- declaration of idrefType -->

  <simpleType name="idrefType">

    <restriction base="xsd:IDREF"/>

  </simpleType>


</schema>