Section: Introduction

1. Introduction

As there are various formats for the representation of linguistic corpora, there are also a number of formats for the encoding of syntactically annotated corpora: Penn Treebank, Susanne, Negra, and several other formats plus various kinds of parser or chunker output. Since applications like TIGERSearch cannot support all existing formats, it makes sense to define one single interface format for import and export. This format should be general enough to encode as many existing formats as possible. To solve the problem of Unicode character encoding, it is also advantageous to choose an XML-based approach.

The TIGER-XML format has been designed as an interface format. It is an XML-based equivalence of the corpus definition sublanguage of the TIGER description language (cf. chapter III). In addition to corpus definitions, the TIGER-XML format can also represent query results.

Any corpus to be processed by the TIGERSearch tool has to be encoded in the TIGER-XML format. For convenience, we have implemented corpus filters (i.e. converters to TIGER-XML) for many popular treebank and parser output formats like bracketing format, PennTreebank format, NeGra format etc. (cf. subsection 3.5, chapter VI for a list of implemented filters).

Why should you read this chapter? First of all, you might have corpora encoded in formats not supported as a corpus filter. So knowledge about TIGER-XML is essential for their conversion. In addition, TIGERSearch supports exporting the matches of a query in the TIGER-XML format. If you like to transform the XML output, e.g. by XSLT stylesheets, you will have to know how the XML document has been designed.

The first section walks through an example of the TIGER-XML format (cf. section 2). In the second section a real-life example is presented (cf. section 3). Finally, you can find a description of the XML schema used to validate TIGER-XML documents (cf. section 4).

If you are interested in the motivations that have influenced the development of the TIGER-XML format, you should have a look at Wolfgang Lezius' Ph.D. thesis [Lezius2002] (in German).