Subsection: Corpus formats

1.2 Corpus formats

TIGERSearch supports many existing corpus encoding formats, based on a two-step approach. Since the data model supported by TIGERSearch is more general and expandable than most data models of formats available (PennTreebank format, Negra format, etc.), we have developed the TIGER-XML format. This format maps the supported data model to XML. The TIGER-XML format is described in chapter V. Corpora to be indexed with the TIGERRegistry tool must be encoded in the TIGER-XML format.

To support as many treebank formats as possible, we have also implemented import filters (i.e. converters to TIGER-XML) for many popular formats such as Penn Treebank format or Negra format, and some parser output formats. Indexing of TIGER-XML corpora and indexing of corpora with an import filter are described in subsection 3.2 and subsection 3.3, respectively. A list of implemented import filters can be found in subsection 3.5.