3.5 List of implemented import filters

Please note: Corpora to be processed by the text-based import filters of TIGERRegistry (except some XML-based filters) have to be encoded in ISO-Latin-1. If characters outside the ISO-Latin-1 character set have to be used in a corpus, please use the following unicode encoding convention: Prefix the hexadecimal unicode number of your character by the string \u. For example, the unicode character corresponding to the hexadecimal number 03a9 (Greek capital letter Omega) has to be encoded as \u03a9.

The following import filters have been implemented:

General bracketing formats

general () filter

This filter should work with bracketing-style corpora that use braces for structuring. It generates cat, pos and word features.

general [] filter

This should work with bracketing-style corpora that use brackets for structuring. It generates cat, pos and word features.

PennTreebank formats

general PennTreebank filter

This filter should work with UPenn-style corpora. Syntactic functions are modelled as edge labels, traces are modelled as secondary edges.

This filter has been tested for the Wall Street Journal and Brown Corpus (Penn Treebank - bracketing version in mrg/ subdirectory), the Penn-Helsinki Parsed Corpus of Middle English, and the Chinese Treebank.

The Chinese Treebank has to be converted to the mentioned Unicode encoding first. The command line tool native2ascii can be used for this purpose. It is included in Sun's Java Development Kit which you can download at http://java.sun.com. For the Chinese treebank, the command line is the following:

native2ascii -encoding GB2312 chinese.txt unicodeoutput.txt

ATIS corpus filter

This is a special filter for the ATIS corpus format only (Penn Treebank - bracketing version in mrg/ subdirectory). It handles the different pos and word notation and other corpus-specific differences.

SWITCHBOARD corpus filter

This is a special filter for the SWITCHBOARD corpus format only (Penn Treebank - bracketing version in mrg/ subdirectory). It skips code and interjection sections.

Korean treebank filter

This is a special filter for the Korean treebank corpus format only. The Korean Treebank has to be converted to the mentioned Unicode encoding first. The command line tool native2ascii can be used for this purpose. It is included in Sun's Java Development Kit which you can download at http://java.sun.com. For the Korean treebank, the command line is the following:

native2ascii -encoding KSC5601 korean.txt unicodeoutput.txt

Treebanks converted by negra-topenn

This is a special filter for corpora that have been generated using the negra-topenn command line tool. This tool is part of the Negra Corpus deliverable. It has been developed to linguistically transform treebanks which have been annotated according to the Negra annotation scheme to the UPenn style format.

Susanne and Christine

Susanne corpus filter

This is a special filter for the Susanne corpus format only.

Christine corpus filter

This is a special filter for the Christine corpus format only.

Negra format

general Negra format filter

This filter should work with any corpus encoded according to the Negra format, Version 3 or Version 4. It has been tested for the Negra Corpus, the Negra 2000 Corpus, the VerbMobil Treebank, and the TIGER Corpus Release 1.

IMS tools

LoPar format filter

LoPar is an implementation of a parser for head-lexicalised probabilistic context-free grammars. Grammars are currently available for German and English. LoPar has been developed at IMS, University of Stuttgart (cf. http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/LoPar-en.html).

The LoPar format filter is able to process a special output format of LoPar. This output can be generated using the following LoPar command line:

cat input.txt | lopar -in <model> -stems -heads -viterbi -viterbi-probs -tgrep > output.txt
The input file must be a text file in one word per line format.

TreeTagger chunking filter

The TreeTagger is a tool for annotating text with part-of-speech and lemma information. It also comprises a chunker that is based on the tagging output. Chunking modules are currently available for German and English. The TreeTagger has been developed at IMS, University of Stuttgart (cf. http://www.ims.uni-stuttgart.de/projekte/corplex/). The TreeTagger chunking filter is able to process the XML output of the chunker.

YAC format filter

The chunker YAC (Yet Another Chunker) is a rule-based chunker for German. It has been developed at IMS, University of Stuttgart (cf. http://www.ims.uni-stuttgart.de/projekte/corplex/). The YAC format filter is able to process the XML-based YAC output.

Other formats

DEREKO format filter

This filter should work with corpora encoded according to the DEREKO corpus format. The DEREKO corpus format has been developed within the DEREKO project.