TIGER Korpus

Der TIGER-Korpus besteht aus ca. 900.000 Token (50.000 Sätze) deutscher Zeitungstexte aus der Frankfurter Rundschau. Der Korpus wurde halbautomatisch mit POS-Tags und mit syntaktischer Struktur versehen. Darüber hinaus enthält er morphologische und lemmatische Informationen für Endknoten

TIGER Korpus


The TIGER Corpus (versions 2.1 and 2.2) consists of app. 900,000 tokens (50,000 sentences) of German newspaper text, taken from the Frankfurter Rundschau. The corpus was semi-automatically POS-tagged and annotated with syntactic structure. Moreover, it contains morphological and lemma information for terminal nodes. For details, see the annotation page. Version 2.2 is a cleaned up version of release 2.1.

The TIGER Corpus is delivered in two treebank formats:

Both versions of the corpus can be processed by the treebank query tool TIGERSearch, which has also been developed within the TIGER project.

Version 1 of the TIGER Corpus is still available as well. It consists of app. 700,000 tokens (40,000 sentences). With respect to version 2, it lacks the morphological and lemma information.

In addition to the TIGER Corpus proper, several resources derived from it are available. These are:

  • TIGER 2.2-doc which includes a full mapping of sentences to documents
  • TIGER Corpus 2.2 converted into CoNLL-2009 dependency trees (by the tool Tiger2Dep)
  • the TIGER 10.000 MOD Bank, which includes the first 10,000 sentences from the TIGER Corpus 2.1, where the original POS tags have been replaced by new tags that provide a more fine-grained analysis of modification in German,
  • the TiGer Dependency Bank, which is a dependency-based gold standard for (hand-crafted) German parsers for the TIGER Corpus sentences 8,001 through 10,000,
  • the TIGER 700 RMRS Bank,
  • the TIGER data sets for the CoNLL-X shared task and
  • dependency triple representations for (almost) the entire treebank, which, like the TiGer DB structures, are intended for evaluation purposes.
  • Brants, Sabine, Stefanie Dipper, Peter Eisenberg, Silvia Hansen, Esther König, Wolfgang Lezius, Christian Rohrer, George Smith, and Hans Uszkoreit. 2004. TIGER: Linguistic Interpretation of a German Corpus. Journal of Language and Computation, 2004 (2), 597-620.
  • TIGER Project. 2003. TIGER Annotationsschema. Manuscript. Universität des Saarlands, Universität Stuttgart Universität Potsdam. July 2003.

1. Research and evaluation purposes

For research and evaluation purposes, the TIGERCorpus can be downloaded for free. However, we ask you to acknowledge the TIGERCorpus license agreement for non-commercial use. The "Accept license terms" button at the bottom of the license will then take you to the download page.

2. Commercial purposes

If you are interested in a commercial license of the TIGERCorpus, please contact the secretary of Prof. Hans Uszkoreit's chair at Saarland University at sek-hu AT coli DOT uni-saarland DOT de.



CLARIN-D Stuttgart (clarin AT ims.uni-stuttgart.de)


Projekt CLARIN-D

Zum Seitenanfang