Annotation
Annotation guidelines
The TIGER project aims to produce a large syntactically annotated corpus of German newspaper text. In order to yield a high-quality and theoretically well-founded annotation of the corpus, detailed annotation guidelines have been developed:
Annotation example
The file samples.tgz contains a short extract from the corpus. It contains the files:
- sample1.export : Negra export format (Release 1)
- sample1.xml : TIGER-XML format (Release 1)
- sample2.export : Negra export format (Release 2)
- sample2.xml : TIGER-XML format (Release 2)
Here is the graphical representation of a single corpus sentence:
- JPG format (Release 1, Annotate)
- Postscript format (Release 1, Annotate)
- JPG format (Release 2, TIGERGraphViewer)
- PDF format (Release 2, TIGERGraphViewer)
Annotation approaches
The quality (in terms of consistency) and the speed of the manual annotation are improved with the help of automatic annotation tools. For the annotation of the TIGER corpus, we are using two different approaches:
- Annotate
The major part of the TIGER corpus annotation is carried out by means of the Annotate software. Annotate is a graphical tool for efficient semi-automatic annotation of corpus data. In the framework of the TIGER project, the tool includes a partial parser and a part-of-speech tagger for the automatic partial corpus annotation. Annotate was developed in the NEGRA project at the University of Saarbrücken. For more information about Annotate, see the Annotate homepage and the LREC'2000 paper by Brants/Plaehn (ps.gz, pdf).
- LFG Annotation
In parallel to the Annotate tool, a broad coverage symbolic LFG grammar - developed in the Pargram project at the University of Stuttgart - is used for annotating the TIGER corpus. Annotation by the LFG grammar involves two steps which are now illustrated by examples (Please follow the links.):
- LFG parsing: First the TIGER corpus is parsed by the LFG grammar. The output of the LFG grammar is disambiguated semi-automatically.
- TIGER transfer: The selected output is then automatically converted to the TIGER export format.
LFG Parsing
This section gives a short introduction about LFG parsing and disambiguation. For a more detailed description see the LINC'2000 paper by Dipper (ps.gz, pdf).
Parsing
The German grammar applied in parsing is a Lexical Functional Grammar (LFG) and was developed in the Pargram project, using the Xerox Linguistic Environment (XLE).
In the context of the TIGER project we investigate the possibilities and limits of grammar-based treebanking. Currently, 35% of real newspaper text is successfully analyzed by the grammar.
The analysis an LFG grammar yields for a given sentence consists of two representations, the constituent structure (c-structure), and the functional structure (f-structure). C-structure encodes information about morphology, constituency, and linear ordering. F-structure represents information about predicate argument structure, about modification, and about tense, mood, etc.
Disambiguation
Most of the sentences of the TIGER corpus are syntactically ambiguous. Hence the grammar output has to be disambiguated before being mapped to the TIGER format. We use two different methods for disambiguation.
Automatic disambiguation: XLE provides a (non-statistical) mechanism for suppressing certain ambiguities automatically. The mechanism consists of a constraint ranking scheme inspired by Optimality Theory (OT). Grammar rules and lexicon entries can be marked by so-called OT marks. Highly improbable or marked readings are filtered out, thus reducing the number of ambiguities the human disambiguator has to deal with.
Manual disambiguation: Remaining ambiguities must be resolved manually. XLE supports manual disambiguation by packing all different readings into one complex representation that can easily be browsed by the human annotator. An example can be found here.
Without using the OT filter mechanism a sentence gets 35,577 analyses on average (median: 20). After OT filtering, the average number of analyses drops to 16.5 (median: 2).