Bild von Institut mit Unilogo
home uni IMS suche Search kontakt Contact
unilogo University of Stuttgart
Institute for Natural Language Processing

Lexical Information for German

 
 

The statistical grammar model based on the trained version of the German Head-Lexicalised Context-Free Grammar represents a source for lexical information.

Frequency Information

The German grammar was trained on 35 million words of a large German newspaper corpus. The resulting statistical grammar model contains frequency information on word forms, part-of-speech tags and lemmas in the training corpus.

Verb Subcategorisation

The statistical grammar model provides lexical information, with emphasis on verb entries. We induced frequency and probability distributions for 16,946 verbs concerning subcategorisation frames, argument selection, verb mode and auxiliary choice.

Viterbi Parses

On the basis of the statistical grammar model we parsed 50 million words of newspaper data and determined their most probable parse trees. (example from Donaukurier)


Documentation

The induction of the subcategorisation lexicon is described by Schulte im Walde (LREC, 2002). The induction of a database of verb and noun collocations can be found in Schulte im Walde (COMPLEX, 2003). A documentation on the statistical grammar framework, the grammar code, training and evaluation can be downloaded as chapter 3 of Schulte im Walde (PhD-Thesis, 2003).



The data is freely available for education, research and other non-commercial purposes. Please contact Sabine Schulte im Walde for more information.