|
The statistical grammar model based on the trained version of the German Head-Lexicalised
Context-Free Grammar represents a source for lexical
information.
Frequency Information
The German grammar was trained on 35 million words of a large German
newspaper corpus. The resulting statistical grammar model contains
frequency information on word forms, part-of-speech tags and lemmas in
the training corpus.
Verb Subcategorisation
The statistical grammar model provides lexical information, with
emphasis on verb entries. We induced frequency and probability
distributions for 16,946 verbs concerning subcategorisation frames,
argument selection, verb mode and auxiliary choice.
Viterbi Parses
On the basis of the statistical grammar model we parsed 50 million
words of newspaper data and determined their most probable parse
trees. (example from Donaukurier)
Documentation
The induction of the subcategorisation
lexicon is described by Schulte im Walde (LREC, 2002). The induction
of a database of verb and noun collocations can be found in Schulte im
Walde (COMPLEX, 2003). A documentation on the statistical grammar
framework, the grammar code, training and evaluation can be downloaded
as chapter 3 of Schulte im Walde (PhD-Thesis, 2003).
The data is freely available for education, research and other
non-commercial purposes. Please contact Sabine Schulte im
Walde for more information.
|