Position within the page tree

Institute for Natural Language Processing
Research
Resources
Lexica
Empirical Lexical Information induced from Lexicalised PCFGs

Empirical Lexical Information induced from Lexicalised PCFGs

Head-Lexicalised Probabilistic Context-Free Grammars represent a lexicalised extension of PCFGs, and incorporate lexical heads into the grammar rules

Empirical Lexical Information induced from Lexicalised PCFGs

Type: Lexicon
Author: Sabine Schulte i mWalde
Description: Head-Lexicalised Probabilistic Context-Free Grammars (HeadLex-PCFGs) represent a lexicalised extension of PCFGs, and incorporate lexical heads into the grammar rules, cf. Charniak (1997) and Carroll and Rooth (1998). As the core of a HeadLex-PCFG, a context-free grammar is developed, with head-marking on the children. The parameters of the probabilistic version of the context-free grammar - both for the unlexicalised PCFG, a lexicalisation bootstrapping, and the lexicalised HeadLex-PCFG - are then estimated in an unsupervised training procedure, using the Expectation-Maximization algorithm (Baum, 1972). The algorithm iteratively improves model parameters by alternately assessing frequencies and estimating probabilities.

We used the statistical parser LoPar to perform the parameter training. The trained grammar model provides lexicalised rules and syntax-semantics head-head co-occurrences, as an empirical resource for inducing quantitative lexical properties at the syntax-semantics interface. The lexical information can be used for lexical acquisition and modeling linguistic phenomena. We provide lexical information for German and for English.

The German HeadLex-PCFG was trained on 35 million words of the Huge German Corpus (HGC), a collection of newspaper corpora from the 1990s. We provide the unlexicalised and lexicalised grammar files for parsing, empirical word frequencies, and lexical data on various linguistic phenomena.

The English HeadLex-PCFG was trained on approx. half of the BNC, 50 million words. The resulting grammar model was applied to obtain Viterbi parses for the whole corpus, 117 million words. From the Viterbi parses we then extracted lexical information about verbs, subcategorisation frames and arguments.
Reference: Sabine Schulte im Walde (2003)
Experiments on the Automatic Induction of German Semantic Verb Classes
PhD Thesis. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart. Published as AIMS Report 9(2).

Sabine Schulte im Walde (1998)
Automatic Semantic Classification of Verbs According to their Alternation Behaviour
Diplomarbeit. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart.
Download: The lexical information is maintained here.

This image shows Sabine Schulte im Walde

Empirical Lexical Information induced from Lexicalised PCFGs

Empirical Lexical Information induced from Lexicalised PCFGs

Sabine Schulte im Walde

Audience

Formalities

Services

Organization

Empirical Lexical Information induced from Lexicalised PCFGs

Empirical Lexical Information induced from Lexicalised PCFGs

Sabine Schulte im Walde

Here you can reach us

Audience

Formalities

Services

Organization