Bild von Institut mit Unilogo
home uni IMS suche Search kontakt Contact
unilogo University of Stuttgart
Institute for Natural Language Processing

Semantically Annotated Lexica

 
 

An important challenge in computational linguistics concerns the construction of large-scale computational lexicons for the numerous natural languages where very large samples of language use are now available. The most approaches require as a prerequisite a fixed taxonomy of semantic relations. This is a problem because (i) entailment hierarchies are presently available for few languages, and (ii) we regard it as an open question whether and to what degree existing designs for lexical hierarchies are appropriate for representing lexical meaning. Both of these considerations suggest the relevance of inductive and experimental approaches to the construction of lexicons with semantic information. In the following papers

  • Inducing a Semantically Annotated Lexicon via EM-based Clustering. Mats Rooth, Stefan Riezler, Detlef Prescher, Glenn Carroll, and Franz Beil. In 37th Annual Meeting of the ACL, 1999, Maryland. (.ps/.ps.gz)

  • EM-Based Clustering for NLP Applications. Mats Rooth, Stefan Riezler, Detlef Prescher, Glenn Carroll, and Franz Beil. In Inducing Lexicons with the EM Algorithm, AIMS Report 4(3), 1998, IMS, Universität Stuttgart. 97-124. (.ps/.ps.gz)

we present a method for automatic induction of semantically annotated subcategorization frames from unannotated corpora. We use a statistical subcat-induction system which estimates probability distributions and corpus frequencies for pairs of a head and a subcat frame. The statistical parser can also collect frequencies for the nominal fillers of slots in a subcat frame. The induction of labels for slots in a frame is based upon estimation of a probability distribution over tuples consisting of a class label, a selecting head, a grammatical relation, and a filler head. The class label is treated as hidden data in the EM-framework for statistical estimation.

In the following, we report results on experiments with observations derived from large English and German corpora:
  • English
    Experiments with British National Corpus (1280715 tokens of verb-noun pairs):
    • Latent Semantic Class Model (.ps.gz)
    • Sematically annotated lexicon of intransitive and transitive verbs (.ps.gz, 983 pages)

  • German
    Experiments with Huge German Corpus (418290 tokens of verb-noun and adj-noun pairs)
    • Latent Semantic Class Model (.ps.gz)
    • Sematically annotated lexicon of intransitive and transitive verbs (.ps.gz, 939 pages)

Please contact Stefan Riezler or Detlef Prescher for more information.