Institute

Studying

Research


 

IMS Textcorpora und Lexicon Group

The Textcorpora and Lexicon Group was a research group at IMS that brought together the researchers from different projects that were developing lexicons, corpora, and tools to work with them.

The major focus of the Textcorpora and Lexicon Group at the IMS is the creation of large-scale, high-quality lexicons for natural language applications. 'Large scale' and 'high quality' can only be obtained simultaneously if appropriate engineering methods are applied. Therefore, we use text retrieval tools and information extraction methods - specialized to the field of lexicography. Usually, this approach is called 'corpus-based lexicography'. ('Corpus' means 'text corpus', i.e. a large collection of texts.)

In the last years, we have developed the following linguistic resources and tools:

Lexicons

We have built up IMSLex, a lexicon for German with morphosyntactic and subcategorization information. XML and relational database technology are used for efficient storage and manipulation. Manual maintenaince work is done with a Java-based graphical user interface.

Tools for automatic text analysis and corpus annotation

For the annotation of part-of-speech information and for lemmatization, we use the TreeTagger. Currently, it is available with 'parameter sets' for English, German, French, and Italian. For syntactic analysis (automatic annotation with syntactic structures), a range of tools is available, e.g. the stochastic parser LoPar, the LFG grammar, and the YAC system.

Retrieval and extraction tools

We have developed specialized retrieval software for linguistically annotated corpora, e.g. the IMS Corpus Workbench (CQP) and TIGERSearch, a tool for querying syntactically annotated corpora. In order to get evidence from texts for specific linguistic properties of words and phrases, grammar fragments and extraction tools have been implemented. With these tools, high quality lexical information can be found automatically, in many cases.

Linguistically annotated text corpora

The IMS Textcorpora and Lexicon Group has been involved in major efforts to create manually validated 'reference corpora', e.g. a German reference corpus with part-of-speech and lemma information, and a syntactically annotated corpus, the TIGER corpus. In addition, researchers at the IMS have access to several hundred millions tokens of automatically annotated text corpora.

Linguistic Engineering Standards

We have been involved in various international efforts to standardize linguistic resources and tools: computational lexicons, speech, textual, and multimodal corpora and their annotations, representation formalisms for lexical and syntactic specifications. We contribute to establishing evaluation metrics and best-practice criteria for natural language processing tools and resources.

 

The Textcorpora and Lexicon Group has been involved in various projects with partners from industry.

Projects

  • 2010-2012: TTC
    (Terminology extraction, translation tools and comparable corpora)
  • 2000-2001: DeKo
    (Deutsche Derivations- und Kompositionsmorphologie / German morphological derivation and compounding)
  • 1999-2002: DEREKO
    (Deutsches Referenzkorpus / German Reference Corpus)
  • 1999-2000: Database Overheidsterminologie / Dutch administration terminology
    Term extraction from legal texts, design and implementation of a term database.
  • 1999-2000: ELSNET
    (European Network in Speech and Natural Language Processing)
    Subproject: Syntax/Semantics Annotation Task
  • 1998-1999: Digital Dictionary of the 20th Century German Language
    Subproject: creation of a morphosyntactically annotated corpus.
  • 1998-1999: MATE
    (Multilevel Annotation, Tools Engineering)
  • 1998: PAROLE
    Subproject: subcategorization lexicon for nouns, adjectives, and adverbs.
  • 1997-1999: DISC and DISC2
    (Dialogue Systems and Components)
  • 1997-1998:
    Feasibility Study on the Construction of Bilingual Dictionary Databases
  • 1996-1998 SPARKLE
    (Shallow Parsing and Knowledge Extraction for Language Engineering, LE-2111)
  • 1995-1997: SFB 340, B7
    (Partial Parsing and the Acquisition of Lexical Syntax and Semantics)
  • 1995-1996: ELSNET
    (European Network in Speech and Natural Language Processing)
    Subproject: Part-of-speech annotated reference corpus for German
  • 1994-1996: DECIDE
    (Designing and Evaluating Extraction Tools for Collocations in Dictionaries and Corpora)
  • 1994-1996: EAGLES
    (Expert Advisory Group on Linguistic Engineering Standards)
    Development of the STTS tagset for German part-of-speech annotation
  • 1993-1995: RELATOR
  • 1992-1994: ELWIS
    (Corpus-based development of lexical knowledge bases)
  • 1992-1996: TC
    (TextCorpora and tools for their exploration)
  • 1992-1995: DELIS
    (Descriptive Lexical Specifications)