Bild von Institut mit Unilogo
home uni IMS suche Search kontakt Contact
unilogo Universität Stuttgart

IMS Textcorpora and Lexicon Group

 
 

Overview

The major focus of the Lexicon and Textcorpora Group at the IMS is the creation of large-scale, high-quality lexicons for natural language applications. 'Large scale' and 'high quality' can only be obtained simultaneously if appropriate engineering methods are applied. Therefore, we use text retrieval tools and information extraction methods - specialized to the field of lexicography. Usually, this approach is called 'corpus-based lexicography'. ('Corpus' means 'text corpus', i.e. a large collection of texts.)

In the last years, we have developed the following linguistic resources and tools:

Lexicons

We have built up IMSLex, a lexicon for German with morphosyntactic and subcategorization information. XML and relational database technology are used for efficient storage and manipulation. Manual maintenaince work is done with a Java-based graphical user interface. In addition, various kinds of word frequency lists and lexical data with semantic annotations have been compiled, see the Gramotron resources pages.

Tools for automatic text analysis and corpus annotation

For the annotation of part-of-speech information and for lemmatization, we use the TreeTagger. Currently, it is available with 'parameter sets' for English, German, French, and Italian. For syntactic analysis (automatic annotation with syntactic structures), a range of tools is available, e.g. the stochastic parser LoPar, the LFG grammar, and the YAC system.

Retrieval and extraction tools

We have developed specialized retrieval software for linguistically annotated corpora, e.g. the IMS Corpus Workbench (CQP) and TIGERSearch, a tool for querying syntactically annotated corpora. In order to get evidence from texts for specific linguistic properties of words and phrases, grammar fragments and extraction tools have been implemented. With these tools, high quality lexical information can be found automatically, in many cases.

Linguistically annotated text corpora

The IMS Textcorpora and Lexicon Group has been involved in major efforts to create manually validated 'reference corpora', e.g. a German reference corpus with part-of-speech and lemma information, and a syntactically annotated corpus, the TIGER corpus. In addition, researchers at the IMS have access to several hundred millions tokens of automatically annotated text corpora.

Linguistic Engineering Standards

We have been involved in various international efforts to standardize linguistic resources and tools: computational lexicons, speech, textual, and multimodal corpora and their annotations, representation formalisms for lexical and syntactic specifications. We contribute to establishing evaluation metrics and best-practice criteria for natural language processing tools and resources.

The "Local" page

More details for new group members can be found on our local page.