Position within the page tree

Institute for Natural Language Processing
Research
Projects
Research Group Textcorpora and Lexicon

Research Group Textcorpora and Lexicon

A research group at IMS that were developing lexicons, corpora, and tools to work with them

IMS Textcorpora and Lexicon Group

Short description

The Textcorpora and Lexicon Group was a research group at IMS that brought together the researchers from different projects that were developing lexicons, corpora, and tools to work with them.

The major focus of the Textcorpora and Lexicon Group at the IMS is the creation of large-scale, high-quality lexicons for natural language applications. 'Large scale' and 'high quality' can only be obtained simultaneously if appropriate engineering methods are applied. Therefore, we use text retrieval tools and information extraction methods - specialized to the field of lexicography. Usually, this approach is called 'corpus-based lexicography'.

Long description

The Textcorpora and Lexicon Group was a research group at IMS that brought together the researchers from different projects that were developing lexicons, corpora, and tools to work with them.

In the last years, we have developed the following linguistic resources and tools:

Lexicons

We have built up IMSLex, a lexicon for German with morphosyntactic and subcategorization information. XML and relational database technology are used for efficient storage and manipulation. Manual maintenaince work is done with a Java-based graphical user interface.

Tools for automatic text analysis and corpus annotation

For the annotation of part-of-speech information and for lemmatization, we use the TreeTagger. Currently, it is available with 'parameter sets' for English, German, French, and Italian. For syntactic analysis (automatic annotation with syntactic structures), a range of tools is available, e.g. the stochastic parser LoPar, the LFG grammar, and the YAC system.

Retrieval and extraction tools

We have developed specialized retrieval software for linguistically annotated corpora, e.g. the IMS Corpus Workbench (CQP) and TIGERSearch, a tool for querying syntactically annotated corpora. In order to get evidence from texts for specific linguistic properties of words and phrases, grammar fragments and extraction tools have been implemented. With these tools, high quality lexical information can be found automatically, in many cases.

Linguistically annotated text corpora

The IMS Textcorpora and Lexicon Group has been involved in major efforts to create manually validated 'reference corpora', e.g. a German reference corpus with part-of-speech and lemma information, and a syntactically annotated corpus, the TIGER corpus. In addition, researchers at the IMS have access to several hundred millions tokens of automatically annotated text corpora.

Linguistic Engineering Standards

We have been involved in various international efforts to standardize linguistic resources and tools: computational lexicons, speech, textual, and multimodal corpora and their annotations, representation formalisms for lexical and syntactic specifications. We contribute to establishing evaluation metrics and best-practice criteria for natural language processing tools and resources.

The Textcorpora and Lexicon Group has been involved in various projects with partners from industry.

Projects

2010-2012: TTC
(Terminology extraction, translation tools and comparable corpora)
2000-2001: DeKo
(Deutsche Derivations- und Kompositionsmorphologie / German morphological derivation and compounding)
1999-2002: DEREKO
(Deutsches Referenzkorpus / German Reference Corpus)
1999-2000: Database Overheidsterminologie / Dutch administration terminology
Term extraction from legal texts, design and implementation of a term database.
1999-2000: ELSNET
(European Network in Speech and Natural Language Processing)
Subproject: Syntax/Semantics Annotation Task
1998-1999: Digital Dictionary of the 20th Century German Language
Subproject: creation of a morphosyntactically annotated corpus.
1998-1999: MATE
(Multilevel Annotation, Tools Engineering)
1998: PAROLE
Subproject: subcategorization lexicon for nouns, adjectives, and adverbs.
1997-1999: DISC and DISC2
(Dialogue Systems and Components)
1997-1998:
Feasibility Study on the Construction of Bilingual Dictionary Databases
1996-1998 SPARKLE
(Shallow Parsing and Knowledge Extraction for Language Engineering, LE-2111)
1995-1997: SFB 340, B7
(Partial Parsing and the Acquisition of Lexical Syntax and Semantics)
1995-1996: ELSNET
(European Network in Speech and Natural Language Processing)
Subproject: Part-of-speech annotated reference corpus for German
1994-1996: DECIDE
(Designing and Evaluating Extraction Tools for Collocations in Dictionaries and Corpora)
1994-1996: EAGLES
(Expert Advisory Group on Linguistic Engineering Standards)
Development of the STTS tagset for German part-of-speech annotation
1993-1995: RELATOR
1992-1994: ELWIS
(Corpus-based development of lexical knowledge bases)
1992-1996: TC
(TextCorpora and tools for their exploration)
1992-1995: DELIS
(Descriptive Lexical Specifications)

Write e-mail
If you have any problems with the website, please directly contact the webmaster.

Research Group Textcorpora and Lexicon

IMS Textcorpora and Lexicon Group

Lexicons

Tools for automatic text analysis and corpus annotation

Retrieval and extraction tools

Linguistically annotated text corpora

Linguistic Engineering Standards

General Contact IMS

Pfaffenwaldring 5 b, 70569 Stuttgart

Webmaster of the IMS

Audience

Formalities

Services

Organization

Research Group Textcorpora and Lexicon

IMS Textcorpora and Lexicon Group

Lexicons

Tools for automatic text analysis and corpus annotation

Retrieval and extraction tools

Linguistically annotated text corpora

Linguistic Engineering Standards

General Contact IMS

Pfaffenwaldring 5 b, 70569 Stuttgart

Webmaster of the IMS

Here you can reach us

Audience

Formalities

Services

Organization