IMS Textcorpora und Lexicon Group
The Textcorpora and Lexicon Group was a research group at IMS that brought together the researchers from different projects that were developing lexicons, corpora, and tools to work with them.
The major focus of the Textcorpora and Lexicon Group at the IMS is the creation of large-scale, high-quality lexicons for natural language applications. 'Large scale' and 'high quality' can only be obtained simultaneously if appropriate engineering methods are applied. Therefore, we use text retrieval tools and information extraction methods - specialized to the field of lexicography. Usually, this approach is called 'corpus-based lexicography'. ('Corpus' means 'text corpus', i.e. a large collection of texts.)
In the last years, we have developed the following linguistic resources and tools:
We have built up IMSLex, a lexicon for German with morphosyntactic and subcategorization information. XML and relational database technology are used for efficient storage and manipulation. Manual maintenaince work is done with a Java-based graphical user interface.
Tools for automatic text analysis and corpus annotation
For the annotation of part-of-speech information and for lemmatization, we use the TreeTagger. Currently, it is available with 'parameter sets' for English, German, French, and Italian. For syntactic analysis (automatic annotation with syntactic structures), a range of tools is available, e.g. the stochastic parser LoPar, the LFG grammar, and the YAC system.
Retrieval and extraction tools
We have developed specialized retrieval software for linguistically annotated corpora, e.g. the IMS Corpus Workbench (CQP) and TIGERSearch, a tool for querying syntactically annotated corpora. In order to get evidence from texts for specific linguistic properties of words and phrases, grammar fragments and extraction tools have been implemented. With these tools, high quality lexical information can be found automatically, in many cases.
Linguistically annotated text corpora
The IMS Textcorpora and Lexicon Group has been involved in major efforts to create manually validated 'reference corpora', e.g. a German reference corpus with part-of-speech and lemma information, and a syntactically annotated corpus, the TIGER corpus. In addition, researchers at the IMS have access to several hundred millions tokens of automatically annotated text corpora.
Linguistic Engineering Standards
We have been involved in various international efforts to standardize linguistic resources and tools: computational lexicons, speech, textual, and multimodal corpora and their annotations, representation formalisms for lexical and syntactic specifications. We contribute to establishing evaluation metrics and best-practice criteria for natural language processing tools and resources.
The Textcorpora and Lexicon Group has been involved in various projects with partners from industry.