 |
|
 |
|
 |
IMS Textcorpora and Lexicon Group |
|
|
 |
 |
 |
|
 |
 |
|
Overview
The major focus of the Lexicon and Textcorpora Group at the IMS is the
creation of large-scale, high-quality lexicons for natural language
applications. 'Large scale' and 'high quality' can only be obtained
simultaneously if appropriate engineering methods are applied.
Therefore, we use text retrieval tools and information extraction
methods - specialized to the field of lexicography. Usually, this
approach is called 'corpus-based lexicography'. ('Corpus' means 'text
corpus', i.e. a large collection of texts.)
In the last years, we have developed the following linguistic
resources and tools:
Lexicons
We have built up
IMSLex, a lexicon for German with morphosyntactic and
subcategorization information. XML and relational database technology
are used for efficient storage and manipulation. Manual maintenaince
work is done with a Java-based graphical user interface. In addition,
various kinds of word frequency lists and lexical data with semantic
annotations have been compiled, see the
Gramotron resources pages.
Tools for automatic text analysis and corpus annotation
For the annotation of part-of-speech information and for
lemmatization, we use the
TreeTagger. Currently, it is available with 'parameter sets' for
English, German, French, and Italian. For syntactic analysis
(automatic annotation with syntactic structures), a range of tools is
available, e.g. the stochastic parser LoPar,
the LFG
grammar, and the YAC
system.
Retrieval and extraction tools
We have developed specialized retrieval software for linguistically
annotated corpora, e.g. the IMS
Corpus Workbench (CQP) and
TIGERSearch, a tool for querying syntactically annotated corpora.
In order to get evidence from texts for specific linguistic properties
of words and phrases, grammar fragments and extraction tools have been
implemented. With these tools, high quality lexical information can be
found automatically, in many cases.
Linguistically annotated text corpora
The IMS Textcorpora and Lexicon Group has been involved in major
efforts to create manually validated 'reference corpora', e.g. a
German reference corpus with part-of-speech and lemma information, and
a syntactically annotated corpus, the
TIGER corpus. In addition, researchers at the IMS have access to
several hundred millions tokens of automatically annotated text corpora.
Linguistic Engineering Standards
We have been involved in various international efforts to standardize
linguistic resources and tools: computational lexicons, speech,
textual, and multimodal corpora and their annotations, representation
formalisms for lexical and syntactic specifications. We contribute to
establishing evaluation metrics and best-practice criteria for natural
language processing tools and resources.
The "Local" page
More details for new group members can be found on our local page.
|
|
|
|
|