Modellseite: 3 Spalten (links, Mitte, rechts)

Default-Text der hier stehen soll ...

Overview

The major focus of the Lexicon and Textcorpora Group at the IMS is thecreation of large-scale, high-quality lexicons for natural languageapplications. 'Large scale' and 'high quality' can only be obtainedsimultaneously if appropriate engineering methods are applied.Therefore, we use text retrieval tools and information extractionmethods - specialized to the field of lexicography. Usually, thisapproach is called 'corpus-based lexicography'. ('Corpus' means 'textcorpus', i.e. a large collection of texts.)

In the last years, we have developed the following linguisticresources and tools:

Lexicons

We have built up IMSLex, a lexicon for German with morphosyntactic andsubcategorization information. XML and relational database technologyare used for efficient storage and manipulation. Manual maintenaincework is done with a Java-based graphical user interface. In addition,various kinds of word frequency lists and lexical data with semanticannotations have been compiled, see theGramotron resources pages.

Tools for automatic text analysis and corpus annotation

For the annotation of part-of-speech information and forlemmatization, we use theTreeTagger. Currently, it is available with 'parameter sets' forEnglish, German, French, and Italian. For syntactic analysis(automatic annotation with syntactic structures), a range of tools isavailable, e.g. the stochastic parser LoPar,the LFGgrammar, and the YACsystem.

Retrieval and extraction tools

We have developed specialized retrieval software for linguisticallyannotated corpora, e.g. the IMSCorpus Workbench (CQP) and TIGERSearch, a tool for querying syntactically annotated corpora.In order to get evidence from texts for specific linguistic propertiesof words and phrases, grammar fragments and extraction tools have beenimplemented. With these tools, high quality lexical information can befound automatically, in many cases.

Linguistically annotated text corpora

The IMS Textcorpora and Lexicon Group has been involved in majorefforts to create manually validated 'reference corpora', e.g. aGerman reference corpus with part-of-speech and lemma information, anda syntactically annotated corpus, the TIGER corpus. In addition, researchers at the IMS have access toseveral hundred millions tokens of automatically annotated text corpora.

Linguistic Engineering Standards

We have been involved in various international efforts to standardizelinguistic resources and tools: computational lexicons, speech,textual, and multimodal corpora and their annotations, representationformalisms for lexical and syntactic specifications. We contribute toestablishing evaluation metrics and best-practice criteria for naturallanguage processing tools and resources.

The "Local" page

More details for new group members can be found on our local page.
Zum Seitenanfang