Textcorpora und Erschliessungswerkzeuge

In 1993/1994 the project collected textual material for German, French and Italian, developed a representation for texts and markups, along with a query language and a corpus access system for linguistic exploration of the text material. Texts and analysis results are kept separate from each other, for reasons of flexibility and extensibility of the system; this is possible because of a particular approach for storage and representation. Tool components under development, language-specific and general, range from morphosyntactic analysis to partial parsing, and from mutual information, t-score, collocation extraction and clustering to HMM-based tagging and n-gram tagging. Research on statistical models for noun phrases, verb-object collocations, etc. is going on.


Funded at 100% by the Ministry of Science and Research of the Land Baden-Württemberg (MWF, Stuttgart), in 1993/1994 and 1995/1996, in the framework of the Forschungsschwerpunktprogramm Baden-Württemberg.



The part-of-speech tagset for German, mainly developed by Anne Schiller and Christine Thielen (University of Tübingen)

The various taggers which have been built by André Kempe and Helmut Schmid

The IMS Corpus Workbench which has been developed by Oliver Christ and Bruno Schulze