Institut für Maschinelle Sprachverarbeitung


IMS Corpus Workbench (CWB)

CWB home Applications Features Online Demos Availability Papers Users' Corner


In order to support work in the fields of lexicography and terminology, IMS has developed a workbench for full-text retrieval from large textual resources (`corpora'). This work was initiated by the TC Project (`Text Corpora and Tools for their Exploitation').

Applications

The IMS Corpus Workbench is used for

Features

Query language

See the overview of the query syntax and some more sample queries.

Display of results

Corpus Administration and Preparation

Retrieval

The query language is interpreted by the 'Corpus Query Processor' (CQP). CQP requires corpora to be registered and encoded in the specific manner.
There used to be a Motif-based graphical user interface 'xkwic', which made access to CQP more convenient for non-programmers. This hasn't been changed for a couple of years now, and it doesn't seem to run with newer versions of the operating systems. So, the Corpus Query Processor is a command line tool only.

At IMS, the largest corpus currently being handled by the Corpus Workbench is a German newspaper corpus which consists of about 200 million tokens, annotated with lemmata, two different part-of-speech tag sets, and sentence boundaries.


Background papers

Oli Christ: "A modular and flexible architecture for an integrated corpus query system". COMPLEX'94, Budapest, 1994. .ps.gz

Oli Christ and B.M.Schulze: "Ein flexibles und modulares Anfragesystem für Textcorpora". Tagungsbericht des Arbeitstreffen Lexikon + Text. Niemeyer, Tübingen, 1995. .ps.gz

Contact

Mr. Ulrich Heid, University of Stuttgart, Institute for Natural Language Processing, Azenbergstr.12, 70174 Stuttgart, Germany
Uli.Heid@ims.uni-stuttgart.de, fon: +49-711-685-81374, fax: +49-711-685-81366


IMS Stuttgart, Fri Oct 27 12:19:00 2000 (Uli.Heid@ims.uni-stuttgart.de)