IMS Corpus Workbench (CWB)

In order to support work in the fields of lexicography and terminology, IMS has developed a workbench for full-text retrieval from large textual resources (`corpora'). This work was initiated by the TC Project (`Text Corpora and Tools for their Exploitation').


The IMS Corpus Workbench is used for

  • Data-driven linguistics:
    Extraction of linguistic knowledge from textual resources or cross-checking of linguistic assumptions against large texts.
  • Lexicography:
    Corpus-based evidence for lexical descriptions.
  • Terminology:
    Extraction of terms and bootstrapping of terminological resources.


Query language

  • unrestricted number of attributes per corpus position
  • regular expressions over attribute values of individual corpus positions (e.g. wild cards for word forms, part-of-speech values)
  • regular expressions over sequences of corpus positions
  • (partial) support of structural annotations (e.g. SGML)
  • incremental concordancing
  • application of a query to all items of a list
  • 'virtual attributes', i.e. runtime access to external applications (e.g. a thesaurus)
  • queries on parallel translated texts

See the overview of the query syntax and some more sample queries.

Display of results

  • user-definable size of 'keyword in context' display
  • 'keyword in context' lines can be sorted in various ways
  • frequency counts, e.g. for word combinations
  • multilingual concordances from aligned corpora
  • html and latex output supported
  • query history

Corpus Administration and Preparation

  • registration of corpora
  • 'encoding' of corpora, i.e. indexing (and compression)
    (for text sources in one-word-per-line format, using ISO8859/Latin-1 8bit character sets, and maybe others)
    For example, the BNC corpus with part-of-speech and lemma annotation will need about 1 GB of disk space.
  • incremental addition of types of corpus annotations ('attributes'). E.g. add part-of-speech values to a corpus once you have access to a POS-tagger.


The query language is interpreted by the 'Corpus Query Processor' (CQP). CQP requires corpora to be registered and encoded in the specific manner.
There used to be a Motif-based graphical user interface 'xkwic', which made access to CQP more convenient for non-programmers. This hasn't been changed for a couple of years now, and it doesn't seem to run with newer versions of the operating systems. So, the Corpus Query Processor is a command line tool only.

At IMS, the largest corpus currently being handled by the Corpus Workbench is a German newspaper corpus which consists of about 200 million tokens, annotated with lemmata, two different part-of-speech tag sets, and sentence boundaries.


Background papers

Oli Christ: "A modular and flexible architecture for an integrated corpus query system". COMPLEX'94, Budapest, 1994. .ps.gz

Oli Christ and B.M.Schulze: "Ein flexibles und modulares Anfragesystem für Textcorpora". Tagungsbericht des Arbeitstreffen Lexikon + Text. Niemeyer, Tübingen, 1995. .ps.gz


Dr. Ulrich Heid, University of Stuttgart, Institute for Natural Language Processing

