For full functionality of this site it is necessary to enable JavaScript. Here are the instructions how to enable JavaScript in your web browser.

Position within the page tree

Institute for Natural Language Processing
Research
Projects
IMS Corpus Workbench (CWB)

Project Corpus Workbench

A workbench for full-text retrieval from large textual resources (`corpora')

IMS Corpus Workbench (CWB)

Short description

In order to support work in the fields of lexicography and terminology, IMS has developed a workbench for full-text retrieval from large textual resources (`corpora').
This work was initiated by the TC Project (`Text Corpora and Tools for their Exploitation').

Long description

In order to support work in the fields of lexicography and terminology, IMS has developed a workbench for full-text retrieval from large textual resources (`corpora'). This work was initiated by the TC Project (`Text Corpora and Tools for their Exploitation').

Features

Query language

unrestricted number of attributes per corpus position
regular expressions over attribute values of individual corpus positions (e.g. wild cards for word forms, part-of-speech values)
regular expressions over sequences of corpus positions
(partial) support of structural annotations (e.g. SGML)
incremental concordancing
application of a query to all items of a list
'virtual attributes', i.e. runtime access to external applications (e.g. a thesaurus)
queries on parallel translated texts

See the overview of the query syntax and some more sample queries.

Display of results

user-definable size of 'keyword in context' display
'keyword in context' lines can be sorted in various ways
frequency counts, e.g. for word combinations
multilingual concordances from aligned corpora
html and latex output supported
query history

Corpus Administration and Preparation

registration of corpora
'encoding' of corpora, i.e. indexing (and compression)
(for text sources in one-word-per-line format, using ISO8859/Latin-1 8bit character sets, and maybe others)
For example, the BNC corpus with part-of-speech and lemma annotation will need about 1 GB of disk space.
incremental addition of types of corpus annotations ('attributes'). E.g. add part-of-speech values to a corpus once you have access to a POS-tagger.

Retrieval

The query language is interpreted by the 'Corpus Query Processor' (CQP). CQP requires corpora to be registered and encoded in the specific manner.
There used to be a Motif-based graphical user interface 'xkwic', which made access to CQP more convenient for non-programmers. This hasn't been changed for a couple of years now, and it doesn't seem to run with newer versions of the operating systems. So, the Corpus Query Processor is a command line tool only.

At IMS, the largest corpus currently being handled by the Corpus Workbench is a German newspaper corpus which consists of about 200 million tokens, annotated with lemmata, two different part-of-speech tag sets, and sentence boundaries.

Applications

The IMS Corpus Workbench is used for

Data-driven linguistics:
Extraction of linguistic knowledge from textual resources or cross-checking of linguistic assumptions against large texts.
Lexicography:
Corpus-based evidence for lexical descriptions.
Terminology:
Extraction of terms and bootstrapping of terminological resources.

References

Oli Christ: "A modular and flexible architecture for an integrated corpus query system". COMPLEX'94, Budapest, 1994.

Oli Christ and B.M.Schulze: "Ein flexibles und modulares Anfragesystem für Textcorpora". Tagungsbericht des Arbeitstreffen Lexikon + Text. Niemeyer, Tübingen, 1995.

Project Corpus Workbench

IMS Corpus Workbench (CWB)

Features

Query language

Display of results

Corpus Administration and Preparation

Retrieval

Ulrich Heid

Audience

Formalities

Services

Organization

Project Corpus Workbench

IMS Corpus Workbench (CWB)

Features

Query language

Display of results

Corpus Administration and Preparation

Retrieval

Ulrich Heid

Here you can reach us

Audience

Formalities

Services

Organization