IMS Corpus Workbench (CWB)
- Short description
In order to support work in the fields of lexicography and terminology, IMS has developed a workbench for full-text retrieval from large textual resources (`corpora').
This work was initiated by the TC Project (`Text Corpora and Tools for their Exploitation').
- Long description
In order to support work in the fields of lexicography and terminology, IMS has developed a workbench for full-text retrieval from large textual resources (`corpora'). This work was initiated by the TC Project (`Text Corpora and Tools for their Exploitation').
- unrestricted number of attributes per corpus position
- regular expressions over attribute values of individual corpus positions (e.g. wild cards for word forms, part-of-speech values)
- regular expressions over sequences of corpus positions
- (partial) support of structural annotations (e.g. SGML)
- incremental concordancing
- application of a query to all items of a list
- 'virtual attributes', i.e. runtime access to external applications (e.g. a thesaurus)
- queries on parallel translated texts
Display of results
- user-definable size of 'keyword in context' display
- 'keyword in context' lines can be sorted in various ways
- frequency counts, e.g. for word combinations
- multilingual concordances from aligned corpora
- html and latex output supported
- query history
Corpus Administration and Preparation
- registration of corpora
- 'encoding' of corpora, i.e. indexing (and compression)
(for text sources in one-word-per-line format, using ISO8859/Latin-1 8bit character sets, and maybe others)
For example, the BNC corpus with part-of-speech and lemma annotation will need about 1 GB of disk space.
- incremental addition of types of corpus annotations ('attributes'). E.g. add part-of-speech values to a corpus once you have access to a POS-tagger.
The query language is interpreted by the 'Corpus Query Processor' (CQP). CQP requires corpora to be registered and encoded in the specific manner.
There used to be a Motif-based graphical user interface 'xkwic', which made access to CQP more convenient for non-programmers. This hasn't been changed for a couple of years now, and it doesn't seem to run with newer versions of the operating systems. So, the Corpus Query Processor is a command line tool only.
At IMS, the largest corpus currently being handled by the Corpus Workbench is a German newspaper corpus which consists of about 200 million tokens, annotated with lemmata, two different part-of-speech tag sets, and sentence boundaries.
The IMS Corpus Workbench is used for
- Data-driven linguistics:
Extraction of linguistic knowledge from textual resources or cross-checking of linguistic assumptions against large texts.
Corpus-based evidence for lexical descriptions.
Extraction of terms and bootstrapping of terminological resources.
- Data-driven linguistics:
Oli Christ: "A modular and flexible architecture for an integrated corpus query system". COMPLEX'94, Budapest, 1994.
Oli Christ and B.M.Schulze: "Ein flexibles und modulares Anfragesystem für Textcorpora". Tagungsbericht des Arbeitstreffen Lexikon + Text. Niemeyer, Tübingen, 1995.