Institut

Studium

Forschung


 

Overview

Note that the latest version of CWB is available athttp://cwb.sourceforge.net

(The IMS version is still available in its version of 2003 here)


 
 
CWB home Applications Features Online Demos Availability Papers Users' Corner


The previous version of this page can be found here

In order to support work in the fields of lexicography andterminology, IMS has developed a workbench for full-text retrievalfrom large textual resources (`corpora').This work was initiated by the TC Project (`Text Corpora and Tools for theirExploitation').

Applications

The IMS Corpus Workbench is used for

  • Data-driven linguistics:
    Extraction of linguistic knowledge from textual resources or cross-checking of linguistic assumptions against large texts.
  • Lexicography:
    Corpus-based evidence for lexical descriptions.
  • Terminology:
    Extraction of terms and bootstrapping of terminological resources.

Features

Query language

  • unrestricted number of attributes per corpus position
  • regular expressions over attribute values of individual corpus positions (e.g. wild cards for word forms, part-of-speech values)
  • regular expressions over sequences of corpus positions
  • (partial) support of structural annotations (e.g. SGML)
  • incremental concordancing
  • application of a query to all items of a list
  • 'virtual attributes', i.e. runtime access to external applications (e.g. a thesaurus)
  • queries on parallel translated texts
See the overview of the query syntax and some moresample queries.

Display of results

  • user-definable size of 'keyword in context' display
  • 'keyword in context' lines can be sorted in various ways
  • frequency counts, e.g. for word combinations
  • multilingual concordances from aligned corpora
  • html and latex output supported
  • query history

Corpus Administration and Preparation

  • registration of corpora
  • 'encoding' of corpora, i.e. indexing (and compression)
    (for text sources in one-word-per-line format, using ISO8859/Latin-1 8bit character sets, and maybe others)
    For example, the BNC corpus with part-of-speech and lemma annotation will need about 1 GB of disk space.
  • incremental addition of types of corpus annotations ('attributes'). E.g. add part-of-speech values to a corpus once you have access to a POS-tagger.

Retrieval

The query language is interpreted by the 'Corpus Query Processor'(CQP). CQP requires corpora to be registered and encoded in thespecific manner.
There used to be a Motif-based graphical user interface'xkwic', which made access to CQP more convenient fornon-programmers. This hasn't been changed for a couple of years now,and it doesn't seem to run with newer versions of the operatingsystems. So, the Corpus Query Processor is a command line tool only.

At IMS, the largest corpus currently being handled by the CorpusWorkbench is a German newspaper corpus which consists of about 200million tokens, annotated with lemmata, two different part-of-speechtag sets, and sentence boundaries.


Background papers

Oli Christ: "A modular and flexible architecture for an integrated corpus querysystem". COMPLEX'94, Budapest, 1994..ps.gz

Oli Christ and B.M.Schulze:"Ein flexibles und modulares Anfragesystem für Textcorpora".Tagungsbericht des Arbeitstreffen Lexikon + Text. Niemeyer, Tübingen, 1995..ps.gz

Contact

Dr. Ulrich Heid,University of Stuttgart, Institute for Natural Language Processing,Azenbergstr.12, 70174 Stuttgart,Germany
Uli.Heid@ims.uni-stuttgart.de,fon: +49-711-685-81373, fax: +49-711-685-81366