Institut für Maschinelle Sprachverarbeitung
IMS Corpus Workbench (CWB)
In order to support work in the fields of lexicography and
terminology, IMS has developed a workbench for full-text retrieval
from large textual resources (`corpora').
This work was initiated by the
TC Project
(`Text Corpora and Tools for their
Exploitation').
Applications
The IMS Corpus Workbench is used for
- Data-driven linguistics:
Extraction of linguistic knowledge from textual resources or
cross-checking of linguistic assumptions against large texts.
-
Lexicography:
Corpus-based evidence for lexical descriptions.
-
Terminology:
Extraction of terms and bootstrapping of terminological resources.
Features
Query language
-
unrestricted number of attributes per corpus position
-
regular expressions over attribute values of individual corpus
positions
(e.g. wild cards for word forms, part-of-speech values)
-
regular expressions over sequences of corpus positions
-
(partial) support of structural annotations (e.g. SGML)
-
incremental concordancing
-
application of a query to all items of a list
-
'virtual attributes', i.e.
runtime access to external applications (e.g. a thesaurus)
-
queries on parallel translated texts
See the
overview of the query syntax and some more
sample queries.
Display of results
- user-definable size of 'keyword in context' display
-
'keyword in context' lines
can be sorted in various ways
-
frequency counts, e.g. for word combinations
-
multilingual concordances from aligned corpora
-
html and latex output supported
-
query history
Corpus Administration and Preparation
-
registration of corpora
-
'encoding' of corpora, i.e. indexing (and compression)
(for text sources in one-word-per-line format,
using ISO8859/Latin-1 8bit character sets, and maybe others)
For example, the BNC corpus
with part-of-speech and lemma annotation will need about 1 GB of disk
space.
-
incremental addition of types of corpus annotations
('attributes'). E.g. add part-of-speech values to a corpus
once you have access to a POS-tagger.
Retrieval
The query language is interpreted by the 'Corpus Query Processor'
(CQP). CQP requires corpora to be registered and encoded in the
specific manner.
There used to be a Motif-based graphical user interface
'xkwic', which made access to CQP more convenient for
non-programmers. This hasn't been changed for a couple of years now,
and it doesn't seem to run with newer versions of the operating
systems. So, the Corpus Query Processor is a command line tool only.
At IMS, the largest corpus currently being handled by the Corpus
Workbench is a German newspaper corpus which consists of about 200
million tokens, annotated with lemmata, two different part-of-speech
tag sets, and sentence boundaries.
Background papers
Oli Christ:
"A modular and flexible architecture for an integrated corpus query
system". COMPLEX'94, Budapest, 1994.
.ps.gz
Oli Christ and B.M.Schulze:
"Ein flexibles und modulares Anfragesystem für Textcorpora".
Tagungsbericht des Arbeitstreffen Lexikon + Text. Niemeyer, Tübingen, 1995.
.ps.gz
Contact
Mr. Ulrich Heid,
University of Stuttgart, Institute for Natural Language Processing,
Azenbergstr.12, 70174 Stuttgart,
Germany
Uli.Heid@ims.uni-stuttgart.de,
fon: +49-711-685-81374, fax: +49-711-685-81366
IMS Stuttgart, Fri Oct 27 12:19:00 2000 (Uli.Heid@ims.uni-stuttgart.de)