Next: 1.2 The CWB corpus
Up: 1 Introduction
Previous: 1 Introduction
Contents
Subsections
1.1 The IMS Corpus Workbench (CWB)
- Tool development
- 1993 - 1996: Project on Text Corpora and Exploration Tools
(financed by the Land Baden-Württemberg)
- 1998 - 2004: Continued in-house development
(partly financed by various research and industrial projects)
- CWB version 3.0 to be released in early 2005
(pre-release versions have been shipped since October 2001)
- Related projects and applications at the IMS
- 1994 - 1998: EAGLES project (EU programme LRE/LE)
(morphosyntactic annotation, part-of-speech tagset, annotation tools)
- 1994 - 1996: DECIDE1 project (EU programme MLAP-93)
(extraction of collocation candidates, macro processor mp)
- 1996 - 1999: Construction of a subcategorization lexicon for German
(PhD thesis Eckle-Kohler, financed by the Land Baden-Württemberg)
- Since 1996: Various commercial and research applications
(terminology extraction, dictionary updates)
- 1999 - 2000: DOT project (Databank Overheidsterminologie)
(stand-alone system for extraction of Dutch legal terminology)
- 1999 - 2003: Implementation of YAC chunk parser for German
(PhD thesis Kermes, annotates results of CQP queries in the corpus)
- 2001 - 2003: Transferbereich 32 (financed by the DFG)
(applications in computational lexicography)
- Some external applications of the IMS Corpus Workbench
- CWB uses proprietary token-based format for corpus storage:
- binary encoding
fast access
- full index
fast look-up of word forms and annotations
- specialised data compression algorithms
- corpus size: up to 500 million words, depending on annotations
- text data and annotations cannot be modified after encoding
(but it is possible to add new annotations or overwrite existing ones)
- assumes Latin-1 encoding, but compatible with other 8-bit ASCII
extensions
(Unicode text in UTF-8 encoding can be processed with some caveats)
- Typical compression ratios for a 100 million word corpus:
- uncompressed text:
1 GByte (without index & annotations)
- uncompressed CWB attributes:
790 MBytes (ratio: 1.3)
- word forms & lexical attributes:
360 MBytes (ratio: 2.8)
- categorical attributes (e.g. POS tags):
120 MBytes (ratio:
8.5)
- binary attributes (yes/no):
50 MBytes (ratio: 20.5)
- Supported operating systems:
- SUN Solaris 2.8 (Sparc processors)
- Linux 2.4+ (Intel i386 and compatible processors)
- Corpus data format is platform-independent
- Source code should compile on most POSIX-compliant 32-bit platforms
- tools for encoding, indexing, compression, decoding, and frequency
distributions
- global ``registry'' holds information about corpora (name, attributes,
data path)
- corpus query processor (CQP):
- fast corpus search (regular expression syntax)
- use in interactive or batch mode
- results displayed in terminal window
- CWB/Perl interface for post-processing, scripting and web interfaces
Next: 1.2 The CWB corpus
Up: 1 Introduction
Previous: 1 Introduction
Contents
Stefan Evert
2005-07-12