|
|
||||||
| CWB home | Applications | Features | Online Demos | Availability | Papers | Users' Corner |
|---|
The previous version of this page can be found here
In order to support work in the fields of lexicography and terminology, IMS has developed a workbench for full-text retrieval from large textual resources (`corpora'). This work was initiated by the TC Project (`Text Corpora and Tools for their Exploitation').
Applications
The IMS Corpus Workbench is used for
- Data-driven linguistics:
Extraction of linguistic knowledge from textual resources or cross-checking of linguistic assumptions against large texts. -
Lexicography:
Corpus-based evidence for lexical descriptions. -
Terminology:
Extraction of terms and bootstrapping of terminological resources.
Query language
- unrestricted number of attributes per corpus position
- regular expressions over attribute values of individual corpus positions (e.g. wild cards for word forms, part-of-speech values)
- regular expressions over sequences of corpus positions
- (partial) support of structural annotations (e.g. SGML)
- incremental concordancing
- application of a query to all items of a list
- 'virtual attributes', i.e. runtime access to external applications (e.g. a thesaurus)
- queries on parallel translated texts
Display of results
- user-definable size of 'keyword in context' display
- 'keyword in context' lines can be sorted in various ways
- frequency counts, e.g. for word combinations
- multilingual concordances from aligned corpora
- html and latex output supported
- query history
Corpus Administration and Preparation
- registration of corpora
-
'encoding' of corpora, i.e. indexing (and compression)
(for text sources in one-word-per-line format, using ISO8859/Latin-1 8bit character sets, and maybe others)
For example, the BNC corpus with part-of-speech and lemma annotation will need about 1 GB of disk space. - incremental addition of types of corpus annotations ('attributes'). E.g. add part-of-speech values to a corpus once you have access to a POS-tagger.
Retrieval
The query language is interpreted by the 'Corpus Query Processor' (CQP). CQP requires corpora to be registered and encoded in the specific manner.
There used to be a Motif-based graphical user interface 'xkwic', which made access to CQP more convenient for non-programmers. This hasn't been changed for a couple of years now, and it doesn't seem to run with newer versions of the operating systems. So, the Corpus Query Processor is a command line tool only.At IMS, the largest corpus currently being handled by the Corpus Workbench is a German newspaper corpus which consists of about 200 million tokens, annotated with lemmata, two different part-of-speech tag sets, and sentence boundaries.
Oli Christ: "A modular and flexible architecture for an integrated corpus query system". COMPLEX'94, Budapest, 1994. .ps.gzOli Christ and B.M.Schulze: "Ein flexibles und modulares Anfragesystem für Textcorpora". Tagungsbericht des Arbeitstreffen Lexikon + Text. Niemeyer, Tübingen, 1995. .ps.gz
Contact
Dr. Ulrich Heid, University of Stuttgart, Institute for Natural Language Processing, Azenbergstr.12, 70174 Stuttgart, GermanyUli.Heid@ims.uni-stuttgart.de, fon: +49-711-685-81373, fax: +49-711-685-81366