![]() | | |||||
CWB home | Applications | Features | Online Demos | Availability | Papers | Users' Corner |
---|
The previous version of this page can be found here
In order to support work in the fields of lexicography andterminology, IMS has developed a workbench for full-text retrievalfrom large textual resources (`corpora').This work was initiated by the TC Project (`Text Corpora and Tools for theirExploitation').
Applications
The IMS Corpus Workbench is used for
- Data-driven linguistics:
Extraction of linguistic knowledge from textual resources or cross-checking of linguistic assumptions against large texts. - Lexicography:
Corpus-based evidence for lexical descriptions. - Terminology:
Extraction of terms and bootstrapping of terminological resources.
Query language
- unrestricted number of attributes per corpus position
- regular expressions over attribute values of individual corpus positions (e.g. wild cards for word forms, part-of-speech values)
- regular expressions over sequences of corpus positions
- (partial) support of structural annotations (e.g. SGML)
- incremental concordancing
- application of a query to all items of a list
- 'virtual attributes', i.e. runtime access to external applications (e.g. a thesaurus)
- queries on parallel translated texts
Display of results
- user-definable size of 'keyword in context' display
- 'keyword in context' lines can be sorted in various ways
- frequency counts, e.g. for word combinations
- multilingual concordances from aligned corpora
- html and latex output supported
- query history
Corpus Administration and Preparation
- registration of corpora
- 'encoding' of corpora, i.e. indexing (and compression)
(for text sources in one-word-per-line format, using ISO8859/Latin-1 8bit character sets, and maybe others)
For example, the BNC corpus with part-of-speech and lemma annotation will need about 1 GB of disk space. - incremental addition of types of corpus annotations ('attributes'). E.g. add part-of-speech values to a corpus once you have access to a POS-tagger.
Retrieval
The query language is interpreted by the 'Corpus Query Processor'(CQP). CQP requires corpora to be registered and encoded in thespecific manner.
There used to be a Motif-based graphical user interface'xkwic', which made access to CQP more convenient fornon-programmers. This hasn't been changed for a couple of years now,and it doesn't seem to run with newer versions of the operatingsystems. So, the Corpus Query Processor is a command line tool only.At IMS, the largest corpus currently being handled by the CorpusWorkbench is a German newspaper corpus which consists of about 200million tokens, annotated with lemmata, two different part-of-speechtag sets, and sentence boundaries.
Oli Christ: "A modular and flexible architecture for an integrated corpus querysystem". COMPLEX'94, Budapest, 1994..ps.gzOli Christ and B.M.Schulze:"Ein flexibles und modulares Anfragesystem für Textcorpora".Tagungsbericht des Arbeitstreffen Lexikon + Text. Niemeyer, Tübingen, 1995..ps.gz
Contact
Dr. Ulrich Heid,University of Stuttgart, Institute for Natural Language Processing,Azenbergstr.12, 70174 Stuttgart,GermanyUli.Heid@ims.uni-stuttgart.de,fon: +49-711-685-81373, fax: +49-711-685-81366