In order to support work in the fields of lexicography and terminology, IMS has developed a workbench for full-text retrieval from large textual resources (`corpora'). This work was initiated by the TC Project (`Text Corpora and Tools for their Exploitation').
The IMS Corpus Workbench is used for
- Data-driven linguistics:
Extraction of linguistic knowledge from textual resources or cross-checking of linguistic assumptions against large texts.
Corpus-based evidence for lexical descriptions.
Extraction of terms and bootstrapping of terminological resources.
- unrestricted number of attributes per corpus position
- regular expressions over attribute values of individual corpus positions (e.g. wild cards for word forms, part-of-speech values)
- regular expressions over sequences of corpus positions
- (partial) support of structural annotations (e.g. SGML)
- incremental concordancing
- application of a query to all items of a list
- 'virtual attributes', i.e. runtime access to external applications (e.g. a thesaurus)
- queries on parallel translated texts
See the overview of the query syntax and some more sample queries.
Display of results
- user-definable size of 'keyword in context' display
- 'keyword in context' lines can be sorted in various ways
- frequency counts, e.g. for word combinations
- multilingual concordances from aligned corpora
- html and latex output supported
- query history
Corpus Administration and Preparation
- registration of corpora
- 'encoding' of corpora, i.e. indexing (and compression)
(for text sources in one-word-per-line format, using ISO8859/Latin-1 8bit character sets, and maybe others)
For example, the BNC corpus with part-of-speech and lemma annotation will need about 1 GB of disk space.
- incremental addition of types of corpus annotations ('attributes'). E.g. add part-of-speech values to a corpus once you have access to a POS-tagger.
The query language is interpreted by the 'Corpus Query Processor' (CQP). CQP requires corpora to be registered and encoded in the specific manner.
There used to be a Motif-based graphical user interface 'xkwic', which made access to CQP more convenient for non-programmers. This hasn't been changed for a couple of years now, and it doesn't seem to run with newer versions of the operating systems. So, the Corpus Query Processor is a command line tool only.
At IMS, the largest corpus currently being handled by the Corpus Workbench is a German newspaper corpus which consists of about 200 million tokens, annotated with lemmata, two different part-of-speech tag sets, and sentence boundaries.
Oli Christ: "A modular and flexible architecture for an integrated corpus query system". COMPLEX'94, Budapest, 1994. .ps.gz
Oli Christ and B.M.Schulze: "Ein flexibles und modulares Anfragesystem für Textcorpora". Tagungsbericht des Arbeitstreffen Lexikon + Text. Niemeyer, Tübingen, 1995. .ps.gz
Dr. Ulrich Heid, University of Stuttgart, Institute for Natural Language Processing, Pfaffenwaldring 5b, 70569 Stuttgart, Germany, Uli.Heid@ims.uni-stuttgart.de, fon: +49-711-685-81373, fax: +49-711-685-81366
For more information see CorpusWorkbench.