IMS Corpus Workbench

CQP administrator's FAQ

How do I install the tools?

Installation is relatively easy, since there is a Makefile distributed with the distribution package. The site-dependent changes to the Makefile and the detailed installation process is described in the INSTALL file, which is also included in the distribution package.

How do I install corpora?

This process is described in detail in the Corpus Administrator's Manual.

How do I add another positional attribute to a corpus?

The new attribute must be available in one-``word''-per-line format (that is, one value per line). You can produce this format, for example, from a blank separated token stream by using GNU's tr (we assume that the file datafile holds the values of your new attribute):
tr ' ' '\n' datafile | ggrep .
The standard versions of tr probably won't work, since they don't allow \n in the replacement pattern. The ggrep . is necessary to delete empty lines.

You must make sure that the token stream produced by your tool generates exactly the same number of tokens as are already encoded in, e.g. the word attribute associated with your corpus. The number of tokens of the word attribute is computed with the command

% lexdecode -S bncims
so there are 117599144 tokens in the ``bncims'' corpus.

To test whether your new attribute has exactly this number of tokens, you can pipe the output of the command above into the wc program:

tr ' ' '\n' datafile | ggrep . | wc -l
this command should report exactly the same figure as the lexdecode command gives.

When you made sure that the new attribute has the same number of tokens (or simply trust yourself), you can encode the new attribute. ``cd'' to the directory where the data is to be stored in and call

tr ' ' '\n' datafile | ggrep . | encode -p attrname
where attrname is the name of your new attribute (e.g. lemma or pos or whatever). Note the lowercase ``-p'' as the argument to encode. It's important NOT to use an uppercase P here. More information about ``encode'' can be found on the encode manual page.

When you have successfully encoded the new attribute, you must register the new attribute by adding a line similar to

ATTRIBUTE attrname { DIR /some/directory/where/the/data/is }
The { DIR ... } clause may be omitted if the new attribute data is in the corpus' HOME directory.

Now, call up makeall in order to index the data:

makeall corpusname attrname
where ``corpusname'' is the name of the corpus the attribute is associated with. Don't forget to give the new attribute name attrname as an additional argument (without a dot between the corpus and the attribute name), otherwise ``makeall' will try to recreate all necessary data files of all positional attributes.

If necessary, you can now compress part of the corpus information.

Syntax of makeall

In an older version of the manual, the makeall syntax for indexing a single positional attribute is given as
makeall corpusname.attributename
This is wrong. The correct syntax is to declare a positional attribute with the standard -P option:
makeall -P attributename corpusname

How do I remove a positional attribute from a corpus?

Simply remove the corresponding entry from the registry file (or comment it out). If you delete the corresponding files, you must make change the registry file accordingly. Normally, you should first remove (or comment out) the entry in the registry file, and only then (when no problems arise in the further use of the programs) delete the data files.

How can I compress corpus data?

A fully indexed corpus can be compressed (that is, you first must run ``makeall'' to create the indices). You can now compress the ``token sequence file'' (which is the sequence of the values of an attribute) or the ``inverted file index'' (which lists, for each value, the corpus positions where this value appears at), or both (which is the normal case).

IMS Stuttgart, Mon Feb 15 15:02:41 1999 (