- an important aspect of interfacing CQP with external tools is access to
the corpus positions of query matches (as well as target and
keyword anchors); this is a prerequisite for the extraction of
further information about the matches by direct corpus access, and it is the
most efficient way to relate query matches to external data structures
(e.g. in a SQL database)
- the dump command (Section 3.3) prints the
required information in a tabular ASCII format that can easily be parsed by
other tools or read into a SQL database;7 each row of the resulting table corresponds to one match of the query, and
the four columns give the corpus positions of the match,
matchend, target and keyword anchors,
respectively; the example below is reproduced from
Section 3.3
1019887 1019888 -1 -1
1924977 1924979 1924978 -1
1986623 1986624 -1 -1
2086708 2086710 2086709 -1
2087618 2087619 -1 -1
2122565 2122566 -1 -1
undefined target anchors are represented by -1 in the
third column; even though no keywords were set for the query, the fourth
column is included in the dump table with all values set to -1
- the table created by the dump command is printed on
stdout by default (where it is parsed by the CWB/Perl interface);
it can also be redirected to a file or pipe; use the following command to
create a compressed dump file:
> dump A > "| gzip > dump.tbl.gz";
- sometimes it is desirable to reload a dump file into CQP after it has
been modified by an external program; this functionality is provided by the
undump command, which reads its input from a file or pipe
> undump B < "gzip -cd mydump.tbl.gz |";
this command creates a new named query result B for the currently
activated corpus, or overwrites an existing one
- the format for undump tables is almost identical to the dump format,
with two exceptions: the first line must contain a single number specifying
the total number of rows in the table (i.e. the number of matches), and the
following table has only two columns (for the match and
matchend anchors) by default; the example below is a valid undump
file for the DICKENS corpus, creating a query result with 5 matches
5
20681 20687
379735 379741
1915978 1915983
2591586 2591591
2591593 2591598
in an interactive CQP session, the input file can be omitted and the undump
table can then be entered directly on the command line; this feature works
only when command-line editing support is enabled with the -e
switch, so CWB/Perl and smilar interfaces have to create a temporary file
for the undump table; try out the example above with
> undump B;
further columns for the target and keyword anchors (in
this order) can be selected by adding with target or
with target keyword to the command:
> undump B with target keyword < "mydump.tbl";
- when the rows of the undump table are not sorted in the natural order
(i.e. by corpus position), they have to be re-ordered internally so that
CQP can work with them; however, the original sort order is automatically
remembered and will be used by the cat and dump commands
(until it is modified by a new sort); if you sort a query result
A, save it with dump to a text file, and then read this
file back in as named query B (after adjusting the file format),
A and B will be sorted identically
- in many cases, overlapping or unsorted matches are not intentional but
rather errors in an automatically generated undump table; in order to catch
such errors, the additional keyword ascending (or asc) can
be specified before the
< character:
> undump B with target ascending < "mydump.tbl";
will abort with an error message (indicating the row number where the error
occurred) unless the match ranges in mydump.tbl are non-overlapping
and sorted in corpus order
- a typical example for the use of dump and undump is
linking CQP queries to corpus metadata stored in an external SQL database;
assume that a corpus consists of a large collection of transcribed
dialogues, which are represented by <dialogue> regions; a rich
amount of metadata (about the speakers, setting, topic, etc.) is available
in a SQL database; the database entries can be linked directly to the
<dialogue> regions by recording their start and end corpus
positions in the database;8 the following commands generate a dump table with the required information,
which can easily be loaded into the database (ignoring the third and fourth
column)
> A = <dialogue> [] expand to dialogue;
> dump A > "dialogues.tbl";
corpus queries will often be restricted to a subcorpus by specifying
constraints on the metadata; when the metadata constraints have been
resolved in the SQL database, they can be translated to the corresponding
regions in the corpus (again represented by start and end corpus position);
after sorting these regions in ascending order and saving them to a suitable
undump file, they are loaded into CQP with an undump command; the
resulting query result can then be activated as a subcorpus for the ensuing
query
> undump SubCorpus < "subcorpus.tbl";
> SubCorpus;
Subcorpus[..]> A = ... ;