next up previous contents index
Next: String variables Up: Access to single corpus Previous: Representations of characters

Subsections

Regular expressions over characters

If you do not know exactly how a word is spelled in the corpus, you can leave the spelling of the word `underspecified' by stating a regular expression . CQP has adopted the POSIX egrep notation of regular expressions. This comprises the following operations: parentheses for marking embedded expressions, concatenation, disjunction, lists of alternative characters, unspecified character, optionality, Kleene star, and Kleene plus. Certain types of regular expressions can be abbreviated by the use of a 'flag'.

Embedded regular expressions

Subsequently, it may be necessary to mark embedded regular expressions. For this purpose, parentheses ( , )  are used.

Concatenation

Even the simple query like

"Clinton";
is an instance of a regular expression. It is formed by the concatenation  of the characters C, l, i, n, t, o, and n. Concatenation is expressed by the juxtaposition of regular expressions.

Disjunction

Let's assume, that we want to find the occurrences of the English word ``the'', but we do want both, upper and lower case occurrences. This query can be expressed as

"(the)|(The)";

Here, the disjunction operator  |  lets CQP look for occurrences of the word form ``the'' and of the sentence initial form ``The''. The disjunction operator is an infix operator which takes two regular expressions as its arguments. Due to the bracketing conventions for the disjunction operator, the above query is equivalent to

"the|The";

By inserting parentheses again, the query can be reformulated more shortly as

"(t|T)he";

Lists of alternative characters

By using a list of alternative characters , the last query can be again rewritten.

"[tT]he";

For example, you can search for all occurrences of single digits in the corpus by the query

"[0123456789]";
This will retrieve any of the following tokens ``0'', ``1'', ``2'', ...``9''. A shorter way to formulate the same query is

"[0-9]";

Unspecified character

Say, if you don't know whether the correct spelling is ``Velazquez'' or maybe ``Velasquez'', you would write

"Vela[zs]quez";

But you could also use the unspecified character  .  in the place of the list of alternative characters. This makes the query a bit more sloppy on the one hand, but on the other hand, this is more handy to write.

"Vela.quez";
The .-operator will match any character.

Optionality

You may want to find simultaneously the two word forms ``walk'' and ``walks''. Both word forms are captured by the regular expression

"walk(s)?";
The optionality operator  ?  indicates that the preceding expression is optional. Since, by default, the ? operator takes the preceding character as its operand, the parentheses can be omitted in the above case.
"walks?";
However, in the query
"walk(ed)?";
which retrieves the occurrences of ``walk'' and those of ``walked'', the parentheses cannot be omitted!

Iteration (Kleene star and Kleene plus)

A word like ``walk'' has several morphological variants: ``walks'', ``walked'', and ``walking''. Being sloppy, we query for all word forms which start with the character sequence walk. This is expressed by

"walk.*";

The Kleene star operator    *  means that the preceding regular expression, here the unspecified character, can occur any number of times, or needn't occur at all. Since this is a sloppy way to express our intended query, we get also matches like ``walker'', ``walkie-talkie'' etc.

In the last query, the word form ``walk'' itself was a part of the query result. If you only want to see word forms which are strictly longer as walk itself, you have to use the plus operator +  instead of the star *.

"walk.+";
The plus + works like the star *, but it requires that its argument expression occurs at least once.

It is a bit hard to think of natural language examples which match a regular expression where the Kleene operator takes a string of length 2 or longer as its argument. In the Penn Treebank corpus, the following query will match only the word ``Honolulu''.

".*lu(lu)+.*";

Flags

Some common types of regular expressions can be expressed in a much shorter manner with the help of CQPflags.

%d 
``insert diacritics''

It is sufficient to specify in the query the plain character without diacritics, but still all its occurrences with diacritics will be considered. E.g. our query for the word Spätzle will turn into:

"Spatzle" \%d;

%c 
``case insensitive''

Retrieve both upper and lower case variants of the query. Example:

"the" %c;
for searching ``the'' as well as ``The''.

%l 
``literal use''

With this option, all the CQP operators in the query are interpreted literally.

"+" %l;

finds all occurrences of +  in the corpus. This query is equivalent to

"\+";

As the %l option turns off both %d and %c, only the combinations %l, %c, %d and %cd are useful.

A nontrivial example

We will conclude this section with quite a nontrivial example of a regular expression. Let's assume, we want to find occurrences of the German verb ``treffen''. Since the German language has a rich inflectional morphology, many word forms are based on this stem:

``treffen'', ``treffe'', ``triffst'', ``trifft'', ``trefft'', ``traf'', ``trafst'', ``trafen'', ``traft'', ``getroffen'', ``träfe'', ``träfst'', ``träfen'', ``träft'', ``treffend'', ``treff'', ``triff''

The easiest way of rendering this list as a regular expression would be to write down a long disjunction of all the individual word forms. However, the query can become shorter (but maybe more opaque) based on the following observations.

In total, we get the following regular expression

"[tT]reff(e(nd?)?|s?t)?|[tT]riff(s?t)?|[tT]raf(s?t|en)?|[tT]r\"af(en?|s?t)?|getroffen";

next up previous contents index
Next: String variables Up: Access to single corpus Previous: Representations of characters
Esther Koenig-Baumer
8/16/1999