If you do not know exactly how a word is spelled in the corpus, you can leave the spelling of the word `underspecified' by stating a regular expression . CQP has adopted the POSIX egrep notation of regular expressions. This comprises the following operations: parentheses for marking embedded expressions, concatenation, disjunction, lists of alternative characters, unspecified character, optionality, Kleene star, and Kleene plus. Certain types of regular expressions can be abbreviated by the use of a 'flag'.
Subsequently, it may be necessary to mark embedded regular expressions. For this purpose, parentheses ( , ) are used.
Even the simple query like
"Clinton";is an instance of a regular expression. It is formed by the concatenation of the characters C, l, i, n, t, o, and n. Concatenation is expressed by the juxtaposition of regular expressions.
Let's assume, that we want to find the occurrences of the English word ``the'', but we do want both, upper and lower case occurrences. This query can be expressed as
"(the)|(The)";
Here, the disjunction
operator |
lets CQP look for occurrences of the word form ``the'' and of the
sentence initial form ``The''. The disjunction operator is an infix
operator which takes two regular expressions as its arguments. Due to
the bracketing conventions for the disjunction operator, the above
query is equivalent to
"the|The";
By inserting parentheses again, the query can be reformulated more shortly as
"(t|T)he";
By using a list of alternative characters , the last query can be again rewritten.
"[tT]he";
For example, you can search for all occurrences of single digits in the corpus by the query
"[0123456789]";This will retrieve any of the following tokens ``0'', ``1'', ``2'', ...``9''. A shorter way to formulate the same query is
"[0-9]";
Say, if you don't know whether the correct spelling is ``Velazquez'' or maybe ``Velasquez'', you would write
"Vela[zs]quez";
But you could also use the unspecified character . in the place of the list of alternative characters. This makes the query a bit more sloppy on the one hand, but on the other hand, this is more handy to write.
"Vela.quez";The .-operator will match any character.
You may want to find simultaneously the two word forms ``walk'' and ``walks''. Both word forms are captured by the regular expression
"walk(s)?";The optionality operator ? indicates that the preceding expression is optional. Since, by default, the ? operator takes the preceding character as its operand, the parentheses can be omitted in the above case.
"walks?";However, in the query
"walk(ed)?";which retrieves the occurrences of ``walk'' and those of ``walked'', the parentheses cannot be omitted!
A word like ``walk'' has several morphological variants: ``walks'', ``walked'', and ``walking''. Being sloppy, we query for all word forms which start with the character sequence walk. This is expressed by
"walk.*";
The Kleene star operator * means that the preceding regular expression, here the unspecified character, can occur any number of times, or needn't occur at all. Since this is a sloppy way to express our intended query, we get also matches like ``walker'', ``walkie-talkie'' etc.
In the last query, the word form ``walk'' itself was a part of the query result. If you only want to see word forms which are strictly longer as walk itself, you have to use the plus operator + instead of the star *.
"walk.+";The plus + works like the star *, but it requires that its argument expression occurs at least once.
It is a bit hard to think of natural language examples which match a regular expression where the Kleene operator takes a string of length 2 or longer as its argument. In the Penn Treebank corpus, the following query will match only the word ``Honolulu''.
".*lu(lu)+.*";
Some common types of regular expressions can be expressed in a much shorter manner with the help of CQPflags.
It is sufficient to specify in the query the plain character without diacritics, but still all its occurrences with diacritics will be considered. E.g. our query for the word Spätzle will turn into:
"Spatzle" \%d;
Retrieve both upper and lower case variants of the query. Example:
"the" %c;for searching ``the'' as well as ``The''.
With this option, all the CQP operators in the query are interpreted literally.
"+" %l;
finds all occurrences of + in the corpus. This query is equivalent to
"\+";
As the %l option turns off both %d and %c, only the combinations %l, %c, %d and %cd are useful.
We will conclude this section with quite a nontrivial example of a regular expression. Let's assume, we want to find occurrences of the German verb ``treffen''. Since the German language has a rich inflectional morphology, many word forms are based on this stem:
``treffen'', ``treffe'', ``triffst'', ``trifft'', ``trefft'', ``traf'', ``trafst'', ``trafen'', ``traft'', ``getroffen'', ``träfe'', ``träfst'', ``träfen'', ``träft'', ``treffend'', ``treff'', ``triff''
The easiest way of rendering this list as a regular expression would be to write down a long disjunction of all the individual word forms. However, the query can become shorter (but maybe more opaque) based on the following observations.
In total, we get the following regular expression
"[tT]reff(e(nd?)?|s?t)?|[tT]riff(s?t)?|[tT]raf(s?t|en)?|[tT]r\"af(en?|s?t)?|getroffen";