
3.8 Improved Query Language (Q4M)
Andreas Mengel IMS Stuttgart
1 Application 2 Environment/Connectivity 3 Input 4 Query Language 4.1 Items 4.1.1 Element Attributes 4.1.2 Values 4.2 Operators 4.2.1 Assignment 4.2.2 Value Reference 4.2.3 Numerical Comparison 4.2.4 String Comparison 4.2.5 Time Relation 4.2.6 Structural Position 4.2.7 Logic 4.2.8 Set Expressions 4.2.9 Grouping 4.2.10 Elements 4.3 Syntax 5. Output 6. Error Messages 7. Acknowledgements
A corpus can consist of more than one of these hierarchies. For each element that contains sub-units, an internal structure can be defined. It is evident that the result of a query does not only depend on the expressiveness of a query language, but also on the encoding and the representation of the corpora. There are problems:
The query language described here, Q4M, is used for identifying constellations.
The term constellation is used in this document to refer to a very
general concept of output. In many query systems it is possible to search
for words or sequences of words that match certain criteria, thus the output
is a sequence of defined segments of corpora. The situation is different
if the corpora to be queried have a multi-level structure and annotation.
Consider a query that refers to time, POS-attributes, speaker identity,
coder properties, and a specific coreference aspect: What should be the
output? Certainly, it will make no sense at all to provide a sequence of
words as result. Rather, it seems useful to provide information on where
elements of such constellations can be found, as it cannot be determined
what the user wants to be displayed. It is then up to the user to select
appropriate views of the locations that have been found.
<word lem="mouse">mice</word>In the example above, the element is word, an attribute of the word is lemma which has the value mouse. In query expressions the attribute can be referred to by "word lem". Attribute reference can also be expressed by regular expressions e.g., word po.+.
The bundle of attributes of an element is called a tag. The content (in XML terminology: the #PCDATA child) of the element (mice) is referred to by the # attribute of the element (word).
<sentence id="sentence_1" coder="Alfred">
<word id="word_1" num="sing">The</word>
<word id="word_2" num="sing">small</word>
<word id="word_3" num="sing">boy</word>
<word id="word_4" ten="past">ran</word>
</sentence>
| sentence : "The
small boy ran"
sentence id : "sentence_1" sentence coder : "Alfred" |
|
| word #
: "The"
word id : "word_1" word num : "sing" |
word #
: "small"
word id : "word_2" word num : "sing" |
| word #
: "boy"
word id : "word_3" word num : "sing" |
word #
: "ran"
word id : "word_4" word ten : "past" |
Variables (see below) are used to address elements, they can be used to address any of the set of values of the attributes of a tag. Single attributes of a tag are addressed by appending a space and the name of the attribute to the variable name. If word elements are referred to by a variable like $w, the attributes of this word element can be referred to by $w lem, $w pos, $w num, and $w #. Element names can also be expressed by regular expressions, e.g. $a w.*.
The notion of element attributes has to be distinguished from
the notion of level of description. E.g., on the syntax level, there
may be elements like sentence, phrase, word etc. All
of them are distinct tags although belonging to one level of description.
The division of linguistic descriptions into levels is a matter of convention.
205.34
"NN"
"example"
In the example ($ef word), after the assignment, any other attribute of the words, e.g. the start times ($ef start) can be accessed. The assignment operations have to be separated from the query expression by a colon.$ab f0(Assign f0 elements to $ab).
$cd phon
(Assign phone elements to $cd)$ef word
(Assign word elements to $ef)
($ef word)($gh phon); ($gh lab ~ "i:") && ($ef @ $gh)
Value Specification: ""
In order to access values of attributes which are not to compared numerically, double quotes must be used:$phrase type ~ "NP"String Expansion: *, +, ., ?, \
One can also express the strings to be searched for by regular expressions. If one wants to search sequences of characters that contain one of the symbols above, a backslash (\) must be put in front of it:$word ort ~ "sh*" (s, sh, shh, shhh, ...)
$word ort ~ "sh+" (sh, shh, shh)
$word ort ~ "sh?" (s, sh)
$word ort ~ "sh." (she, shu, shq)
$word ort ~ ".+s" (is, has, miss, expressions, ...)
$word ort ~ "sh.* (she, shed, show, shrink...)
$word ort ~ "sh\*" (sh*)
Equal: ==
In order to check whether two values are equal, the double equal sign is used:$a min == $b minNot Equal: !=
$a min == 50
In order to check whether two values are unequal, the exclamation mark equal sign is used:$a min != $a minGreater: >
$a min != 50
In order to check whether one value (left) is greater than another (right), the > symbol is used:$a min > $a minSmaller: <
$a min > 50
In order to check whether one value (left) is smaller than another (right), the < symbol is used:$a min < $a minGreater or equal: >=
$a min < 50
In order to check whether one value (left) is greater or equal compared to another (right), the >= symbol is used:$a min >= $a minSmaller or equal: <=
$a min >= 50
In order to check whether one value (left) is smaller or equal compared to another (right), the <= symbol is used:$a min <= $b min
$a min <= 5
Equal: ~
In order to check whether two values are equal, the tilde sign is used:$a lemma ~ $b lemmaNot Equal: !~
$a lemma ~ "good"
$sent # ~ "I was here."
In order to check whether two values are unequal, the exclamation mark equal sign is used:$a lemma !~ $b lemma
$a lemma !~ "good"
$sent # !~ "I was here."
Left overlap: %
Left overlap of two elements specifies that the element in front of the operator starts before the second and ends after the start but before the end of the second element:
($a % $b) := ($a start < $b start) && ($a end > $b start) && ($a end < $b end)$word % $hesitation
$pause % $eyemovement
Left alignment: [[
Left alignment of two elements specifies that both elements start at the same time.($a [[ $b) := ($a start == $b start)$word [[ $hesitation
$pause [[ $eyemovement
Right alignment: ]]
Right alignment of two elements specifies that the two elements end at the same time.($a ]] $b) := ($a end == $b end)$word ]] $hesitation
$pause ]] $eyemovement
Inclusion: @
Inclusion of two elements specifies that the element in front of the operator starts before the second and ends after the second element:($a @ $b) := ($a start < $b start) && ($a end > $b end)$word @ $hesitation
$pause @ $eyemovement
Identical duration: []
Identical duration of two elements specifies that the elements have the same start and end position:($a [] $b) := ($a start == $b start) && ($a end == $b end)Overlap: //$word [] $hesitation
$pause [] $eyemovement
Overlap of two elements specifies that the element in front of the operator and the second element share some moments of time.($a // $b) := (($a start == $b start) && ($a end == $b end)) || (($a start > $b start) && ($a start < $b end)) || (($a end > $b start) && ($a end < $b end))Contact: ][$word // $hesitation
$pause // $eyemovement
Contact of two elements specifies that the element in front of the operator ends at the same time as the second element starts.$a ][ $b := $a end == $b start$word ][ $hesitation
$pause ][ $eyemovement
Direct neighbourhood can be expressed by time relations. Looking for all occurrences of pairs of words can be expressed as
Referring to time:But: This query will only find all words that are not separated by a pause.
($a word) ($b word) ; ($a ][ $b)
(Find all pairs of words where the end of the first equals the start of the second.)
Two operators can be used for accessing structural and hierarchy related
units. The XML document and its graphical representation in the box below
may serve as illustration.
|
doc1 /\ / \ / \ s1 s2 /\ /\ / \ / \ w1 w2 w3 w4 |
|
<doc id="doc1"> <s id="s1"> <w id="w1"> <w id="w2"> </s> <s id="s2"> <w id="w3"> <w id="w4"> </s> </doc> |
Parent: ^
The element in front of ^ is an ancestor node of the element after ^. Note, that any parent-child pair will be selected by this operator. There is no distinction between parents or grandparents and the like.($p phrase)($w word); ($p ^ $w)
(Find phrase-word pairs where the phrases are ancestor nodes of the word elements.)($w word) ($ph phone); ($w # ~ "stop") && ($w ^ $ph) && ($ph type ~ "\?")
(Find all realizations of the word stop, where a glottal stop [?] appears..)
Numerical specification of hierarchical relations
In addition, numerals can be used to further specify the hierarchical relation as in
($d doc)($w word); ($d 2^3 $w)
The numeral left to the ^ operator (2) specifies the hierarchical distance between the nodes, a distance of 1 describes direct parenthood. The numeral at the right hand side of the ^ operator (3) specifies horizontal position properties of the element after the operator: For all nodes in a given generation distance from the parent node, a position value is set, the horizontal descriptor determines the position of the child node in that row. In the example query above, the <w> element with ID w3 would be selected. There is a special character m which can be used after the operator as in
($e *)($w word); ($e ^m $w)
denoting each last node in a generation row. The result pairs of this query would be
<doc id="doc1"> <s id="s2">
<doc id="doc1"> <w id="w4">
<s id="s1"> <w id="w2">
<s id="s2"> <w id="w4">
doc1
/\
/ \
/ \
s1 s2
/\ /\
/ \ / \
w1 w2 w3 w4doc1
/\
/ \
/ \
s1 s2
/\ /\
/ \ / \
w1 w2 w3 w4doc1
/\
/ \
/ \
s1 s2
/\ /\
/ \ / \
w1 w2 w3 w4doc1
/\
/ \
/ \
s1 s2
/\ /\
/ \ / \
w1 w2 w3 w4The meaning of the query could be expressed as:
Find pairs of word elements and other elements which share an ancestor-relation and where the word node is the last in a given generation.
Predecessor: <>
The element in front of <> is a predecessor of the element after <>. Expressions with this operator are evaluated in the following way:
Being a predecessor of another element implies that this other element is a neighbour element. The prerequisite for the neighbour concept is that neighbours are defined by the existence of a common parent node.($a word) ($b word); ($a pos ~ "NN") && ($a <> $b) && ($b # ~ "lesser")As in the case of the ^ operator, the <> can also have numerals at both sides. The use of the operator without numerals will find all neighbour pairs of the elements specified.
(Find nouns which are followed by the word lesser.)($c phone) ($d phone); ($c class ~ "fricative") && ($c <> $d) && ($d seg ~ "\?")
(Find fricatives which are followed by a glottal stop [?].)
A numeral at the left hand side of the operator <> indicates the hierarchical distance of the element relative to which two given nodes can be defined as neighbours. In the query expression
($w1 word)($w2 word); ($w1 2^ $w2)
All <w> nodes which have a common parent node two hierarchical steps above are treated as neighbours, i.e., all <w> elements, because all of them have the <doc> node as a common parent node with a hierarchical distance of 2.
A numeral on the right hand side of the <> operator specifies the distance between the nodes. A query like
($w1 word)($w2 word); ($w1 2^2 $w2)
will give the following result.
<w id="w1"> <w id="w3">
<w id="w2"> <w id="w4">
doc1
/\
/ \
/ \
s1 s2
/\ /\
/ \ / \
w1 w2 w3 w4doc1
/\
/ \
/ \
s1 s2
/\ /\
/ \ / \
w1 w2 w3 w4
($w word)($h hesitation)($n noise); ($w // $h) || ($w // $n)Instead of && and ||, the words and and or can be used.
(Find words which overlap with hesitations or noise or both.)($var01 word); ($var01 f0mean == 120) || ($var01 end <= "233.33")
(Find words, which have a mean f0 value of 120 or which occur before time 233.33.)($var02 word); ($var02 # ~ "hello") && ($var02 dur > "0.5")
(Find words written hello the duration of which is longer than 0.5 seconds.)($w1 word)($w2 word); ($w1 # ~ "no") && ($w2 # ~ "yes") && !($w1 // $2)
(Find words, written no which do not overlap with words written yes.)
Negating a simple or a complex query expression will find all element tuples which are not in the set of tuples specified by the query expression that is negated.
In a series of simple expressions concatenated by && or || conjunctions will be evaluated first.
NB: The use of && or and requires that the brackets to be combined by logical AND share at least one variable. Otherwise the result of the expression is void.
The expression
($a w)($b s); ($a pos ~ "nn") && ($b start > 12.01)
does not make sense since there are no elements by the identity of which the results of the two expressions could be joined. If one only wants all <w> elements and <s> elements that fulfil the conditions in the expression, the use of logical OR would be appropriate:
($a w)($b s); ($a pos ~ "nn") || ($b start > 12.01)
Intersection of element: {
All elements denoted by the variable in front of the operator which are in the set of elements denoted by the variable after the operator are returned.($a w)($b w); ($a <> $b) and ($a { $b) will return all tuples of <w> elements for which it is true that they are neighbour elements and for which it is true that each of the elements is within the set of left neighbours and is a member of the set of right hand side neighbours.
Not in set of elements: !{
All elements denoted by the variable in front of the operator which are not in the set of elements denoted by the variable after the operator are returned.($a w)($b w); ($a <> $b) and ($a !{ $b) will return all tuples of <w> elements for which it is true that they are neighbour elements and for which it is true that each of the elements is within the set of left neighbours and not a member of the set of right hand side neighbours.
Union of elements: {}
All elements denoted by the variable in front of the operator plus the elements denoted by the variable after the operator are returned.($a w)($b w); ($a <> $b) and ($a {} $b) will return all tuples of <w> elements for which it is true that they are neighbour elements and for which it is true that each of the elements is within the set of left neighbours or a member of the set of right hand side neighbours.
Intersection of attribute values: {
The elements denoted by the variable in front of the operator the attribute values of which are in the set of attribute values of the elements denoted by the variable after the operator are returned.($a w)($b word); ($a pos { $b pos) will return all <w> elements for which it is true that they have a pos which is contained in the set of pos values of <word> elements.
Not in set of attribute values: !{
The elements denoted by the variable in front of the operator the attribute values of which are not in the set of attribute values of the elements denoted by the variable after the operator are returned.($a w)($b word); ($a pos !{ $b pos) will return all <w> elements for which it is true that they have a pos which is not a member of the set of pos values of <word> elements.
Union of attribute values: {}
The elements denoted by the variable in front of the operator plus the elements denoted by the variable after the operator attribute value of which are not in the set of attribute values of the elements denoted by the variable in front of the operator are returned.($a w)($b word); ($a pos {} $b pos) will return all <w> elements and those <word> elements the pos value of which is not in the set of <w> elements.
($wo word);
(($wo # ~ "here") || ($wo # ~ "hear")) &&
(($wo spk ~ "Mary") || ($wo spk ~ "Peter"))(Find a word which is either "here" or "hear" and which is either spoken by Mary or Peter.)
($a w); would return a list of all <w> elements
loaded.
| Express | ::= | ( ( DefinEx )+ ";" ( ComplEx )? ) ( <EOF> | "\n" ) |
| DefinEx | ::= | <BPL> Asg AsgElm <BPR> |
| Asg | ::= | "$" <CHR> |
| AsgElm | ::= | <CHR> |
| ComplEx | ::= | Brack ( <LOP> Brack )* |
| Brack | ::= | ( <NLOP> )? <BPL> ( Opert | ComplEx ) <BPR> |
| Opert | ::= | ( ValCmp | LocCmp | GrpCmp ) |
| ValCmp | ::= | ElmAtt ( NumComp | StrComp ) |
| ElmAtt | ::= | Elm Att |
| NumComp | ::= | <NOP> ( NUMBER | ElmAtt ) |
| StrComp | ::= | <SOP> ( ( <QUO> <CHR> <QUO> ) | ElmAtt ) |
| LocCmp | ::= | Elm ( <TIP> | ( <CHR> )* <HRP> ( <CHR> )* ) Elm |
| GrpCmp | ::= | ( GrpElm | GrpAtt ) |
| GrpElm | ::= | Elm <GRP> Elm |
| GrpAtt | ::= | ElmAtt <GRP> ElmAtt |
| Elm | ::= | ( "$" ) <CHR> |
| Att | ::= | ( <CHR> | <HSH> ) |
| NUMBER | ::= | ( <MIN> )? <CHR> |
| BPL | ::= | ( "(" ) |
| BPR | ::= | ( ")" ) |
| CHR | ::= | ( ~[";","$","^","~",",","=","\"","&","|","!","<",">",
"(",")","{","}","\n"," "] )+ |
| LOP | ::= | ( "&&" | "||" | "and" | "or" ) |
| ::= | ( "!" ) | |
| ::= | ( "~" | "!~" ) | |
| ::= | ( "\"" ) | |
| ::= | ( "%" | "@" | "[[" | "]]" | "][" | "[]" | "//" ) | |
| ::= | ( "<>" | "^" ) | |
| ::= | ( "{" | "!{" | "{}" ) | |
| ::= | ( "#" ) | |
| ::= | ( "-" ) | |
| ::= | ( "==" | "<" | ">" | "<=" | ">=" | "!=" ) |
If the query was this:
($w word) ($f f0); ($w # ~ "wonder") and ($w @ $f) and ($f cat ~ "H*")The result of the query could be the following one
| <qures id="qures_34"
qstring="($w word)($f f0) ; ($w # ~ "wonder") and ($w @ $f) and ($f cat ~ "H*")"> <qodesc id="qodesc_1"> <qrtype id="qrtype_1" name="word" var="$w"/> <qrtype id="qrtype_2" name="f" var="$f"/> </qodesc> <qutup id="qutup_1"> <qelm id="qelm_1" href="words.xml#id(w_02)" refvar="$w"/> <qelm id="qelm_2" href="proso.xml#id(f0_01)" refvar="$f"/> </qutup> <qutup id="qutup_2"> <qelm id="qelm_3" href="words.xml#id(w_07)" refvar="$w"/> <qelm id="qelm_4" href="proso.xml#id(f0_05)" refvar="$f"/> </qutup> <qutup id="qutup_3"> <qelm id="qelm_5" href="words.xml#id(w_23)" refvar="$w"/> <qelm id="qelm_6" href="proso.xml#id(f0_12)" refvar="$f"/> </qutup> <qutup id="qutup_4"> <qelm id="qelm_7" href="words.xml#id(w_45)" refvar="$w"/> <qelm id="qelm_8" href="proso.xml#id(f0_34)" refvar="$f"/> </qutup> <qutup id="qutup_5"> <qelm id="qelm_9" href="words.xml#id(w_67)" refvar="$w"/> <qelm id="qelm_10" href="proso.xml#id(f0_45)" refvar="$f"/> </qutup> </qures> |
Note that by providing both the href to the elements found and the variable names used for the specification in the query expressions, each query result can be used as input for a new query. The output of IDs of the elements is to provide the user with information on which elements were identified by the query. This does not seem to be very important if the query expression includes only words and their part of speech information, but it is quite useful if the query expression includes sentences on the one hand an phones on the other: How else could a user know what phone within the sentence matches the query?
The DTD for query results, looks the following way:
<!DOCTYPE qures[
<!ELEMENT qures (qutup)+>
<!ATTLIST qures
id ID #REQUIRED
qstring CDATA #REQUIRED>
<!ELEMENT qrdesc (qrtype)+>
<!ATTLIST qrdesc
id ID #REQUIRED>
<!ELEMENT qrtype ANY>
<!ATTLIST qrtype
id ID #REQUIRED
name CDATA #IMPLIED
type CDATA #IMPLIED>
<!ELEMENT qutup (quelm)+>
<!ATTLIST qutup
id ID #REQUIRED>
<!ELEMENT quelm ANY>
<!ATTLIST quelm
id ID #REQUIRED
refvar CDATA #IMPLIED
href CDATA #IMPLIED
xml:link CDATA #FIXED "simple"
show CDATA #FIXED "embed"
actuate CDATA #FIXED "auto">
]>
Another way of reusing the output of queries is the specification of identified items by a variable. If the aim of a user was to select special elements which were approached by the first query, s/he can use all element selected by this first step in further queries by specifying these elements.
If the query was this:
($w word) ($f f0); ($w # ~ "wonder") && ($w @ $f) && ($f cat ~ "H*")The result of the query could be the following one:
<qures id="qures_1">
<qutup id="qutup_1">
<quelm id="quelm_1" href="words.xml#id(w_02)" refvar="$w"/>
<quelm id="quelm_2" href="proso.xml#id(f0_01)" refvar="$f"/>
</qutup>
<qutup id="qutup_2">
<quelm id="quelm_1" href="words.xml#id(w_07)" refvar="$w"/>
<quelm id="quelm_2" href="proso.xml#id(f0_05)" refvar="$f"/>
</qutup>
...
...
</qures>
In the next query the user can point to all the words found in the
following way, if the words shall be constrained to their POS value:
($w word) ($e el); ($e refvar ~ "$w") && ($e ^ $w) && ($w pos ~ "V")
I.e.; the user has to make explicit that there was a query element
referring to this word ($e ^ $w) by the variable name $w
($e refvar ~ "$w"), the additional information is given by adding
the POS constraint
($w pos ~ "V").
Last modification: 18 Nov 1999