Graph descriptions or graph constraints are (restricted) Boolean expressions over node relations and node descriptions. Currently, conjunction & and disjunction | are available as logical connectives. For example, with the help of the &-operator, the following node relations can be joined into a graph constraint which retrieves the tree shown below.
(#n1 >SB #n2) & (#n1 >HD #n3) & (#n2 >NK #n4) & (#n2 >NK #n5)
Parentheses can be omitted in the usual fashion:
#n1 >SB #n2 & #n1 >HD #n3 & #n2 >NK #n4 & #n2 >NK #n5
The operator precedence is defined as follows: Relation, &, |. This definition is illustrated by the following examples:
Example | Interpretation |
#v > #w & #x | (#v > #w) & #x |
#v & #w | #x | (#v & #w) | #x |
Variables for feature values
Variables for feature values are typically used to express agreement constraints. The following query looks for two adjacent nodes which are labelled with NN or NE.
[pos = #noun] . [pos = #noun:("NN" | "NE")]
Variables for feature constraints
With variables for feature constraints, we can search e.g. for sentences which contain the same preposition (the same word form!), twice:
[#f:(pos="APPR")] .* [#f]
Please note: There is a subtle difference if we used a feature value variable instead. If we only require the identity of the feature value, i.e. of the part-of-speech tag, we get all sentences which contain at least two prepositions (not necessarily the same word form!):
[pos = #v:"APPR"] .* [pos=#v]
Node variables
Node variables are necessary to express multiple node relations with respect to one node, e.g. to list the children of a node like in the example in subsection 7.1:
#np:[cat="NP"] & #np > [pos="ADJA"] & #np > [pos="NN"]
Node (in)equality
Two nodes variables #n1 and #n2 may match the same node in the corpus. If this causes problems, the inequality of two node variables can be enforced e.g. by adding the following subformula which requires the variables #n1 and #n2 to match distinct nodes (due to the irreflexivity of the precedence relation):
((#n1 .* #n2) | (#n2 .* #n1))
In the case your corpus contains unary transitions (nonterminal nodes with one single nonterminal daughter), you should use a weaker constraint for node inequality:
((#n1 .* #n2) | (#n2 .* #n1)) | ((#n1 >* #n2) | (#n2 >* #n1))
In principle, by now there are all the operators to describe syntax graphs. For reasons of convenience, and to a certain extent for reasons of completeness, we have added so-called graph predicates, e.g. to designate the root of a graph.
Root predicate
The root of a graph (for a whole sentence) can be identified by the predicate root.
root(#n1)
Arity predicates
The following graph description describes all graphs which contain a certain node #n1 with at least two children #n2 and #n3:
(#n1 > #n2) & (#n1 > #n3)
However, one would like to state that there must be exactly two children. For this reason, we introduce a two-place operator arity in order to be able to restrict the number of children of a node #n1, e.g. to two children:
(#n1 > #n2) & (#n1 > #n3) & arity(#n1,2)
The arity predicate can also come with three arguments in order to indicate an interval of number of children, e.g. from two to four children:
(#n1 > #n2) & (#n1 > #n3) & arity(#n1,2,4)
Similarly, there is a tokenarity operator to constrain the number of leaves which are dominated by this node. For example, the following means that node #n1 must dominate exactly 5 terminal nodes. And the subsequent example states that node #n1 must have between 5 and 7 leaves.
tokenarity(#n1,5)
tokenarity(#n1,5,7)
Continuity predicates
It may be useful to state that the leaves which are dominated by a node must form a continuous string or not. For this purpose, the two unary operators continuous and discontinuous have been introduced:
continuous(#n1)
discontinuous(#n1)