Book: The TIGERSearch query language

III. The TIGERSearch query language

1. Introduction

In this chapter, the TIGER language will be introduced, the query language for the TIGERSearch query engine. Actually, the TIGER language is not only a query language, but a general decription language for syntax graphs, i.e. restricted directed acyclic graphs. Syntax graphs are close relatives to (syntax) trees, to feature structures and to dependency graphs. Syntax graphs are syntax trees with two additions: Edges may have labels, and crossing edges are permitted. On the other hand, syntax graphs differ from feature structures due to the following two properties: First, the leaf nodes are ordered (like in syntax trees). Second, two edges must not join in a common node. This means that the structure sharing mechanism of feature structures is ruled out. However, structure sharing can be expressed on the additional level of so-called secondary edges.

We call the TIGER language a 'description language' since we want to emphasize that, in principle, the language serves to represent corpus annotations as well as corpus queries. In the TIGER system, for corpus annotation only a proper sublanguage of the TIGER description language may be used which does not include any possibility for underspecification. Note that, for reasons of technical convenience, the corpus annotation has to be encoded in the XML-counterpart of the corpus annotation sublanguage (TIGER-XML, cf. chapter V). The TIGER language accomodates a wide range of common treebank formats.

Acknowledgements and references

Syntax graphs are a formalization of the so-called Negra data structure, which has been used to encode the first major treebank for German, the Negra corpus (cf. [SkutEtAl1997]).

The design and the formal definition of the TIGER language has been influenced by other work on formal languages:

unification-based formalisms

(cf. [DoerreDorna1993], [Doerre1996], [DoerreEtAl1996], [EmeleZajac1990], [Emele1997], [HoehfeldSmolka1988], [Schmid1999])

tree description languages

(cf. [BlackburnEtAl1993], [DuchierNiehren1999], [MarcusEtAl1993], [RogersEtAl1992])

corpus query languages

(cf. [Christ1994], [ChristEtAl1999])

We want to thank our colleagues in the TIGER and DEREKO projects (cf. TIGER project homepage and DEREKO project homepage) for their creative input and their patience in discussing the details of earlier versions of this chapter.

2. Language overview

Like other formal languages, the TIGER language has been defined recursively, i.e. by nested layers of formal constructs. The invidual nodes in syntax graphs can be described with feature constraints, i.e. Boolean expressions over feature-value pairs. For the sake of computational simplicity, we do not admit nested feature structures. This means that feature value descriptions must denote constants or more precisely, strings. For this reason, we rather talk about feature records instead of feature structures. Here are some sample queries for the TIGERSampler corpus which is part of the TIGERSearch distribution (cf. [Smith2002] for an introduction to the corpus annotation).

[word="Abend" & pos="NN"]

[word=/Ma.*/ & pos= ("NN"|"NE")]

There are two elementary node relations, precedence (.) and labelled dominance (>L), and a range of derived node relations such as unlabelled direct dominance (>) and general dominance (>*). Example:

[cat="NP"] > [pos="ART"]

The graph descriptions are made from (restricted) Boolean expressions over node relations. Feature values, feature constraints, and nodes can be refered to by logical variables (e.g. #n) which are bound existentially at the outmost formula level. Example:

(#n:[cat="NP"] > [pos="ART"]) & (#n >* [pos="ADJA"])

Queries can include template calls and type names. Template definitions help to modularize lengthy queries. Types are a means to structure the universe of feature-value pairs. Type definitions include the declaration of features with domain and range types and the definition of type hierarchies.

In the subsequent sections, we introduce the TIGER language in an informal manner, i.e. by the way of examples. All sample queries should work on the TIGERSampler corpus, which is distributed with the TIGERSearch software. A formal definition of the TIGER language is given in a separate document (cf. [KoenigLezius2001]).

In section 12, the reader will find a quick reference of all language elements.

3. Feature value descriptions

3.1 String

Strings are to be marked by quotation marks, e.g. "NN". To function as an actual TIGERSearch query, we must use the string as the value in a feature-value pair, surrounded by brackets:

[pos="NN"]

Characters which are TIGER reserved symbols must be preceded by the \-symbol (cf. following query). TIGER reserved symbols are listed in section 12.

[word="\."]

3.2 Types

Constant-denoting type symbols can serve as descriptions of feature values (cf. section 8). If a feature value symbol comes without quotes, it is interpreted as a type name:

[pos=proform]

3.3 Boolean expressions

Descriptions of feature values may be complex in the sense that type symbols and strings can be combined as Boolean expressions. The operators are ! (negation), & (conjunction), and | (disjunction). Here are uses of a Boolean expressions as a feature value descriptions:

[pos = ("NN" | "NE")]

[pos = ! ("NN" | "NE") ]

By the way, the query above can be written as:

[pos != ("NN" | "NE") ]

Please note: Boolean expressions for feature values which involve binary operators (conjunction and disjunction) must always be put into parentheses. See the example above where the outer parentheses cannot be omitted!

The operator precedence is defined as follows: !, &, |. This definition is illustrated by the following examples:

Example	Interpretation
`! "NN" & "NE"`	`(!"NN") & ("NE")`
`"NN" & "NE" \| "PREL"`	`(NN"&"NE") \| ("PREL")`

3.4 Variables

A Boolean feature value description can be refered to by a variable (in the example: #c):

[pos= #c:("NN"|"NE")]

A variable name has to start with a #-symbol. See subsection 7.2 for more meaningful applications of variables.

3.5 Regular expressions

One can also use regular expressions as feature value descriptions. Regular expressions are marked by enclosing slashes /. The syntax of regular expressions in TIGER is compatible with the syntax of regular expressions in Perl 5.003 (cf. [WallEtAl1996]). In our implementation, the following expression types are available:

Single characters

`a`	the character `a`
`.`	any character

Character classes

`[ace]`	any of the characters `a, c, e`
`[a-z]`	any lower-case letter
`[^a-f]`	any character except `a` to `f`

Special characters

`\s`	whitespace (space, tab, return)
`\d`	digit (0-9)

Alternatives

(abc|de)

the string abc, or de

Quantifiers

`(ab)*`	no or any number of `ab` (empty string, `ab`, `abab`, ...)
`(ab)+`	at least one `ab` (`ab`, `abab`, ...)
`(ab)?`	no or exactly one `ab`
`(ab){m,n}`	from `m` to `n` occurences of `ab`

Grouping

`ab+`	`a` followed by at least one `b` (`a`, `ab`, `abb`, ...)
`(ab)+`	at least one `ab` (`ab`, `abab`, ...)

Please note: In our notation /x/ means /^x$/ in the Perl notation.

The following example means 'find words which start with spiel':

[word = /spiel.*/]

With the following query, one can locate the words das and der, irrespective of capitalization of the first letter:

[lemma = /[dD](as|er)/]

The following query finds words which contain at least one uppercase letter or a figure at a non-initial position, i.e. hyphenated compounds, and potential abbreviations and product names:

[word = /.+([0-9A-Z])+.*/]

Please note: There is a difference between . and \. in the context of regular expressions. The following example denotes all strings starting with the prefix sagt, whereas the subsequent query means all strings with the prefix sagt followed by an arbitrary, possibly empty number of full stops:

[word = /sagt.*/]

[word = /sagt\.*/]

Please note: The TIGER language compiler performs only a rough check of the syntax of regular expressions. The fine-grained syntax check for regular expressions will be carried out when a query is evaluated. Therefore it may involve more effort for you to discover the syntax errors you have made in a regular expression.

3.6 Boolean expressions vs. types vs. regular expressions

Regular expressions for feature values should be reserved for 'open-ended' feature values such as the word values. For features with a restricted range of a values such as syntactic categories, the use of types and Boolean expressions is suggested in order to increase readability and processing efficiency. If possible, types should be used instead of Boolean expressions.

4. Feature constraints

4.1 Feature-value pairs

Simple feature constraints, i.e. feature-value pairs such as pos="NN" have been already introduced in section 3.

4.2 Boolean expressions

Complex feature constraints are Boolean expressions over feature-value pairs:

[word="das" & pos="ART"]

[word = /sp.*/ | pos = "VVFIN"]

[word="das" & !(pos="ART")]

Basically, the last query is equivalent to the following two queries. Note that this kind of equivalence is not generally valid in a typed system (cf. section 8)!

[word="das" & pos != "ART"]

[word="das" & pos = (!"ART")]

The operator precedence is defined as follows: !, &, |. This definition is illustrated by the following examples:

Example	Interpretation
`[! pos="NN" & word="der" ]`	`[(!pos="NN") & (word="der")]`
`[pos="NN" & word="Haus" \| pos="NE"]`	`[(pos="NN" & word="Haus") \| (pos="NE")]`

4.3 Variables

A feature constraint can be prefixed by a variable. For further discussion see subsection 7.2.

[#f: (word="das" & pos="ART")]

4.4 Unspecific feature constraint

The unspecific feature constraint is written as []. It denotes the whole universe of nodes.

5. Node descriptions

A node description, or simply a node, is a pair of a node identifier and a feature constraint. In queries, node identifiers must be node variables (in the example: #n1):

#n1:[word="das" & pos="ART"]

Please note: The use of constant node identifiers is reserved for corpus definitions only. It may not be used in corpus queries:

(*) "id1":[word="das" & pos="ART"]

In queries, the node variable can be completely omitted from a node description unless it is needed for coreference with some other node description. This means that a plain feature constraint is interpreted as a node description with an unspecified node identifier.

[word="das" & pos="ART"]

On the other hand, a node variable by itself, #n1, is interpreted as an (unspecified) node description, i.e. as #n1:[ ].

6. Node relations

6.1 Elementary node relations

Since syntax graphs are two-dimensional objects, we need two operators to express the spatial relations. We simply take the nomenclature from linguistics, and call the vertical dimension the dominance relation and the horizontal dimension the precedence relation.

Labelled direct dominance

The symbol > means direct dominance. It is further specified by an edge label, for example HD. This means that labelled direct dominance is expressed by e.g. >HD. The constraint that an edge which is labelled HD leads from node #n1 to node #n3 (or that #n3 is a direct HD-successor of #n1) is written as follows:

#n1 >HD #n3

As a more comprehensive example, the following labelled dominance constraints encode the vertical dimension of the tree in the graph presented below. Note that labelled dominance is a relation among nodes, not a function like it is the case for feature structures. This means that there may be more than one edge with the same label leading out of one mother node, cf. the NK-edges in the presented graph.

#n1 >SB #n2

#n1 >HD #n3

#n2 >NK #n4

#n2 >NK #n5

Figure: A simple syntax graph

On the basis of the directed edges, which are defined by the dominance relation, the nodes of a syntax graph in the corpus annotation are classified in the following manner:

Inner nodes (nonterminal nodes)

Nodes with outgoing edges, i.e. nodes with children, are called inner nodes or nonterminal nodes. In the presented figure only the nodes #n1 and #n2 are inner nodes.

Leaf nodes (terminal nodes)

Nodes without successors are named leaf nodes or terminal nodes. For example, the nodes #n4, #n5, and #n3 are the leaves of the tree in the presented figure.

Direct precedence of leaf nodes

A syntax graph is not only defined by constraining the vertical arrangement of its nodes, but also by the horizontal order of its leaf nodes, i.e. by the precedence relation among the leaves. We use the . symbol to represent direct precedence, since it serves as the concatenation operator in some programming languages. Now, the horizontal dimension of the tree in the presented figure is determined by the two precedence constraints:

#n4 . #n5

#n5 . #n3

6.2 Derived node relations

In queries, one wants to refer to indeterminate portions of a graph. Therefore, generalized notions of dominance and precedence ('wildcards') are necessary. Furthermore, certain convenient abbreviations should be introduced like a sibling operator.

Dominance

Dominance in general means that there is a path from one node to another one via a connected series of direct dominance relations. The distance of a dominance relation is the length of the path between the two nodes, i.e. the number of direct dominance relations to be traversed. The minimum distance is 1, i.e. in TIGERSearch, a node does not dominate itself.

We introduce the following generalizations of the labelled direct dominance relation >L:

`>`	unlabelled direct dominance
`>*`	dominance (minimum distance 1)
`>n`	dominance with distance `n` (`n>0`)
`>m,n`	dominance with distance between `m` and `n` (`0<m<n`)
`>@l`	leftmost terminal successor ('left corner')
`>@r`	rightmost terminal successor ('right corner')
`!>L`	negated labelled direct dominance
`!>`	negated unlabelled direct dominance
`!>*`	negated dominance
`!>n`	negated dominance, distance `n`
`!>m,n`	negated dominance, distance `m`...`n` (`0<m<n`)
`!>@l`	negated left corner
`!>@r`	negated right corner

For unlabelled edges, the label is considered to be unspecified, i.e. an unlabelled edge can match any edge.

The left corner resp. the right corner relation is reflexive, i.e. the leftmost resp. rightmost successor of a leaf node is the leaf node itself.

For the example tree in subsection 6.1, the following relations hold:

#n1 >* #n4

#n1 >2 #n4

#n1 >@l #n4

#n4 >@l #n4

Concerning the negated node relations, remember that variables are bound existentially at the outermost formula level. For example, the following query has to be read as 'find a syntax graph with an NP node and an NE node which are not connected by an NK-edge.'

[cat="NP"] !>NK [pos="NE"]

Precedence

So far, we have only talked about the precedence of leaf nodes. Whereas, for syntax trees, the precedence relation among its inner nodes is obvious, the situation is not clear at all for syntax graphs, which admit crossing edges. For example, the following two figures are different visualizations of the same syntax graph wrt. the arrangement of the nodes #n2 and #n3.

Figure: Precedence of inner nodes, first layout

Figure: Precedence of inner nodes, second layout

There are several alternatives to define the precedence of inner nodes. One could adopt the strict precedence definition of proper trees: A node #n1 precedes a node #n2 if all nodes dominated by #n1 precede all nodes dominated by #n2 (cf. [SteinerKallmeyer2002]), i.e. if the right corner of #n1 precedes the left corner of #n2. Based on this strict definition, there would be no precedence relation at all between the nodes #n2 and #n3 in the first syntax graph shown above.

We decided for a somewhat weaker definition, which admits precedence statements even for overlapping graphs: A node #n1 is said to precede another node #n2 if the left corner of #n1 precedes the left corner of #n2 (left corner precedence).

According to the left corner based definition, node #n2 precedes node #n3 in the first syntax graph shown above. The user may define his own version of precedence as a template (cf. section 9), e.g. with the help of the right corner operator >@r.

The distance n between two non-identical leaf nodes is the number of leaf nodes between these two nodes increased by 1. For example, the distance between two direct neighbour leaves is 1. The distance between two inner nodes #x and #y is the distance between their respective left corners. Two nodes which share the same left corner, do not precede each other. In particular, a node does not precede itself. For the precedence relation, there are the derived operators listed below:

`.*`	precedence (minimum distance 1)
`.n`	precedence with distance `n` (`n>0`)
`.m,n`	precedence with distance between `m` and `n` (`0>m>n`)
`!.*`	negated precedence
`!.n`	negated precedence, distance `n`
`!.m,n`	negated precedence, distance `m`...`n`

Sibling relation

Two nodes #n1 and #n2 are siblings if they are directly dominated by the same node #n0, i.e. if they have the same mother node #n0. We use the following writing conventions:

`$`	siblings
`$.*`	siblings with precedence
`!$`	negated siblings

We did not include the negated $.* relation since it seems rather counter-intuitive.

6.3 Secondary edges

Although, in the kernel of the TIGER language, a node must not be dominated by more than one node, we have added so-called secondary edges as an extra layer in order to allow for structure sharing in syntax graphs. A secondary edge defines an additional parent node for a given node.

`>~L`	labelled secondary edge
`>~`	secondary edge
`!>~L`	negated labelled secondary edge
`!>~`	negated secondary edge

7. Graph descriptions

7.1 Boolean expressions

Graph descriptions or graph constraints are (restricted) Boolean expressions over node relations and node descriptions. Currently, conjunction & and disjunction | are available as logical connectives. For example, with the help of the &-operator, the following node relations can be joined into a graph constraint which retrieves the tree shown below.

Figure: A simple syntax graph

(#n1 >SB #n2) &
(#n1 >HD #n3) &
(#n2 >NK #n4) &
(#n2 >NK #n5)

Parentheses can be omitted in the usual fashion:

#n1 >SB #n2 &
#n1 >HD #n3 &
#n2 >NK #n4 &
#n2 >NK #n5

The operator precedence is defined as follows: Relation, &, |. This definition is illustrated by the following examples:

Example	Interpretation
`#v > #w & #x`	`(#v > #w) & #x`
`#v & #w \| #x`	`(#v & #w) \| #x`

7.2 Use of variables

Variables for feature values

Variables for feature values are typically used to express agreement constraints. The following query looks for two adjacent nodes which are labelled with NN or NE.

[pos = #noun] . [pos = #noun:("NN" | "NE")]

Variables for feature constraints

With variables for feature constraints, we can search e.g. for sentences which contain the same preposition (the same word form!), twice:

[#f:(pos="APPR")] .* [#f]

Please note: There is a subtle difference if we used a feature value variable instead. If we only require the identity of the feature value, i.e. of the part-of-speech tag, we get all sentences which contain at least two prepositions (not necessarily the same word form!):

[pos = #v:"APPR"] .* [pos=#v]

Node variables

Node variables are necessary to express multiple node relations with respect to one node, e.g. to list the children of a node like in the example in subsection 7.1:

#np:[cat="NP"] &
#np > [pos="ADJA"] &
#np > [pos="NN"]

Node (in)equality

Two nodes variables #n1 and #n2 may match the same node in the corpus. If this causes problems, the inequality of two node variables can be enforced e.g. by adding the following subformula which requires the variables #n1 and #n2 to match distinct nodes (due to the irreflexivity of the precedence relation):

((#n1 .* #n2) | (#n2 .* #n1))

In the case your corpus contains unary transitions (nonterminal nodes with one single nonterminal daughter), you should use a weaker constraint for node inequality:

((#n1 .* #n2) | (#n2 .* #n1)) | ((#n1 >* #n2) | (#n2 >* #n1))

7.3 Graph predicates

In principle, by now there are all the operators to describe syntax graphs. For reasons of convenience, and to a certain extent for reasons of completeness, we have added so-called graph predicates, e.g. to designate the root of a graph.

Root predicate

The root of a graph (for a whole sentence) can be identified by the predicate root.

root(#n1)

Arity predicates

The following graph description describes all graphs which contain a certain node #n1 with at least two children #n2 and #n3:

(#n1 > #n2) & (#n1 > #n3)

However, one would like to state that there must be exactly two children. For this reason, we introduce a two-place operator arity in order to be able to restrict the number of children of a node #n1, e.g. to two children:

(#n1 > #n2) & (#n1 > #n3) & arity(#n1,2)

The arity predicate can also come with three arguments in order to indicate an interval of number of children, e.g. from two to four children:

(#n1 > #n2) & (#n1 > #n3) & arity(#n1,2,4)

Similarly, there is a tokenarity operator to constrain the number of leaves which are dominated by this node. For example, the following means that node #n1 must dominate exactly 5 terminal nodes. And the subsequent example states that node #n1 must have between 5 and 7 leaves.

tokenarity(#n1,5)

tokenarity(#n1,5,7)

Continuity predicates

It may be useful to state that the leaves which are dominated by a node must form a continuous string or not. For this purpose, the two unary operators continuous and discontinuous have been introduced:

continuous(#n1)

discontinuous(#n1)

8. Type definitions

If the user has to declare the symbols which will be used in a corpus and in queries, inconsistencies in the corpus annotation and in the corpus query can be detected much more easily. TIGER allows for the declaration of type hierarchies (cf. subsection 8.2) and features (cf. subsection 8.4).

Type hierarchies have to be linked to a corpus. The linking of type hierarchies is described in subsection 4.3, chapter VI.

8.1 Built-in types

There are the following built-in types in the TIGER language:

String for feature values not listable such as the values of the features word or lemma

UserDefConstant for user-defined ('listable') feature values

Constant comprises both String and UserDefConstant

NT for feature constraints of nonterminal nodes

T for feature constraints of terminal nodes

FREC stands for all feature records (feature constraints), i.e. NT and T

Node stands for node descriptions

Graph means all graphs

Top means anything (in the world of syntax graphs)

The hierarchy of built-in types is visualized in the following figure. Defining a type hierarchy is introduced in the following subsection.

Figure: Built-in types

The built-in types Top, Node and Graph are only required on the conceptual level. In the current implementation, they cannot be referred to in the description language.

8.2 Definition of a type hierarchy

In the current implementation, the user can only add type definitions for feature values, i.e. for the type UserDefConstant. Let us start with a sample type hierarchy for the pos feature:

<typedeclaration base="pos" version="1.0">
...
  <!-- base type: pos -->
  <type name="pos">
     <subtype nameref="openclass"/>
     <subtype nameref="closedclass"/>
     <subtype nameref="punctuation"/>
     <subtype nameref="misc"/>
  </type>
...
</typedeclaration>

Type declaration are encoded in an XML-based format: The type declarations for a feature constitute the contents of a typedeclaration root element. The base attribute defines the base type (or root type) of the type system.

Type definition rules are encoded by type elements. The value of the name attribute is the type t which is being defined. The (direct) subtypes are given by the nameref attribute values of the individual subtype child elements. An occurrence of a type t' in an subtype element is called a use of t'.

In this way, type hierarchies can be defined. The 'terminal nodes' of a type hierarchy (of constant denoting types) are constants:

  <!-- open word classes-->
  <type name="openclass">
     <subtype nameref="noun"/>
     <subtype nameref="verb"/>
     <subtype nameref="adjective"/>
     <!-- adverb -->
     <constant value="ADV" comment="schon, bald, doch"/>
  </type>

  <!-- noun -->
  <type name="noun">
     <!-- common noun -->
     <constant value="NN" comment="Tisch, Herr, [das] Reisen"/>
     <!-- proper noun -->
     <constant value="NE" comment="Hans, Hamburg, HSV"/>
  </type>

On the basis of these type definitions, some disjunctions of feature values can be written in a more concise manner, e.g. the following query can now be replaced by the subsequent query:

[pos = ("NE"|"NN")]

[pos = noun]

Restrictions for type definitions

A type must be defined at most once.

A type must be used exactly once in a type definition.

The first restriction means that neither recursion nor cross-classification (alternative definitions of the same type symbol) can be expressed. If you think you need cross-classification, template definitions (section 9) might be a way out. The second restriction enforces that every type must be hooked up in the type hierarchy. In total this means that type definitions define a tree-shaped type hierarchy. Undefined types may only occur as the leaves of a type hierarchy.

8.3 Type definition example

In this subsection, the part-of-speech type hierarchy used for the TIGER Corpus Sampler (based on a modified version version of the STTS tag set) is listed as an example. The file is also placed in the doc/examples/ subdirectory of your TIGERSearch installation.

<typedeclaration base="pos" version="1.0">

  <!-- base type: pos -->
  <type name="pos">
     <subtype nameref="openclass"/>
     <subtype nameref="closedclass"/>
     <subtype nameref="punctuation"/>
     <subtype nameref="misc"/>
  </type>

  <!-- open word classes-->
  <type name="openclass">
     <subtype nameref="noun"/>
     <subtype nameref="verb"/>
     <subtype nameref="adjective"/>
     <!-- adverb -->
     <constant value="ADV" comment="schon, bald, doch"/>
  </type>

  <!-- closed word classes-->
  <type name="closedclass">
     <!-- definite or indefinite article -->
     <constant value="ART" comment="der, die, das, ein, eine"/>
     <subtype nameref="proform"/>
     <!-- cardinal number -->
     <constant value="CARD" comment="zwei [Männer], [im Jahre] 1994"/>
     <subtype nameref="conjunction"/>
     <subtype nameref="adposition"/>
     <!-- interjection -->
     <constant value="ITJ" comment="mhm, ach, tja"/>
     <subtype nameref="particle"/>
  </type>

  <!-- noun -->
  <type name="noun">
     <!-- common noun -->
     <constant value="NN" comment="Tisch, Herr, [das] Reisen"/>
     <!-- proper noun -->
     <constant value="NE" comment="Hans, Hamburg, HSV"/>
  </type>

  <!-- verb -->
  <type name="verb">
     <subtype nameref="finite"/>
     <subtype nameref="nonfinite"/>
  </type>

  <!-- finite verbform -->
  <type name="finite">
     <!-- finite full verb -->
     <constant value="VVFIN" comment="du] gehst, [wir] kommen [an]"/>
     <!-- finite auxiliary verb -->
     <constant value="VAFIN" comment="[du] bist, [wir] werden"/>
     <!-- finite modal verb --> 
     <constant value="VMFIN" comment="dürfen"/>
  </type>

  <!-- non-finite verbform -->
  <type name="nonfinite">
     <subtype nameref="infinitive"/>
     <subtype nameref="participle"/>
     <subtype nameref="imperative"/>
     <!-- infinitive with zu, full verb -->
     <constant value="VVIZU" comment="anzukommen, loszulassen"/>
  </type>

  <!-- infinitive verbform -->
  <type name="infinitive">
     <!-- inifinitive, full verb -->
     <constant value="VVINF" comment="gehen, ankommen"/>
     <!-- infinitive, auxiliary verb -->
     <constant value="VAINF" comment="werden, sein"/>
     <!-- infinitive, modal verb -->
     <constant value="VMINF" comment="wollen"/>
  </type>

  <!-- past participle -->
  <type name="participle">
     <!-- past participle, full verb -->
     <constant value="VVPP" comment="gegangen, angekommen"/>
     <!-- past participle, auxiliary verb -->
     <constant value="VAPP" comment="gewesen"/>
     <!-- past participle, modal verb -->
     <constant value="VMPP" comment="gekonnt, [er hat gehen] können"/>
  </type>

  <!-- past participle -->
  <type name="participle">
     <!-- past participle, full verb -->
     <constant value="VVPP" comment="gegangen, angekommen"/>
     <!-- past participle, auxiliary verb -->
     <constant value="VAPP" comment="gewesen"/>
     <!-- past participle, modal verb -->
     <constant value="VMPP" comment="gekonnt, [er hat gehen] können"/>
  </type>

  <!-- imperative -->
  <type name="imperative">
     <!-- imperative, full verb -->
     <constant value="VVIMP" comment="komm [!]"/>
     <!-- imperative, auxiliary verb -->
     <constant value="VAIMP" comment="sei [ruhig !]"/>
  </type>

  <!-- adjective -->
  <type name="adjective">
     <!-- attributive adjective -->
     <constant value="ADJA" comment="[das] große [Haus]"/>
     <!-- adverbal or predicative adjective -->
     <constant value="ADJD" comment="[er fährt] schnell, [er ist] schnell"/>
  </type>

  <!-- proform -->
  <type name="proform">
  <subtype nameref="prodemon"/>
  <subtype nameref="proindef"/>
  <!-- irreflexive personal pronoun -->
  <constant value="PPER" comment="ich, er, ihm, mich, dir"/>
  <subtype nameref="propos"/>
  <subtype nameref="prorel"/>
  <!-- reflexive pronoun -->
  <constant value="PRF" comment="sich, einander, dich, mir"/>
  <subtype nameref="prointer"/>
  <!-- pronominal adverb, bug, should be "PAV" -->
  <constant value="PROAV" comment="dafür, dabei, deswegen, trotzdem"/>
  </type>

  <!-- demonstrative pronoun -->
  <type name="prodemon">
     <!-- substitutive demonstrative pronoun -->
     <constant value="PDS" comment="dieser, jener"/>
     <!-- attributive demonstrative pronoun -->
     <constant value="PDAT" comment="jener [Mensch]"/>
  </type>

  <!-- indefinite pronoun -->
  <type name="proindef">
     <!-- substitutive indefinite pronoun -->
     <constant value="PIS" comment="keiner, viele, man, niemand"/>
     <!-- attributive indefinite pronoun -->
     <constant value="PIAT" comment="kein [Mensch], irgendein [Glas]"/>
  </type>

  <!-- posessive pronoun -->
  <type name="propos">
     <!-- substitutive possesive pronoun -->
     <constant value="PPOSS" comment="meins, deiner"/> 
     <!-- attributive possessive pronoun -->
     <constant value="PPOSAT" comment="mein [Buch], deine [Mutter]"/>
  </type>

  <!-- relative pronoun -->
  <type name="prorel">
     <!-- substitutive relative pronoun -->
     <constant value="PRELS" comment="[der Hund,] der"/>
     <!-- attributive relative pronoun -->
     <constant value="PRELAT" comment="[der Mann ,] dessen [Hund]"/>
  </type>

  <!-- interrogative pronoun -->
  <type name="prointer">
     <!-- substitutive interrogative pronoun -->
     <constant value="PWS" comment="wer, was"/>
     <!-- attributive interrogative pronoun -->
     <constant value="PWAT" comment="welche [Farbe], wessen [Hut]"/>
     <!-- interrogative adverb or adverbial relative pronoun -->
     <constant value="PWAV" comment="warum, wo, wann, worüber, wobei"/>
  </type>

  <!-- conjunction -->
  <type name="conjunction">
  <subtype nameref="conjsub"/>
  <!-- coordinating conjunction -->
  <constant value="KON" comment="und, oder, aber"/>
  <!-- comparative conjunction -->
  <constant value="KOKOM" comment="als, wie"/>
  </type>

  <!-- subordinating conjunction -->
  <type name="conjsub">
     <!-- subordinating conjunction with zu-infinitive -->
     <constant value="KOUI" comment="um [zu leben], anstatt [zu fragen]"/>
     <!-- subordinating conjunction with sentence -->
     <constant value="KOUS" comment="weil, daß, damit, wenn, ob"/>
  </type>

  <!-- adposition -->
  <type name="adposition">
     <!-- preposition -->
     <constant value="APPR" comment="in [der Stadt], ohne [mich], von [jetzt an]"/>
     <!-- preposition + article -->
     <constant value="APPRART" comment="im [Haus], zur [Sache]"/>
     <!-- postposition -->
     <constant value="APPO" comment="[ihm] zufolge, [der Sache] wegen"/>
     <!-- circumposition, right part -->
     <constant value="APZR" comment="[von jetzt] an"/>
  </type>

  <!-- particle -->
  <type name="particle">
     <!-- "zu" before infinitive -->
     <constant value="PTKZU" comment="zu [gehen]"/>
     <!-- negating particle -->
     <constant value="PTKNEG" comment="nicht"/>
     <!-- separated verb particle -->
     <constant value="PTKVZ" comment="er kommt] an, [er fährt] rad"/>
     <!-- answer particle -->
     <constant value="PTKANT" comment="ja, nein, danke, bitte"/>
     <!-- particle with adjektive or adverb -->
     <constant value="PTKA" comment="am [schönsten], zu [schnell]"/>
  </type>

  <type name="punctuation">
     <!-- comma -->
     <constant value="$," comment=","/>
     <!-- final punctuation -->
     <constant value="$." comment=". ? ! ; :"/>
     <!-- other punctuation marks -->
     <constant value="$(" comment="- [,]()"/>
  </type>

  <type name="misc">
     <!-- foreign material -->
     <constant value="FM" comment="[Er hat das mit ``] A big fish ['' übersetzt]"/>
     <!-- nonword, with special characters -->
     <constant value="XY" comment="3:7, H2O, D2XW3"/>
     <!-- truncated element -->
     <constant value="TRUNC" comment="An- [und Abreise]"/>
     <!-- untagged -->
     <constant value="--"/>
     <!-- tagging of the token unknown -->
     <constant value="UNKNOWN"/>
  </type>

</typedeclaration>

8.4 Feature declarations

Feature declarations are part of the corpus definition (cf. section 11). A feature declaration states the following information:

It declares the symbol under consideration as a feature name.

It restricts the domain of the feature, i.e. it states for which type the feature may be used.

It restricts the feature values (i.e. the range).

In the current implementation, features can only be declared for the built-in types NT and T. Furthermore, one cannot use a type to restrict the range of a feature, but the possible values for a feature have to be enumerated. One reason is that we want to keep the number of dependencies between corpus definition and type definitions as small as possible. The other reason is that such simple feature declarations can be also be constructed automatically - for those corpora which do not come with feature declarations.

If the value enumeration is omitted from a feature declaration, the default range of that feature is String.

For example, for the type T of feature constraints for terminal nodes, the features word, lemma, and pos may be defined as follows.

<feature name="word" domain="T"/>

<feature name="lemma" domain="T"/>

<feature name="pos" domain="T">
  ...
  <value name="VAFIN">...comment...</value>
  <value name="VAIMP"/>
  <value name="VAINF"/>
  <value name="VAPP"/>
  <value name="VMFIN"/>
  <value name="VMINF"/>
  ...
</feature>

Please note: Each feature must be declared exactly once. The exclusion of multiple declarations for the same feature means that polymorphic overloading of a feature symbol is not permitted.

Please note: In the TIGER description language being a typed language, the following two queries are not equivalent!

[word="das" & !(pos="ART")]

[word="das" & pos != "ART"]

The reason is that !(pos="ART") equals !(T & pos="ART") due to the corresponding feature declaration. The latter formula again is equivalent to !(T) | (pos != "ART"), i.e. either the feature pos is not defined on a type or it is defined and its value is not equal to "ART".

9. Template definitions

When working with a syntactic representation formalism such as the TIGER language, certain pieces of code will be used over and over again. Furthermore, it is good programming and grammar writing style to define generalizations. Therefore, there is an urgent need for a means for defining abbreviations, templates (cf. [Shieber1986]), or macros. We chose TIGER templates to be non-recursive logical relations. This means that, although template calls may be embedded into template definitions, there must be neither direct nor indirect self-reference in template definitions. Hence, templates realize the 'database programming' part of a logic programming language such as Prolog (cf. [SterlingShapiro1986]).

A simple scheme PrepPhrase of a prepositional phrase which consists only of a preposition (APPR) and a proper noun (NE) can be defined as follows:

PrepPhrase(#n0) <-
   #n0:[cat="PP"] > #n1:[pos="APPR"]
   & #n0 > #n2:[pos="NE"]
   & #n1.#n2
   & arity(#n0,2) ;

The arrow <- marks a defining clause of a template definition. The <- operator is a two-place operator which takes a template head on its lefthand side and a template body on its righthand side. The template head consists of the template name (e.g. PrepPhrase) and a list of variables, the argument parameters of the template clause. The list of argument parameters must have at least one element, because otherwise no information flow between the template body and the calling environment would be possible, i.e. the template definition would be useless. An argument parameter #x must come with enough information (in the template body or in the query context) so that one can decide which of the built-in types Constant, FRec, Node, or Graph, the variable #x belongs to. The template body consists of a graph description. The end of a defining clause is marked by the ; symbol. A template definition can consist of several defining clauses (not yet implemented).

The PrepPhrase template can now be used or called in a query:

#n0 & PrepPhrase(#n0) ;

This query will return all graphs from the given corpus, which contain a subtree (rooted in #n0) with the desired shape.

Please note: The '#n0 &' part is important. TIGERSearch assumes that variables which occur only in the template call but not elsewhere, do not make sense. If you omit the '#n0 &' you will get an error message.

The template PrepPhrase can also be used in the body of some other template definition, e.g. in the definition of a pattern for VerbPhrase like 'geht nach Stuttgart'.

VerbPhrase(#n0) <-
   #n0:[cat="VP"] > #n1:[pos="VVFIN"]
   & #n0 > #n2
   & #n1.#n2
   & arity(#n0,2)
   & PrepPhrase(#n2) ;

Please note: The scope of a variable is the defining clause it occurs in. E.g. the variable #n2 in the definition of PrepPhrase is distinct from the variable #n2 in the definition of VerbPhrase. The TIGER language interpreter will resolve the call to PrepPhrase in the following manner. It will:

look up the defining clause of PrepPhrase.
replace all the variable names by new ones (in order to avoid unintended confusion of variables from the 'calling' clause and the 'called' clause).
identify the node id variable #n2 in VerbPhrase with the argument #n0 of PrepPhrase's defining clause. Identification of variables means that the constraints for both variables are joined into a single constraint by a logical conjunction.

After the resolution step, the VerbPhrase-clause has the following shape:

VerbPhrase(#n0) <-
   #n0:[cat="VP"] > #n1:[pos="VVFIN"]
   & #n0 > #n2
   & #n1.#n2
   & #n2:[cat="PP"] > #n21:[pos="APPR"]
   & #n2 > #n22:[pos="NE"]
   & #n21.#n22
   & arity(#n0,2)
   & arity(#n2,2) ;

If we want to find out which verbs go with which prepositions, this can be done by adding the appropriate arguments or parameters.

PrepPhrase(#n0,#prep) <-
   #n0:[cat="PP"] > #n1:[word=#prep & pos="APPR"]
   & #n0 > #n2:[pos="NE"]
   & #n1.#n2
   & arity(#n0,2) ;

VerbPhrase(#n0,#verblemma,#prep) <-
   #n0:[cat="VP"] > #n1:[pos="VVFIN" & lemma=#verblemma]
   & #n0 > #n2
   & #n1.#n2
   & arity(#n0,2)
   & PrepPhrase(#n2,#prep) ;

After a while, we might get interested in a more complex notion of prepositional phrases. For that reason, a template definition may consist of several, alternative defining clauses (not yet implemented):

PrepPhrase(#n0,#prep) <-
   #n0:[cat="PP"] > #n1:[word=#prep & pos="APPR"]
   & #n0 > #n2:[pos="NE"]
   & #n1.#n2
   & arity(#n0,2) ;

PrepPhrase(#n0,#prep) <-
   #n0:[cat="PP"] > #n1:[word=#prep & pos="APPR"]
   & #n0 > #n2:[pos="ART"]
   & #n0 > #n3:[pos="NN"]
   & #n1.#n2
   & #n2.#n3
   & arity(#n0,3) ;

The above definition is equivalent to a disjunction of the two defining clauses:

PrepPhrase(#n0,#prep) <-
   ( #n0:[cat="PP"] > #n1:[word=#prep & pos="APPR"]
     & #n0 > #n2:[pos="NE"]
     & #n1.#n2
     & arity(#n0,2) )
   |
   ( #n0:[cat="PP"] > #n1:[word=#prep & pos="APPR"]
     & #n0 > #n2:[pos="ART"]
     & #n0 > #n3:[pos="NN"]
     & #n1.#n2
     & #n2.#n3
     & arity(#n0,3) );

or, in a more packed manner:

// packed definition of PrepPhrase
PrepPhrase(#n0,#prep) <-
   #n0:[cat="PP"]
   & #n1:[word=#prep & pos="APPR"]
   & #n1.#n2
   & ( ( #n0 > #n1
         & #n0 > #n2:[pos="NE"]
         & arity(#n0,2) )
       |
       ( #n0 > #n1
         & #n0 > #n2:[pos="ART"]
         & #n0 > #n3:[pos="NN"]
         & #n2.#n3
         & arity(#n0,3) ) ) ;

The //-symbol marks the remainder of a line as a comment.

Types vs. templates

One might think of abbreviating a disjunction of feature values by a template name:

gen-dat(#case) <-
    [ case = #case: ( "gen" | "dat" ) ] ;

But we recommend to use a type definition instead, if possible, since this does not introduce an explicit disjunction into the structure.

gen-dat := "gen", "dat" ;

However, TIGERSearch admits only a single type hierarchy. Therefore, views which are orthogonal to this primary type hierarchy, must be expressed by templates.

10. Possible extensions of the TIGER language

In this section, we discuss some possible extensions of the TIGER language and the reasons why we have not adopted them (yet).

10.1 Variables for edge labels

There are already implicit ('don't care') variables for edge labels, i.e. the unlabelled dominance operator > and the dominance wildcard >* etc. If explicit variables for edge labels were available, the user could require co-reference of edge labels. To us it is unclear what kind of computational complexity will be introduced by this additional expressivity. And furthermore, there is a work-around: Since the number of edge labels (grammatical functions) tends to be rather small, it does not seem too inconvenient to define appropriate templates which enumerate pairs of edges with the same labels.

10.2 Negated graph descriptions

Negation on the level of graph descriptions seems somewhat difficult to be grasped conceptually. Since it would have to be pushed down to the level of node relations and feature values anyway, and there is negation available on these lower levels, this might give enough expressivity.

10.3 Universal quantifier

You might be tempted to state the following expression to find trees which are rooted in a VP, but which do not contain any NP:

[cat="VP"] >* [cat=(!"NP")]

But in the TIGER language, this query has the interpretation: 'Find a (sub-)graph rooted in a VP with at least one inner node #n which is not labelled NP':

exists #child: ([cat="VP"] >* #child:[cat=(!"NP")])

The desired constraint of finding a graph which does not contain any NP requires universal quantification and the implication operator (i.e. negation of graph descriptions):

exists #node: forall #child: ( (#node:[cat="VP"] >* #child) =>
                               (#child:[cat=(!"NP")])
                             )

The use of the universal quantifier causes computational overhead since universal quantification usually means that a possibly large number of copies of logical expressions have to be produced. For the sake of computational simplicity and tractability, the universal quantifier is (currently) not part of the TIGER language.

11. Appendix: Corpus definition

Actually, corpora for TIGERSearch have to be defined in TIGER-XML. However, TIGER-XML is a direct translation of the corpus definition sublanguage of the TIGER description language. The restrictions for corpora are as follows:

Declaration of required features

A corpus definition must include the feature declarations for the type NT of nonterminal constraints and the type T of terminal constraints.

Please note: If no explicit feature declarations are given, in most cases, they can be derived automatically from the corpus.

Single Root Node, Connectedness, No Structure Sharing

Every node in a graph except for one distinguished node (root node) has to be directly dominated by exactly one other node.

Please note: A multi-rooted graph (unconnected subgraphs) can be turned automatically into a graph with unique root node by adding an 'artificial' root node plus the edges which point to the individual subgraphs.

Please note: A structure sharing mechanism (multi-dominance) is provided by the additional layer of 'secondary edges'.

Acyclicity

No node may (indirectly) dominate itself.

Full Disambiguation

The graph constraints may only include the basic node relations labelled direct dominance (>L) and direct precedence (.).

The feature constraint for a node must be a conjunction of feature-value pairs either for all features which have been declared for NT or for those which have been defined for T.

For each node, its arity (i.e. the number of its children) has to be fixed.

The precedence relation on terminal nodes (leaf nodes) must be a total order.

The only logical connective on all structural levels is the conjunction operator &.

Neither types nor template calls nor regular expressions are admitted.

12. Appendix: Query Language Quick Reference

Reserved symbols

All the operators and built-in types which are listed below are reserved symbols. In particular, this means that the use of the following characters

 ! " # $ & ( ) * + , - . / : ; = > ? @ | { }

in edge labels, strings, constants, predicate or type names may cause problems. In these cases, a preceding operator \ will help. Example:

[word = "\,"]

Built-in types

The sample queries for the built-in types are meant to illustrate the context where a built-in type may occur. However, these queries are not really meaningful, since they are too general.

Symbol	Meaning	Sample query
`Constant`	constants	`[word = Constant]`
`String`	strings	`[word = String]`
`UserDefConstant`	user defined constants	`[pos=UserDefConstant]`
`FREC`	feature records	`[FREC]`
`NT`	feature records for nonterminals	`[NT]`
`T`	feature records for terminals	`[T]`

Constants

"

constant mark

[word = "Geld"]

Regular expressions for constants

`/`	regular expression mark	`[word = /Geld/]`
`.`	unspecified character	`[word = /sag./]`
`*`	unrestricted repetition	`[lemma = /spiel.*/]`
`+`	repetition with minimum 1	`[word = /.+[0-9A-Z]+.*/]`
`?`	optionality	`[word = /(Leben)s?/]`
`[ ]`	character set	`[word = /.+[0-9A-Z]+.*/]`
`^`	negated character sets	`[word = /[^0-9A-Z].*/]`
`( )`	grouping	`[word = /([lmnp][aeiou])+/]`
`\|`	disjunction	`[word = /[dD](as\|er)/]`
`\`	escape for reserved characters	`[word = /.\-./]`

Feature-value pairs

`=`	feature-value pair	`[pos = "NN"]`
`!=`	negated feature-value pair	`[pos != "NN"]`

Graph predicates

`root()`	root of a graph	`root(#n)`
`arity( , )` `arity( , , )`	arity of a node	`arity(#n,2)` `arity(#n,2,4)`
`tokenarity( , )` `tokenarity( , , )`	number of dominated leaves	`tokenarity(#n,5)` `tokenarity(#n,5,8)`
`continuous( )`	continuous leaves	`continuous(#n)`
`discontinuous( )`	discontinuous leaves	`discontinuous(#n)`

Dominance relation

`>L`	labelled direct dominance	`[cat="NP"] >NK [cat="NP"]` `[cat="NP"] >OA\-MOD [cat="NP"]`
`>`	direct dominance	`[cat="NP"] > [pos="NE"]`
`>*`	dominance	`[cat="NP"] >* [pos="NE"]`
`>$`	dominance, distance n	`[cat="NP"] >2 [pos="NE"]`
`>m,n`	dominance, distance m..n	`[cat="NP"] >2,3 [pos="NE"]`
`>@l`	left corner	`[cat="NP"] >@l [word="die"]`
`>@r`	right corner	`[cat="NP"] >@r [word="Jahr"]`
`$`	siblings	`[word="die"] $ [cat="NP"]`
`$.*`	siblings with precedence	`[word="etwas"] $.* [cat="NP"]`
`!>L`	neg. labelled direct dominance	`[cat="NP"] !>GR [cat="NP"]`
`!>`	neg. direct dominance	`[cat="NP"] !> [pos="NE"]`
`!>*`	neg. dominance	`[cat="NP"] !>* [pos="NE"]`
`!>n`	neg. dominance, distance n	`[cat="NP"] !>2 [pos="NE"]`
`!>m,n`	neg. dominance, distance m..n	`[cat="NP"] !>2,3 [pos="NE"]`
`!>@l`	neg. left corner	`[cat="NP"] !>@l [word="etwas"]`
`!>@r`	neg. right corner	`[cat="NP"] !>@r [word="etwas"]`
`!$`	neg. siblings	`[word="etwas"] !$ [cat="NP"]`
`>~L`	labelled secondary edge	`[cat="VP"] >~HD [cat="NP"]`
`>~`	secondary edge	`[cat="VP"] >~ [cat="NP"]`
`!>~L`	neg. labelled secondary edge	`[cat="VP"] !>~HD [cat="NP"]`
`!>~`	neg. secondary edge	`[cat="VP"] !>~ [cat="NP"]`

Precedence relation

`.`	direct precedence	`[word="die"] . [pos=noun]`
`.*`	precedence	`[word="die"] .* [pos="NN"]`
`.n`	precedence, distance n	`[word="die"] .2 [pos="NN"]`
`.m,n`	precedence, distance m..n	`[word="die"] .2,4 [pos="NN"]`
`!.`	neg. direct precedence	`[word="etwas"] !. [pos="NN"]`
`!.*`	neg. precedence	`[word="etwas"] !.* [pos="NN"]`
`!.n`	neg. precedence, distance n	`[word="etwas"] !.2 [pos="NN"]`
`!.m,n`	neg. precedence, distance m..n	`[word="etwas"] !.2,4 [pos="NN"]`

Variables

`[f=#v]`	variable for a feature value	`[pos=#x:("NN"\|"NE")] . [pos=#x]`
`[#x]`	variable for a feature description	`[#x:(pos="APPR")] .* [#x]`
`#n`	variable for a node identifier	`#n:[cat="NP"] & #n > [pos="ADJA"]) & #n > [pos="NN"]`

Boolean expressions

`( )`	bracketing	`[pos=(!("NN" \| "ART"))]`
`!`	negation (feature values)	`[pos=(!"NN")]`
	negation (feature constraints)	`[!(pos="NN")]`
`&`	conjunction (feature values)	`[pos=(!"NN" & !"NE")]`
	conjunction (feature constraints)	`[pos="NE" & word="Bonn"]`
	conjunction (graph descriptions)	`#n1>#n2 & #n3.#n2`
`\|`	disjunction (feature values)	`[pos=("NN" \| "NE")]`
	disjunction (feature constraints)	`[pos="NE" \| word="es"]`
	disjunction (graph descriptions)	`#n1>#n2 \| #n1>#n3`

Please note: Negation (!) must not have scope over variables.

Please note: A Boolean expression for a feature value must always be put into parentheses (e.g. pos=("NN"|"NE")).

Comments

//

line comment

[pos="NE"].[pos="NE"] // 2-word proper nouns