COREFERENCE
Massimo Poesio

1  CODING PURPOSE

In this chapter we present the coding schemes for coreference in dialogues supported in the MATE project.

1.1 What is 'coreference'?

The term `coreference annotation' is used in an informal way in corpus work to indicate both the annotation of (generalized) anaphoric information and of information about reference proper. We use the term Anaphoric Relation to indicate the relation between two textual elements that denote the same object; the subsequent mention of an entity already introduced is often marked by means of a particular type of noun phrase (NP) called an anaphoric expressions. Annotating corpora with information about such relations between elements of a text is useful both from a linguistic point of view and for applications such as information extraction. A typical example of anaphoric expression are pronouns such as he in the text

John arrived. He looked tired.

In the preferred reading of this text, the pronoun he is a sort of `abbreviated mention' of the individual 'John' which is denoted by the expression John. Following the terminology introduced by Sidner (1979) we will say that in the example just discussed the pronoun he co-specifies with the proper name John, and we will call John the antecedent of the pronoun. We will also say that two strings co-refer when they point to the same entity in the world. In the example above, the pronoun he and the proper name John both co-specify and co-refer; more in general, two expressions may co-specify without co-referring, as we will see below.

The notion of anaphora just introduced is often generalized to relations other than identity. So-called bridging references (Clark, 1977) are expressions that denote objects only related to the denotation of their antecedent by (shared) generic knowledge. An example is the indicators in:

John has bought a new car. The indicators use the latest laser technology.

We are able to interpret the description the indicators because we know that indicators are a part of cars, and a car was mentioned in the first sentence. Some of the relations that may hold between a bridging reference and its antecedent include part-whole as in the example just seen, and element-set (as in The Italian team didn't play well yesterday until the centre-forward was replaced in the 30th minute). A bridging reference may also refer to the object filling a role in an event, whether implicitly or explicitly introduced, e.g. A young woman was attacked earlier this evening on Town Moor. The assailant was chased by a member of the public, but managed to escape. (A detailed survey of alternative classifications of bridging descriptions proposed in the literature can be found in Vieira (1998).)

Another example of expression which has an 'antecedent', but whose relation with the antecedent is not of identity, is the expression one in Wendy prefers the red T-shirt to the yellow one. In this case, we are talking about two distinct T-shirts, of different colours. The expression one thus denotes something like an object type rather than an object token. Pronouns can enter in the same type of semantic relation with their antecedents, albeit more rarely: the classical example of this are sentences such as The man who gave his paycheque to his wife was wiser than the man who gave it to his mistress, which give this kind of pronouns the name paycheck pronouns. Yet another example of indirect relation between an anaphoric expression and its antecedent are bound pronouns (Partee, 1972). In Nobody likes to lose his job, the pronoun his does not `refer' to the same object as its antecedent, the quantifier nobody (which does not refer to anything); this anaphoric expression is best seen as playing the role of a variable in first order logic.

So far, we have seen examples of anaphoric expressions which refer back to an object introduced in the text, or are somehow related to it (as in the case of bridging references). However, for some applications (especially multimedia ones) it is also useful to mark the cases in which an expression in the text refers to an object that has not been mentioned before, but is 'accessible' because it is part of the visible situation: these expressions are called deictics or also indexicals. An example of indexical expression in a real life conversation is the salt in an utterance of the sentence pass me the salt, please in a context in which the salt hasn't been mentioned before. The MapTask corpus collected at HCRC contains a number of references to so-called 'landmarks' - objects on a map that the participants in a conversation look at while doing the task - which are also deictic in this sense, as are the references to objects on the screen in the GOCAD corpus from LORIA.

1.2 Issues to be considered in a dialogue coreference annotation scheme

Whether one is working on text or dialogue, the main problem in annotating anaphora is that almost every word in a text may be anaphoric (in the generalized sense discussed above) to some extent; hand-annotating all anaphoric expressions and all anaphoric relations is therefore impossible, except for small amounts of text. When designing a scheme for annotating anaphoric relations it is then necessary to identify the anaphoric expressions and relations more relevant for one's needs. Narrowing the scope of the scheme may also be necessary in order to achieve good agreement among subjects.
This can be done by specifying syntactic constraints on markables, which are the text spans which enter into coreference relationships, by specifying constraints on the sorts of objects in the world for which coreference will be marked up, or by restricting the kinds of coreferential relations which will be considered (for instance, by deliberately failing to mark bridging references). In addition to the problem of what counts as a markable, there are additional difficulties which are thrown up by annotating dialogue instead of text: what to do about marking up coreferences which occur during disfluent speech, and what to do if the participants in a dialogue do not agree about what an expression refers to, especially if they know about different objects in the world.

1.2.1 Syntactic restrictions on markables

One way of limiting the annotation task is to use syntactic restrictions to determine a set of text spans which the coder will then consider as markables for coreference relations. For instance, many schemes restrict mark-up to NPs, whether these are determined by the human coder or automatically via a morphosyntactic tagger. And even so, the choice of NPs to serve as markables is not straightforward. For instance, it is quite common to ignore first and second person pronouns when marking. It is not clear whether to mark appositions in noun phrases separately (as in "one of engines at Elmira, say engine E2 " or "The Admiral's Head, that famous Portsmouth hostelry "). Similarly, noun phrases in post-copular position can be problematic. For example, it can be argued that in (1.1) a policeman is clearly expressing a predicate, and therefore need not be marked, whereas in (1.2) (to be imagined being said while looking at the sky at night), both the planet on the left and Venus are clearly referring expressions; it's not so clear how to handle the president of the board in (1.3).

(1.1) John is a policeman.
(1.2) The planet on the left is Venus.
(1.3) John is the president of the board.

It may be useful to mark empty elements such as that seen in Sieve the flour and baking powder into the fat. Mix _., even though they leave no trace in the words of the transcript. Anaphoric references to events and other abstract objects may also stretch the notion that markables are traceable NPs.

An issue that has to be considered when thinking about other languages is that in languages such as Spanish and Italian, anaphoric expressions may be morphologically incorporated in the verb: In Italian, for example, certain clitics behave like verb suffixes:

(1.4) A: Adesso dammelo.  [Now give-to me-it]

Because the most common syntactic constructions for coreferential expressions differ in different languages, because people may wish to use different syntactic constraints for different purposes, and because, even with the same purposes, people use different automatic morphosyntactic taggers which make different syntactic distinctions, it is not sensible to impose any standard views on the correct syntactic constraints to use for pre-filtering possible markables. As a result, our approach is to allow the user of the MATE workbench to decide upon a syntactic constraint which suits their corpus and their automatic tagging, by expressing it in the MATE query language. Users who do not wish to impose syntactic constraints at all (for instance, those interested in determining what the distribution of syntactic constructions are for the different kinds of coreference relations) may specify a null constraint, in which case the human coder must scan the complete text looking for referring expressions to code.

1.2.2 Choosing an object type constraint on markables

As well as using syntactic constraints to cut down on the number of coreference annotations, it is also possible to specify restrictions on the kinds of objects in the world for which coreference is of interest. For instance, in the Map Task, researchers often want to know about coreference relations for map landmarks but not for anything else. As with syntactic constraints, reasonable object type constraints will depend on the material being marked. Therefore, again our approach is to allow the user of the MATE workbench to specify this constraint, either as a pre-determined list of objects or by giving a description of the objects of interest. In this latter case, it is of course impossible for the workbench itself to determine which text spans fit the constraint, and so this constraints forms part of the coding instructions for the human user to follow.

1.2.3 Restricting the coreference relations to be marked

Another way of limiting the coreference annotation task is to ask the coder only to mark some kinds of coreference relations. For instance, the very simplest coreference schemes, like MUCCS (Hirschman, 1997) and the scheme used in the Map Task, only specify a relationship when the two discourse entities being linked refer to the same object. One good reason for limiting coreference annotation exercises by restricting the set of relations to be marked is that for many of the most interesting relations, reliable annotation schemes have not yet been developed. The best reliability information to date comes from work by Poesio and Vieira (1997), which concentrated on marking definite descriptions on texts from the Wall Street Journal. Their results confirm Fraurud's (1990) impression that the only distinction that can be marked reliably is that between first mentions and subsequent mentions; bridging references proved remarkably difficult to classify reliably. Of course, for many purposes, and especially for linguistic research on the role of bridging, even unreliable coding may be valuable; however, for large-scale annotation exercises with a language engineering bent, a simpler set of relations may be more appropriate.

1.2.4 Deciding what to do about disfluencies

When annotating dialogues, new problems arise, one of which is what to do about hesitations and disfluencies (such as repetitions and repairs), which break up the syntax of an utterance and can occur in the same location as a referring expression. In (1.5) (from the TRAINS corpus, (Gross et al, 1993)), the noun phrase one of engines at Elmira, say engine E2 is divided between several utterances, broken by pauses and other hesitations. In (1.6) (from (Passonneau, 1996)), the definite description the other kids is repaired into the kid.

(1.5)  9.6: I think what we should do
        9.7: is
        9.8: hook up
        9.9: uh one of the [2sec]
        9.10: engines
        9.11: uh
        9.12: at Elmira
        9.13: say engine E2

(1.6) and the g guy on the bike gives the other kids... gives the kid that returns his hat...

This can cause difficulties for syntactic constraints on markables unless the morphosyntactic tagging takes disfluency into account by splicing disfluent utterances into their perceived targets. What one chooses to do about disfluency is likely to depend on the expected use of the coreference tagging and what possibilities the morphosyntactic tags leave open. If the morphosyntactic tagging allows one to splice together target utterances, then one might choose to ignore disfluencies by constructing and marking on these targets. Alternatively, one might choose to ignore all possible markables within disfluent speech.

1.2.5 Multiple perspectives and misunderstandings

Another problem with annotating coreference in dialogues is that the participants do not always share the same perspective of the world or of the discourse. Sometimes different participants know about different objects in the world, leading to difficulties when one refers to an object unknown to the other. The Map Task makes this obvious by establishing differences between the participants' maps, but some knowledge differences occur in most real-world situations. Even where the universe of objects is completely shared, misunderstandings can arise because people are not always very careful in establishing joint references. As a result, different participants may believe that different coreference relations hold for the same markables. It is possible to allow the annotation of multiple perspectives within a dialogue, if one both allows multiple universes of objects, so that differences in world knowledge are clear, and allows the marking of coreferential links with the set of participants for which they hold. However, this does make annotation rather more complicated than it would be otherwise, and the annotation itself may not be particularly reliable, since making these distinctions requires a certain amount of mind-reading on the part of the coder. Another possibility is to specify that the coder is to annotate only the interpretation of a given noun phrase intended by the speaker. This still requires mind-reading, but less, since only one participant's mind must be read and since the speaker leaves the largest trace of what they think in the transcript.
 
 

1.3 Sources of Examples

A few examples in this document are made up, but most of them come from three main corpora:

In addition, we took several examples from (Quirk and Greenbaum, 1973), from Passonneau's manual (Passonneau, 1996) and from the BBC News web site. We indicate the source of the examples either by explicitly mentioning the source or by means of the symbols (BBC) for the BBC texts, (MF) for the Microfusées texts, (QG) for Quirk and Greenbaum, and (T) for the TRAINS texts.



2 EXISTING SCHEMES