MATE LEVEL MARKUP: COREFERENCE
Second draft, May 99
1 CODING PURPOSE
In this chapter we present the coding schemes for coreference in dialogues supported in the MATE project.
1.1 What is 'coreference'?
The term `coreference annotation' is used in an informal way in corpus work to indicate both the annotation of (generalized) anaphoric information and of information about reference proper. We use the term Anaphoric Relation to indicate the relation between two textual elements that denote the same object; the subsequent mention of an entity already introduced is often marked by means of a particular type of noun phrase (NP) called an anaphoric expressions. Annotating corpora with information about such relations between elements of a text is useful both from a linguistic point of view and for applications such as information extraction. A typical example of anaphoric expression are pronouns such as he in the text
John arrived. He looked tired.
In the preferred reading of this text, the pronoun he is a sort of `abbreviated mention' of the individual 'John' which is denoted by the expression John. Following the terminology introduced by Sidner (1979) we will say that in the example just discussed the pronoun he co-specifies with the proper name John, and we will call John the antecedent of the pronoun. We will also say that two strings co-refer when they point to the same entity in the world. In the example above, the pronoun he and the proper name John both co-specify and co-refer; more in general, two expressions may co-specify without co-referring, as we will see below.
The notion of anaphora just introduced is often generalized to relations other than identity. So-called bridging references (Clark, 1977) are expressions that denote objects only related to the denotation of their antecedent by (shared) generic knowledge. An example is the indicators in:
John has bought a new car. The indicators use the latest laser technology.
We are able to interpret the description the indicators because we know that indicators are a part of cars, and a car was mentioned in the first sentence. Some of the relations that may hold between a bridging reference and its antecedent include part-whole as in the example just seen, and element-set (as in The Italian team didn't play well yesterday until the centre-forward was replaced in the 30th minute). A bridging reference may also refer to the object filling a role in an event, whether implicitly or explicitly introduced, e.g. A young woman was attacked earlier this evening on Town Moor. The assailant was chased by a member of the public, but managed to escape. (A detailed survey of alternative classifications of bridging descriptions proposed in the literature can be found in Vieira (1998).)
Another example of expression which has an 'antecedent', but whose relation with the antecedent is not of identity, is the expression one in Wendy prefers the red T-shirt to the yellow one. In this case, we are talking about two distinct T-shirts, of different colours. The expression one thus denotes something like an object type rather than an object token. Pronouns can enter in the same type of semantic relation with their antecedents, albeit more rarely: the classical example of this are sentences such as The man who gave his paycheque to his wife was wiser than the man who gave it to his mistress, which give this kind of pronouns the name paycheck pronouns. Yet another example of indirect relation between an anaphoric expression and its antecedent are bound pronouns (Partee, 1972). In Nobody likes to lose his job, the pronoun his does not `refer' to the same object as its antecedent, the quantifier nobody (which does not refer to anything); this anaphoric expression is best seen as playing the role of a variable in first order logic.
So far, we have seen examples of anaphoric expressions which refer back to an object introduced in the text, or are somehow related to it (as in the case of bridging references). However, for some applications (especially multimedia ones) it is also useful to mark the cases in which an expression in the text refers to an object that has not been mentioned before, but is 'accessible' because it is part of the visible situation: these expressions are called deictics or also indexicals. An example of indexical expression in a real life conversation is the salt in an utterance of the sentence pass me the salt, please in a context in which the salt hasn't been mentioned before. The MapTask corpus collected at HCRC contains a number of references to so-called 'landmarks' - objects on a map that the participants in a conversation look at while doing the task - which are also deictic in this sense, as are the references to objects on the screen in the GOCAD corpus from LORIA.
1.2 Issues to be considered in a dialogue coreference annotation scheme
Whether one is working on text or dialogue, the main problem in
annotating anaphora is that almost every word in a text may be
anaphoric (in the generalized sense discussed above) to some extent;
hand-annotating all anaphoric expressions and all anaphoric relations
is therefore impossible, except for small amounts of text. When
designing a scheme for annotating anaphoric relations it is then
necessary to identify the anaphoric expressions and relations more
relevant for one's needs. Narrowing the scope of the scheme may also
be necessary in order to achieve good agreement among subjects.
This can be done by specifying syntactic constraints on markables,
which are the text spans which enter into coreference relationships,
by specifying constraints on the sorts of objects in the world for
which coreference will be marked up, or by restricting the kinds of
coreferential relations which will be considered (for instance, by
deliberately failing to mark bridging references). In addition to the
problem of what counts as a markable, there are additional
difficulties which are thrown up by annotating dialogue instead of
text: what to do about marking up coreferences which occur during
disfluent speech, and what to do if the participants in a dialogue do
not agree about what an expression refers to, especially if they know
about different objects in the world.
1.2.1 Syntactic restrictions on markables
One way of limiting the annotation task is to use syntactic restrictions to determine a set of text spans which the coder will then consider as markables for coreference relations. For instance, many schemes restrict mark-up to NPs, whether these are determined by the human coder or automatically via a morphosyntactic tagger. And even so, the choice of NPs to serve as markables is not straightforward. For instance, it is quite common to ignore first and second person pronouns when marking. It is not clear whether to mark appositions in noun phrases separately (as in "one of engines at Elmira, say engine E2 " or "The Admiral's Head, that famous Portsmouth hostelry "). Similarly, noun phrases in post-copular position can be problematic. For example, it can be argued that in (1.1) a policeman is clearly expressing a predicate, and therefore need not be marked, whereas in (1.2) (to be imagined being said while looking at the sky at night), both the planet on the left and Venus are clearly referring expressions; it's not so clear how to handle the president of the board in (1.3).
(1.1) John is a policeman.
(1.2) The planet on the left is Venus.
(1.3) John is the president of the board.
It may be useful to mark empty elements such as that seen in Sieve the flour and baking powder into the fat. Mix _., even though they leave no trace in the words of the transcript. Anaphoric references to events and other abstract objects may also stretch the notion that markables are traceable NPs.
An issue that has to be considered when thinking about other languages is that in languages such as Spanish and Italian, anaphoric expressions may be morphologically incorporated in the verb: In Italian, for example, certain clitics behave like verb suffixes:
(1.4) A: Adesso dammelo. [Now give-to me-it]
Because the most common syntactic constructions for coreferential expressions differ in different languages, because people may wish to use different syntactic constraints for different purposes, and because, even with the same purposes, people use different automatic morphosyntactic taggers which make different syntactic distinctions, it is not sensible to impose any standard views on the correct syntactic constraints to use for pre-filtering possible markables. As a result, our approach is to allow the user of the MATE workbench to decide upon a syntactic constraint which suits their corpus and their automatic tagging, by expressing it in the MATE query language. Users who do not wish to impose syntactic constraints at all (for instance, those interested in determining what the distribution of syntactic constructions are for the different kinds of coreference relations) may specify a null constraint, in which case the human coder must scan the complete text looking for referring expressions to code.
1.2.2 Choosing an object type constraint on markables
As well as using syntactic constraints to cut down on the number of coreference annotations, it is also possible to specify restrictions on the kinds of objects in the world for which coreference is of interest. For instance, in the Map Task, researchers often want to know about coreference relations for map landmarks but not for anything else. As with syntactic constraints, reasonable object type constraints will depend on the material being marked. Therefore, again our approach is to allow the user of the MATE workbench to specify this constraint, either as a pre-determined list of objects or by giving a description of the objects of interest. In this latter case, it is of course impossible for the workbench itself to determine which text spans fit the constraint, and so this constraints forms part of the coding instructions for the human user to follow.
1.2.3 Restricting the coreference relations to be marked
Another way of limiting the coreference annotation task is to ask the coder only to mark some kinds of coreference relations. For instance, the very simplest coreference schemes, like MUCCS (Hirschman, 1997) and the scheme used in the Map Task, only specify a relationship when the two discourse entities being linked refer to the same object. One good reason for limiting coreference annotation exercises by restricting the set of relations to be marked is that for many of the most interesting relations, reliable annotation schemes have not yet been developed. The best reliability information to date comes from work by Poesio and Vieira (1997), which concentrated on marking definite descriptions on texts from the Wall Street Journal. Their results confirm Fraurud's (1990) impression that the only distinction that can be marked reliably is that between first mentions and subsequent mentions; bridging references proved remarkably difficult to classify reliably. Of course, for many purposes, and especially for linguistic research on the role of bridging, even unreliable coding may be valuable; however, for large-scale annotation exercises with a language engineering bent, a simpler set of relations may be more appropriate.
1.2.4 Deciding what to do about disfluencies
When annotating dialogues, new problems arise, one of which is what to do about hesitations and disfluencies (such as repetitions and repairs), which break up the syntax of an utterance and can occur in the same location as a referring expression. In (1.5) (from the TRAINS corpus, (Gross et al, 1993)), the noun phrase one of engines at Elmira, say engine E2 is divided between several utterances, broken by pauses and other hesitations. In (1.6) (from (Passonneau, 1996)), the definite description the other kids is repaired into the kid.
(1.5) 9.6: I think what we should do
9.7: is
9.8: hook up
9.9: uh one of the [2sec]
9.10: engines
9.11: uh
9.12: at Elmira
9.13: say engine E2
(1.6) and the g guy on the bike gives the other kids... gives the kid that returns his hat...
This can cause difficulties for syntactic constraints on markables unless the morphosyntactic tagging takes disfluency into account by splicing disfluent utterances into their perceived targets. What one chooses to do about disfluency is likely to depend on the expected use of the coreference tagging and what possibilities the morphosyntactic tags leave open. If the morphosyntactic tagging allows one to splice together target utterances, then one might choose to ignore disfluencies by constructing and marking on these targets. Alternatively, one might choose to ignore all possible markables within disfluent speech.
1.2.5 Multiple perspectives and misunderstandings
Another problem with annotating coreference in dialogues is that the participants do not always share the same perspective of the world or of the discourse. Sometimes different participants know about different objects in the world, leading to difficulties when one refers to an object unknown to the other. The Map Task makes this obvious by establishing differences between the participants' maps, but some knowledge differences occur in most real-world situations. Even where the universe of objects is completely shared, misunderstandings can arise because people are not always very careful in establishing joint references. As a result, different participants may believe that different coreference relations hold for the same markables. It is possible to allow the annotation of multiple perspectives within a dialogue, if one both allows multiple universes of objects, so that differences in world knowledge are clear, and allows the marking of coreferential links with the set of participants for which they hold. However, this does make annotation rather more complicated than it would be otherwise, and the annotation itself may not be particularly reliable, since making these distinctions requires a certain amount of mind-reading on the part of the coder. Another possibility is to specify that the coder is to annotate only the interpretation of a given noun phrase intended by the speaker. This still requires mind-reading, but less, since only one participant's mind must be read and since the speaker leaves the largest trace of what they think in the transcript.
1.3 Sources of Examples
A few examples in this document are made up, but most of them come from three main corpora:
In addition, we took several examples from (Quirk and Greenbaum, 1973), from Passonneau's manual (Passonneau, 1996) and from the BBC News web site. We indicate the source of the examples either by explicitly mentioning the source or by means of the symbols (BBC) for the BBC texts, (MF) for the Microfusées texts, (QG) for Quirk and Greenbaum, and (T) for the TRAINS texts.
2 EXISTING SCHEMES
We analyzed five existing schemes in preparation for this proposal. Although in general MATE chose to review only schemes which had been proven reliable, in the case of coreference, reliability tests were rare or informal enough that this constraint was somewhat relaxed. The five schemes reviewed were the MUCSS scheme developed for MUC-7 (Hirschman, 1997), the DRAMA scheme (Passonneau, 1996), the Lancaster University UCREL scheme (Fligelstone, 1992), the scheme developed by Bruneseaux and Romary (1997) and the MapTask annotation of landmarks. These schemes are discussed in MATE deliverable D1.1.
The MUCCS scheme is the best known and most widely used of the existing coreference schemes, the more modest in scope (it concentrates on identity relations between NPs) and the only one whose reliability has been systematically tested. However, this scheme was designed for text, so it does not provide instructions either for dealing with problems in dialogue such as disfluencies or misunderstandings, or for annotating references to the visual situation, common e.g., in the MapTask corpus and in multimodal applications, and that we hypothesize can be reliably annotated. Also, its syntactic constraint on markables is designed only for English. The DRAMA scheme was designed for dialogues and therefore does include instructions for dealing with some difficult problems of markable identification in dialogues, but still relies on English-specific syntactic constraints in order to reduce the annotation task to something doable. DRAMA also includes instructions for dealing with bridging references - whose reliability however still has to be ascertained - but not for references to the visual situation. Finally, the Lancaster scheme was also designed for texts, and in certain ways is more ambitious than any of the schemes discussed here in that it also contains instructions for annotating elliptical references. We are not aware of any study of the reliability of the scheme.
3 SELECTED SCHEMES
Given that most coreference work is currently done with schemes which are not particularly reliable, and that there is little general agreement on the names of relations to use, we have adopted a modular approach to coreference schemes by which users can construct a scheme which is appropriate for them. Because the semantics of anaphora and coreference is relatively well-understood, it is possible to extract from the schemes discussed above a fairly short list of options available to the designer of a scheme. (This is unlike the case of dialogue acts, where different schemes are very difficult to compare.) These considerations suggested a `meta-scheme' approach to the problem of developing a scheme for the coreference level that could be useful for a variety of applications. What this means is that instead of proposing a single scheme, we identified a range of types of information about `coreference' that the designer of a scheme may want to annotate among those specified in the coding schemes for coreference discussed above; we evaluated how reliable each type of annotation is likely to be; and we specified the markup language needed to pursue each option. The workbench will support the whole range of elements and attributes of the meta-scheme; the task of the designer of a scheme will be to identify the options of interest among those supported by the workbench, ignoring the rest. Specifying a specific coreference scheme out of this range of options involves specifying syntactic and object type constraints, both of which can be empty, which are to be used for pre-filtering markables, plus specifying the coreference relations of interest, which will be assumed to be defined meaningfully for the human user and documented within the coding module.
In order to show that this approach is workable, in addition to providing a specification of this range of options we also showed how a useful set of schemes can be specified using the elements and attributes of the meta-scheme, so that the coreference community can use the MATE workbench to annotate according to their favorite scheme. These schemes include a 'basic' scheme that can be used to do the type of annotation that is done using MUCCS; a scheme that can be used to annotate references to the visual situation, as in the MapTask scheme and in the scheme developed by Bruneseaux and Romary; and a scheme to do the type of annotation which is possible in DRAMA, which involves an extended set of anaphoric relations. In addition, we included a discussion of the possible options when selecting markables, including instructions for annotating anaphoric constructs typical in Romance languages such as clitics, and for dealing with some typical dialogue phenomena. (This discussion does not cover all possible sorts of anaphoric relations and uses of deictics; it is only concerned with the cases in which the anaphoric expression is an NP.)
4 MARKUP ELEMENTS COMMON TO ALL OPTIONS
In this section we introduce the common core of markup distinctions common to all options allowed by the meta-scheme.
4.1 Markup Declaration
The following elements are used in the coreference schemes. As in all other schemes we use a single element to mark both anaphoric expressions and the NPs that serve as antecedents; the main difference from the MUC-7 scheme and DRAMA is that, following Bruneseaux and Romary (1997) (who, in turn, followed the TEI specification), we separated out the annotation of co-specification from the annotation of discourse entities. We use therefore two main elements: <coref:de>, used to annotate the elements which enter in co-specification relations; and <coref:link>, used for expressing co-specification between discourse entities. This way of annotating relations has the advantage that a discourse entity can be related by links to more than one other discourse entity; this is important to allow a discourse entity to be related both to an antecedent introduced in the discourse and to an entity in the universe of discourse. In addition, we have elements for specifying objects in the visual situation that can serve as antecedents, and for marking text constituents that introduce elements which participate in anaphoric relations in an indirect way.
Attributes: ID, HREF.
Attributes: HREF (obligatory)
TYPE (obligatory; with values as specified under the schemes)
WHO-BELIEVES (optional; default value SHARED; other values to be set to the participants in the dialogue (below, G and F)).
Embedded elements: <coref:anchor>
Attributes: HREF
Attributes: ID (obligatory)
modifies (optional, only permitted value is COMMON, used when the universe extends the common universe
Embedded elements: <coref:ue>
Attributes: ID
Attributes: ID
4.2 Description of Elements
4.2.1 Discourse Entities
Description
The assumption underlying most annotation schemes for coreference is that processing text involves building a discourse model containing discourse entities, and that anaphoric relations are relations between these discourse entities (Webber, 1978; Heim, 1982; Kamp, 1981). We use the <coref:de> tag to annotate the text spans that introduce a discourse entity - that is, that can be subsequently referred to by means of anaphoric expressions. These are commonly noun phrases. Not all noun phrases do this: for example, whereas
John likes Bill
introduces two discourse entities, as can be shown by the fact that a follow-up like
He is crazy
is ambiguous in that he can refer either to John or to Bill, the sentence
John is a policeman
which from a syntactic point of view also contains two NPs, nevertheless only introduces one discourse entity, as can be seen by the fact that in this case, the continuation He is crazy is not ambiguous. As a consequence, the NP a policeman would not get a <coref:de>tag; in other words, the textual elements given a <coref:de> tag are a subset of the range of NPs.
Data Source
The annotation for <coref:de>'s should be included in a file with pointers to a base file which has already been XML tagged with information about the structure of the conversation, ideally using TEI coding (http://etext.virginia.edu/TEI.html), suitably converted into XML. A typical dialogue marked up in TEI has a <teiHeader>, <head>, and a <body> which is broken up into utterances (<u>), marked for speaker. Each <pause> is marked. The <u> might be further segmented, for example into prosodic phrases, using the TEI <seg> tags. Gestures and mouse clicks may also be marked, as may notes made by the annotator or the initial transcriber, and more detailed information can be given about pause durations, type of transitions between speakers, and many other features. The French conversation in (4.1), for example (from the Microfusées corpus), might be marked up as in (4.2):
(4.1)
Formateur: Alors donc / vous avez / ici [au milieu de la table] / les
modèles des fusées volé /
[Le formateur dispose le petit paquet de dessins des
9 fusées.]
Mia: Oui
Formateur: Et vous allez essayer de vous mettre d'accord sur un
classement / hein classer les fusées qui ont bien volé
ou qui ont moins bien volé / [Le formateur montre avec les
mains un endroit (bien volé puis un autre (moins bien
volé .]
Mia: Alors par exemple de celle qui a / le / qui a volé le plus loin / à à celle qui a volé moins loin(?)
Instructor: OK, then, here you have [in the middle of the table] the
models of the rockets. [The instructor puts down the little packet of
9 rocket designs.]
Mia: Yes
Instructor: And you are going to try to agree on a
classification... to classify the rockets which flew well or which
flew less well.. [The instructor points to one place (those which flew
well) then another (those which flew less well)]
Mia: So for example from the one which.. it.. which flew the
furthest... to the one which flew the least far?
(4.2)
<u id="u1" who="F">
<seg id="u1seg1"> Alors donc <pause dur=short /> vous avez <pause dur=short /> ici <note place=inline> au milieu de la table </note> <pause dur=short /> les modèles des fusées <pause dur=short /> </seg>
<note place="outline" type="stage directions">Le formateur dispose le petit paquet de dessins des 9 fusées. </note>
</u>
<u id="u2" who="M" trans=pause>
<seg id="u2seg1"> Oui </seg>
</u>
<u id="u3" who="F">
<seg id="u3seg1"> Et vous allez essayer de vous mettre d'accord sur un classement <pause dur=short /> </seg>
<seg id="u3seg2"> hein classer les fusées qui ont bien volé ou qui ont moins bien volé <pause dur=short /> </seg >
<note place="outline" type="stage directions">Le formateur montre avec les mains un endroit (bien volé ) puis un autre (moins bien volé) . </note>
</u>
<u id="u4" who="M " trans=pause>
<seg id="u4seg1" >Alors par exemple de celle qui a <pause dur=short /> le <pause dur=short /> qui a volé le plus loin <pause dur=short /> à celle qui a volé moins loin (?) </seg>
</u>
<u id="u1" who="F">
<seg id="u1seg1">OK, then, <pause dur=short /> you have
<pause dur=short /> here <note place=inline> in the middle
of the table </note> <pause dur=short /> the models of the
rockets <pause dur=short /> </seg>
<note place="outline" type="stage directions" The instructor puts down the little packet of 9 rocket designs </note>
</u>
<u id="u2" who="M" trans=pause>
<seg id="u2seg1">Yes</seg>
</u>
<u id="u3" who="F">
<seg id="u3seg1"> And you are going to try to agree on a classification <pause dur=short /> </seg>
<seg id="u3seg2" >to classify the rockets which flew well or which flew less well <pause dur=short /> </seg>
<note place="outline" type="stage directions">The instructor points to one place (those which flew well) then another (those which flew less well). </note>
</u>
<u id="u4" who="M " trans=pause>
<seg id="u4seg1">So for example from the one which <pause dur=short /> it <pause dur=short /> which flew the furthest <pause dur=short /> to the one which flew the least far (?) </seg>
</u>
The details of the TEI mark-up may not suit all corpora, depending on the format in which the initial transcription has been presented. For example, in the TRAINS corpus each speaker turn is segmented into a number of different utterances, separated at prosodic phrase boundaries (4.3). This means that the <u> are much shorter than those in true TEI-conformant mark-up, and there is then no TEI tag suitable for grouping the utterances into turns. For the moment, we have adopted the procedure in this case of introducing <turn> tags for a whole turn, and using <u> for each utterance or prosodic phrase:
(4.3)
44.1 S: +okay+
44.2 : okay
44.3 : lemme run /
44.4 : lemme make sure I got all this
44.5 : okay
44.6 : you wanna send E2
44.7 : you wanna link
44.8 : uh
44.9 : the boxcar at Elmira to E2
44.10 : and send that to Corning
45.1 M: yeah
46.1 S: and have it load oranges
47.1 M: right
48.1 S: okay
(4.4)
<turn id="t44" who="S">
<u id="u44.1"> +okay+</u>
<u id="u44.2"> okay</u>
<u id="u44.3"> lemme run /</u>
<u id="u44.4"> lemme make sure I got all this</u>
<u id="u44.5"> okay</u>
<u id="u44.6"> you wanna send E2 </u>
<u id="u44.7"> you wanna link</u>
<u id="u44.8"> uh</u>
<u id="u44.9"> the boxcar at Elmira to E2 </u>
<u id="u44.10"> and send that to Corning </u> </turn
<turn id="t45" who="M">
<u id="u45.1"> yeah</u></turn
<turn id="t46" who="S">
<u id="u46.1"> and have it load oranges </u>
</turn
<turn id="t47" who="M">
<u id="u47.1"> right</u> </turn
<turn id="t48" who="S">
<u id="u48.1"> okay </u></turn
If one wishes to impose syntactic restrictions on potential markables - which is a good idea for annotation exercises of any complexity - then this basic level must be further annotated with something which allows that constraint to be expressed - word tags, or full syntactic elements, or morpho-syntax tags as defined in the MATE Morpho-syntax scheme (Pirrelli and Soria, 1999). Since different schemes make different choices, the exact data source requirements are left to the individual schemes.
Assignment
The only attributes of <coref:de>that have to be set are id and href, both of which are automatically computed by the MATE workbench, either by making <coref:de> elements match the output of some MATE query on morphosyntactic tagging or by computation from text selected in the coding interface by the human user.
Example
Assuming that chunks with nominal governors are chosen as markables and that the sentence
(4.5)John likes Bill
would get annotated with chunks as follows:
(4.6) ch.xml
<ch id="ch_001" type="N">
<potgov id="p_001"> John</potgov></ch>
<ch id="ch_002" type="V">
<potgov id="p_002"> likes </potgov></ch>
<ch id="ch_003" type="N">
<potgov id="p_003" > Bill </potgov></ch>
then the following discourse entities would be annotated:
(4.7) coref.xml
<coref:de id="de_001" href="ch.xml#id(ch_001)"/>
<coref:de id="de_002" href=" ch.xml#id(ch_003)"/>
Important Note: Since the underlying XML representation is meant to be transparent to the annotator using the MATE tools, in the examples below we have simplified the notation considerably so as to make it easier for non-XML experts to understand the annotation; this would also make it clearer that the meta-scheme does not crucially depend on a particular type of basic level markup. First of all, we give examples in plain text, abstracting away from the chunking level, except in a few cases when this is necessary. Second, instead of representing the markup by means of href pointers as in (4.7), we will adopt a more conventional SGML-style format with tags wrapped around the parts of the text to be annotated with a <coref:de> element, so as to make it clearer to the annotator which part of the text to highlight and to mark; the representation in (4.7) will be automatically constructed by the tool and the annotator need not be aware of it. In our examples, we will generally use the following representation, rather than the format in (4.7):
(4.8) <coref:de>John</coref:de> likes <coref:de>Bill</coref:de>
Coding Procedure
Left to the individual schemes.
Markup Table
|
Element |
Attributes |
Content |
|
de |
ID, HREF |
none |
4.2.2 Link and Anchor Entities
Description
<coref:link> elements are used to mark anaphoric relations between discourse entities, the most basic of which is the identity relation. This relation obtains between two phrases in a text when they denote the same object in the world; the phrases used to refer to this object can be the same, like 'la surface... la surface' in (4.9), 'orange juice... orange juice' in (4.10), 'les ailerons... les ailerons' in (4.11) or different, as is seen with 'the engine E3... it... it' in (4.12), or 'ces deux fusées... elles' in (4.13). As these last two examples suggest, it is very common for a pronoun to be used to refer to a discourse entity previously referred to by a full noun phrase.
(4.9)
S: Créer la surface.
W: Opération effectuée
S: Modéliser la surface
W: Quel nom voulez-vous donner à la surface ?
S: Create the surface
W: Done
S: Model the surface
W: What name do you want to give to the surface ? (MF)
(4.10)
When do we have orange juice at Elmira?
We have orange juice at Elmira at 6 a.m. (T)
(4.11)
197 F: mmh / Donc qu'est ce que vous allez garder en fait (?) + /
198 M: |la longueur du tube et les ailerons |
199 D:| les ailerons |
200 F: Donc les ailerons vous m'avez dit.
197 F: mmm / Well, what are you going to keep, then ? /
198 M: the length of the tube and the wings |
199 D: | the wings |
200 F: well, the wings, you said (MF)
(4.12) we're gonna take the engine E3 and shove it over to Corning, hook it up to the tanker car... (T)
(4.13)
193 F: Donc qu'est ce qui / qu'est ce qui serait commun à ces deux fusées. Ces deux fusées ont /
194 D: c'est qu'elles ont / elles ont la même...
193 F: What would it be that these two rockets have in common? These two rockets have /
194 D: it's that they have / they have the same... (MF)
In this section we only discuss the case of links describing identity relations, but nothing prevents an annotator to use a wider range of relations, as done in the DRAMA scheme; some suggestions concerning possible relations are in Section 8.
Data Source
The <coref:link>and <coref:anchor> elements point to <coref:de>elements.
Segmentation/Selection
Not applicable (the information provided by <coref:link> elements comes entirely from their attributes).
Assignment
The HREF attributes of link and anchor elements both refer to the ID of an antecedent, which can be either a <coref:de> element, a <coref:ue> element, or a <coref:seg> element (see below). For the moment, we assume that the antecedent denotes the same object as the <coref:de> element, and the ident relation is used. We assume in the rest of this document that the annotation is contained within a file 'coref.xml' to which the href elements point.
Coreference chains: It is often the case that more than two discourse entities refer to the same object; in this case, a coreference chain is formed. Because the identity relation is transitive, if A is ident with B and B is ident with C, then A is ident with C; so it doesn't matter which item in a coreference chain is chosen as antecedent for a new phrase. This can be tracked through the markup.
Furthermore, since the identity relation is symmetric, it doesn't matter which <coref:de>element is chosen as 'current element' and which one as 'anchor'. It is often less confusing, however, to adopt the convention that the <coref:link> element should point to the latest discourse entity, whereas the <coref:anchor> element should point to the antecedent.
Participants interpret anaphoric expressions differently: It is also possible to observe that at a certain point in a dialogue the conversational participants had differences of opinion about coreferential links. For this reason, links can contain specifications of which agent or set of agents believes them to hold, via the optional WHO-BELIEVES attribute. The default value for this attribute is SHARED.
Example
We use the <coref:link>and <coref:anchor> elements to mark anaphoric relations, as follows. When two noun phrases marked as <coref:de>elements co-specify, a <coref:link>element is added. The href attribute of this element points to the anaphoric expression, and contains at least one <coref:anchor> element specifying the antecedent (by means of a second href pointer). The type of relation that holds between the two discourse entities (the values of which depend on the exact scheme implemented) is specified by the type attribute of the <coref:link> element. (As we will see below, specifying anaphoric relations by means of elements embedded into a <coref:link> element allows the annotator to mark for ambiguities of co-specification.) Here are some example annotations.
(4.15)
coref.xml
When do we have <coref:de ID="de _01"> orange juice </coref:de> at Elmira?
We have <coref:de ID="de _02"> orange juice </coref:de> at Elmira at 6 a.m. (T)
<coref:link href="coref.xml#id(de_02)" type="ident" >
<coref:anchor href="coref.xml#id(de_01)" />
</coref:link>
(4.16)
coref.xml:
197 F: mmh / Donc qu'est ce que vous allez garder en fait (?) + /
198 M: |la longueur du tube et <coref:de ID="de _98"> les ailerons </coref:de>
199 D:<coref:de ID="de 99"> les ailerons </coref:de>
200 F: Donc <coref:de ID="de_ 100"> les ailerons </coref:de> vous m'avez dit.
<coref:link href="coref.xml#id(de_99)" type="ident" >
<coref:anchor href="coref.xml#id(de_98)" />
</coref:link>
<coref:link href="coref.xml#id(de_100)" type="ident" >
<coref:anchor href="coref.xml#id(de_99)" />
</coref:link>
(4.17)
we're gonna take <coref:de ID="de _07"> the engine E3 </coref:de> and shove <coref:de ID="de _08"> it </coref:de> over to Corning, hook <coref:de ID="de _09"> it </coref:de> up to the tanker car...
<coref:link href="coref.xml#id(de_08)" type="ident" >
<coref:anchor href="coref.xml#id(de_07)" />
</coref:link>
<coref:link href="coref.xml#id(de_09)" type="ident" >
<coref:anchor href="coref.xml#id(de_08)" />
</coref:link>
Ambiguity: The reason why more than one <coref:anchor> element may be embedded in a <coref:link> element is to annotate ambiguity. In case more than one entity appear to be equally likely antecedents for an anaphoric expression, each of the possibilities can be marked by means of a separate <coref:anchor> element. In the following example, the pronoun it in 15.16 could refer equally well to engine E3 or to the tanker car. If the annotator desires to annotate both antecedents, as in DRAMA or in the Lancaster scheme, this can be done as shown below.
coref.xml:
15.12 : we're gonna take <coref:de ID="de_15">the engine E3</coref:de>
15.13 : and shove <coref:de ID="de_16"> it </coref:de> over to Corning
15.14 : hook <coref:de ID="de_17">it</coref:de> up to <coref:de ID="de_18">the tanker car</coref:de>
15.15 : _and_
15.16 : and send <coref:de ID="de_19">it</coref:de> back to Elmira
<coref:link href="coref.xml#id(de_16)" type="ident">
<coref:anchor href="coref.xml#id(de_15)"/>
</coref:link>
<coref:link href="coref.xml#id(de_17)" type="ident">
<coref:anchor href="coref.xml#id(de_16)"/>
</coref:link>
<coref:link href="coref.xml#id(de_19)" type="ident">
<coref:anchor href="coref.xml#id(de_17)"/>
<coref:anchor href="coref.xml#id(de_18)"/>
</coref:link>
Coding Procedure
Left to the individual schemes.
Markup Table
|
Element |
Attributes |
Content |
|
coref:link |
HREF, WHO-BELIEVES, TYPE |
one or more <coref:anchor>s |
|
coref:anchor |
HREF |
none |
4.2.3 Universe and UE Entities
In face-to-face or human-machine dialogue, participants may make reference to items visible to them at the time of speaking. A simple example of this is Pass the salt, please, where salt may not have been previously mentioned in the conversation, and thus does not corefer with any other <coref:de>, but does refer to an entity which is in the visible situation. Tracking these references is important for multimodal systems (Bruneseaux and Romary, 1997), and they have been annotated reliably in the MapTask. This tracking requires two new elements: a <coref:universe> element (as in the Bruneseaux and Romary scheme) used to specify a 'universe of discourse', that is, a set of objects, each specified by a <coref:ue> element.
The <coref:universe> element may also be used to specify references to items in the non-visible 'universe' of shared knowledge which allows hearers to correctly assign reference to items such as the Eiffel Tower - the so-called `larger-situation' (Hawkins, 1978) or `hearer-old' (Prince, 1981) references; however, annotators should keep in mind that it is often difficult to do such categorizations reliably, as found out by Fraurud (1990) and Poesio and Vieira (1998).
Description
In order to mark up reference to items in the visual situation, the items in the visual situation are listed as universe entities (coref:ue), embedded within a <coref:universe> element. Each <coref:ue> element has an ID, like <coref:de>do, so that a relation of identity between a noun phrase and an object in the visual situation can be encoded by an ident link between a <coref:de>and a <coref:ue just like identity between two <coref:de>elements.
Where feasible, it is suggested that all objects in the visual situation be included in a single <coref:universe> element. In cases like the MapTask dialogues where the participants to the conversation have two different maps, it is suggested that three universes be created: one with ID common containing all objects shared between the visual situations, and then one universe for each conversational participant containing the elements known only to that element, and with value modifies="common". This will ensure that the shared elements receive a unique ID.
In some types of dialogues the visual situation may change: new objects may be created and old objects destroyed (e.g.,when the visual situation is the screen). These situations may be modeled by allowing for the creation of new universes in the middle of dialogues, although this is not yet supported.
Data Source
There are no additional requirements on source data for the use of universes, unless a scheme implements a restriction on what coreferences are to be annotated based on the types of objects referred to; in this case, the annotator needs a description of the objects to check against. For instance, if the annotator were to mark up only references to Map Task landmarks, then the annotator would need a list of landmarks or copies of the maps. This information may not be enshrined in the data files themselves but in the coding module for the scheme instantiation.
Segmentation
Not applicable.
Assignment
The modifies attribute for all but the common universe should be set to common.
Example
The following is a simple example of the use of a universe.
(4.18)
<coref:universe ID="u1">
<coref:ue ID="ue1"> Diamond mine </coref:ue>
<coref:ue ID="ue2"> Graveyard </coref:ue>
<coref:ue ID="ue3"> Fast running creek </coref:ue>
<coref:ue ID="ue4"> Fast flowing river </coref:ue>
<coref:ue ID="ue5"> Canoes </coref:ue>
</coref:universe>
FOLLOWER: Uh-huh. Curve round. To your right.
GIVER: Uh-huh.
FOLLOWER: Right.... Right underneath <coref:de ID="de _50"> the diamond mine. </coref:de> Where do I stop.
GIVER: Well....... Do. Have you got <coref:de ID="de _51"> a graveyard?</coref:de> Sort of in the middle of the page?... On on a level to <coref:de ID="de _52"> the c--... er diamond mine. </coref:de>
FOLLOWER: No. I've got <coref:de ID="de _53"> a fast running creek. </coref:
GIVER: <coref:de ID="de _54"> A fast flowing river </coref:de>,... eh.
FOLLOWER: No. Where's <coref:de ID="de _55"> that </coref:de>. Mmhmm,... eh. <coref:de ID="de _56"> Canoes </coref:de>
<coref:link href="coref.xml#id(de_50)" type="ident" >
<coref:anchor href="coref.xml#id(ue1)" />
</coref:link>
<coref:link href="coref.xml#id(de_51)" type="ident" >
<coref:anchor href="coref.xml#id(ue2)" />
</coref:link>
<coref:link href="coref.xml#id(de_52)" type="ident" >
<coref:anchor href="coref.xml#id(ue1)" />
</coref:link>
<coref:link href="coref.xml#id(de_53)" type="ident" >
<coref:anchor href="coref.xml#id(ue3)" />
</coref:link>
<coref:link href="coref.xml#id(de_54)" type="ident" >
<coref:anchor href="coref.xml#id(ue4)" />
</coref:link>
<coref:link href="coref.xml#id(de_55)" type="ident" >
<coref:anchor href="coref.xml#id(de_54)" />
</coref:link>
<coref:link href="coref.xml#id(de_56)" type="ident" >
<coref:anchor href="coref.xml#id(ue5)" />
</coref:link>
Note that <coref:de ID="de_55">, that, could be marked up as ident with either the universe entity ue4, or with the discourse entity de_54. One of the advantages of this way of annotating references to the visual situation is that an extended coreference chain tracking mechanism should be able to include in a coreference chain both references to universe elements and references to discourse entities; the annotator may then choose how he/she wishes to annotate this. If the annotation tool can't do this type of coreference chain tracking, then the coding manual should include a disambiguation rule: for the type of multimodal applications on which Bruneseaux and Romary worked it seems preferable to mark links with universe entities rather than marking links with previous discourse entities.
The following is a more complex example which includes multiple universes encoded different world knowledge and a disagreement about a coreferential link in the dialogue.
(4.19)
<coref:universe ID="common ">
<coref:ue ID="ue2"> gold mine </coref:ue>
</coref:universe>
<coref:universe ID="GIVER_universe" modifies="common" >
<coref:ue ID="ue1"> diamond mine </coref:ue>
</coref:universe>
<coref:universe ID="FOLLOWER_universe" modifies="common" >
.....
</coref:universe>
GIVER: Do_you have <coref:de ID="de_20"> diamond_mine. </coref:de>
FOLLOWER: Yes I've got <coref:de ID="de_21"> a gold_mine. </coref:de>
GIVER: Ah. S--.
FOLLOWER: ....
GIVER: You don't have <coref:de ID="de_22"> diamond_mine </coref:de> though.
FOLLOWER: No. It's <coref:de ID="de_23"> a gold_mine </coref:de> according to this one. Presumably <coref:de ID="de_24"> that's </coref:de> the same.
GIVER: Well I've got <coref:de ID="de_25"> a gold_mine </coref:de> as well you see.
<coref:link href="coref.xml#id(de_20)" who-believes="G" type="ident">
<coref:anchor href="coref.xml#id(ue1)" />
</coref:link>
<coref:link href="coref.xml#id(de_21)" who-believes="F" type="ident">
<coref:anchor href="coref.xml#id(ue2)" />
</coref:link>
<coref:link href="coref.xml#id(de_21)" who-believes="F" type="ident">
<coref:anchor href="coref.xml#id(de_20)" />
</coref:link>
<coref:link href="coref.xml#id(de_22)" who-believes="G" type="ident">
<coref:anchor href="coref.xml#id(ue1)" />
</coref:link>
<coref:link href="coref.xml#id(de_22)" type="ident">
<coref:anchor href="coref.xml#id(de_20)" />
</coref:link>
<coref:link href="coref.xml#id(de_23)" who-believes="F" type="ident">
<coref:anchor href="coref.xml#id(ue2)" />
</coref:link>
<coref:link href="coref.xml#id(de_23)" who-believes="F" type="ident">
<coref:anchor href="coref.xml#id(de_22)" />
</coref:link>
<coref:link href="coref.xml#id(de_24)" who-believes="F" type="ident">
<coref:anchor href="coref.xml#id(de_22)" />
</coref:link>
Coding Procedure
The annotation should begin with the creation of a <coref:universe> element (or a common universe plus one for each participant, if their knowledge is not the same). This is commonly done before the annotation of discourse entities if the universe is static.
Markup Table
|
Element |
Attributes |
Content |
|
coref:universe |
ID, MODIFIES |
one or more <coref:ue>s |
|
coref:ue |
ID |
description of object |
4.2.4 Seg Elements
Description
Even if we only consider anaphoric relations involving nominal elements, there are at least two situations in which an annotator may wish to mark an anaphoric relation that also involves other types of constituents. The first is the case in which the anaphoric element is either unexpressed or incorporated in the verb. The second situation are the cases of so-called discourse deixis (Webber, 1991), in which the antecedent of a nominal expression is an abstract object such as an event or proposition introduced in the discourse somewhat indirectly by sentences. (DRAMA allows for such relations to be marked.)
The solution we propose is to use a <coref:seg> element which, like the TEI <seg> element, can be used to mark up arbitrary pieces of text. <coref:seg> elements are given an id which can then be pointed at by a <coref:link> element just like for other anaphoric relations.
The <coref:seg> element could also be used to annotate anaphoric relations between non-nominal elements, such as in VP ellipsis.
Data Source
Data source requirements for <coref:seg> elements are the same as for <coref:de> elements.
Segmentation
To be specified by the coding manual for a given scheme.
Assignment
The id attribute is automatically set by the workbench.
Example
Using <coref:seg> to mark up empty and incorporated constituents: As seen above, in Italian, Spanish and many other languages, certain nominal constituents may not be realized; this is especially common for nominals in subject position, but can also happen in object position, especially in instructions, as in:
Add the dry yeast to the water and let _ sit for a few minutes. Add the rest of the water and sugar. Stir _
These nominals are present in annotations produced by hand (e.g., in the Penn Treebank), but the parsers used for parsing spoken dialogues tend not to produce representations containing empty constituents in this case. In case these nominals are not represented in the base level, we verb can be marked with a <coref:seg> element, and the anaphoric relation coded as usual by means of <coref:link> elements, as follows:
(4.20)
coref.xml:
A: Dov'e` <coref:de ID="de_157">Gianni?</coref:de>
[Where is Gianni?]
B: <coref:seg type="pred" ID="seg_158 >e`
andato a mangiare </coref:seg>
[_ went to have lunch]
<coref:link href="coref.xml#id(seg_158)"
type="ident">
<coref:anchor href="coref.xml#id(de_157)"/>
</coref:link>
This representation can only be used without loss of information when there is at most one empty elements; this is true for Italian, but not for Japanese or Portuguese. If more precision is needed, the annotator could define more specific identity relations also specifying which empty argument of the verb enters in the anaphoric relation: such relations could be called, e.g., subj-ident, obj-ident, etc. These relations could then used instead of ident as the value of the type attribute of the <coref:link> element; we won't make them part of the annotation scheme discussed here, however.
A second case in which an argument is not realized by means of a nominal is that of incorporated clitics, such as daselo in (4.21) below. Clitic suffixes are also found in transcriptions of spoken English:
44.4 : lemme make sure I got all this
44.5 : okay (T)
In the case of incorporated clitics, as well, the verb can be marked with a <coref:seg> element when the parser doesn't produce a morphologically decomposed representation, and then the anaphoric relations in which the clitics are involved can be encoded either by means of a single ident relation or by means of more fine-grained relations such as subj-ident or obj-ident.
(4.21)
coref.xml
A: Mira, te doy <coref:de ID="de_167"> este libro </coref:de> ¿Conoces a <coref:de ID="de_168"> mi suegra?</coref:de>
B: Sí, claro.
A: Pues <coref:seg ID="seg_169"> dáselo </coref:seg> cuando <coref:de ID="de_170"> la </coref:de> veas.
<coref:link href="coref.xml#id(seg_169)"
type="obj-ident">
<coref:anchor href="coref.xml#id(de_167)"/>
</coref:link>
<coref:link href="coref.xml#id(seg_169)"
type="iobj-ident">
<coref:anchor href="coref.xml#id(de_168)"/>
</coref:link>
Provided that the <coref:seg> elements are identified during the first pass of markable identification, encoding this information should not be any harder than in the case of MUCCS. The real question for this type of annotation is which empty elements to annotate --e.g., in addition to 'small pro' elements such as those discussed above, the annotator may also decide to annotate `big PRO' elements that according to some syntactic theories occupy the subject position of infinitival clauses.
Using SEG to mark the antecedents of discourse deixis: Abstract objects such as events, actions and propositions can all serve as antecedents of anaphoric expressions. We are not aware of any reliability results for this type of annotation, but the <coref:seg> element can be used to identify the antecedents in this type of anaphora. If desired, the annotator could use a second attribute type to specify the type of object introduced by the <coref:seg> element; type would have values event, prop and action.
(4.22)
<coref:seg type="event" ID="seg_130">The 23-year-old had hit his head against another player</coref:seg> during a game of Aussie-rules football.
McGlinn remembered nothing of <coref:de ID="de_131"> the collision </coref:de>,but developed a headache and had several seizures.
<coref:link href="coref.xml#id(de_131)"
type="ident">
<coref:anchor href="coref.xml#id(seg_130)"/>
</coref:link>
(4.23)
a. Despite the latest negative results, doctors are still convinced that Tamoxifen can prevent breast cancer. This is because of the way it blocks the action of oestrogen, the female sex hormone that can make the breast cells of some women go out of control.
b.Despite the latest negative results, <coref:seg type="prop" ID="seg_129"> doctors are still convinced that <coref:de ID="de_131"> Tamoxifen </coref:de> can prevent breast cancer </coref:seg>. <coref:de ID="de_130"> This </coref:de> is because of the way <coref:de ID="de_132"> it </coref:de> blocks the action of oestrogen, the female sex hormone that can make the breast cells of some women go out of control.
<coref:link href="coref.xml#id(de_130)"
type="ident">
<coref:anchor href="coref.xml#id(seg_129)"/>
</coref:link>
(4.24)
a.
GIVER: You're sort_of going past stone creek... but your line's curving up past the... flat rocks.
FOLLOWER: Right. Okay.
GIVER: and then starting to come down again.
FOLLOWER: Got that
b.
GIVER: You're sort_of going past stone creek... but your line's curving up past the... flat rocks.
FOLLOWER: Right. Okay.
GIVER: <coref:seg ID="seg_135" type="action">And
then starting to come down again.</coref:seg>
FOLLOWER: Got <coref:de ID="de_136"> that </coref:de>.
<coref:link href="coref.xml#id(de_136)"
type="ident">
<coref:anchor href="coref.xml#id(seg_135)"/>
</coref:link>
These examples also ilustrate some of the problems to be addressed when designing a reliable annotation scheme for discourse deixis: these include deciding what part of the text counts as antecedent as well as deciding which type of object the antecedent is (see, e.g., (4.24)).
Coding Procedure
Left to the individual schemes.
Markup Table
|
Element |
Attributes |
Content |
|
seg |
ID, TYPE |
none |
4.3 Integrated Example
See (4.18) and (4.19).
4.4 Joint Coding Procedure
Left to the individual schemes.
5 A MUC 7-LIKE SCHEME
As a first example of how the markup elements introduced in the previous section can be used to annotate according to the indications of some of the current schemes, we show how to use them to annotate according to the MUCCS scheme developed for MUC-7 which, as mentioned above, is the simplest and most reliable of the existing schemes.
5.1 Markup Declaration
Two of the elements introduced in the previous section are needed for this scheme: <coref:de> and <coref:link>.
5.2 Description of Elements
As in the common section, except that the MUC-7 instructions should be used to segment discourse entities. Note that the MUCCS instructions also prescribe the mark up of parts of NPs, not only of full NPs. (See section 8 for a discussion of the options for <coref:de> markup.)
5.3 Integrated Example
See examples (4.15)-(4.17).
5.4 Joint Coding Procedure
According to the MUCCS instructions: first mark up all text elements specified as markables, then annotate all the coreference relations.
6 A MAPTASK SCHEME
This scheme illustrates how to fill in the modular components in order to instantiate the landmark coding which HCRC uses for the Map Task. In their original work, no syntactic constraints on markables were applied, since for their purposes the syntactic form of referring expressions was an empirical question, but annotation was limited to references to Map Task landmarks. This scheme requires the annotator to set up universes of landmarks corresponding to the set of landmarks on both maps, the set of objects only on the giver map, and the set of objects only on the follower map.
6.1 Markup Declaration
Mark-up is as in the common section.
6.2 Description of Elements
As in the common section, with the following exceptions:
6.2.1 Discourse Entities
Description
The <coref:de>tag is used to mark only spans of words referring to a landmark.
Data Source
Since annotators mark spans of words which refer to landmarks, the data must be marked up with words which have IDs, since thee will be used to fill in HREF attributes of the DE elements.
Segmentation/Selection
Annotators should only select text spans which refer to landmarks. Selection requires the annotator to have access either to a list of landmarks or to the Giver and Follower maps.
Assignment
The TYPE of DE elements must be set to IDENT.
Example
See examples 4.17 and 4.18, under "Common Markup".
Coding Procedure
See "Joint Coding Procedure" below.
6.2.2 Universes and universe elements
Description
All common landmarks should be included in a universe called common; then all landmarks in the giver's map only should be included in a universe called GIVER_universe, whereas all the landmarks included in the follower's map only should be included in a universe called FOLLOWER_universe.
Data Source
The annotation of these elements is done on the basis of the maps or a landmark list.
Assignment
The labels of the landmarks in the map should be used as the content of the coref:ue elements.
The modifies attribute of the GIVER_universe and FOLLOWER_universe coref:universe elements should be set to common.
Example
See examples 4.17 and 4.18, under "Common Markup".
Coding Procedure
See "Joint Coding Procedure" below.
6.2.3 Links
Description
The <coref:link>element is used to annotate references to the landmarks.
Data Source
The <coref:link> elements point to <coref:de> elements and <coref:ue> elements.
Segmentation/selection
A <coref:link> should be specified for every <coref:de> element.
Assignment
The href attribute of the <coref:link> element should be set to a discourse entity; the href attribute of the <coref:anchor> should be set to a universe entity. The WHO-BELIEVES attribute can be used to annotate misunderstandings, and in cases of ambiguity, two anchor elements may be included.
Example
See examples 4.17 and 4.18, under "Common Markup".
Coding Procedure
See "Joint Coding Procedure" below.
6.3 Integrated example
See examples 4.17 and 4.18, under "Common Markup".
6.4 Joint Coding Procedure
The annotator, after having set up the universes of landmarks, reads the dialogue from start to finish, looking for references to landmarks to mark as discourse entities and simultaneously linking them to universe elements.
7 DRAMA
DRAMA (Passonneau, 1996) can also be implemented using the markup elements above, but extending the range of values of the TYPE attribute.
7.1 Markup Declaration
As in the common parts of the scheme.
7.2 Description of Elements
As in the common section, with the exception that the TYPE attribute of <coref:link> elements may be one of: {coref, subset, member, part, cause, poss, argptv, prop}. No restriction should be placed on the types of objects for which coreference information is specified. DRAMA gives explicit instructions about the syntax of markables which must be realized either using a MATE query on the morphosyntactic tagging or by the human coder.
7.3 Integrated Example
See Passonneau's manual.
7.4 Joint Coding Procedure
See Passonneau's manual.
8 AN OVERVIEW OF THE POSSIBLE DECISIONS ABOUT MARKUP
In this section we discuss in more detail some of the issues that the designer of a coreference scheme must confront, and give some suggestions.
8.1 Marking Up Discourse Entities
This section offers guidance on the types of words and phrases which may be marked up as discourse entities (<coref:de>). As in MUCCS and in DRAMA, we only give instructions about how to mark possible antecedents for nominal anaphora; not about marking antecedents of verbal ellipsis or other forms of anaphoric relations. The discussion follows roughly the order of the analogous discussion in DRAMA.
Our intention was to cover as many types of possible antecedents as possible, compatibly with what we think can be annotated reliably; depending on the particular application, only a subset of the markables identified here may actually be marked. The most important decision to be made by the designer of an 'instance' of the present scheme is whether to annotate all NPs, or only those that enter into anaphoric relations (i.e., are either anaphoric expressions or the antecedents of anaphoric expressions). In what follows, we assume that all NPs introducing discourse entities have to be specified as <coref:de>elements, and discuss which ones do not introduce discourse entities.
Whatever the decision is made concerning the text constituents to mark, the experience with MUC indicates that it's best to split the annotation task in two - reaching agreement on a set of <coref:de>s before attempting to specify the anaphoric relations with <coref:link>elements as discussed in the next section.
8.1.1 NPs with head noun
The 'canonical 'noun phrase consists of a head noun optionally pre- or post-modified by determiners, quantifiers, adjectives, etc. The whole NP (not just the head noun) is marked up wherever it denotes an entity which may be subsequently or previously referred to elsewhere in the text. Both definite and indefinite NPs can potentially enter into this kind of relationship. The following examples show some examples of canonical NPs which would be marked up (only the markup for the first example is shown).
(8.1) France came from behind to beat Croatia 2-1 and reach
their first World Cup final at the Stade de France. (BBC)
(8.2) France came from behind to beat Croatia 2-1 and reach
<coref:de id="de_30"> their first World Cup final
</coref:de> at the Stade de France. (BBC)
(8.3) But home fans at the Stade de France endured an agonising final
20 minutes after Laurent Blanc was shown the red card following
a tussle with Slaven Bilic. (BBC)
(8.4) Prolific Davor Suker gave the keeper no chance for his
fifth goal of the tournament. (BBC)
(8.5) A high-class move followed two minutes later with Youri
Djorkaeff finding Guivarc'h with an excellent ball. (BBC)
(8.6) The first three students came in.
(8.7) A lot of students followed
Note that in the case of the first three students, three students is not marked separately even though it could occur by itself in NP position; this is because the subconstituent could not serve as antecedent by itself. However, where the quantifier occurs in an of-construction with a full NP complement, both NPs should be marked as both could serve as antecedents, as follows:
(8.8) Some of the symptoms are barely noticeable except when the patient is tired.
(8.9) <coref:de ID="DE_101"> Some of <coref:de ID="DE_102"> the symptoms </coref:de> </coref:de>are barely noticeable except when the patient is tired
NPs should only be marked up where they do introduce a new discourse entity. As we will see below, there are cases in which this is clearly not the case: e.g., when an indefinite NP occurs as a predicate nominal. In some cases it is difficult to tell whether a given indefinite NP is predicative or not. Our suggestion is to keep the rules for these cases simple: unless it has been decided only to mark NPs that actually enter into anaphoric relations, mark all indefinite NPs as <coref:des except when they occur in the special constructions discussed in 8.1.8.
It should be noticed that indefinite NPs may introduce discourse entities even whey they do not refer to anything in the world: e.g., in
(8.10) I want to buy a car.
A car does not refer to any particular object in the world (unlike, say, in I've just seen a lovely car, but it was too expensive). Yet, that car can still occur in anaphoric relations, although under particular conditions (`modal subordination'): e.g., it is possible to continue (8.10) by saying I need it/one to go to work. (Indeed, this possibility of anaphoric relations to expressions that do not refer is the reason why the intermediate level of `discourse entity' was introduced - see e.g., (Karttunen, 1967) or (Webber, 1978).)
8.1.2 NPs containing relative clauses
Where an NP contains a restrictive relative clause, the whole
NP, including the restrictive relative clause, should be marked up as
single discourse entity.
(8.11) They will play Brazil on Sunday after a dominant performance
against a Croatia side who surprised many by reaching the
semi-final stage. (BBC)
(8.12) They will play Brazil on Sunday after a dominant performance
against <coref:de ID="DE_31"> a Croatia side who surprised many
by reaching the semi-final stage </coref:de>.
It may be argued that non-restrictive relative clauses should not be marked up as part of the discourse entity, as they supply additional information about the referent rather than helping to identify him or her. However, it may be useful for the goal of the annotation to include information about the pattern of reference of non-restrictive relative clauses, and annotators may therefore decide that these elements should be marked up. (If the non-restrictive relative clause is not to be marked up, the <coref:de>tag should be assigned to the part of the NP that precedes the relative clause.) (8.14) illustrates how to tag a non-restrictive relative clause leaving it outside the <coref:de>tag; (8.15) how to include it.
(8.13) The 26-year-old Thuram, who had never before scored for France, scored twice after prolific Davor Suker had put Croatia in front at the start of the second half. (BBC)
(8.14) <coref:de ID="DE_32"> The 26-year-old Thuram </coref:de>, who had never before scored for France, scored twice after prolific Davor Suker had put Croatia in front at the start of the second half.
(8.15) <coref:de ID="DE_33"> The 26-year-old Thuram, who had never before scored for France </coref:de>, scored twice after prolific Davor Suker had put Croatia in front at the start of the second half.
(8.16) A spokeswoman for the Prince of Wales has confirmed that the encounter, which is said to have been amicable, did take place, but stressed it was a private matter for the family. (BBC)
We assume in what follows that non-restrictive relative clauses, as well, are included inside the <coref:de tag for a given NP.
8.1.3 Bare nouns
Where the NP consists only of a noun, this should be marked up normally as a <coref:de>, as in (8.17).
(8.17)
5.1 M: and there're <coref:de> oranges </coref:de> at Corning (T)
However, bare nouns in non-head position - e.g., the premodifier orange in (8.18) - should normally not be marked up, since generally they do not enter in anaphoric relations; only the NP orange juice as a whole would be a markable.
(8.18)
7.2 : we have to make <coref:de> orange juice </coref:de>
However, the designer of a scheme may decide to allow for bare nouns in this position to optionally be marked as <coref:de>s when entering into anaphoric relations with other NPs.
Bare NPs may often be used to talk about kinds (Carlson, 1977) rather than tokens, as in utterance 1.6 in (8.19): in this example, no specific bananas have been chosen, and any will do.
(8.19)
1.1 M: okay
1.2 : I have to get
1.3 : one tanker of OJ
1.4 : to
1.5 : Avon
1.6 : and a boxcar of bananas
1.7 : ... to
1.8 : ... Corning
: ...
3.2 : so there're
3.3 : bananas at Avon
The relation between kinds such as the one in 1.6 and sets of objects such as the one denoted by the bare NPs bananas in the following utterance 3.3 is not really identity of reference; at most, it can be said that the bananas in 3.3 are an instance of the type of objects mentioned in 1.6. The core scheme makes no explicit provision for this type of relation; however, if the designer of a scheme wished to mark relations like these within the core scheme, the ident relation could be used in a looser sense (denoting `sense identity'), as done in MUCSS. The extended scheme does make provision for the mark-up of the relationship between a kind and token or instantiation of that kind; see Instantiation.
8.1.4 Noun phrases without a head noun: pronouns
Personal pronouns, demonstrative pronouns, possessive pronouns and indefinite pronouns may all enter into coreference and should be marked up wherever necessary.
(8.20)
31.4 : now I have a good i /
31.5 : oh no we can't use the same thing (T)
(8.21)
104.11 : the boxcar has a bad wheel
104.12 : and won't be available for 8 hours
105.1 M: oooh...
106.1 S: so we can't use that (T)
(8.22)
144.1 S: we get our OJ ready by 8 (T)
(8.23)
Sujet: Euh, peut-on construire la partie
supérieure sphérique du, de la surface ?
Compere: Je ne dispose pas de surface
Compere: A partir de quoi voulez-vous en créer une ? (G)
It is not usually necessary to mark up each occurrence of first and second person pronouns, as the cases in which they co-specify can generally be automatically determined. However, where assigning reference to these pronouns seems unusually complicated, as is sometimes the case where there are many speakers, the annotator may choose to mark them up.
Pleonastic NPs (expletives) should not be marked up, as they never enter into anaphoric relations.
(8.24) 85.6 : now how long does
it take from Elmira to Corning? (T)
(8.25) It seems to me that John is going mad.
Reflexive pronouns may be marked up as <coref:de> if they are considered to truly denote an item in the world (8.26). This should be decided on the basis of whether they identify a person or thing which could be considered an argument of the verb. For example, in the English phrases Julie washed herself or Bill was talking to himself, the reflexive pronouns clearly identify an argument of the verb, and could indeed be replaced by other noun phrases. In Spanish, however, many verbs take what might be considered 'lexicalised' reflexive pronouns: they do not appear to refer to an argument, and could not be replaced by another NP, e.g. reirse, to laugh, irse to leave/ be off. There are also cases in Spanish of the reflexive pronoun 'se' being used in impersonal and passive constructions, in which the pronoun does not seem to refer to any argument of the verb. A rough guideline for this might be that these reflexive pronouns should be marked up when they seem to coincide with their referential use in English. In (8.27), for example, the first two instances of the reflexive pronoun se do not seem to refer to anything, but the last (adapts itself) would be deemed referential and marked up as a <coref:de>. However, this may be an overly anglocentric view, and annotators may choose to adopt their own decision mechanisms in this area.
(8.26) He picked himself up off the floor.
(8.27) Durante el Congreso se escucharon llamados a que se
reconozcan los errores del pasado y para que el partido
<coref:dese adapte</coref:de> al nuevo clima
pol?tico.
(During the conference calls were heard that past errors
should be recognised and that the party should adapt itself
to the new political climate.)
8.1.5 Other phrases without a head noun
Proper names should be marked up wherever necessary (8.28), (8.29). They may be pre-modified, in which case the modifier should be included in the <coref:de>(8.30). However, where a proper name contains another NP within it, this smaller NP usually need not be separately marked up, since it typically works as a modifier and does not enter into anaphoric relations: in (8.31), for example, the word Parades would not be individually marked, nor would Ulster in (8.32).
(8.28) France came from behind to beat Croatia 2-1 and reach their first World Cup final at the Stade de France. (BBC)
(8.29) But home fans at the Stade de France endured an agonising final 20 minutes after Laurent Blanc was shown the red card following a tussle with Slaven Bilic. (BBC)
(8.30) Prolific Davor Suker gave the keeper no chance for his fifth goal of the tournament. (BBC)
(8.31) The independent Parades Commission banned the Protestant march from entering the nationalist Garvaghy Road on Sunday. (BBC)
(8.31) While they have held a largely peaceful protest at the entrance of the Garvaghy Road, watched by the Royal Ulster Constabulary and the British army, loyalist violence has erupted throughout the province in protest.
Other examples of phrases classified as NPs without having a head noun are phrases with adjectives or quantifiers as heads, as in (8.32) and (8.33). Gerundive clauses also function in a similar way to NPs, and should be annotated where necessary.
(8.32) I prefer the largest. (D)
(8.33) A few people found their way to the destination but a great many did not understand the directions. (D)
(8.34) They had been accused of ignoring the environment.
8.1.6 Conjoined NPs
Where two or more NPs are conjoined or disjoined, it may be necessary to mark up the larger NP as well as the constituent NPs, depending on whether it is referred to later in the dialogue. In (8.35), for example, the coordinated NP John and Luise serves as antecedent for the plural pronoun They.
(8.35) John and Louise went out for a fancy meal for her graduation. They had a huge argument and he walked out without paying.
(8.36)
<coref:de ID="DE_40"> <coref:de ID="DE_41"> John </coref:de> and <coref:de ID="DE_42"> Louise </coref:de> </coref:de> went out for a fancy meal for <coref:de ID="DE_43"> her </coref:de> graduation. <coref:de ID="DE_44"> They </coref:de> had a huge argument and <coref:de ID="DE_45"> he </coref:de> walked out without paying.
8.1.7 Linguistic contexts that do not introduce discourse entities
While the discussion of markables above defines the class of elements which can be marked up, there are cases where the linguistic or discourse context of an NP make it inappropriate to mark it as a <coref:de> element. Although we assume that in most cases markables will be automatically identified by means of search patterns formulated in terms of the MATE query language, it is possible that the annotator may want not to annotate these NPs as <coref:de>s. Examples of this are predicate nominals, where the NP cannot be considered to introduce a discourse entity. Some such contexts are discussed below.
Predicate nominals
Where the noun phrase can clearly be identified as being predicative, this should not be marked up, as this type of phrase does not introduce a discourse entity.
(8.51) John is a policeman
(cf. John is tall.)
Indefinite NPs in copular position are typically predicative and need not be annotated. A more complex case is the one in which the NP in copular position is definite, as in the following example:
(8.52) John is the President of the USA.
In the case of sentences like (8.52) one might argue either that we have an explicit equality and therefore both NPs should be marked as <coref:de>s, or that we only need to mark one (presumably the subject) since only one discourse entity is introduced by this sentence. Again, the decision depends in part on the task: if the system is used for information extraction it may be useful to annotate all anaphoric relations, whereas in other cases annotating these NPs may be considered a waste of time. A cautious policy is that unless an NP of this type is clearly predicative, it should be marked up as a discourse entity.
Appositional phrases
There are a number of different types of appositive constructions. Basic apposition consists of two noun phrases with identical reference, which need not be contiguous (8.53), (8.54). However, the appositional phrase may not be a simple NP (8.55), (8.56), may be introduced or followed by a marker such as say or included (8.56), (8.57), (8.58), and may even enter into a restrictive apposition, lacking the distinctive commas (8.59), (8.60).
(8.53) News of the sudden death of the imprisoned opposition leader, Chief Moshood Abiola, has shaken Nigeria. (BBC)
(8.54) An unusual present awaited him, a book on ethics (QG)
(8.55) The reason that he gave, that he didn't notice the other car,... (QG)
(8.56) We should send one of the engines at Avon, say engine E1, to Bath to pick up the tanker car (T)
(8.57) Many people, my sister included, ... (QG)
(8.58) Many professions, such as the legal profession, ... (GG)
(8.59) The famous critic Paul Jones (QG)
(8.60) Your duty to report the accident takes precedence over everything else (QG)
Our suggestion is to follow what proposed in MUCSS (Hirschman 1997), and tag the NP as a whole as well as any separate NP contained in the appositive clauses, if the appositive clause is contiguous to the NP (8.61), (8.63). Discontinuous appositions can be marked separately (8.62). An ident link can then be marked between the appositive clause and either the other clause (discontinuous apposition) or the NP as a whole (see below). In the case of restrictive appositions, only the NP as a whole will be marked (8.64), (8.65).
(8.61)
News of the sudden death of <coref:de ID="DE_44"> the imprisoned opposition leader, <coref:de ID="DE_45"> Chief Moshood Abiola </coref:de> </coref:de>, has shaken Nigeria
(8.62)
<coref:de ID="DE_46"> An unusual present </coref:de> awaited him, <coref:de ID="DE_47"a book on ethics </coref:de>
(8.63)
We should send <coref:de ID="DE_50"> one of <coref:de ID="DE_51"> the engines at Avon </coref:de>, say <coref:de ID="DE_52"engine E1 </coref:de> </coref:de>, to Bath to pick up the tanker car
(8.64) <coref:de ID="DE_53"> the famous critic Paul Jones </coref:de>
(8.65) <coref:de ID="DE_54"> Your duty to report the accident
</coref:de> takes precedence over everything else
With appositions we have the same problem as with predicate nominals: some appositional phrases are really just predicative, and need not be marked up separately (see Predicate nominals).
(8.66) Norman Jones, at that time a student,... (QG)
(8.67) <coref:de ID="DE_48"> Norman Jones, at that time a student. </coref:de>..
(8.68) <coref:de ID="DE_49"> Julius Caesar, a well-known emperor</coref:de>
whereas in the following case it may be argued that the appositional phrase does introduce a discourse entity which is equated with the one introduced by Julius Caesar:
(8.69) Julius Caesar, the well-known emperor
Our only recommendation is to treat the two cases in the same way.
Negated or questioned contexts
Where the existence of an entity is being denied by the context in which the noun phrase occurs, as in (8.70), the discourse entity introduced by that entity often will not enter in any anaphoric relations; in these cases, the designer of a scheme may decide to mark the string as a <coref:de>only if it enters into subsequent anaphoric relations, as in (8.71).
(8.70) But the volume of noise from the home fans began to subside as the opening goal failed to materialise. (BBC)
(8.71) I don't want to buy a car. It would cost me too much money.
A similar problem arises with NPs which occur as part of a question. While NPs within questions often don't really refer to an item in the world, annotators may wish to mark these items up in order to link them with another reference to an object; a link which may be made in the answer to the question (8.72). However, it should be noted that this is not really a co-specification relation; see the section Instantiation in the extended scheme for a full treatment of <coref:de> in questions.
(8.72)
Does anyone have a pencil?
Yes - there's one there
Disfluencies
Disfluencies and repairs are common features of spoken dialogues. In this case, the problem is to decide whether noun phrases introduced within repaired parts of a dialogue should be marked up. As some theories suggest that repaired items are not available for subsequent processing, and to avoid unnecessary marking up of elements, it is suggested to make the marking optional: only those items within repaired sections which are subsequently referred to should be marked up. Note that if the repaired item corefers with an earlier <coref:de>and then is subsequently referred to, the later reference may be linked directly with the initial one, making the marking up of the repaired item unnecessary in this case. In (8.73), only the most complete phrase fragment needs to be marked up for the presence of elles, coreferential with ces deux fusées.
(8.73)
193 F: Donc qu'est ce qui / qu'est ce qui serait commun à ces deux fusées. Ces deux fusées ont /
194 D: c'est qu'elles ont / elles ont la même / elles /
elles / toutes les / tous les ailerons
(In this case we deviate from the approach taken in DRAMA by Passonneau (1996), who proposes to mark up all NPs occurring in disfluencies, although again they do not create ambiguity.)
8.1.8 Discontinuous elements
Sometimes a discourse entity is not introduced by a single
continuous phrase, but by a number of different utterances interrupted
by disfluencies or comments. Sometimes this may be due to the way the
text is segmented in the basic level: in the TRAINS corpus, for
example, a great number of <coref:de>appear discontinuous due to
the way in which the corpus has been marked up, with each speaker's
utterance split into small strings over many numbered lines (8.45). If
no comments or unrelated utterances intervene, it may be possible to
simply group a number of lines into a single utterance, that can then
be marked up as <coref:de>. If, however, the original separation
into utterances has to be preserved, we propose to rely on the fact
that information about discontinuous constituency is represented at
the chunk level by means of next and prev attributes (see (8.46)), and
to make the <coref:de> element point to the entire span of chunks
beginning with the first chunk that is actually part of the NP (one
of the in (8.45)) until the last chunk (say engine E2 in
the same example), as in (8.47).
(8.45)
9.6: I think what we should do
9.7: is
9.8: hook up
9.9: uh one of the
[2sec]
9.10: engines
9.11: uh
9.12: at Elmira
9.13: say engine E2
(8.46) ch.xml:
9.6: I think what we should do
9.7: is
9.8: hook up
9.9: uh
<ch id="ch_60 "> one of the </ch>
[2sec]
9.10: <ch ID="ch 61" next="ch_63"> engines </ch>
9.11: <ch ID="ch 62"> uh</ch>
9.12: <ch ID="ch_63" prev="ch_61"> at Elmira </ch>
9.13: <ch ID="ch_64"> say </ch>
<ch ID="ch_65"> engine E2 </ch>
(8.47)coref.xml:
<coref:de ID="DE_01" href="ch.xml#id(ch_61)..ch.xml#id(ch_65)"/>
<coref:de ID="DE_02" href="ch.xml#id(ch_65)"/>
In (8.48), the giver's <coref:de> is interrupted by the follower's turn; again the mechanism for dealing with discontinuous constituents with chunks must be used to reconstruct it:
(8.48)
GIVER: curving, just curving round the diamond
FOLLOWER: uh-huh
GIVER: mine...... uh-huh
(8.49) ch.xml:
GIVER: Curving, just curving round
<ch ID="ch_66" next="ch_68"> the diamond </ch>
FOLLOWER: <ch ID="ch_67"uh-huh </ch>
GIVER: <ch ID="ch_68" prev="ch_66"> mine </ch> .
<ch ID="ch_69"..... uh-huh</ch>
(8.50)coref.xml:
<coref:de ID="DE_01" href="ch.xml#id(ch_66).. ch.xml#id(ch_68)"/>
8.2 Assigning Links
This section discusses a few issues concerning the specification of links between discourse entities.
8.2.1 Coordinated NPs
Where two or more NPs are conjoined or disjoined, it may be necessary to mark up the larger NP as well as the constituent NPs, depending on whether it is referred to later in the dialogue. In (8.74), for example, the coordinated NP John and Luise serves as antecedent for the plural pronoun They.
(8.74) John and Louise went out for a fancy meal for her graduation. They had a huge argument and he walked out without paying.
(8.75)
<coref:de ID="de_40"> <coref:de ID="de _41"> John </coref:de> and <coref:de ID="de _42"> Louise </coref:de> </coref:de> went out for a fancy meal for <coref:de ID="de _43"> her </coref:de> graduation. <coref:de ID="de _44"> They </coref:de> had a huge argument and <coref:de ID="de _45"> he </coref:de> walked out without paying.
<coref:link href="coref.xml#id(de_43)" type="ident" >
<coref:anchor href="coref.xml#id(de_42)" />
</coref:link>
<coref:link href="coref.xml#id(de_44)" type="ident">
<coref:anchor href="coref.xml#id(de_40)" />
</coref:link>
<coref:link href="coref.xml#id(de_45)" type="ident" >
<coref:anchor href="coref.xml#id(de_41)" />
</coref:link>
Note that in this example, discourse entity de_40 is a set which includes both John and Mary; the relation between these entities cannot be annotated using only ident, but it's possible to do so with the extended set of relations discussed below.
8.2.2 Possessive pronouns
Possessive pronouns co-specify with their antecedents (e.g. Louise ...her graduation), and are therefore marked as ident, as shown in (8.74). The relationship between the whole NP designating the possessed item and its possessor, however, is not identity; Louise and her graduation clearly do not refer to the same entities in the world. Again, these relationships are part of the extended scheme: see Extended scheme: possessive.)
8.2.3 Clitics
As discussed in Section 7.2.2 above, we propose to use <coref:seg> elements to mark anaphoric expressions such as clitics which are morphologically incorporated into a verb. We then use <coref:link> elements to mark the anaphoric relations of the discourse entity referred to by a clitic with other discourse entities. So, for example, the anaphoric information in (8.38) would be annotated as follows:
(8.76)
A: Mira, te doy <coref:de id="de_1"> este libro </coref:de> ¿Conoces a <coref:de id="de_2"> mi suegra?</coref:de>
Pues <coref:seg id="seg_3"> dáselo </coref:seg> cuando <coref:de id="de_5"> la </coref:de> veas.
<coref:link href="coref.xml#id(seg_3)" type="ident">
<coref:anchor href="coref.xml#id(de_2)" />
</coref:link>
<coref:link href="coref.xml#id(de_5)" type="ident">
<coref:anchor href="coref.xml#id(de_2)" />
</coref:link>
8.2.4 Appositions
Depending on the application, it may be useful to mark an ident link between the appositive clause and either the other clause (discontinuous apposition) or the NP as a whole.
(8.77)
News of the sudden death of <coref:de ID="de_44"> the imprisoned opposition leader, <coref:de ID="de_45"> Chief Moshood Abiola </coref:de> </coref:de>, has shaken Nigeria
<coref:link href="coref.xml#id(de_45)" type="ident" >
<coref:anchor href="coref.xml#id(de_44)" />
</coref:link>
(8.78)
<coref:de ID="de_46"> An unusual present </coref:de> awaited him, <coref:de ID="de_47"a book on ethics </coref:de>
<coref:link href="coref.xml#id(de_46)" type="ident" >
<coref:anchor href="coref.xml#id(de_47)" />
</coref:link>
(8.79)
We should send <coref:de ID="de_50"> one of <coref:de ID="de_51"> the engines at Avon </coref:de>, say <coref:de ID="de_52"engine E1 </coref:de> </coref:de>, to Bath to pick up the tanker car
<coref:link href="coref.xml#id(de_52)" type="ident" >
<coref:anchor href="coref.xml#id(de_50)" />
</coref:link>
8.2.5 Coding Procedure for <coref:link> elements
The annotation should proceed in two steps: first all <coref:de>elements should be marked and agreed upon by the markers, then all links should be established. No convention on choosing a particular textual element as antecedent is needed, provided that the tool used supports coreference chain; and anyway they can be computed by hand.
8.3 Extending the set of anaphoric relations
In this section we discuss various issues that arise when trying to annotate more complex anaphoric relations than simple identity. As mentioned in Section 7, this can be done using the markup elements introduced in Section 4, but allowing more values for the type attribute of the <coref:link> element; we provide a specification of the modified <coref:link> element below. Our aim in this section is to highlight some of the problems that arise when doing so and suggest ways of reducing them. The set of relations allowed by the scheme derives from the analysis of Vieira (1998) and includes the bridging relations in DRAMA.
As the poor reliability scores which have been obtained by Poesio and Vieira (1998) for this kind of scheme indicate, once one moves beyond the ident relation, it can be difficult to decide how to classify the link between two elements. We addressed this problem by adopting the TEI technique of specifying `subtypes' of links: in those cases in which it may be difficult to identify precisely the type of relation that exists between two entities, we introduced a more general relation to be used as type of a link, as well as more specific relations to be used as values of the subtype attribute in those cases in which this additional specification is possible.
8.3.1 Links with Extended Relations
Description
The <coref:link> element in the Extended Relations Scheme has two attributes: type and subtype. The type attribute is used to specify the semantic relation between the discourse entity introduced by a textual element and a previous discourse entity; the relations allowed include, in addition to identity, many of the relations often grouped under 'bridging' relations (Clark, 1977).
Data Source
The basic level for link relations is the same as discussed in Section 4.
Segmentation
As above, link elements do not mark parts of text.
Assignment
8.3.1.1 Set Relations
Member
The member value should be used for the type attribute where the discourse entity pointed at by the coref:link element is a member of the set denoted by the discourse entity pointed at by the coref:anchor element. In the preferred reading of (8.82), for example, Paul and Jane are understood to be members of the set denoted by the kids. Note that this relation can apply whether it is the member or the set that appears first in the discourse, but when marking it up, the order of the arguments obviously matters.
(8.82) The kids went to a party last weekend. Paul wanted to wear his new suit, but Jane insisted on wearing her jeans.
(8.83)
<coref:de ID="de_85"> The kids </coref:de> went to a party last weekend. <coref:de ID="de_86"> Paul </coref:de> wanted to wear his new suit, but <coref:de ID="de_87"> Jane </coref:de> insisted on wearing her jeans
<coref:link href="coref.xml#id(de_86)" type="member">
<coref:anchor href="coref.xml#id(de_85)" />
</coref:link>
<coref:link href="coref.xml#id(de_87)" type="member">
<coref:anchor href="coref.xml#id(de_85)" />
</coref:link>
Subset
This value can be used when one discourse entity denotes a subset of the set denoted by the other discourse entity. As in the case of the element relation, the order of the arguments is important when marking up: the subset should be pointed at by the <coref:link> element, whereas the superset should be pointed at by the <coref:anchor> element. In the following example, there are two subsets of the initial set of rockets: the rockets which flew well, and the rockets which didn't fly well.
(8.84)
F: Alors donc / vous avez / ici / les modèles de fusées /
M: Oui
F: Et vous allez essayer de vous mettre d'accord sur un classement /hein classer les fusées qui ont bien volé ou qui ont moins bien volé /
(8.85)
F: Alors donc / vous avez / ici / <coref:de ID="de_88"> les modèles de fusées </coref:de>
M: Oui
F: Et vous allez essayer de vous mettre d'accord sur un classement /hein classer <coref:de ID="de_89"> les fusées qui ont bien volé </coref:de> ou <coref:de ID="de_90"> qui ont moins bien volé </coref:de>
<coref:link href="coref.xml#id(de_89)" type="subset " >
<coref:anchor href="coref.xml#id(de_88)" />
</coref:hlink
<coref:link href="coref.xml#id(de_90)" type="subset " >
<coref:anchor href="coref.xml#id(de_88)" />
</coref:hlink
Discourse entities can enter in a number of relationships that could be generically be described as cases of `possession', and it's not always easy to decide precisely which type of relation is involved in any given case. In these cases, we propose to use the relation type poss; if a more detailed annotation is required, one of three subtypes can be specified - attribute, partitive, or strict possession.
Attribute
This relation is used when one <coref:de> expresses something which is an attribute of another <coref:de>; canonical examples of this include someone's height or weight. This relation may be expressed in two main ways: using a possessive pronoun or a genitive (our sheer effort, his team's application), or by means of an of-construction (The quality of both teams, la taille de ailerons). In both cases, the relation can be annotated by means of a <coref:anchor> element of type poss, subtype attr, between the whole NP and the NP denoting the possessor (8.86), (8.87), (8.88), (8.89). The order of the arguments is important: the <coref:link> element should point at the NP denoting the attribute, whereas the <coref:anchor> element should point at the NP denoting the possessor. (If the possessor has been previously mentioned, then an ident link would also be marked between the two mentions of the possessor.) Note that in (8.87) two possessive relations are annotated: a strict possessive link to Aime Jacquet for his team, and an attributive link for his team's application.
(8.86) French boss Aime Jacquet praised his team's application (BBC)
(8.87)
<coref:de ID="de_91"> French boss Aime Jacquet </coref:de> praised <coref:de ID="de_92"> <coref:de ID="de_93"> <coref:de ID="de_94"> his </coref:de> team's </coref:de> application. </coref:de>
<coref:link href="coref.xml#id(de_94)" type="ident" >
<coref:anchor href="coref.xml#id(de_91)" />
</coref:link>
<coref:link href="coref.xml#id(de_93)" type="poss " subtype="sposs" >
<coref:anchor href="coref.xml#id(de_94)" />
</coref:link>
<coref:link href="coref.xml#id(de_92)" type="poss " subtype="attr " >
<coref:anchor href="coref.xml#id(de_93)" />
</coref:link>
(8.88) He said: "I think our sheer effort and mental concentration saw us through."
(8.89)
He said: " I think <coref:de ID="de_95"> <coref:de ID="de_96"> our </coref:de> sheer effort and mental concentration </coref:de> saw us through."
<coref:link href="coref.xml#id(de_95)" type="poss " subtype="attr " >
<coref:anchor href="coref.xml#id(de_96)" />
</coref:link>
(8.90)
F: ...les ailerons...
M: la taille de ailerons
F: ...the wings...
M: the height of the wings
(8.91)
F: ...<coref:de ID="de_97"> les ailerons </coref:de>...
M: <coref:de ID="de_98"> la taille de <coref:de ID="de_99"> ailerons </coref:de> </coref:de>
<coref:link href="coref.xml#id(de_98)" type="poss " subtype="attr " >
<coref:anchor href="coref.xml#id(de_99)" />
</coref:link>
(8.92) Team mate Rivaldo acknowledged the quality of both teams.
(8.93)
<coref:de ID="DE_100"> Team mate Rivaldo </coref:de> acknowledged <coref:de ID="DE_101"> the quality of <coref:de ID="DE_102"> both teams </coref:de> </coref:de>.
<coref:link href="coref.xml#id(de_101)" type="poss " subtype="attr " >
<coref:anchor href="coref.xml#id(de_102)" />
</coref:link>
Part
A <coref:anchor> with type poss and subtype part is used where one <coref:de>denotes a physical part of another <coref:de. Where the two objects are linked within one phrase, the link is marked in the same way as an attr link (8.95). Where the two <coref:de>are separately expressed, they each form the argument of the part link, with the part being the first argument, the whole the second. (Note that in order to annotate expressions like the chair leg, which have one sense very similar to that of possessive expressions like the chair's leg, it would be necessary to mark nominal premodifiers, contrary to what suggested in 8.1.1.)
(8.94) The seat of the chair broke when I stood on it to open the window.
(8.95)
<coref:de ID="DE_104"> The seat of <coref:de ID="DE_103"> the chair</coref:de> </coref:de> broke when I stood on <coref:de ID="DE_105"> it </coref:de> to open the window.
<coref:link href="coref.xml#id(de_101)" type="poss " subtype="part " >
<coref:anchor href="coref.xml#id(de_102)" />
</coref:link>
(8.96) F: donc est-ce que ces deux fusées ont les même ailerons? (MF)
So do these two rockets have the same wings?
(8.97)
donc est-ce que <coref:de ID="de_105"> ces deux fusées </coref:de> ont <coref:de ID="de_106"> les même ailerons </coref:de>
<coref:link href="coref.xml#id(de_106)" type="poss " subtype="part " >
<coref:anchor href="coref.xml#id(de_105)" />
</coref:link>
(8.98) Army experts in Northern Ireland have defused a 1400 pound bomb left near the main road in County Tyrone. The device, which included two booster tubes, may have been designed for an attack on a security force patrol.
(8.99) Army experts in Northern Ireland have defused <coref:de ID="de_107"> a 1400 pound bomb left near the main road in County Tyrone </coref:de>. <coref:de ID="de_108"> The device </coref:de>, which included <coref:de ID="de_109"> two booster tubes </coref:de>, may have been designed for an attack on a security force patrol
<coref:link href="coref.xml#id(de_108)" type="ident " >
<coref:anchor href="coref.xml#id(de_107)" />
</coref:link>
<coref:link href="coref.xml#id(de_109)" type="poss " subtype="part " >
<coref:anchor href="coref.xml#id(de_108)" />
</coref:link>
Strict possession
A link of type poss and subtype sposs encodes the relationship between two objects where one 'belongs' to the other; typically, the possessor is a person or animate object. This link can be expressed, like the attributive construction, by a genitive or possessive pronoun (8.100), or by an of-construction (8.102). The order of the arguments is the same as in the attr link - possession first, possessor second.
(8.100) It was a brave decision by Jerry Seinfeld to turn down $5m an episode to make another series of his hugely popular sitcom. (BBC)
(8.101) It was a brave decision by <coref:de ID="de_110"> Jerry Seinfeld </coref:de> to turn down $5m an episode to make another series of <coref:de ID="de_111"> <coref:de ID="de_112"> his </coref:de> hugely popular sitcom </coref:de>
<coref:link href="coref.xml#id(de_112)" type="ident " >
<coref:anchor href="coref.xml#id(de_110)" />
</coref:link>
<coref:link href="coref.xml#id(de_111)" type="poss" subtype="sposs " >
<coref:anchor href="coref.xml#id(de_110)" />
</coref:link>
(8.102) The service is to be held in the Church of Our Lady and St Patrick, in Ballymoney. (BBC)
(8.103) The service is to be held in <coref:de ID="de_113"the Church of <coref:de ID="de_114"> Our Lady and St Patrick </coref:de> </coref:de>, in Ballymoney.
<coref:link href="coref.xml#id(de_113)" type="poss " subtype="sposs ">
<coref:anchor href="coref.xml#id(de_114)" />
</coref:link>
To see how sometimes it can be difficult to distinguish between the three types of possessive links, consider in (8.104) - both of-constructions could be considered in a way to be strict possession, or the first could be an attribute and the second a part:
(8.104) The health of Ronaldo, in the hours leading up to Sunday's World Cup final, is dominating the sports pages of newspapers worldwide.
8.3.1.3 Other relations
In this section we illustrate a few other relations that may occur between two discourse entities. The designer of the annotator scheme may decide to annotate these or not depending on the degree of precision needed; else, a simple 'general relation' may be annotated.
Bound anaphors
This value should be used for type when a discourse entity is bound by a quantifier (8.105), (8.106). The pronoun and its antecedent are linked by a bound link, with the first argument of the link being bound by the second.
(8.105) Nobody likes to lose his job.
(8.106)
<coref:de ID="de_80"> Nobody </coref:de> likes to lose <coref:de ID="de_81"> his </coref:de> job
<coref:link href="coref.xml#id(de_81)" type="bound">
<coref:anchor href="coref.xml#id(de_80)" />
</coref:link>
(8.107) Every man for himself.
Function-value
The f-v value can be used to indicate the relationship between a function and its value(s). Although these objects have the same reference when the function NP denotes a value, distinguishing this relation from identity is useful in those cases in which it is the sense of the NP that matters, as when a single function is assigned two different values (8.108) - marking these links as ident would result in asserting that 90 degrees is identical with 70 degrees. In this case, we would mark up two f-v links, one between the temperature and 90 degrees, and another between the temperature and 70 degrees (8.109). Because the f-v link is not symmetrical or transitive, unlike the ident link, this does not lead to 70 degrees and 90 degrees being marked as ident. The first argument of the link is the function, the second the value.
(8.108) The temperature rose to 90 degrees before dropping to 70 degrees
(8.109)
<coref:de ID="DE_82"> The temperature </coref:de> rose to <coref:de ID="DE_83"> 90 degrees </coref:de> before dropping to <coref:de ID="DE_84"> 70 degrees </coref:de>
<coref:link href="coref.xml#id(de_82)" type="f-v" >
<coref:anchor href="coref.xml#id(de_83)" />
</coref:link>
<coref:link href="coref.xml#id(de_82)" type="f-v" >
<coref:anchor href="coref.xml#id(de_84)" />
</coref:link>
Instantiation
This relationship holds between two discourse entities when the second <coref:de> refers to a particular instantiation of the class identified by the first <coref:de>, as in (8.110). The link is marked up as type inst, with the first argument being the instance, and the second the class or non-referential use.
(8.110)
A: We need oranges.
B: There are some at Corning.
(8.111)
A: We need <coref:de ID="de_115"oranges</coref:de>.
B: There are <coref:de ID="de_116"some</coref:de> at Corning.
<coref:link href="coref.xml#id(de_116)" type="inst " >
<coref:anchor href="coref.xml#id(de_115)" />
</coref:link>
This type of link might also be used to mark up the relationship between the class of entities whose existence or identity is queried by a question, and an entity that verifies that description:
(8.112)
A: which route do you want to take?
B: the Corning to Elmira route.
(8.113)
A: <coref:de ID="de_116a"> which route</coref:de> do you want to take?
B: <coref:de ID="de_115a"> the Corning to Elmira route</coref:de>.
<coref:link href="coref.xml#id(de_115a)" type="inst " >
<coref:anchor href="coref.xml#id(de_116a)" />
</coref:link>
(8.114) illustrates the difficulty of marking up this kind of relation: the first mention of un train is not referential, and the last clearly is, but it is not so clear what the link should be between either of these and the mention in turn O4; this one is talking about the same hypothetical train as the first mention, but is still not referential, and therefore cannot be inst. We have therefore tentatively linked these first two non-referential uses as ident.
(8.114)
C2:-- est-c'que vous pourriez me dire si: il y a un train vers les douze heures quarante-cinq au départ de Paris Saint-Lazare pour Pontoise?
O3:-- pour Pontoise ?
C3:-- oui
O4:-- un train qui circule tous les jours ?
C4:-- oui
O5:-- ne quittez pas s'il vous plaît
.....
O6:-- allo
C5:-- oui
O7:-- (h) oui vous avez un train à douze heures quarante-cinq hein, il circule tous les jours sauf les dimanches et fêtes (SNCF)
C2:-- Could you tell me if there's a train around twelve forty-five from Paris Saint-Lazare to Pontoise?
O3:-- to Pontoise?
C3:-- Yes.
O4:-- A train which runs every day?
C4:-- Yes.
O5:-- Please hold the line.
......
O6:-- Hello?
C5:-- Yes.
O7:-- Yes, you've got a train at 12:45; it runs every day except Sundays and holidays.
(8.115)
C2:-- est-c'que vous pourriez me dire si: il y a <coref:de ID="de_117"> un train </coref:de> vers les douze heures quarante-cinq au départ de Paris Saint-Lazare pour Pontoise?
O3:-- pour Pontoise ?
C3:-- oui
O4:-- <coref:de ID="de_118"> un train qui circule tous les jours </coref:de>?
C4:-- oui
O5:-- ne quittez pas s'il vous plaît
....
O6:-- allo
C5:-- oui
O7:-- (h) oui vous avez <coref:de ID="de_119"> un train </coref:de> à douze heures quarante-cinq hein, <coref:de ID="de_120"> il </coref:de> circule tous les jours sauf les dimanches et fêtes
<coref:link href="coref.xml#id(de_117)" type="ident " >
<coref:anchor href="coref.xml#id(de_118)" />
</coref:link>
<coref:link href="coref.xml#id(de_119)" type="inst" >
<coref:anchor href="coref.xml#id(de_118)" />
</coref:link>
<coref:link href="coref.xml#id(de_120)" type="inst" >
<coref:anchor href="coref.xml#id(de_119)" />
</coref:link>
An alternative analysis of this example could be as follows: question C2 queries whether the set of trains for Pontoise from Paris Saint-Lazaire is non empty; O4 introduces a new class of trains, which however specializes the first class. So the link between de_120 and de_119 could be analyzed as either a subset relation or perhaps by introducing a new intensional relation between types, specializes. We will not discuss how to do this here.
Event relations
The event relation link encodes the link between a discourse entity and a preceding event or situation, expressed by a noun phrase, verbal phrase or sentence, in case the discourse entity plays a role of some sort in the event/situation. (This link is a generalization of the cause and arg links in DRAMA.) As in the case of possession relations, we propose a general link type e-rel; if further detail is required as to the role the <coref:de> plays in the event, e.g. to follow the DRAMA encoding, this may be provided in the subtype (e.g. cause, agent, patient, etc.); we will not attempt to define a general-purpose set of subtypes here. The coref:link element should point to the discourse entity, the coref:anchor to the event.
In the following example (from Passonneau), the e-rel relation holds
between two noun phrases:
There was an explosion. The noise was tremendous.
There was <coref:de ID= "de_3"> an explosion. </coref:de> <coref:de ID="de_4"> The
noise </coref:de> was tremendous.
<coref:link href= "id(de_4)" type = "e-rel" >
<coref:anchor href = "id(de_3)" />
</coref:link>
In more complex cases, the event is introduced by a verb phrase or a sentence that has to be marked up. We propose to use the element <coref:seg (already used in the Core Scheme for annotating verbal elements containing clitics) for this purpose.
(8.116) Muslims from all over the world were taught gun-making and guerrilla warfare in Afghanistan. The instructors were members of some of the most radical Islamic militant groups in the region. (Independent)
(8.117) <coref:seg ID="de_130"> Muslims from all over the world were taught gun-making and guerrilla warfare in Afghanistan.</coref:seg> <coref:de ID="de _131"> The instructors</coref:de> were members of some of the most radical Islamic militant groups in the region.
<coref:link href="coref.xml#id(de_131)" type="e-rel " >
<coref:anchor href="coref.xml#id(de_130)" />
</coref:link>
8.3.1.4 General
As mentioned above in the introduction to this section, the genrel (general relation) link may be used as a 'catch-all' label where an annotator believes that two discourse entities are related, but does not wish to give a very detailed classification of the type of non-ident relationship involved. Alternatively, this type of relation may also be used in addition to these classes to cover any other types of links which do not appear to fit into the above classification. One example of this can be seen in the phrase The man who gives his paycheck to his wife is wiser than the man who gives it to his mistress, in that 'it' here does not refer to the same entity as its antecedent, 'his paycheck', but rather to something which stands in the same relationship to the second man as the paycheck does to the first.
Example
Plenty of examples are given above.
Coding Procedure
As in the case of the MUCCS scheme, annotation with <coref:link> elements should follow the determination of <coref:de>elements. The decision concerning the value of type should be done as follows:
a. see if the current discourse entity is identical with a previous discourse entity; if so, create a link, and specify type=ident;
b. else, see if it stands in one of the set relations;
c. else, see if it stands in a possession relation;
d. else, if it appears that the discourse entity is in a relation with one of the previous discourse entities,
but the relation is not one of those listed above, create a link and then
1. if additional attribute values are used, examine if one of those applies;
2. else, use type=genrel
Markup Table
|
Element |
Attributes |
Content |
|
link |
HREF |
one or more <anchors |
|
anchor |
HREF, type, subtype |
none |
The mapping between the DRAMA relations and those discussed below is specified as follows:
|
DRAMA's name |
Meta-scheme name |
|
Part |
poss, subtype part |
|
Cause |
e-rel, subtype cause |
|
Poss |
poss, subtype sposs |
|
arg (as part of arg/ptv) |
e-rel |
|
prop |
not included |
|
ptv (as part of arg/ptv) |
e-rel |
|
Coref |
ident |
|
Subset |
subset-of |
|
Member |
member |
9 REFERENCES
A. Anderson, M. Bader, E. Bard, E. Boyle, G. Doherty, S. Garrod,
S. Isard, J. Kowtko, J. McAllister, J. Miller, C. Sotillo,
H. Thompson, and R. Weinert (1991). The HCRC MapTask Corpus.
Language and Speech, 34(4): 351-366.
Bruneseaux, F. and Romary, L. (1997) REG: Reference Encoding
Guidelines, Draft, March 25, 1997.
Carletta, J. (1996) Assessing agreement on classification tasks: the
kappa statistic, Computational Linguistics 22(2): 249-254.
Carlson, G. N. (1977). Reference to Kinds in English. PhD
thesis, University of Massachusetts at Amherst
Clark, H. H. (1977) Bridging. In P. N. Johnson-Laird and P.C. Wason,
editors, Thinking: Readings in Cognitive Science. Cambridge
University Press, London and New York.
Fraurud, K. (1990) Definiteness and the processing of NPs in natural
discourse.Journal of Semantics, 7:395-433.
D. Gross, J. Allen and D. Traum, The TRAINS 91 Dialogues, TRAINS
Technical Note 92-1, 1993.
Grosz, B. J. (1977). The Representation and Use of Focus in
Dialogue Understanding. PhD thesis, Stanford University.
Hawkins, J. A. (1978). Definiteness and Indefiniteness. Croom
Helm, London.
Hirschman, L. (1997) MUC-7 Coreference Task Definition, Version
3.0. In Proc. MUC-7
Karttunen, L. (1976). Discourse Referents. In J. McCawley, editor,
Syntax and Semantics 7 - Notes from the Linguistic
Underground. Academic Press, New York.
Partee, B. H. (1972) Opacity, coreference, and pronouns. In
D. Davidson and G. Harman, editors, Semantics for Natural
Language. D. Reidel, Dordrecht, Holland, pages 415-441.
Passonneau, R. J. (1996) Instructions for Applying Discourse Reference
Annotation for Multiple Applications (DRAMA)
Poesio, M. and R. Vieira. (1998) A corpus-based investigation of
definite description use. Computational Linguistics, n. 24,
n.2.
Prince, E. F. (1981) Toward a taxonomy of given-new information. In
P. Cole, editor, Radical Pragmatics. Academic Press, New York,
pages 223-256.
Quirk and S. Greenbaum. 1973. A University Grammar of
English, Longman.
Sidner, C. L. (1979) Towards a computational theory of definite
anaphora comprehension in English discourse. Ph.D. thesis, MIT.
Vieira, R. (1998) Definite Description Resolution in Unrestricted
Texts. Ph.D. thesis, University of Edinburgh, Centre for Cognitive
Science.
Webber, B. (1978) A Formal Approach to Discourse
Anaphora. Ph.D. thesis, Harvard University.