4 MARKUP ELEMENTS COMMON TO ALL OPTIONS

In this section we introduce the common core of markup distinctions common to all options allowed by the meta-scheme.

4.1 Markup Declaration

The following elements are used in the coreference schemes. As in all other schemes we use a single element to mark both anaphoric expressions and the NPs that serve as antecedents; the main difference from the MUC-7 scheme and DRAMA is that, following Bruneseaux and Romary (1997) (who, in turn, followed the TEI specification), we separated out the annotation of co-specification from the annotation of discourse entities. We use therefore two main elements: <coref:de>, used to annotate the elements which enter in co-specification relations; and <coref:link>, used for expressing co-specification between discourse entities. This way of annotating relations has the advantage that a discourse entity can be related by links to more than one other discourse entity; this is important to allow a discourse entity to be related both to an antecedent introduced in the discourse and to an entity in the universe of discourse. In addition, we have elements for specifying objects in the visual situation that can serve as antecedents,   and for marking text constituents that introduce elements which participate in anaphoric relations in an indirect way.

            Embedded elements: <coref:ue>


4.2 Description of Elements

4.2.1 Discourse Entities

Description

The assumption underlying most annotation schemes for coreference is that processing text involves building a discourse model containing discourse entities, and that anaphoric relations are relations between these discourse entities (Webber, 1978; Heim, 1982; Kamp, 1981). We use the <coref:de> tag to annotate the text spans that introduce a discourse entity - that is, that can be subsequently referred to by means of anaphoric expressions. These are commonly noun phrases.  Not all noun phrases do this: for example, whereas

John likes Bill

introduces two discourse entities, as can be shown by the fact that a follow-up like

He is crazy

is ambiguous in that he can refer either to John or to Bill, the sentence

John is a policeman

which from a syntactic point of view also contains two NPs, nevertheless only introduces one discourse entity, as can be seen by the fact that in this case, the continuation He is crazy is not ambiguous. As a consequence, the NP a policeman would not get a <coref:de> tag; in other words, the textual elements given a <coref:de> tag are a subset of the range of NPs.
 

Data Source

The annotation for <coref:de>'s should be included in a file with pointers to a base file which has already been XML tagged with information about the structure of the conversation, ideally using TEI coding (http://etext.virginia.edu/TEI.html), suitably converted into XML. A typical dialogue marked up in TEI has a <teiHeader>, <head>, and a <body> which is broken up into utterances (<u>), marked for speaker. Each <pause> is marked. The <u> might be further segmented, for example into prosodic phrases, using the TEI <seg> tags. Gestures and mouse clicks may also be marked, as may notes made by the annotator or the initial transcriber, and more detailed information can be given about pause durations, type of transitions between speakers, and many other features. The French conversation in (4.1), for example (from the Microfusées corpus), might be marked up as in (4.2):

(4.1)
Formateur: Alors donc / vous avez / ici [au milieu de la table] / les modèles des fusées volé /
[Le formateur dispose le petit paquet de dessins des 9 fusées.]
Mia: Oui
Formateur: Et vous allez essayer de vous mettre d'accord sur un classement / hein classer les fusées qui ont bien volé ou qui ont moins bien volé / [Le formateur montre avec les mains un endroit (bien volé puis un autre (moins bien volé .]
Mia: Alors par exemple de celle qui a / le / qui a volé le plus loin / à à celle qui a volé moins loin(?)
Instructor: OK, then, here you have [in the middle of the table] the models of the rockets. [The instructor puts down the little packet of 9 rocket designs.]
Mia: Yes
Instructor: And you are going to try to agree on a classification... to classify the rockets which flew well or which flew less well.. [The instructor points to one place (those which flew well) then another (those which flew less well)]
Mia: So for example from the one which.. it.. which flew the furthest... to the one which flew the least far?

(4.2)
 

<u id="u1" who="F">
  <seg id="u1seg1">
    Alors donc
    <pause dur="short"/>
    vous avez
    <pause dur="short"/>
    ici
    <note place="inline">
      au milieu de la table
    </note>
    <pause dur="short"/>
    les modèles des fusées
    <pause dur="short"/>
  </seg>
  <note place="outline" type="stage directions">
    Le formateur dispose le petit paquet de dessins des 9 fusées.
  </note>
</u>
<u id="u2" who="M" trans="pause">
  <seg id="u2seg1">
    Oui
  </seg>
</u>
<u id="u3" who="F">
  <seg id="u3seg1">
    Et vous allez essayer de vous mettre d'accord sur un classement
    <pause dur="short"/>
  </seg>
  <seg id="u3seg2">
    hein classer les fusées qui ont bien volé ou qui ont moins bien volé
    <pause dur="short"/>
  </seg>
  <note place="outline" type="stage directions">
    Le formateur montre avec les mains un endroit (bien volé) puis un autre
    (moins bien volé) .
  </note>
</u>
<u id="u4" who="M" trans="pause">
  <seg id="u4seg1">
    Alors par exemple de celle qui a
    <pause dur="short"/>
    le
    <pause dur="short"/>
    qui a volé le plus loin
    <pause dur="short"/>
    à celle qui a volé moins loin (?)
  </seg>
</u>
<u id="u1" who="F">
  <seg id="u1seg1">
    OK, then,
    <pause dur="short"/>
    you have
    <pause dur="short"/>
    here
    <note place="inline">
      in the middle of the table
    </note>
    <pause dur="short"/>
    the models of the rockets
    <pause dur="short"/>
  </seg>
  <note place="outline" type="stage directions"/>
    The instructor puts down the little packet of 9 rocket designs
  </note>
</u>
<u id="u2" who="M" trans="pause">
  <seg id="u2seg1">
    Yes
  </seg>
</u>
<u id="u3" who="F">
  <seg id="u3seg1">
    And you are going to try to agree on a classification
    <pause dur="short"/>
  </seg>
  <seg id="u3seg2">
    to classify the rockets which flew well or which flew less well
    <pause dur="short"/>
  </seg>
  <note place="outline" type="stage directions">
    The instructor points to one place (those which flew well) then another
    (those which flew less well).
  </note>
</u>
<u id="u4" who="M" trans="pause">
  <seg id="u4seg1">
    So for example from the one which
    <pause dur="short"/>
    it
    <pause dur="short"/>
    which flew the furthest
    <pause dur="short"/>
    to the one which flew the least far (?)
  </seg>
</u>

The details of the TEI mark-up may not suit all corpora, depending on the format in which the initial transcription has been presented. For example, in the TRAINS corpus each speaker turn is segmented into a number of different utterances, separated at prosodic phrase boundaries (4.3). This means that the <u> are much shorter than those in true TEI-conformant mark-up, and there is then no TEI tag suitable for grouping the utterances into turns. For the moment, we have adopted the procedure in this case of introducing <turn> tags for a whole turn, and using <u> for each utterance or prosodic phrase:

(4.3)
 
 

44.1   S: +okay+
44.2    : okay
44.3    : lemme run /
44.4    : lemme make sure I got all this
44.5    : okay
44.6    : you wanna send E2
44.7    : you wanna link
44.8    : uh
44.9    : the boxcar at Elmira to E2
44.10   : and send that to Corning
45.1   M: yeah
46.1   S: and have it load oranges
47.1   M: right
48.1   S: okay

(4.4)
 
 

<turn id="t44" who="S">
  <u id="u44.1">+okay+</u>
  <u id="u44.2">okay</u>
  <u id="u44.3">lemme run</u>
  <u id="u44.4">lemme make sure I got all this</u>
  <u id="u44.5">okay</u>
  <u id="u44.6">you wanna send E2</u>
  <u id="u44.7">you wanna link</u>
  <u id="u44.8">uh</u>
  <u id="u44.9">the boxcar at Elmira to E2</u>
  <u id="u44.10">and send that to Corning</u>
</turn
<turn id="t45" who="M">
  <u id="u45.1">yeah</u>
</turn
<turn id="t46" who="S">
  <u id="u46.1">and have it load oranges</u>
</turn
<turn id="t47" who="M">
  <u id="u47.1">right</u>
</turn
<turn id="t48" who="S">
  <u id="u48.1">okay</u>
</turn>

If one wishes to impose syntactic restrictions on potential markables - which is a good idea for annotation exercises of any complexity - then this basic level must be further annotated with something which allows that constraint to be expressed - word tags, or full syntactic elements, or morpho-syntax tags as defined in the MATE Morpho-syntax scheme (Pirrelli and Soria, 1999).  Since different schemes make different choices, the exact data source requirements are left to the individual schemes.

Assignment

The only attributes of <coref:de> that have to be set are id and href, both of which are automatically computed by the MATE workbench, either by making <coref:de> elements match the output of some MATE query on morphosyntactic tagging or by computation from text selected in the coding interface by the human user.

Example

Assuming that chunks with nominal governors are chosen as markables and that the sentence

(4.5) John likes Bill

would get annotated with chunks as follows:

(4.6)
 
 

ch.xml
<ch id="ch_001" type="N">
  <potgov id="p_001">John
  </potgov>
</ch>
<ch id="ch_002" type="V">
  <potgov id="p_002">likes
  </potgov>
</ch>
<ch id="ch_003" type="N">
  <potgov id="p_003">Bill
  </potgov>
</ch>

then the following discourse entities would be annotated:

(4.7)
 

coref.xml
<coref:de id="de_001" href="ch.xml#id(ch_001)"/>
<coref:de id="de_002" href="ch.xml#id(ch_003)"/>

Important Note: Since the underlying XML representation is meant to be transparent to the annotator using the MATE tools, in the examples below we have simplified the notation considerably so as to make it easier for non-XML experts to understand the annotation; this would also make it clearer that the meta-scheme does not crucially depend on a particular type of basic level markup. First of all, we give examples in plain text, abstracting away from the chunking level, except in a few cases when this is necessary. Second, instead of representing the markup by means of href pointers as in (4.7), we will adopt a more conventional SGML-style format with tags wrapped around the parts of the text to be annotated with a <coref:de> element, so as to make it clearer to the annotator which part of the text to highlight and to mark; the representation in (4.7) will be automatically constructed by the tool and the annotator need not be aware of it. In our examples, we will generally use the following representation, rather than the format in (4.7):

(4.8)
 

<coref:de>John</coref:de>
likes
<coref:de>Bill</coref:de>

 

Coding Procedure

Left to the individual schemes.
 

Markup Table
 
 

<coref:de>
id [ASCII]
href <ch>

4.2.2 Link and Anchor Entities

Description

<coref:link> elements are used to mark anaphoric relations between discourse entities, the most basic of which is the identity relation. This relation obtains between two phrases in a text when they denote the same object in the world; the phrases used to refer to this object can be the same, like 'la surface... la surface' in (4.9), 'orange juice... orange juice' in (4.10), 'les ailerons... les ailerons' in (4.11) or different, as is seen with 'the engine E3... it... it' in (4.12), or 'ces deux fusées... elles' in (4.13). As these last two examples suggest, it is very common for a pronoun to be used to refer to a discourse entity previously referred to by a full noun phrase.

(4.9)
 

S: Créer la surface.
W: Opération effectuée
S: Modéliser la surface
W: Quel nom voulez-vous donner à la surface ?
S: Create the surface
W: Done
S: Model the surface
W: What name do you want to give to the surface ? (MF)

(4.10)
 
 

When do we have orange juice at Elmira?
We have orange juice at Elmira at 6 a.m. (T)

 

(4.11)
 
 

197 F: mmh / Donc qu'est ce que vous allez garder en fait (?) + /
198 M: |la longueur du tube et les ailerons |
199 D:| les ailerons |
200 F: Donc les ailerons vous m'avez dit.
197 F: mmm / Well, what are you going to keep, then ? /
198 M: the length of the tube and the wings |
199 D: | the wings |
200 F: well, the wings, you said (MF)

 

(4.12)
 

we're gonna take the engine E3 and shove it over to Corning, hook it up to the tanker car... (T)

 
 

(4.13)
 

193 F: Donc qu'est ce qui / qu'est ce qui serait commun à ces deux fusées. Ces deux fusées ont /
194 D: c'est qu'elles ont / elles ont la même...
193 F: What would it be that these two rockets have in common? These two rockets have /
194 D: it's that they have / they have the same... (MF)

 
A group of children perform an intricate dance in a small theatre in the northern Sri Lankan town of Jaffna.
The appreciative audience sit in the open air and applaud their performance.
The members of the Centre for Performing Arts in Jaffna are justly proud of their performance...(BBC)

In this section we only discuss the case of links describing identity relations, but nothing prevents an annotator to use a wider range of relations, as done in the DRAMA scheme; some suggestions concerning possible relations are in Section 8.

Data Source

The <coref:link> and <coref:anchor> elements point to <coref:de> elements.

Segmentation/Selection

Not applicable (the information provided by <coref:link> elements comes entirely from their attributes).

Assignment

The HREF attributes of link and anchor elements both refer to the ID of an antecedent, which can be either a <coref:de> element, a <coref:ue> element, or a <coref:seg> element (see below). For the moment, we assume that the antecedent denotes the same object as the <coref:de> element, and the ident relation is used. We assume in the rest of this document that the annotation is contained within a file 'coref.xml' to which the href elements point.

Coreference chains: It is often the case that more than two discourse entities refer to the same object; in this case, a coreference chain is formed. Because the identity relation is transitive, if A is ident with B and B is ident with C, then A is ident with C; so it doesn't matter which item in a coreference chain is chosen as antecedent for a new phrase.  This can be tracked through the markup.

Furthermore, since the identity relation is symmetric, it doesn't matter which <coref:de> element is chosen as 'current element' and which one as 'anchor'.  It is often less confusing, however, to adopt the convention that the <coref:link> element should point to the latest discourse entity, whereas the <coref:anchor> element should point to the antecedent.

Participants interpret anaphoric expressions differently: It is also possible to observe that at a certain point in a dialogue the conversational participants had differences of opinion about coreferential links.  For this reason, links can contain specifications of which agent or set of agents believes them to hold, via the optional WHO-BELIEVES attribute.  The default value for this attribute is SHARED.

Example

We use the <coref:link> and <coref:anchor> elements to mark anaphoric relations, as follows. When two noun phrases marked as <coref:de> elements co-specify, a <coref:link> element is added. The href attribute of this element points to the anaphoric expression, and contains at least one <coref:anchor> element specifying the antecedent (by means of a second href pointer). The type of relation that holds between the two discourse entities (the values of which depend on the exact scheme implemented) is specified by the type attribute of the <coref:link> element. (As we will see below, specifying anaphoric relations by means of elements embedded into a <coref:link> element allows the annotator to mark for ambiguities of co-specification.)  Here are some example annotations.

(4.15)
 
 

coref.xml
When do we have<coref:de ID="de _01">orange juice</coref:de>at Elmira?
We have <coref:de ID="de _02">orange juice</coref:de>at Elmira at 6 a.m. (T)

<coref:link type="ident" href="coref.xml#id(de_02)">
  <coref:anchor href="coref.xml#id(de_01)"/>
</coref:link>

(4.16)
 
 

coref.xml
197 F: mmh / Donc qu'est ce que vous allez garder en fait (?) + /
198 M: |la longueur du tube et <coref:de ID="de _98">les ailerons</coref:de>
199 D:<coref:de ID="de_99">les ailerons</coref:de>
200 F: Donc <coref:de ID="de_100">les ailerons</coref:de> vous m'avez dit.

<coref:link href="coref.xml#id(de_99)" type="ident">
  <coref:anchor href="coref.xml#id(de_98)" />
</coref:link>
<coref:link href="coref.xml#id(de_100)" type="ident" >
  <coref:anchor href="coref.xml#id(de_99)"/>
</coref:link>

(4.17)
 
 
 
 

we're gonna take <coref:de ID="de_07">the engine E3</coref:de> and shove
<coref:de ID="de_08">it</coref:de> over to Corning, hook
<coref:de ID="de_09">it</coref:de> up to the tanker car...

<coref:link href="coref.xml#id(de_08)" type="ident">
  <coref:anchor href="coref.xml#id(de_07)"/>
</coref:link>
<coref:link href="coref.xml#id(de_09)" type="ident">
  <coref:anchor href="coref.xml#id(de_08)"/>
</coref:link>

Ambiguity: The reason why more than one <coref:anchor> element may be embedded in a <coref:link> element is to annotate ambiguity. In case more than one entity appear to be equally likely antecedents for an anaphoric expression, each of the possibilities can be marked by means of a separate <coref:anchor> element. In the following example, the pronoun it in 15.16 could refer equally well to engine E3 or to the tanker car. If the annotator desires to annotate both antecedents, as in DRAMA or in the Lancaster scheme, this can be done as shown below.
 
 

coref.xml
15.12 : we're gonna take <coref:de ID="de_15">the engine E3</coref:de>
15.13 : and shove <coref:de ID="de_16">it</coref:de> over to Corning
15.14 : hook <coref:de ID="de_17">it</coref:de> up to
        <coref:de ID="de_18">the tanker car</coref:de>
15.15 : _and_
15.16 : and send <coref:de ID="de_19">it</coref:de> back to Elmira
 

<coref:link href="coref.xml#id(de_16)" type="ident">
  <coref:anchor href="coref.xml#id(de_15)"/>
</coref:link>
<coref:link href="coref.xml#id(de_17)" type="ident">
  <coref:anchor href="coref.xml#id(de_16)"/>
</coref:link>
<coref:link href="coref.xml#id(de_19)" type="ident">
  <coref:anchor href="coref.xml#id(de_17)"/>
  <coref:anchor href="coref.xml#id(de_18)"/>
</coref:link>


 

Coding Procedure

Left to the individual schemes.

Markup Table
 
 

<coref:link>
id [ASCII]
who-believes [ASCII]
type ident, member, subset, poss, e-rel, argptv, prop, bound, f-v, inst, genrel
subtype attr, part, sposs, cause
href <coref:de>
content <coref:anchor>

 
 
<coref:anchor>
id [ASCII]
href <coref:de>

 

4.2.3 Universe and UE Entities

In face-to-face or human-machine dialogue, participants may make reference to items visible to them at the time of speaking. A simple example of this is Pass the salt, please, where salt may not have been previously mentioned in the conversation, and thus does not corefer with any other <coref:de>, but does refer to an entity which is in the visible situation. Tracking these references is important for multimodal systems (Bruneseaux and Romary, 1997), and they have been annotated reliably in the MapTask. This tracking requires two new elements: a <coref:universe> element (as in the Bruneseaux and Romary scheme) used to specify a 'universe of discourse', that is, a set of objects, each specified by a <coref:ue> element.

The <coref:universe> element may also be used to specify references to items in the non-visible 'universe' of shared knowledge which allows hearers to correctly assign reference to items such as the Eiffel Tower - the so-called `larger-situation' (Hawkins, 1978) or `hearer-old' (Prince, 1981) references; however, annotators should keep in mind that it is often difficult to do such categorizations reliably, as found out by Fraurud (1990) and Poesio and Vieira (1998).

Description

In order to mark up reference to items in the visual situation, the items in the visual situation are listed as universe entities (<coref:ue>), embedded within a <coref:universe> element. Each <coref:ue> element has an ID, like <coref:de> do, so that a relation of identity between a noun phrase and an object in the visual situation can be encoded by an ident link between a <coref:de> and a <coref:ue> just like identity between two <coref:de> elements.

Where feasible, it is suggested that all objects in the visual situation be included in a single <coref:universe> element. In cases like the MapTask dialogues where the participants to the conversation have two different maps, it is suggested that three universes be created: one with ID common containing all objects shared between the visual situations, and then one universe for each conversational participant containing the elements known only to that element, and with value modifies="common". This will ensure that the shared elements receive a unique ID.

In some types of dialogues the visual situation may change: new objects may be created and old objects destroyed (e.g.,when the visual situation is the screen). These situations may be modeled by allowing for the creation of new universes in the middle of dialogues, although this is not yet supported.

Data Source

There are no additional requirements on source data for the use of universes, unless a scheme implements a restriction on what coreferences are to be annotated based on the types of objects referred to; in this case, the annotator needs a description of the objects to check against.  For instance, if the annotator were to mark up only references to Map Task landmarks, then the annotator would need a list of landmarks or copies of the maps.  This information may not be enshrined in the data files themselves but in the coding module for the scheme instantiation.

Segmentation

Not applicable.

Assignment

The modifies attribute for all but the common universe should be set to common.

Example

The following is a simple example of the use of a universe.

(4.18)
 
 

<coref:universe ID="u1">
  <coref:ue ID="ue1">Diamond mine</coref:ue>
  <coref:ue ID="ue2">Graveyard</coref:ue>
  <coref:ue ID="ue3">Fast running creek</coref:ue>
  <coref:ue ID="ue4">Fast flowing river</coref:ue>
  <coref:ue ID="ue5">Canoes</coref:ue>
</coref:universe>

FOLLOWER: Uh-huh. Curve round. To your right.
GIVER:    Uh-huh.
FOLLOWER: Right.... Right underneath <coref:de ID="de_50">the diamond mine.</coref:de>
          Where do I stop.
GIVER:    Well....... Do. Have you got <coref:de ID="de_51">a graveyard?</coref:de>
          Sort of in the middle of the page? ... On on a level to
          <coref:de ID="de _52">the c-- ... er diamond mine.</coref:de>
FOLLOWER: No. I've got <coref:de ID="de_53">a fast running creek.</coref:
GIVER:    <coref:de ID="de _54">A fast flowing river</coref:de>,... eh.
FOLLOWER: No. Where's <coref:de ID="de_55"> that </coref:de>. Mmhmm,... eh.
          <coref:de ID="de _56">Canoes</coref:de>

<coref:link href="coref.xml#id(de_50)" type="ident">
  <coref:anchor href="coref.xml#id(ue1)"/>
</coref:link>
<coref:link href="coref.xml#id(de_51)" type="ident">
  <coref:anchor href="coref.xml#id(ue2)"/>
</coref:link>
<coref:link href="coref.xml#id(de_52)" type="ident">
  <coref:anchor href="coref.xml#id(ue1)"/>
</coref:link>
<coref:link href="coref.xml#id(de_53)" type="ident">
  <coref:anchor href="coref.xml#id(ue3)"/>
</coref:link>
<coref:link href="coref.xml#id(de_54)" type="ident">
  <coref:anchor href="coref.xml#id(ue4)"/>
</coref:link>
<coref:link href="coref.xml#id(de_55)" type="ident">
  <coref:anchor href="coref.xml#id(de_54)"/>
</coref:link>
<coref:link href="coref.xml#id(de_56)" type="ident">
  <coref:anchor href="coref.xml#id(ue5)"/>
</coref:link>


 

Note that <coref:de ID="de_55">, that, could be marked up as ident with either the universe entity ue4, or with the discourse entity de_54. One of the advantages of this way of annotating references to the visual situation is that an extended coreference chain tracking mechanism should be able to include in a coreference chain both references to universe elements and references to discourse entities; the annotator may then choose how he/she wishes to annotate this. If the annotation tool can't do this type of coreference chain tracking, then the coding manual should include a disambiguation rule: for the type of multimodal applications on which Bruneseaux and Romary worked it seems preferable to mark links with universe entities rather than marking links with previous discourse entities.
The following is a more complex example which includes multiple universes encoded different world knowledge and a disagreement about a coreferential link in the dialogue.

(4.19)
 
 
 

<coref:universe ID="common">
  <coref:ue ID="ue2">gold mine</coref:ue>
</coref:universe>
<coref:universe ID="GIVER_universe" modifies="common">
  <coref:ue ID="ue1">diamond mine</coref:ue>
</coref:universe>
<coref:universe ID="FOLLOWER_universe" modifies="common">
.....
</coref:universe>

GIVER:    Do_you have <coref:de ID="de_20">diamond_mine.</coref:de>
FOLLOWER: Yes I've got <coref:de ID="de_21">a gold_mine.</coref:de>
GIVER:    Ah. S--.
FOLLOWER: ....
GIVER:    You don't have <coref:de ID="de_22">diamond_mine</coref:de> though.
FOLLOWER: No. It's <coref:de ID="de_23"> a gold_mine </coref:de> according to this one.
          Presumably <coref:de ID="de_24">that's</coref:de> the same.
GIVER:    Well I've got <coref:de ID="de_25">a gold_mine</coref:de> as well you see.

<coref:link href="coref.xml#id(de_20)" who-believes="G" type="ident">
  <coref:anchor href="coref.xml#id(ue1)"/>
</coref:link>
<coref:link href="coref.xml#id(de_21)" who-believes="F" type="ident">
  <coref:anchor href="coref.xml#id(ue2)"/>
</coref:link>
<coref:link href="coref.xml#id(de_21)" who-believes="F" type="ident">
  <coref:anchor href="coref.xml#id(de_20)"/>
</coref:link>
<coref:link href="coref.xml#id(de_22)" who-believes="G" type="ident">
  <coref:anchor href="coref.xml#id(ue1)"/>
</coref:link>
<coref:link href="coref.xml#id(de_22)" type="ident">
  <coref:anchor href="coref.xml#id(de_20)"/>
</coref:link>
<coref:link href="coref.xml#id(de_23)" who-believes="F" type="ident">
  <coref:anchor href="coref.xml#id(ue2)"/>
</coref:link>
<coref:link href="coref.xml#id(de_23)" who-believes="F" type="ident">
  <coref:anchor href="coref.xml#id(de_22)"/>
</coref:link>
<coref:link href="coref.xml#id(de_24)" who-believes="F" type="ident">
  <coref:anchor href="coref.xml#id(de_22)"/>
</coref:link>

Coding Procedure

The annotation should begin with the creation of a <coref:universe> element (or a common universe plus one for each participant, if their knowledge is not the same). This is commonly done before the annotation of discourse entities if the universe is static.
 

Markup Table
 
 

<coref:universe>
id [ASCII]
modifies <ch>
content <coref:ue>

 
<coref:ue>
id [ASCII]
content TEXT: description of
      object

4.2.4 Seg Elements
 

Description

Even if we only consider anaphoric relations involving nominal elements, there are at least two situations in which an annotator may wish to mark an anaphoric relation that also involves other types of constituents. The first is the case in which the anaphoric element is either unexpressed or incorporated in the verb. The second situation are the cases of so-called discourse deixis (Webber, 1991), in which the antecedent of a nominal expression is an abstract object such as an event or proposition introduced in the discourse somewhat indirectly by sentences. (DRAMA allows for such relations to be marked.)
 
 

The solution we propose is to use a <coref:seg> element which, like the TEI <seg> element, can be used to mark up arbitrary pieces of text. <coref:seg> elements are given an id which can then be pointed at by a <coref:link> element just like for other anaphoric relations.
 

The <coref:seg> element could also be used to annotate anaphoric relations between non-nominal elements, such as in VP ellipsis.
 
 

Data Source

Data source requirements for <coref:seg> elements are the same as for <coref:de> elements.
 
 

Segmentation

To be specified by the coding manual for a given scheme.
 
 

Assignment

The id attribute is automatically set by the workbench.
 
 

Example

Using <coref:seg> to mark up empty and incorporated constituents: As seen above, in Italian, Spanish and many other languages, certain nominal constituents may not be realized; this is especially common for nominals in subject position, but can also happen in object position, especially in instructions, as in:
 
 

Add the dry yeast to the water and let _ sit for a few minutes. Add the rest of the water and sugar. Stir _






These nominals are present in annotations produced by hand (e.g., in the Penn Treebank), but the parsers used for parsing spoken dialogues tend not to produce representations containing empty constituents in this case. In case these nominals are not represented in the base level, we verb can be marked with a <coref:seg> element, and the anaphoric relation coded as usual by means of <coref:link> elements, as follows:
 
 

(4.20)
 
 

coref.xml
A: Dov'e` <coref:de ID="de_157">Gianni?</coref:de>
   [Where is Gianni?]
B: <coref:seg type="pred" ID="seg_158>
     e` andato a mangiare
   </coref:seg>
   [_ went to have lunch]

<coref:link href="coref.xml#id(seg_158)" type="ident">
  <coref:anchor href="coref.xml#id(de_157)"/>
</coref:link>


 

This representation can only be used without loss of information when there is at most one empty elements; this is true for Italian, but not for Japanese or Portuguese. If more precision is needed, the annotator could define more specific identity relations also specifying which empty argument of the verb enters in the anaphoric relation: such relations could be called, e.g., subj-ident, obj-ident, etc. These relations could then used instead of ident as the value of the type attribute of the <coref:link> element; we won't make them part of the annotation scheme discussed here, however.
 

A second case in which an argument is not realized by means of a nominal is that of incorporated clitics, such as daselo in (4.21) below. Clitic suffixes are also found in transcriptions of spoken English:
 
 

44.4 : lemme make sure I got all this
44.5 : okay (T)

In the case of incorporated clitics, as well, the verb can be marked with a <coref:seg> element when the parser doesn't produce a morphologically decomposed representation, and then the anaphoric relations in which the clitics are involved can be encoded either by means of a single ident relation or by means of more fine-grained relations such as subj-ident or obj-ident.
 
 

(4.21)
 
 

coref.xml
A: Mira, te doy <coref:de ID="de_167">este libro</coref:de>
   ¿Conoces a <coref:de ID="de_168">mi suegra?</coref:de>
B: Sí, claro.
A: Pues <coref:seg ID="seg_169">dáselo</coref:seg> cuando
   <coref:de ID="de_170">la</coref:de> veas.

<coref:link href="coref.xml#id(seg_169)" type="obj-ident">
  <coref:anchor href="coref.xml#id(de_167)"/>
</coref:link>
<coref:link href="coref.xml#id(seg_169) type="iobj-ident">
  <coref:anchor href="coref.xml#id(de_168)"/>
</coref:link>

Provided that the <coref:seg> elements are identified during the first pass of markable identification, encoding this information should not be any harder than in the case of MUCCS. The real question for this type of annotation is which empty elements to annotate --e.g., in addition to 'small pro' elements such as those discussed above, the annotator may also decide to annotate `big PRO' elements that according to some syntactic theories occupy the subject position of infinitival clauses.
 

Using SEG to mark the antecedents of discourse deixis: Abstract objects such as events, actions and propositions can all serve as antecedents of anaphoric expressions. We are not aware of any reliability results for this type of annotation, but the <coref:seg> element can be used to identify the antecedents in this type of anaphora. If desired, the annotator could use a second attribute type to specify the type of object introduced by the <coref:seg> element; type would have values event, prop and action.
 

(4.22)
 

<coref:seg type="event" ID="seg_130">
The 23-year-old had hit his head against another player
</coref:seg> during a game of Aussie-rules football.
McGlinn remembered nothing of
<coref:de ID="de_131">
the collision
</coref:de>, but developed a headache and had several seizures.

<coref:link href="coref.xml#id(de_131)" type="ident">
  <coref:anchor href="coref.xml#id(seg_130)"/>
</coref:link>


 

(4.23)
 
 
 

a. Despite the latest negative results, doctors are still
   convinced that Tamoxifen can prevent breast cancer.
   This is because of the way it blocks the action of oestrogen,
   the female sex hormone that can make the breast cells of some
   women go out of control.

b. Despite the latest negative results,
   <coref:seg type="prop" ID="seg_129">
   doctors are still convinced that
   <coref:de ID="de_131">Tamoxifen</coref:de>
   can prevent breast cancer
   </coref:seg>.
   <coref:de ID="de_130">This</coref:de>
   is because of the way
   <coref:de ID="de_132">it</coref:de>
   blocks the action of oestrogen, the female sex hormone that
   can make the breast cells of some women go out of control.

<coref:link href="coref.xml#id(de_130)" type="ident">
  <coref:anchor href="coref.xml#id(seg_129)"/>
</coref:link>


 

(4.24)
 
 

a.
GIVER:     You're sort_of going past stone creek ...
           but your line's curving up past the ... flat rocks.
FOLLOWER:  Right. Okay.
GIVER:     and then starting to come down again.
FOLLOWER:  Got that
 

b.
GIVER:     You're sort_of going past stone creek ...
           but your line's curving up past the ... flat rocks.
FOLLOWER:  Right. Okay.
GIVER:     <coref:seg ID="seg_135" type="action">
           And then starting to come down again.
           </coref:seg>
FOLLOWER:  Got <coref:de ID="de_136">that</coref:de>.

<coref:link href="coref.xml#id(de_136)" type="ident">
  <coref:anchor href="coref.xml#id(seg_135)"/>
/coref:link>

These examples also ilustrate some of the problems to be addressed when designing a reliable annotation scheme for discourse deixis: these include deciding what part of the text counts as antecedent as well as deciding which type of object the antecedent is (see, e.g., (4.24)).
 
 

Coding Procedure

Left to the individual schemes.
 
 

Markup Table
 
 

<coref:seg>
id [ASCII]
type [ASCII]

 

4.3 Integrated Example

See (4.18) and (4.19).
 
 

4.4 Joint Coding Procedure

Left to the individual schemes.



5 A MUC 7-LIKE SCHEME