4 Chunk-level Annotation Coding Module

4.1 Coding Purpose

This section is concerned with syntactic annotation at the chunking level. Chunks are textual units of adjacent word tokens which can be linked mutually through unambiguously identified dependency chains with no recourse to idiosyncratic lexical information.

In marking chunks, we are mainly interested in their category and start and end points. It should be noted that chunks do not necessarily cover the entire sentence, as there may be material that does not belong to any chunk. For example, prepositions, coordinators, subordinators, and adverbs are, in some cases and according to some instantiations of chunking, not part of any chunk.
 

4.2 Selected schemes

There are several approaches to chunking, which comply with somewhat varying notions of what a chunk is (cf. Abney, 1996; Federici, Montemagni and Pirrelli, 1996, 1998). The notation presented here is based on SPARKLE specifications (Carroll et al., 1997), which were developed so as to represent an edited intersection of different existing schemes.
 

4.3 Markup Declaration

<ch> <potgov>
<aux>
<cop>
<intro>
<modal>
<caus>

4.3.1 <ch>: Chunks

4.3.1.1 Description

Chunks are defined strictly synctactically. Following Abney (1996:1), a chunk is "the non-recursive core of an intra-clausal constituent, extending from the beginning of a constituent to its head (or potential governor, see below), but not including post-head dependents".

Each chunk includes a sequence of adjacent word tokens which are mutually related through dependency links. For a more detailed discussion, cf. MATE Deliverable D.1.1. Examples and criteria for chunking are given in the following sections.

4.3.1.2 Data Source

Tagging at the chunk level presupposes the markup of morphological words.

4.3.1.3 Segmentation/Selection

Given the sentence "The hungry man could always eat the meals offered by the pious woman", the chunking will be as follows:

[The hungry man] [could always eat] [the meals] [offered] [by the pious woman].

The sentence is segmented into five chunks. As noted before, each chunk includes a sequence of adjacent word tokens (a text substring) which are mutually related through dependency links. The fact that two substrings are assigned different chunks does not necessarily entail that there is no dependency relationship linking the two. Simply, this means that, on the basis of the available lexical knowledge, it is impossible to state unambiguously what chunk relates to its neighbouring chunks and what the nature of this relationship is.

4.3.1.4 Assignment

Four different attributes are needed for the description of chunks:

Type

Each identified chunk must be labelled as to its category. Chunk categories, to be expressed as values of the type attribute, are given in the table below:
 

NAME TYPE
ADJ adjectival chunk
ADV adverbial chunk
FV finite verb chunk
G gerundival chunk
I infinitival chunk
N nominal chunk
P prepositional chunk
PART participial chunk
U unassigned

In the following we provide basic guidelines for the identification of chunk categories, according to the SPARKLE specifications. Examples are given for Italian and English.

ADJ

Adjectival chunks are chunks beginning with any premodifying adverbs and intensifiers and ending with a head adjective. Adjectival phrases occurring in pre-nominal position are not marked as distinct chunks since their relationship to the governing noun is unambiguously identified within the nominal chunk.

ADJ chunks thus include:

ADV

Adverbial chunks are chunks extending form any adverbial pre-modifier to the head adverb. Adverbial phrases occurring between an auxiliary verb and a past participle are not isolated as distinct chunks due to their unambiguous dependency on the verb. By the same token, adverbs which happen to immediately premodify verbs or adjectives are respectively part of either verbal or adjectival chunk. Noun phrases used adverbially (e.g. last week, this morning) are treated as nominal chunks. E.g.:

FV

Finite verbal chunks include modals, ordinary and causative auxiliaries as well as medial adverbs and clitic pronouns, and they end at the head verb.

G

The "G" value marks gerundival chunks. When part of a tensed group (e.g. in the progressive construction), the gerundival verb form is not marked independently (but rather as part of a FV). G chunks include gerunds functioning as noun phrases but not those functioning as adjectives.

I

Infinitival chunks include both bare infinitives and infinitives introduced by a preposition.

N

Noun chunks are chunks extending from the beginning of the noun phrase to the head noun. They include nominal chunks headed by nouns, pronouns, verbs in their infinitival form when preceded by an article (i.e. Italian nominalised infinitival constructions) and proper names. All kinds of modifiers and/or specifiers occurring between the beginning of the noun phrase and the head noun are included in N chunks. E.g.:

P

Prepositional chunks are chunks which extend from the preposition to the head of the embedded noun phrase. Typical instances of P chunks are:

PART

A Past participle chunk includes participial constructions such as:

U

The U value indicates that a given chunk cannot be assigned any other value, in general because it is incomplete due to interruption.
For any language-specific issue the interested reader is referred to Abney (1996), Carroll et al. (1997), Federici, Montemagni and Pirrelli (1996).
 

Broken

The attribute broken supplies a way for representing chunk partials and discontinuous chunks, which are often encountered as a result of interruptions, retracings, and so on.
 
 

4.3.1.5 Examples
 
 

(1) Hello, can I help you?

 
mword.xml
...
<mw id="mw_001">hello</mw>
<mw id="mw_002">can</mw>
<mw id="mw_003">I</mw>
<mw id="mw_004">help</mw>
<mw id="mw_005">you</mw>
...
chunk.xml
...
<ch id="ch_001" type="ADV" href="mword.xml#id(mw_001)"/>
<ch id="ch_002" type="FV"  href="mword.xml#id(mw_002)"/>
<ch id="ch_003" type="N"   href="mword.xml#id(mw_003)"/>
<ch id="ch_004" type="FV"  href="mword.xml#id(mw_004)"/>
<ch id="ch_005" type="N"   href="mword.xml#id(mw_005)"/>
...
 
 
(2) La preghiamo di rispondere alle domande del sistema 
(‘We ask you to reply to the systems’s questions’)
mword.xml
...
<mw id="mw_008">la</mw>
<mw id="mw_009">preghiamo</mw>
<mw id="mw_010">di</mw>
<mw id="mw_011">rispondere</mw>
<mw id="mw_012">alle</mw>
<mw id="mw_013">domande</mw>
<mw id="mw_014">del</mw>
<mw id="mw_015">sistema</mw>
...
chunk.xml
...
<ch id="ch_007" type="FV" href="mword.xml#id(mw_008)..id(mw_009)"/>
<ch id="ch_008" type="I"  href="mword.xml#id(mw_010)..id(mw_011)"/>
<ch id="ch_009" type="P"  href="mword.xml#id(mw_012)..id(mw_013)"/>
<ch id="ch_010" type="P"  href="mword.xml#id(mw_014)..id(mw_015)"/>
...

The interrupted chunk you kn- can be annotated as follows:
 
 

mword.xml
...
<mw id="mw_001" type="PD">you</mw>
<mw id="mw_002" type="U">kn-</mw>
...
chunk.xml
...
<ch id="ch_001" type="N"            href="mword.xml#id(mw_001)">
<ch id="ch_002" type="U" broken="Y" href="mword.xml#id(mw_002)"/>
...

4.3.1.6 Markup Table
 

<ch>
id [ASCII]
href <mw>
type ADJ, PA, ADV, SUBORD, N, P, FV, G, I, PART, Di, ADJ_PART, COORD, U
broken Y, N

4.3.2 <potgov>: Potential Governor

4.3.2.1 Description

A potential governor is the lexical head of the chunk, that is the lexical element (as opposed to a grammatical element), within a chunk, which neighbouring chunks can syntactically combine with in a dependency relation. Note that the specific nature of this relation plays no role in the definition of "potential governor". In fact, a potential governor can be either the head of a dependency or a dependent itself. Thus, it only represents the lexical hook on which other chunks can lean syntactically.

Grammatical words, such as auxiliaries and prepositions are here represented as chunk-internal elements other than potential governors. This choice is geared towards combining grammatical and lexical information in the most informative and manageable way and pave the way to functional annotation.

4.3.2.2 Data Source

For the tagging of potential governors, the markup of morphological words (<mw>) is necessary.

4.3.2.3 Segmentation/Selection

For each chunk type, a potential governor is always the rightmost element of a chunk. In the table below, chunks are classified according to the type of head they require.
 

CHUNK TYPE POSSIBLE HEADS
ADJ adj
ADV adv
N noun, pron, verb
P noun, pron, verb
FV verb
G verb
I verb
PART verb

4.3.2.4 Assignment

Two different attributes are needed for the description of potential governors:

4.3.2.5 Examples
 
 
(1) a boxcar of oranges

 
mword.xml
...
<mw id="mw_020">a</mw>
<mw id="mw_021">boxcar</mw>
<mw id="mw_022">of</mw>
<mw id="mw_023">oranges</mw>
...
chunk.xml
...
<ch id="ch_014" type="N" href="mword.xml#id(mw_020)..id(mw_021)">
  <potgov id="p_014" href="mword.xml#id(mw_021)"/>
</ch>
<ch id="ch_015" type="P" href="mword.xml#id(mw_022)..id(mw_023)">
  <potgov id="p_015" href="mword.xml#id(mw_023)"/>
</ch>
...
 
 
(2) La preghiamo di rispondere alle domande del sistema
(‘We ask you to reply to the systems’s questions’)

 
mword.xml
...
<mw id="mw_008">la</mw>
<mw id="mw_009">preghiamo</mw>
<mw id="mw_010">di</mw>
<mw id="mw_011">rispondere</mw>
<mw id="mw_012">alle</mw>
<mw id="mw_013">domande</mw>
<mw id="mw_014">del</mw>
<mw id="mw_015">sistema</mw>
...
chunk.xml
...
<ch id="ch_007" type="FV" href="mword.xml#id(mw_008)..id(mw_009)"/>
  <potgov id="p_007"      href="mword.xml#id(mw_009)"/>
</ch>
<ch id="ch_008" type="I"  href="mword.xml#id(mw_010)..id(mw_011)"/>
  <potgov id="p_008"      href="mword.xml#id(mw_011)"/>
</ch>
<ch id="ch_009" type="P"  href="mword.xml#id(mw_012)..id(mw_013)"/>
  <potgov id="p_009"      href="mword.xml#id(mw_013)"/>
</ch>
<ch id="ch_010" type="P"  href="mword.xml#id(mw_014)..id(mw_015)"/>
  <potgov id="p_010"      href="mword.xml#id(mw_015)"/>
</ch>
...

4.3.2.6 Markup Table
 

<potgov>
id [ASCII]
href <mw>

 

4.3.3 <aux>: Auxiliary Verb

4.3.3.1 Description

The auxiliary verb in a verbal chunk, e.g. "have" in I have said, or "was" in was always caught.

4.3.3.2 Data Source

For the tagging of auxiliaries, the markup of morphological words is necessary.

4.3.3.3 Assignment

Two different attributes are needed for the description of auxiliaries:


4.3.3.4 Examples
 
 

(1) I have said

 
mword.xml
...
<mw id="mw_001">I</mw>
<mw id="mw_002">have</mw>
<mw id="mw_003">said</mw>
...
chunk.xml
...
<ch id="ch_001" type="N"  href="mword.xml#id(mw_001)">
  <potgov id="p_001"      href="mword.xml#id(mw_001)"/>
</ch>
<ch id="ch_002" type="FV" href="mword.xml#id(mw_002)..id(mw_003)">
  <potgov id="p_002"      href="mword.xml#id(mw_003)"/>
  <aux id="aux_001"       href="mword.xml#id(mw_002)"/>
</ch>
...
 
(2) ho prenotato un altro biglietto
(‘(I) have booked another ticket’)

 
mword.xml
...
<mw id="mw_001">ho</mw>
<mw id="mw_002">prenotato</mw>
<mw id="mw_003">un</mw>
<mw id="mw_004">altro</mw>
<mw id="mw_005">biglietto</mw>
...
chunk.xml
...
<ch id="ch_001" type="FV" href="mword.xml#id(mw_001)..id(mw_002)">
  <potgov id="p_001"      href="mword.xml#id(mw_002)"/>
  <aux id="aux_001"       href="mword.xml#id(mw_001)"/>
</ch>
<ch id="ch_002" type="N"  href="mword.xml#id(mw_003)..id(mw_005)">
  <potgov id="p_002"      href="mword.xml#id(mw_005)"/>
</ch>
...
  4.3.3.5 Markup Table
 
<aux>
id [ASCII]
href <mw>

4.3.4 <cop>: Copulas

4.3.4.1 Description

This element marks all forms of ‘be’ functioning as a copula, e.g. "is" in this is good, or "è" (‘is’) in la prenotazione è obbligatoria ‘reservation is obligatory’ .

4.3.4.2 Data Source

For the tagging of copulas, the markup of morphological words is necessary.

4.3.4.3 Assignment

Two different attributes are needed for the description of copulas:

4.3.4.4 Examples
 
 
(1) this is good

 
mword.xml
...
<mw id="mw_001">this</mw>
<mw id="mw_002">is</mw>
<mw id="mw_003">good</mw>
...
chunk.xml
...
<ch id="ch_001" type="N"  href="mword.xml#id(mw_001)">
  <potgov id="p_001"      href="mword.xml#id(mw_001)"/>
</ch>
<ch id="ch_002" type="FV" href="mword.xml#id(mw_002)..id(mw_003)">
  <potgov id="p_002"      href="mword.xml#id(mw_003)"/>
  <cop id="cop_001"       href="mword.xml#id(mw_002)"/>
</ch>
...
 
(2) la prenotazione è obbligatoria 
(‘reservation is obligatory’)

 
mword.xml
...
<mw id="mw_001">la</mw>
<mw id="mw_002">prenotazione</mw>
<mw id="mw_003">è</mw>
<mw id="mw_003">obbligatoria</mw>
...
chunk.xml
...
<ch id="ch_001" type="N"  href="mword.xml#id(mw_001)..id(mw_002)">
  <potgov id="p_001"      href="mword.xml#id(mw_002)"/>
</ch>
<ch id="ch_002" type="FV" href="mword.xml#id(mw_003)..id(mw_004)">
  <potgov id="p_002"      href="mword.xml#id(mw_004)"/>
  <cop id="cop_001"       href="mword.xml#id(mw_003)"/>
</ch>
...

4.3.4.5 Markup Table
 

<cop>
id [ASCII]
href <mw>

4.3.5 <intro>: Introducer


4.3.5.1 Description

An underspecified label for the grammatical unit introducing a prepositional, infinitival or verbal chunk, e.g. "to" in to the restaurant, "to" in I need to wash my hair, or "of" in of doing.

4.3.5.2 Data Source

For the tagging of introducers, the markup of morphological words is necessary.

4.3.5.3 Assignment

Two different attributes are needed for the description of introducers:

4.3.5.4 Examples
 
 
(1)  a boxcar of oranges

 
mword.xml
...
<mw id="mw_020">a</mw>
<mw id="mw_021">boxcar</mw>
<mw id="mw_022">of</mw>
<mw id="mw_023">oranges</mw>
...
chunk.xml
...
<ch id="ch_014" type="N" href="mword.xml#id(mw_020)..id(mw_021)">
  <potgov id="p_014"     href="mword.xml#id(mw_021)"/>
</ch>
<ch id="ch_015" type="P" href="mword.xml#id(mw_022)..id(mw_023)">
  <potgov id="p_015"     href="mword.xml#id(mw_023)"/>
  <intro id="i_007"      href="mword.xml#id(mw_022)"/>
</ch>
...
 
(2)  La preghiamo di rispondere 
(‘we ask you to reply’)

 
mword.xml
...
<mw id="mw_008">la</mw>
<mw id="mw_009">preghiamo</mw>
<mw id="mw_010">di</mw>
<mw id="mw_011">rispondere</mw>
...
chunk.xml
...
<ch id="ch_007" type="FV" href="mword.xml#id(mw_008)..id(mw_009)">
  <potgov id="p_007"      href="mword.xml#id(mw_009)"/>
</ch>
<ch id="ch_008" type="I"  href="mword.xml#id(mw_010)..id(mw_011)">
  <intro id="i_001"       href="mword.xml#id(mw_010)"/>
  <potgov id="p_008"      href="mword.xml#id(mw_011)"/>
</ch>
...
 
4.3.5.5 Markup Table
 
<intro>
id [ASCII]
href <mw>

 

4.3.6 <modal>: Modal Auxiliary

4.3.6.1 Description

A label for modal auxiliaries such as "can", "have to", "may", "must", "need", "ought to".

4.3.6.2 Data Source

For the tagging of modal auxiliaries, the markup of morphological words is necessary.

4.3.6.3 Assignment

Two different attributes are needed for the description of modal auxiliaries:

4.3.6.4 Examples
 
 
(1)  I must admit it

 
mword.xml
...
<mw id="mw_001">I</mw>
<mw id="mw_002">must</mw>
<mw id="mw_003">admit</mw>
<mw id="mw_004">it</mw>
...
chunk.xml
...
<ch id="ch_001" type="N"  href="mword.xml#id(mw_001)">
  <potgov id="p_001"      href="mword.xml#id(mw_001)"/>
</ch>
<ch id="ch_002" type="FV" href="mword.xml#id(mw_002)..id(mw_003)">
  <potgov id="p_002"      href="mword.xml#id(mw_003)"/>
<modal id="mo_001"        href="mword.xml#id(mw_002)"/>
</ch>
<ch id="ch_003" type="N"  href="mword.xml#id(mw_003)">
  <potgov id="p_003"      href="mword.xml#id(mw_003)"/>
</ch>
<ch id="ch_001" type="N"  href="mword.xml#id(mw_001)">
  <potgov id="p_001"      href="mword.xml#id(mw_001)"/>
</ch>
<ch id="ch_002" type="FV" href="mword.xml#id(mw_002)..id(mw_003)">
  <potgov id="p_002"      href="mword.xml#id(mw_003)"/>
  <modal id="mo_001"      href="mword.xml#id(mw_002)"/>
</ch>
<ch id="ch_003" type="N"  href="mword.xml#id(mw_003)">
  <potgov id="p_003"      href="mword.xml#id(mw_003)"/>
</ch>
...
 
(2)     lo devo ammettere 
         (‘I must admit it’)

 
mword.xml
...
<mw id="mw_001">lo</mw>
<mw id="mw_002">devo</mw>
<mw id="mw_003">ammettere</mw>
...
chunk.xml
...
<ch id="ch_001" type="FV" href="mword.xml#id(mw_001)..id(mw_003)">
  <potgov id="p_001"      href="mword.xml#id(mw_003)"/>
  <modal id="mo_001"      href="mword.xml#id(mw_002)"/>
</ch>
...

4.3.6.5 Markup Table
 

<modal>
id [ASCII]
href <mw>

4.3.7 <caus>: Causative Verbs

4.3.7.1 Description

A label for verbs such as ‘let’, ‘make’ and ‘cause’ functioning in Italian, French and Spanish causative constructions.

4.3.7.2 Data Source

For the tagging of causative verbs, the markup of morphological words is necessary.

4.3.7.3 Assignment

Two different attributes are needed for the description of causative verbs:

4.3.7.4 Examples
 
 
(1)  L’ho fatto piangere 
       (‘I made him cry’)

 
mword.xml
...
<mw id="mw_001">lo</mw>
<mw id="mw_002">ho</mw>
<mw id="mw_003">fatto</mw>
<mw id="mw_004">piangere</mw>
...
chunk.xml
...
<ch id="ch_001" type="FV" href="mword.xml#id(mw_001)..id(mw_004)">
  <potgov id="p_001"      href="mword.xml#id(mw_004)"/>
  <aux id="aux_001"       href="mword.xml#id(mw_002)"/>
  <caus id="caus_001"     href="mword.xml#id(mw_003)"/>
</ch>
...

4.3.7.5 Markup Table
 

<caus>
id [ASCII]
href <mw>

 
 

[back to top]

[back to Introduction]

[Next: Functional Annotation Coding Module]