In marking chunks, we are mainly interested in their category and start
and end points. It should be noted that chunks do not necessarily cover
the entire sentence, as there may be material that does not belong to any
chunk. For example, prepositions, coordinators, subordinators, and adverbs
are, in some cases and according to some instantiations of chunking, not
part of any chunk.
Chunks are defined strictly synctactically. Following Abney (1996:1), a chunk is "the non-recursive core of an intra-clausal constituent, extending from the beginning of a constituent to its head (or potential governor, see below), but not including post-head dependents".
Each chunk includes a sequence of adjacent word tokens which are mutually related through dependency links. For a more detailed discussion, cf. MATE Deliverable D.1.1. Examples and criteria for chunking are given in the following sections.
Tagging at the chunk level presupposes the markup of morphological words.
4.3.1.3 Segmentation/Selection
Given the sentence "The hungry man could always eat the meals offered by the pious woman", the chunking will be as follows:
[The hungry man] [could always eat] [the meals] [offered] [by the pious woman].
The sentence is segmented into five chunks. As noted before, each chunk includes a sequence of adjacent word tokens (a text substring) which are mutually related through dependency links. The fact that two substrings are assigned different chunks does not necessarily entail that there is no dependency relationship linking the two. Simply, this means that, on the basis of the available lexical knowledge, it is impossible to state unambiguously what chunk relates to its neighbouring chunks and what the nature of this relationship is.
Four different attributes are needed for the description of chunks:
Each identified chunk must be
labelled as to its category. Chunk categories, to be expressed as values
of the type
attribute, are given in the table below:
| NAME | TYPE |
| ADJ | adjectival chunk |
| ADV | adverbial chunk |
| FV | finite verb chunk |
| G | gerundival chunk |
| I | infinitival chunk |
| N | nominal chunk |
| P | prepositional chunk |
| PART | participial chunk |
| U | unassigned |
In the following we provide basic guidelines for the identification of chunk categories, according to the SPARKLE specifications. Examples are given for Italian and English.
ADJ
Adjectival chunks are chunks beginning with any premodifying adverbs and intensifiers and ending with a head adjective. Adjectival phrases occurring in pre-nominal position are not marked as distinct chunks since their relationship to the governing noun is unambiguously identified within the nominal chunk.
ADJ chunks thus include:
Adverbial chunks are chunks extending form any adverbial pre-modifier to the head adverb. Adverbial phrases occurring between an auxiliary verb and a past participle are not isolated as distinct chunks due to their unambiguous dependency on the verb. By the same token, adverbs which happen to immediately premodify verbs or adjectives are respectively part of either verbal or adjectival chunk. Noun phrases used adverbially (e.g. last week, this morning) are treated as nominal chunks. E.g.:
Finite verbal chunks include modals, ordinary and causative auxiliaries as well as medial adverbs and clitic pronouns, and they end at the head verb.
The "G" value marks gerundival chunks. When part of a tensed group (e.g. in the progressive construction), the gerundival verb form is not marked independently (but rather as part of a FV). G chunks include gerunds functioning as noun phrases but not those functioning as adjectives.
Infinitival chunks include both bare infinitives and infinitives introduced by a preposition.
Noun chunks are chunks extending from the beginning of the noun phrase to the head noun. They include nominal chunks headed by nouns, pronouns, verbs in their infinitival form when preceded by an article (i.e. Italian nominalised infinitival constructions) and proper names. All kinds of modifiers and/or specifiers occurring between the beginning of the noun phrase and the head noun are included in N chunks. E.g.:
Prepositional chunks are chunks which extend from the preposition to the head of the embedded noun phrase. Typical instances of P chunks are:
A Past participle chunk includes participial constructions such as:
The U value indicates that a given chunk cannot be assigned
any other value, in general because it is incomplete due to interruption.
For any language-specific issue the interested reader
is referred to Abney (1996), Carroll et al. (1997), Federici, Montemagni
and Pirrelli (1996).
Broken
The attribute broken
supplies a way for representing chunk partials and discontinuous chunks,
which are often encountered as a result of interruptions, retracings, and
so on.
| (1) Hello, can I help you? |
|
|
| ...
<mw id="mw_001">hello</mw> <mw id="mw_002">can</mw> <mw id="mw_003">I</mw> <mw id="mw_004">help</mw> <mw id="mw_005">you</mw> ... |
|
|
| ...
<ch id="ch_001" type="ADV" href="mword.xml#id(mw_001)"/> <ch id="ch_002" type="FV" href="mword.xml#id(mw_002)"/> <ch id="ch_003" type="N" href="mword.xml#id(mw_003)"/> <ch id="ch_004" type="FV" href="mword.xml#id(mw_004)"/> <ch id="ch_005" type="N" href="mword.xml#id(mw_005)"/> ... |
| (2) La preghiamo di rispondere alle domande del sistema
(‘We ask you to reply to the systems’s questions’) |
|
|
| ...
<mw id="mw_008">la</mw> <mw id="mw_009">preghiamo</mw> <mw id="mw_010">di</mw> <mw id="mw_011">rispondere</mw> <mw id="mw_012">alle</mw> <mw id="mw_013">domande</mw> <mw id="mw_014">del</mw> <mw id="mw_015">sistema</mw> ... |
|
|
| ...
<ch id="ch_007" type="FV" href="mword.xml#id(mw_008)..id(mw_009)"/> <ch id="ch_008" type="I" href="mword.xml#id(mw_010)..id(mw_011)"/> <ch id="ch_009" type="P" href="mword.xml#id(mw_012)..id(mw_013)"/> <ch id="ch_010" type="P" href="mword.xml#id(mw_014)..id(mw_015)"/> ... |
The interrupted chunk you kn-
can
be annotated as follows:
|
|
| ...
<mw id="mw_001" type="PD">you</mw> <mw id="mw_002" type="U">kn-</mw> ... |
|
|
| ...
<ch id="ch_001" type="N" href="mword.xml#id(mw_001)"> <ch id="ch_002" type="U" broken="Y" href="mword.xml#id(mw_002)"/> ... |
|
|
|
| id | [ASCII] |
| href | <mw> |
| type | ADJ, PA, ADV, SUBORD, N, P, FV, G, I, PART, Di, ADJ_PART, COORD, U |
| broken | Y, N |
A potential governor is the lexical head of the chunk, that is the lexical element (as opposed to a grammatical element), within a chunk, which neighbouring chunks can syntactically combine with in a dependency relation. Note that the specific nature of this relation plays no role in the definition of "potential governor". In fact, a potential governor can be either the head of a dependency or a dependent itself. Thus, it only represents the lexical hook on which other chunks can lean syntactically.
Grammatical words, such as auxiliaries and prepositions are here represented as chunk-internal elements other than potential governors. This choice is geared towards combining grammatical and lexical information in the most informative and manageable way and pave the way to functional annotation.
For the tagging of potential governors, the markup of morphological words (<mw>) is necessary.
4.3.2.3 Segmentation/Selection
For each chunk type, a potential governor is always the rightmost element
of a chunk. In the table below, chunks are classified according to the
type of head they require.
| CHUNK TYPE | POSSIBLE HEADS |
| ADJ | adj |
| ADV | adv |
| N | noun, pron, verb |
| P | noun, pron, verb |
| FV | verb |
| G | verb |
| I | verb |
| PART | verb |
Two different attributes are needed for the description of potential governors:
| (1) a boxcar of oranges |
|
|
| ...
<mw id="mw_020">a</mw> <mw id="mw_021">boxcar</mw> <mw id="mw_022">of</mw> <mw id="mw_023">oranges</mw> ... |
|
|
| ...
<ch id="ch_014" type="N" href="mword.xml#id(mw_020)..id(mw_021)"> <potgov id="p_014" href="mword.xml#id(mw_021)"/> </ch> <ch id="ch_015" type="P" href="mword.xml#id(mw_022)..id(mw_023)"> <potgov id="p_015" href="mword.xml#id(mw_023)"/> </ch> ... |
| (2) La preghiamo di rispondere alle domande del sistema
(‘We ask you to reply to the systems’s questions’) |
|
|
| ...
<mw id="mw_008">la</mw> <mw id="mw_009">preghiamo</mw> <mw id="mw_010">di</mw> <mw id="mw_011">rispondere</mw> <mw id="mw_012">alle</mw> <mw id="mw_013">domande</mw> <mw id="mw_014">del</mw> <mw id="mw_015">sistema</mw> ... |
|
|
| ...
<ch id="ch_007" type="FV" href="mword.xml#id(mw_008)..id(mw_009)"/> <potgov id="p_007" href="mword.xml#id(mw_009)"/> </ch> <ch id="ch_008" type="I" href="mword.xml#id(mw_010)..id(mw_011)"/> <potgov id="p_008" href="mword.xml#id(mw_011)"/> </ch> <ch id="ch_009" type="P" href="mword.xml#id(mw_012)..id(mw_013)"/> <potgov id="p_009" href="mword.xml#id(mw_013)"/> </ch> <ch id="ch_010" type="P" href="mword.xml#id(mw_014)..id(mw_015)"/> <potgov id="p_010" href="mword.xml#id(mw_015)"/> </ch> ... |
|
|
|
| id | [ASCII] |
| href | <mw> |
The auxiliary verb in a verbal chunk, e.g. "have" in I have said, or "was" in was always caught.
For the tagging of auxiliaries, the markup of morphological words is necessary.
Two different attributes are needed for the description of auxiliaries:
| (1) I have said |
|
|
| ...
<mw id="mw_001">I</mw> <mw id="mw_002">have</mw> <mw id="mw_003">said</mw> ... |
|
|
| ...
<ch id="ch_001" type="N" href="mword.xml#id(mw_001)"> <potgov id="p_001" href="mword.xml#id(mw_001)"/> </ch> <ch id="ch_002" type="FV" href="mword.xml#id(mw_002)..id(mw_003)"> <potgov id="p_002" href="mword.xml#id(mw_003)"/> <aux id="aux_001" href="mword.xml#id(mw_002)"/> </ch> ... |
| (2) ho prenotato un altro biglietto
(‘(I) have booked another ticket’) |
|
|
| ...
<mw id="mw_001">ho</mw> <mw id="mw_002">prenotato</mw> <mw id="mw_003">un</mw> <mw id="mw_004">altro</mw> <mw id="mw_005">biglietto</mw> ... |
|
|
| ...
<ch id="ch_001" type="FV" href="mword.xml#id(mw_001)..id(mw_002)"> <potgov id="p_001" href="mword.xml#id(mw_002)"/> <aux id="aux_001" href="mword.xml#id(mw_001)"/> </ch> <ch id="ch_002" type="N" href="mword.xml#id(mw_003)..id(mw_005)"> <potgov id="p_002" href="mword.xml#id(mw_005)"/> </ch> ... |
|
|
|
| id | [ASCII] |
| href | <mw> |
This element marks all forms of ‘be’ functioning as a copula, e.g. "is" in this is good, or "è" (‘is’) in la prenotazione è obbligatoria ‘reservation is obligatory’ .
For the tagging of copulas, the markup of morphological words is necessary.
Two different attributes are needed for the description of copulas:
| (1) this is good |
|
|
| ...
<mw id="mw_001">this</mw> <mw id="mw_002">is</mw> <mw id="mw_003">good</mw> ... |
|
|
| ...
<ch id="ch_001" type="N" href="mword.xml#id(mw_001)"> <potgov id="p_001" href="mword.xml#id(mw_001)"/> </ch> <ch id="ch_002" type="FV" href="mword.xml#id(mw_002)..id(mw_003)"> <potgov id="p_002" href="mword.xml#id(mw_003)"/> <cop id="cop_001" href="mword.xml#id(mw_002)"/> </ch> ... |
| (2) la prenotazione è obbligatoria
(‘reservation is obligatory’) |
|
|
| ...
<mw id="mw_001">la</mw> <mw id="mw_002">prenotazione</mw> <mw id="mw_003">è</mw> <mw id="mw_003">obbligatoria</mw> ... |
|
|
| ...
<ch id="ch_001" type="N" href="mword.xml#id(mw_001)..id(mw_002)"> <potgov id="p_001" href="mword.xml#id(mw_002)"/> </ch> <ch id="ch_002" type="FV" href="mword.xml#id(mw_003)..id(mw_004)"> <potgov id="p_002" href="mword.xml#id(mw_004)"/> <cop id="cop_001" href="mword.xml#id(mw_003)"/> </ch> ... |
|
|
|
| id | [ASCII] |
| href | <mw> |
An underspecified label for the grammatical unit introducing a prepositional, infinitival or verbal chunk, e.g. "to" in to the restaurant, "to" in I need to wash my hair, or "of" in of doing.
For the tagging of introducers, the markup of morphological words is necessary.
Two different attributes are needed for the description of introducers:
| (1) a boxcar of oranges |
|
|
| ...
<mw id="mw_020">a</mw> <mw id="mw_021">boxcar</mw> <mw id="mw_022">of</mw> <mw id="mw_023">oranges</mw> ... |
|
|
| ...
<ch id="ch_014" type="N" href="mword.xml#id(mw_020)..id(mw_021)"> <potgov id="p_014" href="mword.xml#id(mw_021)"/> </ch> <ch id="ch_015" type="P" href="mword.xml#id(mw_022)..id(mw_023)"> <potgov id="p_015" href="mword.xml#id(mw_023)"/> <intro id="i_007" href="mword.xml#id(mw_022)"/> </ch> ... |
| (2) La preghiamo di rispondere
(‘we ask you to reply’) |
|
|
| ...
<mw id="mw_008">la</mw> <mw id="mw_009">preghiamo</mw> <mw id="mw_010">di</mw> <mw id="mw_011">rispondere</mw> ... |
|
|
| ...
<ch id="ch_007" type="FV" href="mword.xml#id(mw_008)..id(mw_009)"> <potgov id="p_007" href="mword.xml#id(mw_009)"/> </ch> <ch id="ch_008" type="I" href="mword.xml#id(mw_010)..id(mw_011)"> <intro id="i_001" href="mword.xml#id(mw_010)"/> <potgov id="p_008" href="mword.xml#id(mw_011)"/> </ch> ... |
4.3.5.5 Markup Table
|
|
|
| id | [ASCII] |
| href | <mw> |
A label for modal auxiliaries such as "can", "have to", "may", "must", "need", "ought to".
For the tagging of modal auxiliaries, the markup of morphological words is necessary.
Two different attributes are needed for the description of modal auxiliaries:
| (1) I must admit it |
|
|
| ...
<mw id="mw_001">I</mw> <mw id="mw_002">must</mw> <mw id="mw_003">admit</mw> <mw id="mw_004">it</mw> ... |
|
|
| ...
<ch id="ch_001" type="N" href="mword.xml#id(mw_001)"> <potgov id="p_001" href="mword.xml#id(mw_001)"/> </ch> <ch id="ch_002" type="FV" href="mword.xml#id(mw_002)..id(mw_003)"> <potgov id="p_002" href="mword.xml#id(mw_003)"/> <modal id="mo_001" href="mword.xml#id(mw_002)"/> </ch> <ch id="ch_003" type="N" href="mword.xml#id(mw_003)"> <potgov id="p_003" href="mword.xml#id(mw_003)"/> </ch> <ch id="ch_001" type="N" href="mword.xml#id(mw_001)"> <potgov id="p_001" href="mword.xml#id(mw_001)"/> </ch> <ch id="ch_002" type="FV" href="mword.xml#id(mw_002)..id(mw_003)"> <potgov id="p_002" href="mword.xml#id(mw_003)"/> <modal id="mo_001" href="mword.xml#id(mw_002)"/> </ch> <ch id="ch_003" type="N" href="mword.xml#id(mw_003)"> <potgov id="p_003" href="mword.xml#id(mw_003)"/> </ch> ... |
| (2) lo devo ammettere
(‘I must admit it’) |
|
|
| ...
<mw id="mw_001">lo</mw> <mw id="mw_002">devo</mw> <mw id="mw_003">ammettere</mw> ... |
|
|
| ...
<ch id="ch_001" type="FV" href="mword.xml#id(mw_001)..id(mw_003)"> <potgov id="p_001" href="mword.xml#id(mw_003)"/> <modal id="mo_001" href="mword.xml#id(mw_002)"/> </ch> ... |
|
|
|
| id | [ASCII] |
| href | <mw> |
A label for verbs such as ‘let’, ‘make’ and ‘cause’ functioning in Italian, French and Spanish causative constructions.
For the tagging of causative verbs, the markup of morphological words is necessary.
Two different attributes are needed for the description of causative verbs:
| (1) L’ho
fatto piangere
(‘I made him cry’) |
|
|
| ...
<mw id="mw_001">lo</mw> <mw id="mw_002">ho</mw> <mw id="mw_003">fatto</mw> <mw id="mw_004">piangere</mw> ... |
|
|
| ...
<ch id="ch_001" type="FV" href="mword.xml#id(mw_001)..id(mw_004)"> <potgov id="p_001" href="mword.xml#id(mw_004)"/> <aux id="aux_001" href="mword.xml#id(mw_002)"/> <caus id="caus_001" href="mword.xml#id(mw_003)"/> </ch> ... |
|
|
|
| id | [ASCII] |
| href | <mw> |
[Next: Functional Annotation Coding Module]