<stem><cpw>
<suffix>
<prefix>
<cpw_h>
<mw> elements serve to the markup of morphological words. As will be made clear below (see section 3.3.1.3), a morphological word stands, in the unmarked case, in a one-to-one relationship with an orthographic word. Exceptions to this tendency are represented by the case of many morphological words forming part of the same orthographic word (e.g. in cliticized words) and by the case of many orthographic words which are in fact part of one and the same morphological word (e.g. in multi-word compounds).
Basically, morphological words are described at this level through the attributes type and subtype, which are intended to reflect EAGLES recommendations for morphosyntactic annotation. EAGLES recommendations make a distinction among three levels of specification, respectively encompassing obligatory, recommended and optional attribute-value pairs. Obligatory information concerns word category.
Word categories can further be specified by means of appropriate morphosyntactic features (such as gender, number, case etc.), expressed as supplementary recommended tags. The combination of a category tag with its morphosyntactic feature specification yields complex tags of considerable length and granularity, so as to cover the range of phenomena most commonly attested in European Languages. An outstanding set of optional tags are foreseen to deal with highly language-specific attribute-value pairs.
We partially mimic this three-fold level of recommended specifications as follows: the attribute type is intended to contain major obligatory part-of-speech categories, while the attribute subtype is used to introduce some recommended morpho-syntactic values. While the attribute type is specified obligatorily, the attribute subtype is only optional.
3.3.1.2 Data Source
Tagging at the morphological level presupposes the markup of orthographic words.
3.3.1.3 Segmentation/Selection
As already observed above, in many cases, orthographic and morphological words are in a one-to-one correspondence. There are several exceptions to this tendency, however, as shown by the existence of compounds such as credit card, or cliticised words such as Italian dimmelo ‘say it to me’. In the former case, two distinct orthographic words make up a unique compounded word; in the latter case, a unique orthographic word is made up out of three distinct morphological words, namely di ‘say’, mi ‘to me’ and lo ‘it’.
From this it follows that annotation at the morphological level involves at least three different labelling mechanisms: i) marking up one word according to a set of morpholexical and morphosyntactic features; this represents the simplest case, where one morphological word corresponds to one orthographic word only; ii) grouping more than one orthographic word into one compounded constituent; this is the case of annotating a compound like credit card; iii) segmenting a morphologically complex orthographic word into its constituent elements, as in the case of cliticised words, such as Italian diglielo, fammi, fammelo, Spanish daselo, diselo, but also English it’s, lemme. Note that, in case ii) above, a single morphological word will point to a sequence of two or more orthographic words in the resource file. In the case of cliticised words, on the other hand, two or more morphological words will point to the same orthographic form. In this latter case, a standardized form of the orthographic counterparts of the identified morphological words will be specified by manually inserting it as a CDATA child of the <mw> element being annotated, (see the examples below).
In simple cases, this standard form coincides with a substring of the annotated word form: for example the cliticised form daselo is to be segmented as da-se-lo, where the segments da, se and lo are also found as independent word forms in Spanish. In other cases, the standard form abstracts away from the orthographic segment, when the latter undergoes changes such as elision, epenthesis etc. For example, the form it’s should be segmented into the standard forms it and is, and not - say - into it and ‘s.
These annotation activities are not necessarily mutually exclusive, as it is often the case that segmented constituents are, in their turn, to be specified for their morphosyntactic features etc. In this document, we provide a unified mark-up framework for carrying out the required annotation practices in an integrated way.
Four attributes are obligatorily needed for the description of morphological words:
Each identified morphological word must be labelled as to its part-of-speech
category. POS categories, to be expressed as values of the type
attribute, are taken from the set of morphosyntactic major categories detailed
in EAGLES specifications for morpho-syntactic annotation.
The table below summarizes the entire range of values.
| VALUE | POS |
| N | noun |
| V | verb |
| AJ | adjective |
| PD | pronoun/determiner |
| AT | article |
| AV | adverb |
| AP | adposition |
| C | conjunction |
| NU | numeral |
| I | interjection |
| U | unique/unassigned |
| R | residual |
| F | filler |
| DM | discourse marker |
| PU | punctuation |
Most of the above tags are self-explanatory; a few words about the following tags are needed (quoting from the relevant EAGLES document):
The residual values is assigned to words which lie outside the traditionally accepted range of grammatical classes, although they occur quite commonly (for example: foreign words).
Discourse markers or discourse items are adverbs, conjuctions and small clauses such as well, you know, I mean, and so on, used as interactional markers (turn-taking and giving, signals of correction, understanding, prompting, and connection to previous utterances).
Lemma
This attribute is intended to provide a place for specifying the exponent
form of the lemma of the annotated word form, if a separate lexicon in
XML format is not available. For example, the lemma of forms such as is
and
are
is the bare infinitive be by lexical convention.
This attribute should not be used if such a lexicon is available; in
this case, an alternative annotation is applied instead (see section
5.5.1.
Subtype
This attribute specifies additional
recommended morphosyntactic features, according to EAGLES specifications
for morpho-syntactic annotation.
A value of the attribute subtype is in fact a complex sequence of atomic
symbols to be interpreted according to a specific grid of positional slots.
In EAGLES, each major morphosyntactic category is associated with a specific
grid of positions. For example, the grid for verb-specific morphosyntactic
features is given in the following table:
| (i) | Person: | 1. First | 2. Second | 3. Third | |
| (ii) | Gender: | 1. Masculine | 2. Feminine | 3. Neuter | |
| (iii) | Number: | 1. Singular | 2. Plural | ||
| (iv) | Finiteness: | 1. Finite | 2. Non-finite | ||
| (v) | Verbform/ Mood: | 1. Indicative | 2. Subjective | 3. Imperative | 4. Conditional |
| 5. Infinite | 6. Participle | 7. Gerund | 8. Supine | ||
| (vi) | Tense: | 1. Present | 2. Imperfect | 3. Future | 4. Past |
| (vii) | Voice: | 1. Active | 2. Passive | ||
| (viii) | Status: | 1. Main | 2. Auxiliary |
An Italian verb form such as andò ‘(he) went’, for example, which conveys the values ‘3rd person, singular, finite, indicative, past tense, active, main verb, non-phrasal, non-reflexive, verb’ would be represented, according to the table above, as the complex value ‘V3011141101200’.
Wherever an attribute is inapplicable to a given word in a given tagset, the value 0 fills that attribute position in the string of digits. When the 0s occur in final position, without any non-zero digits following, they can be dropped.
The complete set of EAGLES positional grids for any part of speech is provided as an appendix to this document (cf. the Appendix).
Broken: the word is interrupted (Y) or not (N, default value).
| (1) We take the oranges to Elmira uh I mean to Corning |
|
|
| ...
<w id="w_001">we</w> <w id="w_002">take</w> <w id="w_003">the</w> <w id="w_004">oranges</w> <w id="w_005">to</w> <w id="w_006">Elmira</w> <w id="w_007">uh</w> <w id="w_008">I</w> <w id="w_009">mean</w> <w id="w_010">to</w> <w id="w_011">Corning</w> ... |
|
|
| ...
<mw id="mw_001" type="PD" subtype="PD1020115" lemma="we" href="orth.xml#id(w_001)"/> <mw id="mw_002" type="V" subtype="V00011111" lemma="take" href="orth.xml#id(w_002)"/> <mw id="mw_003" type="AT" subtype="AT1000" lemma="the" href="orth.xml#id(w_003)"/> <mw id="mw_004" type="N" subtype="N102000" lemma="orange" href="orth.xml#id(w_004)"/> <mw id="mw_005" type="AP" subtype="AP1" lemma="to" href="orth.xml#id(w_005)"/> <mw id="mw_006" type="N" subtype="N201000" lemma="Elmira" href="orth.xml#id(w_006)"/> <mw id="mw_007" type="I" subtype="I" href="orth.xml#id(w_007)"/> <mw id="mw_008" type="PD" subtype="PD1010115" lemma="I" href="orth.xml#id(w_008)"/> <mw id="mw_009" type="V" subtype="V00011111" lemma="mean" href="orth.xml#id(w_009)"/> <mw id="mw_010" type="AP" subtype="AP1" lemma="to" href="orth.xml#id(w_010)"/> <mw id="mw_011" type="N" subtype="N201000" lemma="Corning" href="orth.xml#id(w_011)"/> <mw id="mw_001" type="PD" subtype="PD1020115" lemma="we" href="orth.xml#id(w_001)"/> <mw id="mw_002" type="V" subtype="V00011111" lemma="take" href="orth.xml#id(w_002)"/> <mw id="mw_003" type="AT" subtype="AT1000" lemma="the" href="orth.xml#id(w_003)"/> <mw id="mw_004" type="N" subtype="N102000" lemma="orange" href="orth.xml#id(w_004)"/> <mw id="mw_005" type="AP" subtype="AP1" lemma="to" href="orth.xml#id(w_005)"/> <mw id="mw_006" type="N" subtype="N201000" lemma="Elmira" href="orth.xml#id(w_006)"/> <mw id="mw_007" type="I" subtype="I" href="orth.xml#id(w_007)"/> <mw id="mw_008" type="PD" subtype="PD1010115" lemma="I" href="orth.xml#id(w_008)"/> <mw id="mw_009" type="V" subtype="V00011111" lemma="mean" href="orth.xml#id(w_009)"/> <mw id="mw_010" type="AP" subtype="AP1" lemma="to" href="orth.xml#id(w_010)"/> <mw id="mw_011" type="N" subtype="N201000" lemma="Corning" href="orth.xml#id(w_011)"/> ... |
| (2) diglielo |
|
|
| ...
<w id="w_001"> diglielo </w> ... |
|
|
| ...
<mw id="mw_001" type="V" subtype="V20113101" lemma="dire" href="orth.xml#id(w_001)">di</mw> <mw id="mw_002" type="PD" subtype="PD3110315" lemma="egli" href="orth.xml#id(w_001)">gli</mw> <mw id="mw_003" type="PD" subtype="PD3110415" lemma="esso" href="orth.xml#id(w_001)">lo</mw> ... |
| (3) fammi |
|
|
| ...
<w id="w_001">fammi</w> ... |
|
|
| ...
<mw id="mw_001" type="V" subtype="V20113101" lemma="fare" href="orth.xml#id(w_001)">fa</mw> <mw id="mw_002" type="PD" subtype="PD1010315" lemma="io" href="orth.xml#id(w_001)">mi</mw> ... |
| (4) glielo dico |
|
|
| ...
<w id="w_001">glielo</w> <w id="w_002">dico</w> ... |
|
|
| ...
<mw id="mw_001" type="PD" subtype="PD3110315" lemma="egli" href="orth.xml#id(w_001)">gli</mw> <mw id="mw_002" type="PD" subtype="PD3110415" lemma="esso" href="orth.xml#id(w_001)">lo</mw> <mw id="mw_003" type="V" subtype="V10111101" lemma="dire" href="orth.xml#id(w_002)"/> ... |
| (5) red skin |
|
|
| ...
<w id="w_001">red</w> <w id="w_002">skin</w> ... |
|
|
| ...
<mw id="mw_001" type="N" subtype="N101000" href="orth.xml#id(w_001)..id(w_002)"/> ... |
| (6) I know it's late |
|
|
| ...
<w id="w_001">I</w> <w id="w_002">know</w> <w id="w_003">it’s</w> <w id="w_004">late</w> ... |
|
|
| ...
<mw id="mw_001" type="PD" subtype="PD1010115" lemma="I" href="orth.xml#id(w_001)"/> <mw id="mw_002" type="V" subtype="V00011111" lemma="know" href="orth.xml#id(w_002)"/> <mw id="mw_003" type="PD" subtype="PD33101151" lemma="it" href="orth.xml#id(w_003)">it</mw> <mw id="mw_004" type="V" subtype="V30111111" lemma="be" href="orth.xml#id(w_003)">is</mw> <mw id="mw_005" type="AV" subtype="AV1120" lemma="late" href="orth.xml#id(w_004)"/> ... |
3.3.1.6 Markup Table
|
|
|
| id | [ASCII] |
| href | <w> |
| type | N, V, AJ, PD, AT, AV, AP, C, NU, I, U, R, F, DM, PU |
| subtype | (cfr. EAGLES tables and Appendix) |
| lemma | [ASCII] |
| broken | Y, N |
Compounds are annotated using <cpw> elements. <cpw> elements are multi-word units whose constituents are linked to morphological words through in-line reference. Thus, unlike any other element identified at this level of annotation, <cpw> elements are linked only indirectly to orthographic words (via morphological words). This is required by the specific linguistic nature of compounds, whose syntactic behaviour is in many respects completely independent of (or opaque to) their internal complex structure, which is nonetheless analysed in terms of (possibly recursive) levels of embedding. This hybrid status forces the annotator to annotate morphological constituents first, to then tag them as forming part of the same morphological construct.
Compounds are represented orthographically in a variety of different ways, ranging from a one word representation, as is commonly the case in German, and more rarely in English and Italian compounds (cupboard, blackbird, cassaforte ‘safe’ etc.), to a dashed multiword unit (as in common-or-garden), to a sequence of independent orthographic words (as in credit card or Italian "syntactic" compounds such as ferro da stiro ‘iron’). As orthographic representations are crucially a matter of convention and do not line up with linguistically grounded distinctions, it is recommended that annotation of compounds abstract away from considerations concerning orthography, and be motivated only on linguistic grounds. In the following, we will provide such a skeletal linguistically-based typology and suggest ways of representing it through XML. It is important to notice, however, two things. First, the encoding conventions suggested here for compounds can be put to use to annotate other linguistic material which does not fall traditionally into the category of compounds. This material includes frozen expressions such as ad hoc, a priori, matter of fact etc., or complex dialogue markers such as you know. Secondly, one word compounds should not be annotated as chains of segmented constituents as suggested for morphological derivatives. If one does so, it would miss out a lot of information normally associated with the identification of compounds.
Tagging of <cpw> presupposes the markup of morphological words.
3.3.2.3 Segmentation/Selection
Although compounding represents a critical area for both theoretical and computational morphology, annotation of compounds (as opposed to their identification or their interpretation) can be a relatively trivial issue if limited to signalling membership of a sequence of word forms (such as copy and editor in copy editor, or ferro + da + stiro in ferro da stiro) to a morphosyntactically unique word. Concrete identification of these constituents may vary depending on the orthographic rendering of a compound. In cases such as English cupboard or pineapple, identification of the constituents of a compound requires the process of singling out word-internal constituents, instead of grouping independent orthographic words. In still further cases, segmentation is already indirectly signalled in the orthography by means of dashes: as in common-or-garden. Since orthography is a rather poor indicator of the morphological nature of a compound to be annotated, we suggest that the representation of compounds be grounded on linguistic motivations only, as sketchily suggested in the following.
A useful distinction to be made is that between endocentric and exocentric compounds, the former showing a semantic head (e.g. mother milk is a kind of milk), which is not present in the latter (e.g. red skin or redskin is not a kind of skin). Due to their non compositionality at the level of meaning, it is recommendable that exocentric compounds be treated as whole morphological words. Note that also other types of multiword units which are not commonly categorised as compounds, such as, for example, discourse markers like you know, can receive a representation as whole morphological words.
There is wider annotation leeway in the case of endocentric compounds, where it can be argued that the constituent mother in mother milk is a sort of modifier of the head milk, both syntactically and semantically. Since we mark modifiers as independent units at the functional level, it may be convenient to keep the two constituents of the compound mother milk separate at the level of morphological analysis. Still we are interested in annotating the fact that mother milk is a compound. The mark-up scheme suggested in the following section makes provision for both kinds of solution. A <head> element (see below) is contained by each <cpw> element and serves to specify the semantic head of the compound.
As a rule, we recommend that endocentric compounds be annotated via <cpw> elements, and exocentric compounds be treated as a single morphological unit.
Clearly, provided that consistency is assured, nothing opposes to annotating
all compounds as one morphological word, or conversely to using a <cpw>
element for the annotation of any compound, be it endo- or exo-centric.
Five different attributes are needed for the description of compound words:
| (1) mother milk |
|
|
| ...
<w id="w_001">mother</w> <w id="w_002">milk</w> ... |
|
|
| ...
<mw id="mw_001" type="N" subtype="N10101" lemma="mother" href="orth.xml#id(w_001)"/> <mw id="mw_002" type="N" subtype="N10102" lemma="milk" href="orth.xml#id(w_002)"/> <cpw id="cpw_001" type="N" href="mword.xml#id(mw_001)..id(mw_002)"/>
|
| (2) credit card |
|
|
| ...
<w id="w_001">credit</w> <w id="w_002">card</w> ... |
|
|
| ...
<mw id="mw_001" type="N" subtype="N10101" lemma="credit" href="orth.xml#id(w_001)"/> <mw id="mw_002" type="N" subtype="N10101" lemma="card" href="orth.xml#id(w_002)"/> <cpw id="cpw_001" type="N" href="mword.xml#id(mw_001)..id(mw_002)"/>
|
|
|
|
| id | [ASCII] |
| href | <mw> |
| type | N, V, AJ, PD, AT, AV, AP, C, NU, I, U, R, F, DM |
| subtype | (cfr. EAGLES specifications) |
| broken | Y, N |
A <cpw_h> element is used to mark the semantic head in a compound. As a rule of thumb, headed compounds (or endocentric compounds) should always be in an IS_A relationship to their heads: e.g., a school bus is_a bus, a credit card is a card etc.
For the tagging of heads, the markup of morphological words is necessary.
3.3.3.3 Segmentation/Selection
It is common knowledge that the head of a compound always takes an obligatory position in the sequence of word constituents. This position is subject to language-dependent parameterisation: in languages such as German or English, the head of a compound is the rightmost word element in the chain; in French or Italian the head normally takes the leftmost position, although there can be exceptions to this general tendency, due to the analogical pressure of left-headed compounds (as in Italian scuola bus ‘school bus’ instead of the expected bus scuola).
Two attributes are needed for the description of head elements:
| (1) school bus |
|
|
| ...
<w id="w_001">school</w> <w id="w_002">bus</w> ... |
|
|
| ...
<mw id="mw_001" type="N" subtype="N10101" href="orth.xml#id(w_001)"/> <mw id="mw_002" type="N" subtype="N10101" href="orth.xml#id(w_002)"/> <cpw id="cpw_001"
type="N" href="mword.xml#id(mw_001)..id(mw_002)">
|
| (2) credit card |
|
|
| ...
<w id="w_001">credit</w> <w id="w_002">card</w> ... |
|
|
| ...
<mw id="mw_001" type="N" subtype="N10101" href="orth.xml#id(w_001)"/> <mw id="mw_002" type="N" subtype="N10101" href="orth.xml#id(w_002)"/> <cpw id="cpw_001"
type="N" href="mword.xml#id(mw_001)..id(mw_002)">
|
|
|
|
| id | [ASCII] |
| href | <mw> |
This set of elements is used to annotate derivational morphology. Unlike inflected forms, derivatives such as Italian derivazion-ale, or English derivation-al, frank-ly, friend-ly, lend themselves more naturally to being marked up as a chain of segmented constituents. "Morpheme segmentation", either immediate (e.g. signalling the most external affix only, as in "derivation-al"), or complete (as in "deriv-ation-al") or hierarchical (as in "(((deriv) ation) al)") is provided, for example, in the CELEX electronic lexica (Burnage, 1990). Yet, this type of representation is, in general, not able to account for non concatenative phenomena such as stem allomorphy (admittedly far less frequent in derivational morphology than in inflectional morphology). For lack of better consensual encoding practices, immediate flat morpheme segmentation could be proposed as a reasonable minimal annotation strategy for encoding derivational morphemes. Note that segmentation is represented here rather indirectly, that is, not through interspersion of dashes in the orthographic rendering of a derivative, but through indication of the standard form of the internal constituents of the derivative. The standard form is expressed, as in the case of cliticised words, as a value of the attribute seg, often not the lemma, but the corresponding form found as an independent orthographic unit. This definition, however, does not apply to the case of bound morphemes such as derivational suffixes/prefixes. In this case, we suggest, as a rule of thumb, to assign to seg a base form corresponding to the orthography of the suffix/prefix as it shows up in non fused environments. In difficult cases, one can resort to more than one base form.
Annotation of derivational morphology presupposes the markup of orthographic words.
3.3.4.3 Segmentation/Selection
Flat immediate segmentation requires identification of the most external suffix/prefix in the derivative: e.g. industri-al, considerab-ly etc. When a derivative is both prefixed and suffixed, criteria of selectional restrictions on the way morphemes are concatenated are normally invoked to establish whether the newly added morpheme is a prefix or a suffix: e.g. in a word such as recognition, the suffix -ion is uncontroversially the last morpheme, since the prefix re- can only be attached to verb bases/roots (re-cognise), but not to nouns (*re-cognition). In other more ambiguous cases, the annotator must be guided by lexico-semantic criteria: e.g., the morpheme anti- can be prefixed to both nouns (anti-missile, anti-matter) and adjectives (anti-social). In a case such as anti-semitic, it is the existence of the noun anti-semite and its strong semantic relation with anti-semitic which favours consideration of the suffix -ic as the last morpheme, attached to the base anti-semite.
It should be noted that the style of XML representation illustrated
in the following allows the annotator to get around a number of paradoxes
of orthographic segmentation. In fact, word-internal constituents are represented
not as strings separated by dashes, but rather as standard forms which
are represented as CDATA children of the <stem>,
<suffix>
and <prefix> elements.
In cases of highly fused derivatives, such as, for example, recognition,
we suggest to assign the standard forms recognise and ion to
the two identified constituents of the immediate segmentation. This makes
provision for some kind of abstract representation, thus avoiding the problem
of defining an appropriate segmentation of the whole orthographic form
recognition,
which is fairly controversial.
Two attributes are needed for the description of stems, suffixes and prefixes:
| (1) derivational morphology |
|
|
| ...
<w id="w_001">derivational</w> <w id="w_002">morphology</w> ... |
|
|
| ...
<mw id="mw_001" type="AJ" href="orth.xml#id(w_001)"> <stem id="st_001" type="N" href="orth.xml#id(w_001)">derivation</stem> <suffix id="su_001" href="orth.xml#id(w_001)">al</suffix> </mw> <mw id="mw_002"
type="N" href="orth.xml#id(w_002)"/>
|
| (2) frankly |
|
|
| ...
<w id="w_001">frankly</w> ... |
|
|
| ...
<mw id="mw_001" type="AV" href="orth.xml#id(w_001)"> <stem id="st_001" type="AJ" href="orth.xml#id(w_001)">frank</stem> <suffix id="su_001" href="orth.xml#id(w_001)">ly</suffix> </mw> ... |
|
|
|
| id | [ASCII] |
| href | <w> |
| type | N, V |
|
|
|
| id | [ASCII] |
| href | <w> |
|
|
|
| id | [ASCII] |
| href | <w> |
[Next:
Chunk-level Annotation Coding Module]