MATE LEVEL MARKUP
MORPHOSYNTAX
Vito Pirrelli, Claudia Soria
1 Introduction
The present document is intended to offer an edited selection of good
practices for annotation of dialogue text at the levels of word analysis,
chunking and representation of syntactic functional relations (sections
3, 4 and 5). Moreover, it provides guidelines to the edited transcription
of a dialogue (section 2). These levels are seen as conceptually independent,
though inter-connected, sub-levels of morpho-syntactic analysis. For a
more comprehensive review of the concrete practices followed in a representative
set of current annotation schemes, the interested reader is referred to
MATE Deliverable 1.1. Here, we provide
a formal framework, or annotation meta-scheme, to be used as a practical
blue-print to both language- and domain-specific scheme development and/or
customization. Incidentally, it should be emphasized that the meta-scheme
described in these pages offers the further bonus of being usable as a
kind of lingua franca for exchange of information and data. This
is supported by the choice of XML as mark-up language.
1.1 General Requirements
We first identified the following list of prerequisites to the design of
a meta-scheme for dialogue annotation at the morpho-syntactic level:
-
be robust and wide-coverage;
-
be flexible, customizable and usable for practical applications;
-
be modular (to allow for partial instantiations of the meta-scheme);
-
be redundantly specified (to be able to accomodate alternative practices
for the annotation of the same phenomenon);
-
make provision for graded levels of abstraction from raw data;
-
be amenable to (semi)automatic annotation;
-
be reliable in terms of inter-annotator agreement;
-
have the potential for multi-lingual application.
In what follows, we will comment on each of these desiderata and illustrate
their relationship with morpho-syntactic annotation in MATE.
1.1.1 Coverage and Expected Feed-back
It is useful, at this stage, to make it clear what sort of input the present
deliverable is intended to elicit. First, we provide a list of phenomena
modelled through the suggested meta-scheme. The list exemplifies a representative
range of phenomena crucially involved in the annotation of a dialogue at
the levels of morpho-syntactic analysis touched on in this report, but
it is not intended to give instructions for marking up an exhaustive list
of language-specific facts. For example, we suggest to annotate derivatives
through immediate morpheme segmentation, i.e. by signalling the
most external affix only, as in ``(derivation(al))''. However, we do not
provide language-specific recommendations concerning problems of segmentation
due to fusional phenomena or truncated stems, as in provide à
provision or truncate à
truncation. First, it would simply be impossible to tackle problems
at this level of detail for even a subset of the languages that MATE is
interested in covering. Secondly, linguistic issues at this level of granularity
depend too heavily on the theoretical commitment of the annotators and
on the intended purposes of their annotation scheme. It is questionable
to suggest a standard practice at this level.
Furthermore, the range of phenomena to be considered poses a considerable
challenge to any attempt to adapt existing annotation practices, predominantly
designed for annotating written texts, to the specific exigencies of dialogic
data. The challenge has mainly to do with the noisy nature of spoken texts:
namely, usage of non-standard forms, repetitions, false starts, anacolutha
and incomplete phrases, etc. We envisage that most of this noise should
preliminarily be annotated at a low-level of edited transcription
as shown in section 2 of the present report. This stage is intended as
a preliminary filter and plays the role of marking noisy or non standard
material with no editing out. In fact, much of noise will receive a linguistic
annotation at the level of chunking (section 4
of the present report), while being ignored at the level of functional
annotation.
1.1.2 Core and Periphery
For all sub-levels of annotation considered here, the meta-scheme consists
of two subsets of tags. The first subset, or core scheme, supplies
basic means for annotating obligatory information. The second subset, or
periphery tag set, serves the purpose of making provision for further
linguistic annotation to be added on top of obligatory information, whenever
this is required by the annotator. In its turn, the periphery tag set parts
into two further subsets: a recommended set and an optional one. This makes
the meta-scheme highly modular, and open to further augmentation, both
in terms of more granular information and of further independent dimensions
of analysis.
1.1.3 Redundancy
In designing the MATE meta-scheme for morpho-syntactic annotation, considerable
care was taken to provide the potential user with a battery of open choices
for encoding a particular range of phenomena, rather than give one solution
only. For example, a morphological compound can either be represented as
a whole morphological word, or as two (or more) independent words linked
together through a compounding relationship.
1.1.4 Supported Annotation Schemes
We provide, at each sub-level of annotation, an indication of some currently
available mark-up schemes that i) are compatible with our meta-scheme,
and ii) will be supported by the MATE Workbench. These recommended practices
are to be interpreted as a fair basis for testing the MATE Workbench. Our
main concern was to support those annotation practices which both represent
current standardization efforts and are proven to be portable and usable
for practical applications. Two efforts, in particular, meet these requirements,
namely the EAGLES and SPARKLE projects. EAGLES annotation scheme is the
product of a joint European EC-funded standardisation effort carried out
in the framework of the European Action Group for Language Engineering
Standards.
The output of EAGLES (http://www.ilc.pi.cnr.it/EAGLES/home.html)
has been particularly instrumental for the definition of a common set of
morphosyntactic tags, with some integration and extension. The EC-funded
project SPARKLE (Shallow Parsing and Knowledge Extraction for Language
Engineering (http://www.ilc.pi.cnr.it/sparkle.html))
has developed a layered scheme of syntactic annotation encompassing three
different levels of annotation: the chunk-based level, the constituency-based
level and the functional level. The scheme is designed so as to be flexible,
multipurpose, and amenable to finite state techniques of local and robust
parsing.
In fact, SPARKLE specification were actually used for intelligent cross-lingual
text editing and translation filtering in multilingual information retrieval
systems (Xerox, Sharp), and speech recognition (Daimler-Benz), with a steady
improvement in performance.
The annotation meta-scheme presented in this report represents an XML
instantiation of two of the three levels envisaged in SPARKLE: namely chunking
and functional annotation. In Deliverable
1.1 we argued that these two levels provide room for partial parsing,
underspecification and graded levels of abstraction from raw input text,
thus fulfilling many of the desiderata connected with syntactic processing
of typically noisy data such as dialogues.
1.1.5 Amenability to Automatic Annotation
The concern for amenability to (semi)automatic annotation is of paramount
importance throughout the present report, and has influenced a number of
choices. For example, the preference given to chunking and functional
annotation over, e.g., phrase structure trees at the syntactic level
is chiefly motivated by the locality of the analysis offered by
these two schemes, a useful feature if one wants to prevent a local parsing
failure from backfiring and causing the entire parse of an utterance to
fail. This is particularly desirable with a view to manual correction of
an automatically annotated output, since a reasonably reliable partial
analysis is always better than no analysis at all.
1.1.6 Reliability
Inter-coder human reliability deals with the issue of replicability of
a task by a human annotator. The task is defined by an annotation scheme
and a corpus to be annotated accordingly. The common assumption is that
each token phenomenon in the corpus should be given the same tag by several
independent annotators. A meta-scheme, as an edited collection of different
annotation schemes, can only i) provide figures of the reliability of each
independently supported scheme, and ii) possibly shed light on and give
reasons for the respective degrees of reliability of the tag sets considered.
1.1.7 Multilinguality
The selection of a scheme is also motivated on a multi--lingual basis.
In fact, consideration of multilingual aspects was one of the main motivations
in support of our choice of the SPARKLE annotation scheme as a starting
point here (see sections 4 and 5
below), since SPARKLE provides annotated material in four different
languages, namely English, French, German and Italian. However, in the
present meta-scheme no attempt is made at covering any one language in
particular as the supported schemes present slightly different annotation
choices, depending on language-specific phenomena. For example, while,
for some languages, the syntactic function of a phrase can be read off
a traditional constituency-based parse tree, it is not clear how tenable
the same claim is for relatively free phrase order languages such as German
and Italian, which typically exhibit complex cases of phrase scrambling
and discontinuous phrases. This argument justifies our choice of
considering syntactic functions as primitive linguistic notions, rather
than somewhat derivative of the configurational properties of a language.
1.2 Overview
The present document is structured into four sections, apart from this
one. Each section covers a level of morpho-syntactic annotation, and their
sequence ideally represents the procedure which should be followed by the
potential user in annotating a dialogue at the morpho-syntactic level.
The individual sub-levels are:
-
the edited transcription level, whereby the transcript is
annotated for all nonstandard forms,
repetitions, false starts etc.
-
the morphological word-level,
whereby morphological words are identified and annotated for their morphological
properties. Annotation at this level is required for annotation at higher
levels of analysis.
-
the chunk-level, where
the document is annotated for its syntactic structure into chunks (see
section 3)
-
the functional-level,
whereby functional dependencies between lexical heads are annotated.
All levels presuppose a transcription file marked up in XML for words.
Levels c) and d) are independent of each other, but dependent on level
b).
1.3 Document Index
[back to top]
[Next: Edited Transcription
Coding Module]