MATE LEVEL MARKUP
MORPHOSYNTAX
Vito Pirrelli, Claudia Soria

 
 
 

1 Introduction

The present document is intended to offer an edited selection of good practices for annotation of dialogue text at the levels of word analysis, chunking and representation of syntactic functional relations (sections 3, 4 and 5). Moreover, it provides guidelines to the edited transcription of a dialogue (section 2). These levels are seen as conceptually independent, though inter-connected, sub-levels of morpho-syntactic analysis. For a more comprehensive review of the concrete practices followed in a representative set of current annotation schemes, the interested reader is referred to MATE Deliverable 1.1. Here, we provide a formal framework, or annotation meta-scheme, to be used as a practical blue-print to both language- and domain-specific scheme development and/or customization. Incidentally, it should be emphasized that the meta-scheme described in these pages offers the further bonus of being usable as a kind of lingua franca for exchange of information and data. This is supported by the choice of XML as mark-up language.
 

1.1 General Requirements

We first identified the following list of prerequisites to the design of a meta-scheme for dialogue annotation at the morpho-syntactic level: In what follows, we will comment on each of these desiderata and illustrate their relationship with morpho-syntactic annotation in MATE.
 

1.1.1 Coverage and Expected Feed-back

It is useful, at this stage, to make it clear what sort of input the present deliverable is intended to elicit. First, we provide a list of phenomena modelled through the suggested meta-scheme. The list exemplifies a representative range of phenomena crucially involved in the annotation of a dialogue at the levels of morpho-syntactic analysis touched on in this report, but it is not intended to give instructions for marking up an exhaustive list of language-specific facts. For example, we suggest to annotate derivatives through immediate morpheme segmentation, i.e. by signalling the most external affix only, as in ``(derivation(al))''. However, we do not provide language-specific recommendations concerning problems of segmentation due to fusional phenomena or truncated stems, as in provide à provision or truncate à truncation. First, it would simply be impossible to tackle problems at this level of detail for even a subset of the languages that MATE is interested in covering. Secondly, linguistic issues at this level of granularity depend too heavily on the theoretical commitment of the annotators and on the intended purposes of their annotation scheme. It is questionable to suggest a standard practice at this level.

Furthermore, the range of phenomena to be considered poses a considerable challenge to any attempt to adapt existing annotation practices, predominantly designed for annotating written texts, to the specific exigencies of dialogic data. The challenge has mainly to do with the noisy nature of spoken texts: namely, usage of non-standard forms, repetitions, false starts, anacolutha and incomplete phrases, etc. We envisage that most of this noise should preliminarily be annotated at a low-level of edited transcription as shown in section 2 of the present report. This stage is intended as a preliminary filter and plays the role of marking noisy or non standard material with no editing out. In fact, much of noise will receive a linguistic annotation at the level of chunking (section 4 of the present report), while being ignored at the level of functional annotation.
 

1.1.2 Core and Periphery

For all sub-levels of annotation considered here, the meta-scheme consists of two subsets of tags. The first subset, or core scheme, supplies basic means for annotating obligatory information. The second subset, or periphery tag set, serves the purpose of making provision for further linguistic annotation to be added on top of obligatory information, whenever this is required by the annotator. In its turn, the periphery tag set parts into two further subsets: a recommended set and an optional one. This makes the meta-scheme highly modular, and open to further augmentation, both in terms of more granular information and of further independent dimensions of analysis.
 

1.1.3 Redundancy

In designing the MATE meta-scheme for morpho-syntactic annotation, considerable care was taken to provide the potential user with a battery of open choices for encoding a particular range of phenomena, rather than give one solution only. For example, a morphological compound can either be represented as a whole morphological word, or as two (or more) independent words linked together through a compounding relationship.
 

1.1.4 Supported Annotation Schemes

We provide, at each sub-level of annotation, an indication of some currently available mark-up schemes that i) are compatible with our meta-scheme, and ii) will be supported by the MATE Workbench. These recommended practices are to be interpreted as a fair basis for testing the MATE Workbench. Our main concern was to support those annotation practices which both represent current standardization efforts and are proven to be portable and usable for practical applications. Two efforts, in particular, meet these requirements, namely the EAGLES and SPARKLE projects. EAGLES annotation scheme is the product of a joint European EC-funded standardisation effort carried out in the framework of the European Action Group for Language Engineering Standards.

The output of EAGLES (http://www.ilc.pi.cnr.it/EAGLES/home.html) has been particularly instrumental for the definition of a common set of morphosyntactic tags, with some integration and extension. The EC-funded project SPARKLE (Shallow Parsing and Knowledge Extraction for Language Engineering (http://www.ilc.pi.cnr.it/sparkle.html)) has developed a layered scheme of syntactic annotation encompassing three different levels of annotation: the chunk-based level, the constituency-based level and the functional level. The scheme is designed so as to be flexible, multipurpose, and amenable to finite state techniques of local and robust parsing.

In fact, SPARKLE specification were actually used for intelligent cross-lingual text editing and translation filtering in multilingual information retrieval systems (Xerox, Sharp), and speech recognition (Daimler-Benz), with a steady improvement in performance.

The annotation meta-scheme presented in this report represents an XML instantiation of two of the three levels envisaged in SPARKLE: namely chunking and functional annotation. In Deliverable 1.1 we argued that these two levels provide room for partial parsing, underspecification and graded levels of abstraction from raw input text, thus fulfilling many of the desiderata connected with syntactic processing of typically noisy data such as dialogues.
 

1.1.5 Amenability to Automatic Annotation

The concern for amenability to (semi)automatic annotation is of paramount importance throughout the present report, and has influenced a number of choices. For example, the preference given to chunking and functional annotation over, e.g., phrase structure trees at the syntactic level is chiefly motivated by the locality of the analysis offered by these two schemes, a useful feature if one wants to prevent a local parsing failure from backfiring and causing the entire parse of an utterance to fail. This is particularly desirable with a view to manual correction of an automatically annotated output, since a reasonably reliable partial analysis is always better than no analysis at all.
 

1.1.6 Reliability

Inter-coder human reliability deals with the issue of replicability of a task by a human annotator. The task is defined by an annotation scheme and a corpus to be annotated accordingly. The common assumption is that each token phenomenon in the corpus should be given the same tag by several independent annotators. A meta-scheme, as an edited collection of different annotation schemes, can only i) provide figures of the reliability of each independently supported scheme, and ii) possibly shed light on and give reasons for the respective degrees of reliability of the tag sets considered.
 

1.1.7 Multilinguality

The selection of a scheme is also motivated on a multi--lingual basis. In fact, consideration of multilingual aspects was one of the main motivations in support of our choice of the SPARKLE annotation scheme as a starting point here (see sections 4 and 5 below), since SPARKLE provides annotated material in four different languages, namely English, French, German and Italian. However, in the present meta-scheme no attempt is made at covering any one language in particular as the supported schemes present slightly different annotation choices, depending on language-specific phenomena. For example, while, for some languages, the syntactic function of a phrase can be read off a traditional constituency-based parse tree, it is not clear how tenable the same claim is for relatively free phrase order languages such as German and Italian, which typically exhibit complex cases of phrase scrambling and discontinuous phrases. This argument justifies our choice of considering syntactic functions as primitive linguistic notions, rather than somewhat derivative of the configurational properties of a language.
 

1.2 Overview

The present document is structured into four sections, apart from this one. Each section covers a level of morpho-syntactic annotation, and their sequence ideally represents the procedure which should be followed by the potential user in annotating a dialogue at the morpho-syntactic level. The individual sub-levels are:
  All levels presuppose a transcription file marked up in XML for words. Levels c) and d) are independent of each other, but dependent on level b).
 
 

1.3 Document Index

 
 

[back to top]

[Next: Edited Transcription Coding Module]