Edited Transcription Coding Module

name: Edited Transcription (ET)

coding purpose: to code disfluency phenomena in speech

coding level: Morphosyntax

data sources: spoken corpora

module references: orthographic transcription module

description: four elements are used to annotate disfluency phenomena. seg elements are all-purpose elements intended to mark dysfluent portions of dialogue and their possible repairs (when present in context). Attribute type identifies the specific type of disfluency which is found in the annotated segment, namely whether it is an interruption, a non-standard use, an omission, or a completion of a previous utterance. Attribute rep allows the annotator to indicate the target or standard form of a non-standard usage. Attribute ins allows to insert missing elements. seg elements convey basic, obligatory information. Further refinements of this obligatory information are possible through use of recommended and optional elements, which refer to seg elements through inline href links, namely dys, reparandum, signal and repair. dys elements serve the purpose of specifying the type of relationship between two seg elements, when these are used to mark dysfluencies which are contiguous in nature. An attribute type can be used to further define the type of disfluency. The elements reparandum, signal and repair are to be seen as a means for a more detailed analysis of the components of a dysfluency. By referring to and qualifying seg elements, they serve the purpose of specifying which previously identified seg element is repaired, which element is signalling that a repairing sequence is about to be uttered, and which element corresponds to the repair in the strict sense.

example:

 
given this input…:
<w id="w_001"> I </w>
<w id="w_002"> wanted </w>
<w id="w_003"> uh </w>
<w id="w_004"> I </w>
<w id="w_005"> thought </w>
<w id="w_006"> I </w>
<w id="w_007"> wanted </w>
<w id="w_008"> to </w>
<w id="w_009"> invite </w>
<w id="w_010"> Margie </w>
…the following annotation is built:

<seg id="seg_001" type="broken" href="orth.xml#id(w_001)..id(w_002)"/>

<seg id="seg_002" href="orth.xml#id(w_004)..id(w_010)"/>
<dys id="dys_001" type="retrins" href="edit.xml#id(seg_001)..(seg_002)">
<reparandum id="repm_001" href="edit.xml#id(seg_001)"/>
<repair id="rep_001" href="edit.xml#id(seg_002)"/>
</dys>

 

markup declaration:

ELEMENT edit_file (seg+, dys+)

ELEMENT seg
ATTRIBUTES:
type (broken | sic | gap | scomp | ocomp)
rep TEXT
ins TEXT
ID
HREF
 

Recommended extensions to the core scheme:

ELEMENT dys (repair?, reparandum?, signal?)
ATTRIBUTES:
type TEXT
ID
HREF
 

Optional extensions:

ELEMENT reparandum
ATTRIBUTES:
ID
HREF

ELEMENT signal
ATTRIBUTES:
ID
HREF

ELEMENT repair
ATTRIBUTES:
ID
HREF

coding procedure: Encode by coder 1. Check by coder 2.

creation notes:
 

Authors: Claudia Soria, Vito Pirrelli
Version: 1., May 1999; 2., October 1999
Comments: none
Literature:

 

Morphosyntactic Annotation Coding Module

name: Morphosyntactic Annotation

coding purpose: identification of morphological words, annotation of part-of-speech categories, annotation of morpho-syntactic features, annotation of interrupted words, annotation of clitics, annotation of compound words, annotation of derivational morphology.

coding level: Morphosyntax

data sources: spoken or written corpora

module references: orthographic transcription module

description: six elements are used to annotate morphological analysis . mw elements identify morphological words. Attribute type is mandatory: it specifies the part-of-speech category of an item. In this implementation, type is used to encode EAGLES-conformant part-of-speech categories; attribute subtype is optional, and may be used to specify additional morphosyntactic features to be associated with words. In the actual implementation presented here, subtype is used to convey EAGLES-conformant recommended morpho-syntactic values. Finally, attribute lemma allows for specification of the lemma of the item in question. An optional attribute broken serves to annotate word partials.cpw elements are used to annotate compounds. Attributes are the same as those for mw elements. A cpw_h element is used to mark the semantic head in a compound. Three elements, namely stem, prefix and suffix are used to annotate derivational morphology.

example:
 

<mw id="mw_001" type="PD" subtype="PD1020115" lemma="we" href="orth.xml#id(w_001)"/>
<mw id="mw_002" type="V" subtype="V00011111" lemma="take" href="orth.xml#id(w_002)"/>
<mw id="mw_003" type="AT" subtype="AT1000" lemma="the" href="orth.xml#id(w_003)"/>
<mw id="mw_004" type="N" subtype="N102000" lemma="orange" href="orth.xml#id(w_004)"/>
<mw id="mw_005" type="AP" subtype="AP1" lemma="to" href="orth.xml#id(w_005)"/>
<mw id="mw_006" type="N" subtype="N201000" lemma="Elmira" href="orth.xml#id(w_006)"/>
<mw id="mw_007" type="I" subtype="I" href="orth.xml#id(w_007)"/>
<mw id="mw_008" type="PD" subtype="PD1010115" lemma="I" href="orth.xml#id(w_008)"/>
<mw id="mw_009" type="V" subtype="V00011111" lemma="mean" href="orth.xml#id(w_009)"/>
<mw id="mw_010" type="AP" subtype="AP1" lemma="to" href="orth.xml#id(w_010)"/>
<mw id="mw_011" type="N" subtype="N201000" lemma="Corning" href="orth.xml#id(w_011)"/>
<mw id="mw_001" type="PD" subtype="PD1020115" lemma="we" href="orth.xml#id(w_001)"/>
<mw id="mw_002" type="V" subtype="V00011111" lemma="take" href="orth.xml#id(w_002)"/>
<mw id="mw_003" type="AT" subtype="AT1000" lemma="the" href="orth.xml#id(w_003)"/>
<mw id="mw_004" type="N" subtype="N102000" lemma="orange" href="orth.xml#id(w_004)"/>
<mw id="mw_005" type="AP" subtype="AP1" lemma="to" href="orth.xml#id(w_005)"/>
<mw id="mw_006" type="N" subtype="N201000" lemma="Elmira" href="orth.xml#id(w_006)"/>
<mw id="mw_007" type="I" subtype="I" href="orth.xml#id(w_007)"/>
<mw id="mw_008" type="PD" subtype="PD1010115" lemma="I" href="orth.xml#id(w_008)"/>
<mw id="mw_009" type="V" subtype="V00011111" lemma="mean" href="orth.xml#id(w_009)"/>
<mw id="mw_010" type="AP" subtype="AP1" lemma="to" href="orth.xml#id(w_010)"/>
<mw id="mw_011" type="N" subtype="N201000" lemma="Corning" href="orth.xml#id(w_011)"/>


markup declaration:

ELEMENT mw (lexit*, stem*, suffix*, prefix*)
ATTRIBUTES
type (N|V|AJ|PD|AT|AV|AP|C|NU|I|U|R|F|DM|PU)
lemma TEXT
subtype TEXT
broken (Y|N)
ID
HREF

ELEMENT cpw (cpw_h?)
ATTRIBUTES
type (N|V|AJ|PD|AT|AV|AP|C|NU|I|U|R|F|DM|PU)
subtype TEXT
broken (Y|N)
ID
HREF

ELEMENT cpw_h
ATTRIBUTES
ID
HREF

ELEMENT stem
ATTRIBUTES
type (N|V)
ID
HREF

ELEMENT suffix
ATTRIBUTES
ID
HREF

ELEMENT prefix
ATTRIBUTES
ID
HREF
 

The following element is used in case there is a reference lexicon in xml format

ELEMENT lexit
ATTRIBUTES
ID
HREF

coding procedure: morphological annotation is almost always performed automatically. Manual checking is recommended.

creation notes:

 
Authors: Claudia Soria, Vito Pirrelli
Version: 1., May 1999; 2., October 1999
Comments: none
:

 

Chunking Coding Module

name: Chunking

coding purpose: to code syntactic structure in terms of labelled entities corresponding to chunks. Each chunk is further analyzed for its internal structure.

coding level: Morphosyntax

data sources: spoken or written corpora

module references: morphosyntactic annotation module

description: seven elements are used to annotate syntactic analysis. ch elements are used to identify a sequence of adjacent word tokens which are mutually related through dependency links (i.e., a chunk). Two attributes are used for the description of chunks: type is mandatory, and encodes the syntactic category to which a given chunk belongs. broken is optional, and serves to annotate chunk partials. potgov elements identify “potential governors”, namely the lexical heads of chunks. aux, cop, intro, modal and causal elements specify, respectively, the auxiliary verb, the copula, the introducer or preposition, the modal auxiliary verb and the causative verb in a chunk, if applicable.

example:

given this input…:

<mw id="mw_001"> hello </mw>
<mw id="mw_002"> can </mw>
<mw id="mw_003"> I </mw>
<mw id="mw_004"> help </mw>
<mw id="mw_005"> you </mw>

…the following annotation is built:

<ch id="ch_001" type="ADV" href="mword.xml#id(mw_001)">
<potgov id=”p_001” href=” mword.xml#id(mw_001)”/>
</ch>
<ch id="ch_002" type="FV" href="mword.xml#id(mw_002)">
<potgov id=”p_002” href=” mword.xml#id(mw_002)”/>
</ch>
<ch id="ch_003" type="N" href="mword.xml#id(mw_003)">
<potgov id=”p_003” href=” mword.xml#id(mw_003)”/>
</ch>
<ch id="ch_004" type="FV" href="mword.xml#id(mw_004)">
<potgov id=”p_004” href=” mword.xml#id(mw_004)”/>
</ch>
<ch id="ch_005" type="N" href="mword.xml#id(mw_005)">
<potgov id=”p_005” href=” mword.xml#id(mw_005)”/>
</ch>

markup declaration:

ELEMENT ch (potgov, aux?, cop?, intro?, modal?, caus?)
ATTRIBUTES
type (ADJ|PA|ADV|SUBORD|N|P|FV|G|I|PART|Di|ADJ_PART|COORD|U)
broken (Y | N)
ID
HREF

ELEMENT potgov
ATTRIBUTES
ID
HREF

ELEMENT aux
ID
HREF

ELEMENT cop
ATTRIBUTES
ID
HREF

ELEMENT intro
ATTRIBUTES
ID
HREF

ELEMENT modal
ATTRIBUTES
ID
HREF

ELEMENT caus
ATTRIBUTES
ID
HREF
 

coding procedure: Chunking can be performed either automatically or manually. In the first case, a manual checking of the chunker output is recommended. In the second case, the standard practice is sufficient (i.e., encode by coder 1. check by coder 2.)

creation notes:
 

Authors: Claudia Soria, Vito Pirrelli
Version: 1., May 1999; 2., October 1999
Comments: none
Literature:

 

Functional Annotation Coding Module

name: Functional Annotation

coding purpose: to encode functional analysis of data, that is to provide information about how grammatical relations such as subject, object and indirect object are instantiated in context.

coding level: Morphosyntax

data sources: spoken or written corpora

module references: morphosyntactic annotation module

description: Encoding is carried out by means of funct elements, which point to lexical tokens only indirectly.

The terms of the relationship are annotated through two dedicated elements, head and dep, which are hierarchically embedded within funct elements and point to the relevant lexical tokens in the resource file.
The type of relationship involved is represented by means of a list of values for the attribute type, further specifying dep elements.
Morphosyntactic features can be specified, when needed, through attributes in head and dep elements.
head attributes are: diath (i.e., the diathesis of a verbal head, whether active, passive, or middle), tense, person, number and gender, respectively the morphosyntactic tense, person, number and gender of the head.
dep attributes are: intro (for introducer, i.e. the element which possibly introduces the dependent), case (i.e., the case of the dependent), and
synt_real (i.e., the particular syntactic realization of the dependent, whether clausal or non clausal).


example:
 

given this input…:
<mw id="mw_001"> Paul </mw>
<mw id="mw_002"> said </mw>
<mw id="mw_003"> that </mw>
<mw id="mw_004"> he </mw>
<mw id="mw_005"> will </mw>
<mw id="mw_006"> accept </mw>
<mw id="mw_007"> the </mw>
<mw id="mw_008"> job </mw>
..we build the following annotation:

<funct id="funct_001" >

<head id="h_001" href="mword.xml#id(mw_002)"/>
<dep id="d_001" type="subj" href="mword.xml#id(mw_001)"/>
<dep id="d_002" type="comp" href="mword.xml#id(mw_006)"/>
</funct>
<funct id="funct_002">
<head id="h_002" href="mword.xml#id(mw_006)"/>
<dep id="d_003" type="subj" href="mword.xml#id(mw_004)"/>
<dep id="d_004" type="dobj" href="mword.xml#id(mw_008)"/>
</funct>


markup declaration:

ELEMENT funct (head, dep+)
ATTRIBUTES:
ID
HREF

ELEMENT head
ATTRIBUTES:
head TEXT
diath (active|passive|middle)
person (1|2|3)
number (sg|pl)
gender (m|f|n)
v_type (impers)
ID
HREF

ELEMENT dep
ATTRIBUTES:
type (subj|dobj|obj2|iobj|mod|comp)
intro TEXT
case TEXT
synt_real (n_cl|cl|c|x)
ID
HREF

ELEMENT coord (arg+)
ATTRIBUTES:
type (and|or|comma)
ID
HREF

ELEMENT arg
ATTRIBUTES:
ID
HREF

ELEMENT bind (arg+)
ATTRIBUTES:
ID
HREF
 

coding procedure: manual annotation of the functional syntactic analysis of a text is performed through the following steps:
 


creation notes:
 

Authors: Claudia Soria, Vito Pirrelli
Version: 1., May 1999; 2., October 1999
Comments: none
Literature: