Corpus-driven genitive disambiguation

Nadine Aldinger - Corpus Linguistics 2005, Birmingham, 07/17

1   Nominalizations and genitive attributes
2   Finding and evaluating context parameters for genitive disambiguation
3   Data collection and annotation
3.1   Corpus architecture
3.2   Database layout
3.3   Extraction procedure
3.4   Classification of genitives
4   First results
4.1   Singular vs. plural nominals
5   Summary
6   Next steps

1 Nominalizations and genitive attributes (1)

(1) a.
DortistdieRadioaktivitätlautMessungen
thereistheradioactivityaccording-tomeasurements
derOrganisationhundertmalhöheralsnormal.
the-genorganizationa-hundred-timeshigherthannormal
"According to measurements of/by the organization, the radioactivity level is a hundred times higher there."
measure (organization, obj)
  b.
DerSatellitdientderMessungvonVerschiebungeninderErdkruste.
Thesatelliteservesthe-datmeasurementofshiftsintheearth's-crust
"The satellite serves to measure shifts in the earth's crust."
measure (subj, shifts)

1 Nominalizations and genitive attributes (2)

Context parameters usable for corpus-based disambiguation

  1. morphosyntactic form of the NP headed by the nominal:
    i.e. number and definiteness, maybe case
    Messungen (a) vs. die Messung (b)

  2. syntactic structure of the local context:
    inner structure of the nominal's NP, but also embedding in PPs/VPs
    PP-laut (a)

  3. properties of the nominal's base verb:
    e.g. telicity, syntactic subcategorization
    measure (subj, s-comp(C_dass))

  4. lexical material in the local context:
    e.g. selectional restrictions on the governing verb
    dienen "to serve" (b)

2 Finding and evaluating context parameters for genitive disambiguation

  1. Collect linguistic context parameters that might be relevant for genitive interpretation from literature and from (qualitative) corpus observations.

  2. Collect corpus sentences containing representative nominalizations plus genitive attributes in a database and annotate their parameter values automatically.

  3. Annotate genitive interpretation manually for each sentence.

  4. Analyze frequency distributions to find the parameters and parameter combinations that are most useful to predict genitive interpretation.

  5. Implement tests based on these combinations.

  6. Re-run the tests on new corpus sentences and evaluate the quality of their predictions for genitive interpretation; if necessary and possible, improve the tests and look for more diagnostic parameters (bootstrapping).

3.1 Corpus architecture

German newspaper corpus (Frankfurter Rundschau 1992-93), ~40 million words;
pre-processed:

Tools:

3.2 Database layout (1): Tables

3.2 Database layout (2): Examples

(2) a.
derGewinn,densiedurchVermietungganzerEtagen
theprofitwhichtheythroughrentingwhole-genfloors-gen
anpolnischeLeiharbeitererzielthatten 
toPolishcasual-workersmadehad 
"the profit they had made by renting whole floors to Polish casual workers"
  b.
DieBodenmessungendesstädtischenUmweltamtes
thesoil-measuringsthe-genmunicipal-genenvironmental-authority
ergabenkatastrophaleErgebnisse. 
yieldeddisastrousresults 
"The soil measurings of the municipal authority yielded disastrous results."

3.2 Database layout (3): Nominals

field values (2a) (2b)
lemma string Vermietung Bodenmessung
corpus frequency int 107 7
compound non-head string - Boden
compound head string Vermietung Messung

3.4 Database layout (4): Base verbs

field values (2a) (2b)
lemma (nominalized) string Vermietung Messung
arg{1,2,3} string 1: subj(NP_nom)
2: obj(NP_acc)

1: subj(NP_nom)
2: obj(NP_acc)
3: iobj(NP_dat)

1: subj(NP_nom)
2: obj(NP_acc)
3: obj-pred(PP_als)

1: subj(NP_nom)
2: obj(NP_acc)
3: p-obj(PP_an_acc)

1: subj(NP_nom)

1: subj(NP_nom)
3: p-obj(PP_an_acc)
1: subj(NP_nom)
2: arg(PRON_refl-acc)

1: subj(NP_nom)
2: arg(PRON_refl-acc)
3: p-obj(PP_an_dat)

1: subj(NP_nom)
2: arg(PRON_refl-acc)
3: p-obj(PP_mit_dat)

1: subj(NP_nom)
2: obj(NP_acc)

1: subj(NP_nom)
2: obj(NP_acc)
3: p-obj(PP_an_dat)
1: subj(NP_nom)
3: corr_pobj(PAV_an_dat)

1: subj(NP_nom)
3: s-comp(C_daß)

1: subj(NP_nom)
3: s-comp(C_ob)

1: subj(NP_nom)
3: s-comp(C_wh)

1: subj(NP_nom)
3: v-comp(VP_zu-inf-perf)

1: subj(NP_nom)
3: v-comp(VP_zu-inf-pres)

3.2 Database layout (5): matches and annotated features (a)

field values (2a) (2b)
match sentence text ...
corpus identifier int
Features of the nominal and its NP
number sg, pl sg pl
definiteness def, indef, null (bare singular) null def
case (set of) nom, gen, dat, acc nom, gen,
dat, acc
nom, acc
specifier: word string - die
specifier: part of
speech (STTS tagset)
ART (article: die, eine)
PDAT (demonstrative pronoun: diese, ...)
PIAT (indefinite pronoun: keine, etwas, ...)
PPOSAT (possessive pronoun: seine, ihre, ...)
NE (proper noun in genitive case)
- ART
adjectival modifier(s) string - -
post-genitival PP:
preposition
string an -
post-genitival PP:
case of governed NP
(set of) nom, gen, dat, acc acc -

3.2 Database layout (6): matches and annotated features (b)

field values (2a) (2b)
Features of the genitive NP / von-PP
number sg, pl, von-PP pl sg
definiteness def, indef, null (bare singular) indef def
head lemma string Etage Umweltamt
animacy and/or other general lexical properties from GermaNet* string place? institution?
Features of embedding context
preposition of embedding PP string durch -
main verb lemma of the clause in which the nominal's NP/PP is an argument or adjunct* string erzielen ergeben
grammatical function of the nominal's NP/PP w.r.t. clause verb* subject, direct object, adjunct ... adjunct subject

* planned

3.3 Extraction procedure

3.4 Classification of genitives

4 First results (1)

Candidate parameters determining subject-to-object genitive ratio

4 First results (2): Singular vs. plural nominals (a)

(3) a.
LautA.lebendortrund45 000Menschen,dievorder
according-toA.livethereabout45 000peoplewhobeforethe
VerfolgungderArmeegeflohenwaren. 
persecutionthe-genarmyfledwere 
"According to A., about 45 000 people live there who had fled the persecution by the army."
  b.
V. warfderFührunginBagdaddieVerfolgung
V. threwthe-datleadershipinBagdadthe-accpersecution
allerOppositionellenvor. 
all-genoppositionalsbefore 
"V. accused the leaders in Bagdad of the persecution of all oppositionals."
  c.
DieTochterentkamdenVerfolgungenderNazis.
Thedaughterescapedthepersecutionsthe-genNazis
"The daughter escaped the Nazi persecutions."

4 First results (3): Singular vs. plural nominals (b)

N-def N-num    gen.subj. gen.obj. others    gen.subj. : gen.obj.
def sg 626 7995 4184 1 : 12.8
def pl 507 188 596 2.7 : 1
indef sg 593 1857 1568 1 : 3.1
indef pl 436 153 254 2.8 : 1
null sg 565 1239 3246 1 : 2.2
(total) sg 1784 11091 8998 1 : 6.2
(total) pl 943 341 850 2.8 : 1

5 Summary

6 Next steps

enlarge example database