Corpus-driven genitive disambiguation
Nadine Aldinger - Corpus Linguistics 2005, Birmingham, 07/17
| 1 |
|
Nominalizations and genitive attributes |
|
| 2 |
|
Finding and evaluating context parameters for genitive disambiguation |
|
| 3 |
|
Data collection and annotation |
| 3.1 |
|
Corpus architecture |
| 3.2 |
|
Database layout |
| 3.3 |
|
Extraction procedure |
| 3.4 |
|
Classification of genitives |
|
| 4 |
|
First results |
| 4.1 |
|
Singular vs. plural nominals |
|
| 5 |
|
Summary |
|
| 6 |
|
Next steps |
1 Nominalizations and genitive attributes (1)
| (1) |
a. |
| Dort | ist | die | Radioaktivität | laut | Messungen |
| there | is | the | radioactivity | according-to | measurements |
| der | Organisation | hundertmal | höher | als | normal. |
| the-gen | organization | a-hundred-times | higher | than | normal |
| "According to measurements of/by the organization, the radioactivity level is a hundred times higher there." |
| measure (organization, obj) |
|
| |
b. |
| Der | Satellit | dient | der | Messung | von | Verschiebungen | in | der | Erdkruste. |
| The | satellite | serves | the-dat | measurement | of | shifts | in | the | earth's-crust |
| "The satellite serves to measure shifts in the earth's crust." |
| measure (subj, shifts) |
|
1 Nominalizations and genitive attributes (2)
Context parameters usable for corpus-based disambiguation
- morphosyntactic form of the NP headed by the nominal:
i.e. number and definiteness, maybe case
Messungen (a) vs. die Messung (b)
- syntactic structure of the local context:
inner structure of the nominal's NP, but also embedding in PPs/VPs
PP-laut (a)
- properties of the nominal's base verb:
e.g. telicity, syntactic subcategorization
measure (subj, s-comp(C_dass))
- lexical material in the local context:
e.g. selectional restrictions on the governing verb
dienen "to serve" (b)
2 Finding and evaluating context parameters for genitive disambiguation
- Collect linguistic context parameters that might be relevant for genitive interpretation from literature and from (qualitative) corpus observations.
- Collect corpus sentences containing representative nominalizations plus genitive attributes in a database and annotate their parameter values automatically.
- Annotate genitive interpretation manually for each sentence.
- Analyze frequency distributions to find the parameters and parameter combinations that are most useful to predict genitive interpretation.
- Implement tests based on these combinations.
- Re-run the tests on new corpus sentences and evaluate the quality of their predictions for genitive interpretation; if necessary and possible, improve the tests and look for more diagnostic parameters (bootstrapping).
3.1 Corpus architecture
German newspaper corpus (Frankfurter Rundschau 1992-93), ~40 million words;
pre-processed:
- tokenized
- part-of-speech tagged (TreeTagger, STTS tagset)
- lemmatized (TreeTagger)
- morphosyntactically annotated (DMOR)
- chunked / partially parsed (YAC)
Tools:
- IMS Corpus Workbench (CWB)
- powerful regular-expression query language (CQP)
- Perl scripting interface
- MySQL database
- PHP/HTML-based database interface
3.2 Database layout (1): Tables
- all nominals in the corpus with frequency values and compound information
- all base verbs of the nominals with complete subcategorization information
- corpus matches of nominal + genitive attribute (in sentence) annotated with context parameters (automatically) and genitive interpretation (manually)
3.2 Database layout (2): Examples
| (2) |
a. |
| der | Gewinn, | den | sie | durch | Vermietung | ganzer | Etagen |
| the | profit | which | they | through | renting | whole-gen | floors-gen |
| an | polnische | Leiharbeiter | erzielt | hatten | |
| to | Polish | casual-workers | made | had | |
| "the profit they had made by renting whole floors to Polish casual workers" |
|
| |
b. |
| Die | Bodenmessungen | des | städtischen | Umweltamtes |
| the | soil-measurings | the-gen | municipal-gen | environmental-authority |
| ergaben | katastrophale | Ergebnisse. | |
| yielded | disastrous | results | |
| "The soil measurings of the municipal authority yielded disastrous results." |
|
3.2 Database layout (3): Nominals
| field |
values |
(2a) |
(2b) |
| lemma |
string |
Vermietung |
Bodenmessung |
| corpus frequency |
int |
107 |
7 |
| compound non-head |
string |
- |
Boden |
| compound head |
string |
Vermietung |
Messung |
3.4 Database layout (4): Base verbs
| field |
values |
(2a) |
(2b) |
| lemma (nominalized) |
string |
Vermietung |
Messung |
| arg{1,2,3} |
string |
1: subj(NP_nom)
2: obj(NP_acc)
1: subj(NP_nom)
2: obj(NP_acc)
3: iobj(NP_dat)
1: subj(NP_nom)
2: obj(NP_acc)
3: obj-pred(PP_als)
1: subj(NP_nom)
2: obj(NP_acc)
3: p-obj(PP_an_acc)
1: subj(NP_nom)
1: subj(NP_nom)
3: p-obj(PP_an_acc)
|
1: subj(NP_nom)
2: arg(PRON_refl-acc)
1: subj(NP_nom)
2: arg(PRON_refl-acc)
3: p-obj(PP_an_dat)
1: subj(NP_nom)
2: arg(PRON_refl-acc)
3: p-obj(PP_mit_dat)
1: subj(NP_nom)
2: obj(NP_acc)
1: subj(NP_nom)
2: obj(NP_acc)
3: p-obj(PP_an_dat)
|
1: subj(NP_nom)
3: corr_pobj(PAV_an_dat)
1: subj(NP_nom)
3: s-comp(C_daß)
1: subj(NP_nom)
3: s-comp(C_ob)
1: subj(NP_nom)
3: s-comp(C_wh)
1: subj(NP_nom)
3: v-comp(VP_zu-inf-perf)
1: subj(NP_nom)
3: v-comp(VP_zu-inf-pres)
|
3.2 Database layout (5): matches and annotated features (a)
| field |
values |
(2a) |
(2b) |
| match sentence |
text |
... |
| corpus identifier |
int |
|
| Features of the nominal and its NP
|
| number |
sg, pl |
sg |
pl |
| definiteness |
def, indef, null (bare singular) |
null |
def |
| case |
(set of) nom, gen, dat, acc |
nom, gen, dat, acc |
nom, acc |
| specifier: word |
string |
- |
die |
specifier: part of speech (STTS tagset) |
ART (article: die, eine)
PDAT (demonstrative pronoun: diese, ...)
PIAT (indefinite pronoun: keine, etwas, ...)
PPOSAT (possessive pronoun: seine, ihre, ...)
NE (proper noun in genitive case)
|
- |
ART |
| adjectival modifier(s) |
string |
- |
- |
post-genitival PP: preposition |
string |
an |
- |
post-genitival PP: case of governed NP |
(set of) nom, gen, dat, acc |
acc |
- |
3.2 Database layout (6): matches and annotated features (b)
| field |
values |
(2a) |
(2b) |
| Features of the genitive NP / von-PP
|
| number |
sg, pl, von-PP |
pl |
sg |
| definiteness |
def, indef, null (bare singular) |
indef |
def |
| head lemma |
string |
Etage |
Umweltamt |
| animacy and/or other general lexical properties from GermaNet* |
string |
place? |
institution? |
|
| Features of embedding context
|
| preposition of embedding PP |
string |
durch |
- |
| main verb lemma of the clause in which the nominal's NP/PP is an argument or adjunct* |
string |
erzielen |
ergeben |
| grammatical function of the nominal's NP/PP w.r.t. clause verb* |
subject, direct object, adjunct ... |
adjunct |
subject |
* planned
3.3 Extraction procedure

3.4 Classification of genitives
- subject genitive
die Befürchtung der Gewerkschaft "the fear of the trade union"
- object genitive
die Befürchtung einer Finanzkrise "the fear of a financial crisis"
- subject genitive with non-transitive base verb
intransitive: eine Erkrankung des Herzens "a disease of the heart", base verb erkranken "to fall sick";
inherently (or very preferably) reflexive: die Annäherung des Autos "the approach of the car", base verb sich annähern "to approach"
- other thematic genitives (from indirect, genitive, or prepositional objects)
in Ermangelung guter Alternativen "for lack of good alternatives", base verb ermangeln (dummy subject, genitive object or an-PP) "to lack"
- non-thematic genitives (temporal, modal, with superlative, quantitative)
- genitive modifying the non-head of a compound nominal
die Altersverteilung der Studenten "the age distribution of the students"
4 First results (1)
Candidate parameters determining subject-to-object genitive ratio
- number of the nominal
Verfolgung "pursuing, persecution" (3 subject genitives : 178 object genitives) vs. Verfolgungen "pursuings, persecutions" (2 : 1)
- availability of Object reading (i.e. whether the nominal can refer to a physical or abstract object)
Herstellung "production" (0 : 369) vs. Veranstaltung "event, function" (357 : 14)
- subcategorization properties of the nominal's base verb: most notably for content/informational Objects, i.e. "Object" instances of nominals from proposition-embedding verbs
Äußerung "utterance" (343 : 10)
4 First results (2): Singular vs. plural nominals (a)
| (3) |
a. |
| Laut | A. | leben | dort | rund | 45 000 | Menschen, | die | vor | der |
| according-to | A. | live | there | about | 45 000 | people | who | before | the |
| Verfolgung | der | Armee | geflohen | waren. | |
| persecution | the-gen | army | fled | were | |
| "According to A., about 45 000 people live there who had fled the persecution by the army." |
|
| |
b. |
| V. warf | der | Führung | in | Bagdad | die | Verfolgung |
| V. threw | the-dat | leadership | in | Bagdad | the-acc | persecution |
| aller | Oppositionellen | vor. | |
| all-gen | oppositionals | before | |
| "V. accused the leaders in Bagdad of the persecution of all oppositionals." |
|
| |
c. |
| Die | Tochter | entkam | den | Verfolgungen | der | Nazis. |
| The | daughter | escaped | the | persecutions | the-gen | Nazis |
| "The daughter escaped the Nazi persecutions." |
|
4 First results (3): Singular vs. plural nominals (b)
| N-def |
N-num |
|
gen.subj. |
gen.obj. |
others |
|
gen.subj. |
: |
gen.obj. |
|
| def |
sg |
|
626 |
7995 |
4184 |
|
1 |
: |
12.8 |
| def |
pl |
|
507 |
188 |
596 |
|
2.7 |
: |
1 |
| indef |
sg |
|
593 |
1857 |
1568 |
|
1 |
: |
3.1 |
| indef |
pl |
|
436 |
153 |
254 |
|
2.8 |
: |
1 |
| null |
sg |
|
565 |
1239 |
3246 |
|
1 |
: |
2.2 |
|
| (total) |
sg |
|
1784 |
11091 |
8998 |
|
1 |
: |
6.2 |
| (total) |
pl |
|
943 |
341 |
850 |
|
2.8 |
: |
1 |
- some plural nominals have a high rate of subject genitives compared to their singular form
- some (most?) nominals which occur exclusively with object genitives in singular form have no plural at all
- predominance of Object readings with plural nominals - no "real" subject genitives there
5 Summary
- task: disambiguation of genitive attributes of nominalizations
- approach: derive useful context parameters from large amounts of corpus data
- proof of concept: plural
6 Next steps
enlarge example database
- more examples
- more parameters:
- sortal reading of the nominal
at least distinguish Process/Event from Result State/Object
- clausal context, esp. main verb
requires full or partial parsing or manual annotation/checking
- thematic proto-roles of base verbs
requires manual annotation
- ontological information
for genitive attributes, e.g. strong preference for animate subject genitives
for embedding verbs and prepositions, e.g. spatial verb -> Object reading