###############
## DErivBase ##
###############
This documentation file
* explains what DErivBase is
* gives details about the resource's file formats and versions (changelogs)
* provides information about the derivation rules with which DErivBase was
created and which are also shipped.
In case you wish to access an earlier version of DErivBase, please send a request
to: zeller at cl dot uni-heidelberg dot de.
+++++++++++++++++++++
++ About DErivBase ++
+++++++++++++++++++++
DErivBase is a large-coverage derivational resource for German. It consists of
derivational families, which are defined as equivalence classes of lemmas (nouns,
verbs, and adjectives). The lemmas of one family are derivationally related among
each other. They were extracted from the sdeWaC corpus; lemmatization and POS
tagging was done with TreeTagger, further morphological analysis with the MATE
tools. The resource was built with hand-written derivation rules, which use string
transformation functions to map basis lemmas into derived lemmas.
Since v2.0, DErivBase is additionally semantically refined. That is, derivational
families are clustered according to semantically coherent sub-families (for details,
see [2]). For instance, the family with 10 members:
Anbaggern_Nn Ausbaggern_Nn Abbaggern_Nn Bagger_Nm Baggern_Nn
baggern_V anbaggern_V aufbaggern_V ausbaggern_V abbaggern_V
is split into the following three clusters, which are semantically coherent:
1. abbaggern_V Abbaggern_Nn
2. ausbaggern_V Ausbaggern_Nn baggern_V Baggern_Nn aufbaggern_V Bagger_Nm
3. anbaggern_V Anbaggern_Nn
This clustering is achieved as follows: Pairs of derivationally related
lemmas are classified by a supervised machine learning model, whether or not
they are supposed to be semantically related. Then, we use hierarchical
agglomerative clustering to transfer the pairwise decision to complete
clusters within a family (the threshold for clustering was optimised for F1).
DErivBase v2.0 covers 280,336 lemmas; 65,420 of them are grouped into 20,371
non-singleton families (i.e., 214,916 are singleton families).
Each pair of words in a family is connected by a path of derivation rules.
There is a weak negative correlation between derivation path length and relatedness
of the connected lemmas. Thus, we assume that lemma pairs from the same
derivational family are more connected, the less rules are necessary to connect
them on the shortest path. This fact can be caught by, e.g., assigning a pair a
weight weight 1/n, where n is the length of the shortest path between them.
Path weights then can be employed when applying DErivBase, e.g. for measuring
the semantic similarity of two lemmas.
++++++++++++++++++
++ File formats ++
++++++++++++++++++
The resource is available in three file formats:
--------------------------------
-- Derivational families only --
--------------------------------
The file "DErivBase-v*.txt" contains information about derivational families, but
not about rules. Each line of the file contains one derivational family, where the
order of the individual lemmas is not relevant. For each lemma, a suffix specifies
its part-of-speech (_V for verb, _Nf for feminine noun, _Nm for masculine noun,
_Nn for neuter noun, _N for gender-unspecific noun, _A for adjective). The families
are sorted by size before semantic clustering (i.e., according to the size in v1.4.1)
in descending order, where same-size families were ordered alphabetically. The
semantic clusters within the same derivational family is in random order (i.e.,
neither cluster size, nor alphabetical order matter).
Example for a derivational family containing 10 words:
Verschmelzung_Nf verschmelzend_A schmelzend_A Verschmelzen_Nn Schmelzen_Nn
Schmelze_Nf Schmelz_Nm umschmelzen_V verschmelzen_V schmelzen_V
----------------------------------------------------
-- Derivational families with rule paths per pair --
----------------------------------------------------
The file "DErivBase-v*-rulePaths.txt" contains the derivational families enriched
with information about which derivation rules connect each lemma pair within the
same derivational family, and how long the derivation path between the lemma pair
is (see explanation above).
In order to easily access the shortest rule path for all possible lemma pairs,
this format contains *ALL* lemma pairs of a derivational family, and their
connection by the shortest possible derivation path between them.
This format also includes the corresponding length of the connecting rule path.
Each lemma pair and its derivation path is presented in a single row. That is, for
a derivational family of n members, there are "n choose 2" (= n*(n-1)/2) rows for
this family in the "rulePaths" format.
For example, the following derivational family with four members:
Aalen_Nn Aalener_Nm Aal_Nn aalen_V
is shown in the following six lines (4*3/2):
Aal_Nn Aalen_Nn 2 Aal_Nn dNV09> aalen_Ven dVN09> Aalen_Nn
Aal_Nn aalen_V 1 Aal_Nn dNV09> aalen_V
Aal_Nn Aalener_Nm 3 Aal_Nn dNV09> aalen_Ven dVN09> Aalen_Nn dNN05> Aalener_Nm
Aalen_Nn aalen_V 1 Aalen_Nn dNV09> aalen_V
Aalen_Nn Aalener_Nm 1 Aalen_Nn dNN05> Aalener_Nm
aalen_V Aalener_Nm 2 aalen_V dVN09> Aalen_Nn dNN05> Aalener_Nm
where
* the first two lemmas are the actual lemma pair
* the number after the lemma pair indicates the (shortest) path length between
the very first and the very last lemma of this line
* each lemma pair is connected by exactly one derivation rule, given after the
path length
The derivation rules, in turn, correspond to the following pattern:
* they always start with "d"
* they always end with ">"
* two capital letters indicate the input and output part-of-speech, e.g., "NV" means
that the input lemma of the rule is a noun, and the output lemma is a verb
* rules contain a number, which is just an ID number for this part-of-speech pair
("05") and optionally, a sub-number separated by a "."; e.g. "05.1". The sub-number
range is [1,3]; it is technically necessary to indicate that the input lemma's
gender (for nouns) or grammatical suffix (for verbs) is the same as that of the
output lemma
* an optional asterisk "*" indicates that the rule is applied inversely; e.g.,
"behalten_V dNV01*> Halt_Nm", as opposed to "Halt_Nm dNV01> behalten_V"
Please note that the path weight proposed above can be trivially computed from the
path length n indicated in each line, i.e., 1/n.
For each lemma, a suffix specifies its part-of-speech (_Ven for verb ending with 'en',
_Veln for verb ending wit 'eln', _Vern for verb ending with 'ern', _Nf for feminine
noun, _Nm for masculine noun, _Nn for neuter noun, _N for gender-unspecific noun,
_A for adjective). The families are sorted by size before semantic clustering (i.e.,
according to the size in v1.4.1) in descending order, where same-size families were
ordered alphabetically. Within a family, the pairs as well as the lemmas in a pair
are ordered in alphabetic order according to the German alphabet (i.e., ä=a,
ö=o, ü=u, ß=ss).
-------------------------------------------------------
-- Derivational families with probabilities per pair --
-------------------------------------------------------
The file "DErivBase-v2*-probabilities.txt" contains the derivational families
enriched with the probability score produced by the machine learning classifier
for semantic validation (see [2]).
This format contains *ALL* lemma pairs of a derivational and semantically
validated family, and their respective probability score. Each lemma pair and
its probability score is presented in a single row. That is, for a derivational
family of n members, there are "n choose 2" (= n*(n-1)/2) rows for this family
in the "probabilities" format.
For example, the derivational family with 3 members:
Wasserchemie_Nf Wasserchemiker_Nm wasserchemisch_A
is shown in the following 3 lines:
Wasserchemie_Nf Wasserchemiker_Nm 0.779426142885
Wasserchemie_Nf wasserchemisch_A 0.981155397119
Wasserchemiker_Nm wasserchemisch_A 0.955528623541
Please note that this probability score can be used as a reliability weight
for the respective lemma pair.
For each lemma, a suffix specifies its part-of-speech (_Ven for verb ending with 'en',
_Veln for verb ending wit 'eln', _Vern for verb ending with 'ern', _Nf for feminine
noun, _Nm for masculine noun, _Nn for neuter noun, _N for gender-unspecific noun,
_A for adjective). The families are sorted by size before clustering (i.e., according
to the size in v1.4.1) in descending order, where same-size families were ordered
alphabetically. Within a family, the pairs are ordered in alphabetic order
according to the German alphabet (i.e., ä=a, ö=o, ü=u, ß=ss).
+++++++++++++++
++ Changelog ++
+++++++++++++++
----------
-- v2.0 --
----------
DErivBase v2.0 is a semantically refined version of v1.4.1, i.e., derivational
families from v1.4.1 are split according to semantic coherence. Therefore, the
size of many derivational families has changed.
The rule set used for v1.4.1 remained unchanged.
Additionally, there is a new data format: "Derivational families with probabilities
per pair", as explained above.
(October, 2014)
------------
-- v1.4.1 --
------------
Exactly the same as DErivBase v1.4, but removed two duplicate lemmas (Einstelle_Nf,
Einsteller_Nm).
This change only applies to the "DErivBase-v1.4.1.txt" format file; all other files
are exactly as in version 1.4.
(May, 2014)
----------
-- v1.4 --
----------
DErivBase v1.4 builds upon v1.2, which was constructed as explained in Zeller et
al. (2013).
Version 1.4 underwent three major changes:
- addition of 109 derivation rules compared to v.1.2; in total 267 rules: This version
of the resource tries to cover almost all (and surely all productive) possible
German derivation processes. This involes especially more prefixation rules, but also
some more suffixation rules.
- manual post-processing: refactoring of the 20 biggest derivational families: see
changelog v1.3
- new data format: Earlier DErivBase versions were available in the "Derivational
families only" format, and in the "Derivational families with pairwise weighted path
length" format, which was defined as follows:
--------------------------------------------------------------
-- Derivational families with pairwise weighted path length --
--------------------------------------------------------------
The file "DErivBase-v*-pathLength.txt" contains the derivational families
enriched with information about the length of the derivation path between
each pair of lemmas of the same family (see explanation above).
In order to easily access the path weights for all possible lemma pairs,
each lemma and its corresponding family members including their pairwise path
weights are presented in a single row. For example, the following derivational
family with four members:
Aalener_Nm Aalen_Nn aalen_V Aal_Nn
is shown in four successive lines:
Aalener_Nm: Aalen_Nn 1.00 aalen_V 0.50 Aal_Nn 0.33
Aalen_Nn: Aalener_Nm 1.00 aalen_V 1.00 Aal_Nn 0.50
aalen_V: Aal_Nn 1.00 Aalen_Nn 1.00 Aalener_Nm 0.50
Aal_Nn: aalen_V 1.00 Aalen_Nn 0.50 Aalener_Nm 0.33
where a number indicates the path weight between the lemma at the beginning
of the line (in front of the colon), and the lemma directly preceding the
number.
For each lemma, a suffix specifies its part-of-speech (_V for verb, _Nf for
feminine noun, _Nm for masculine noun, _Nn for neuter noun, _N for gender-
unspecific noun, _A for adjective). The families are sorted in alphabetic
order; the lines within one family are again ordered alphabetically regarding
the first lemma in the line.
Now, we provide DErivBase in the "Derivational families only" and the "Derivational
families with rule paths per pair" formats (see above), which is more informative.
Additionally, we revised the gold standard annotation presented in the paper [1],
because we encountered some annotation errors.
Version 1.4 consists of 267 rules and 219,214 derivational families, thereof 17,314
non-singletons. On the gold standard described in the paper [1], it achieves a precision
of 85% and a recall of 87% (+16% compared to v1.3). On the revised gold standard, it
achieves a precision of 85% and a recall of 91%.
----------
-- v1.3 --
----------
DErivBase v1.3 builds upon v1.2, which was constructed as explained in Zeller et
al. (2013). Version 1.3 underwent the following rule refinements refinements and
post-processing steps subsequent to the final analysis described in the paper [1]:
- added two derivational rules (dNN28, dVA14): Catch some final coverage lacks
- refined one derivational rule (dVA13): Cover additional optional patterns
- manual post-processing: refactored the 20 biggest derivational families in v1.2:
Since the biggest families incorrectly aggregated several derivational families,
we split them up into their correct families. In case a corresponding family already
existed, we added the lemma(s) of the big family to the corresponding correct family.
NOTE: this manual split is done *only* for the "DErivBase-v*.txt" file format, since
the manual split would lead to broken derivation chains in formats describing the
rule path / chain length.
Version 1.3 consists of 160 rules and 239,796 derivational families, thereof 17,863
non-singletons. On the gold standard described in the paper [1], it achieves a precision
of 83% and a recall of 71%.
----------
-- v1.2 --
----------
Version 1.2 is the version of DErivBase which was evaluated as model "DErivBase-L123"
in Zeller et al. (2013). It consists of 158 rules and 240,237 derivational families,
thereof 17,764 non-singletons. On the gold standard described in the paper [1], it
achieves a precision of 83% and a recall of 71%.
++++++++++++++++++++++
++ Derivation rules ++
++++++++++++++++++++++
The file "DErivBase-v*-rules.txt" contains the derivation rules with which the
resource was built. The following information are given:
* One or more German derivation examples in a line beginning with "--"
* A unique name assigned to each rule, which indicates the parts-of-speech
involved
* The derivation rule itself
An exemplary rule description is:
-- Kunst -> künstlich, Herr -> herrlich, Glück -> glücklich
dNA01
(sfx "lich" & try uml) nouns adjectives
where "Kunst -> künstlich" (art -> artificial) is one derivation example, the
rule name "dNA01" indicates that this is a derivation rule which maps a noun into
an adjective, and the derivation rule itself transforms a noun into an adjective
by performing an umlaut shift within the noun (u -> ü) whenever possible (which
is the case for the vowels a, o, u) and obligatorily adding the suffix "lich".
This rule should enable interested users to understand which derivation rules are
considered in DErivBase.
---
[1] Zeller, B., Šnajder, J., and Padó, S. (2013):
DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German.
Proceedings of ACL 2013, Sofia, Bulgaria.
[2] Zeller, B., Padó, S., and Šnajder, J. (2014):
Towards semantic validation of a derivational lexicon.
Proceedings of COLING 2014, Dublin, Ireland.