###############
## DErivBase ##
###############

This documentation file 
* explains what DErivBase is
* gives details about the resource's file formats and versions (changelogs)
* provides information about the derivation rules with which DErivBase was 
  created and which are also shipped.

In case you wish to access an earlier version of DErivBase, please send a request 
to: zeller at cl dot uni-heidelberg dot de.


+++++++++++++++++++++
++ About DErivBase ++
+++++++++++++++++++++

DErivBase is a large-coverage derivational resource for German. It consists of 
derivational families, which are defined as equivalence classes of lemmas (nouns,
verbs, and adjectives). The lemmas of one family are derivationally related among 
each other. They were extracted from the sdeWaC corpus; lemmatization and POS 
tagging was done with TreeTagger, further morphological analysis with the MATE 
tools. The resource was built with hand-written derivation rules, which use string 
transformation functions to map basis lemmas into derived lemmas. 

Since v2.0, DErivBase is additionally semantically refined. That is, derivational
families are clustered according to semantically coherent sub-families (for details,
see [2]). For instance, the family with 10 members:
  Anbaggern_Nn Ausbaggern_Nn Abbaggern_Nn Bagger_Nm Baggern_Nn 
  baggern_V anbaggern_V aufbaggern_V ausbaggern_V abbaggern_V
is split into the following three clusters, which are semantically coherent:
  1. abbaggern_V Abbaggern_Nn
  2. ausbaggern_V Ausbaggern_Nn baggern_V Baggern_Nn aufbaggern_V Bagger_Nm
  3. anbaggern_V Anbaggern_Nn

This clustering is achieved as follows: Pairs of derivationally related 
lemmas are classified by a supervised machine learning model, whether or not 
they are supposed to be semantically related. Then, we use hierarchical 
agglomerative clustering to transfer the pairwise decision to complete 
clusters within a family (the threshold for clustering was optimised for F1).


DErivBase v2.0 covers 280,336 lemmas; 65,420 of them are grouped into 20,371 
non-singleton families (i.e., 214,916 are singleton families). 


Each pair of words in a family is connected by a path of derivation rules. 
There is a weak negative correlation between derivation path length and relatedness 
of the connected lemmas. Thus, we assume that lemma pairs from the same 
derivational family are more connected, the less rules are necessary to connect 
them on the shortest path. This fact can be caught by, e.g., assigning a pair a 
weight weight 1/n, where n is the length of the shortest path between them. 
Path weights then can be employed when applying DErivBase, e.g. for measuring 
the semantic similarity of two lemmas. 


++++++++++++++++++
++ File formats ++
++++++++++++++++++

The resource is available in three file formats:

--------------------------------
-- Derivational families only -- 
--------------------------------


The file "DErivBase-v*.txt" contains information about derivational families, but 
not about rules. Each line of the file contains one derivational family, where the 
order of the individual lemmas is not relevant. For each lemma, a suffix specifies 
its part-of-speech (_V for verb, _Nf for feminine noun, _Nm for masculine noun, 
_Nn for neuter noun, _N for gender-unspecific noun, _A for adjective). The families 
are sorted by size before semantic clustering (i.e., according to the size in v1.4.1) 
in descending order, where same-size families were ordered alphabetically. The
semantic clusters within the same derivational family is in random order (i.e., 
neither cluster size, nor alphabetical order matter).

Example for a derivational family containing 10 words:
   Verschmelzung_Nf verschmelzend_A schmelzend_A Verschmelzen_Nn Schmelzen_Nn 
   Schmelze_Nf Schmelz_Nm umschmelzen_V verschmelzen_V schmelzen_V


----------------------------------------------------
-- Derivational families with rule paths per pair -- 
----------------------------------------------------
    
The file "DErivBase-v*-rulePaths.txt" contains the derivational families enriched 
with information about which derivation rules connect each lemma pair within the
same derivational family, and how long the derivation path between the lemma pair 
is (see explanation above).

In order to easily access the shortest rule path for all possible lemma pairs, 
this format contains *ALL* lemma pairs of a derivational family, and their 
connection by the shortest possible derivation path between them.
This format also includes the corresponding length of the connecting rule path. 
Each lemma pair and its derivation path is presented in a single row. That is, for 
a derivational family of n members, there are "n choose 2" (= n*(n-1)/2) rows for 
this family in the "rulePaths" format.
For example, the following derivational family with four members:
   Aalen_Nn Aalener_Nm Aal_Nn aalen_V 
is shown in the following six lines (4*3/2):
  Aal_Nn Aalen_Nn 2 Aal_Nn dNV09> aalen_Ven dVN09> Aalen_Nn
  Aal_Nn aalen_V 1 Aal_Nn dNV09> aalen_V
  Aal_Nn Aalener_Nm 3 Aal_Nn dNV09> aalen_Ven dVN09> Aalen_Nn dNN05> Aalener_Nm
  Aalen_Nn aalen_V 1 Aalen_Nn dNV09> aalen_V
  Aalen_Nn Aalener_Nm 1 Aalen_Nn dNN05> Aalener_Nm
  aalen_V Aalener_Nm 2 aalen_V dVN09> Aalen_Nn dNN05> Aalener_Nm
where 
* the first two lemmas are the actual lemma pair
* the number after the lemma pair indicates the (shortest) path length between
  the very first and the very last lemma of this line
* each lemma pair is connected by exactly one derivation rule, given after the
  path length

The derivation rules, in turn, correspond to the following pattern:
* they always start with "d"
* they always end with ">"
* two capital letters indicate the input and output part-of-speech, e.g., "NV" means
  that the input lemma of the rule is a noun, and the output lemma is a verb
* rules contain a number, which is just an ID number for this part-of-speech pair 
  ("05") and optionally, a sub-number separated by a "."; e.g. "05.1". The sub-number
  range is [1,3]; it is technically necessary to indicate that the input lemma's 
  gender (for nouns) or grammatical suffix (for verbs) is the same as that of the 
  output lemma
* an optional asterisk "*" indicates that the rule is applied inversely; e.g., 
  "behalten_V dNV01*> Halt_Nm", as opposed to "Halt_Nm dNV01> behalten_V"

Please note that the path weight proposed above can be trivially computed from the 
path length n indicated in each line, i.e., 1/n.

For each lemma, a suffix specifies its part-of-speech (_Ven for verb ending with 'en',
_Veln for verb ending wit 'eln', _Vern for verb ending with 'ern', _Nf for feminine 
noun, _Nm for masculine noun, _Nn for neuter noun, _N for gender-unspecific noun, 
_A for adjective). The families are sorted by size before semantic clustering (i.e., 
according to the size in v1.4.1) in descending order, where same-size families were 
ordered alphabetically. Within a family, the pairs as well as the lemmas in a pair 
are ordered in alphabetic order according to the German alphabet (i.e., ä=a, 
ö=o, ü=u, ß=ss).


-------------------------------------------------------
-- Derivational families with probabilities per pair -- 
-------------------------------------------------------
    
The file "DErivBase-v2*-probabilities.txt" contains the derivational families 
enriched with the probability score produced by the machine learning classifier 
for semantic validation (see [2]).

This format contains *ALL* lemma pairs of a derivational and semantically
validated family, and their respective probability score. Each lemma pair and 
its probability score is presented in a single row. That is, for a derivational 
family of n members, there are "n choose 2" (= n*(n-1)/2) rows for this family 
in the "probabilities" format.

For example, the derivational family with 3 members:
  Wasserchemie_Nf Wasserchemiker_Nm wasserchemisch_A
is shown in the following 3 lines:
  Wasserchemie_Nf Wasserchemiker_Nm 0.779426142885
  Wasserchemie_Nf wasserchemisch_A 0.981155397119
  Wasserchemiker_Nm wasserchemisch_A 0.955528623541

Please note that this probability score can be used as a reliability weight 
for the respective lemma pair.

For each lemma, a suffix specifies its part-of-speech (_Ven for verb ending with 'en',
_Veln for verb ending wit 'eln', _Vern for verb ending with 'ern', _Nf for feminine 
noun, _Nm for masculine noun, _Nn for neuter noun, _N for gender-unspecific noun, 
_A for adjective). The families are sorted by size before clustering (i.e., according
to the size in v1.4.1) in descending order, where same-size families were ordered 
alphabetically. Within a family, the pairs are ordered in alphabetic order 
according to the German alphabet (i.e., ä=a, ö=o, ü=u, ß=ss).


+++++++++++++++
++ Changelog ++
+++++++++++++++

----------
-- v2.0 --
----------

DErivBase v2.0 is a semantically refined version of v1.4.1, i.e., derivational
families from v1.4.1 are split according to semantic coherence. Therefore, the 
size of many derivational families has changed. 

The rule set used for v1.4.1 remained unchanged.

Additionally, there is a new data format: "Derivational families with probabilities 
per pair", as explained above.

(October, 2014)


------------
-- v1.4.1 --
------------

Exactly the same as DErivBase v1.4, but removed two duplicate lemmas (Einstelle_Nf, 
Einsteller_Nm).
This change only applies to the "DErivBase-v1.4.1.txt" format file; all other files 
are exactly as in version 1.4.

(May, 2014)


----------
-- v1.4 --
----------

DErivBase v1.4 builds upon v1.2, which was constructed as explained in Zeller et 
al. (2013). 
Version 1.4 underwent three major changes:
- addition of 109 derivation rules compared to v.1.2; in total 267 rules: This version 
  of the resource tries to cover almost all (and surely all productive) possible 
  German derivation processes. This involes especially more prefixation rules, but also 
  some more suffixation rules.
- manual post-processing: refactoring of the 20 biggest derivational families: see 
  changelog v1.3
- new data format: Earlier DErivBase versions were available in the "Derivational 
  families only" format, and in the "Derivational families with pairwise weighted path 
  length" format, which was defined as follows:
        --------------------------------------------------------------
        -- Derivational families with pairwise weighted path length -- 
        --------------------------------------------------------------
    
        The file "DErivBase-v*-pathLength.txt" contains the derivational families 
        enriched with information about the length of the derivation path between 
        each pair of lemmas of the same family (see explanation above).

        In order to easily access the path weights for all possible lemma pairs, 
        each lemma and its corresponding family members including their pairwise path 
        weights are presented in a single row. For example, the following derivational 
        family with four members:
           Aalener_Nm Aalen_Nn aalen_V Aal_Nn
        is shown in four successive lines:
           Aalener_Nm: Aalen_Nn 1.00 aalen_V 0.50 Aal_Nn 0.33
           Aalen_Nn: Aalener_Nm 1.00 aalen_V 1.00 Aal_Nn 0.50
           aalen_V: Aal_Nn 1.00 Aalen_Nn 1.00 Aalener_Nm 0.50
           Aal_Nn: aalen_V 1.00 Aalen_Nn 0.50 Aalener_Nm 0.33
        where a number indicates the path weight between the lemma at the beginning 
        of the line (in front of the colon), and the lemma directly preceding the 
        number.

        For each lemma, a suffix specifies its part-of-speech (_V for verb, _Nf for 
        feminine noun, _Nm for masculine noun, _Nn for neuter noun, _N for gender-
        unspecific noun, _A for adjective). The families are sorted in alphabetic 
        order; the lines within one family are again ordered alphabetically regarding 
        the first lemma in the line.
  Now, we provide DErivBase in the "Derivational families only" and the "Derivational
  families with rule paths per pair" formats (see above), which is more informative.

Additionally, we revised the gold standard annotation presented in the paper [1], 
because we encountered some annotation errors.

Version 1.4 consists of 267 rules and 219,214 derivational families, thereof 17,314 
non-singletons. On the gold standard described in the paper [1], it achieves a precision 
of 85% and a recall of 87% (+16% compared to v1.3). On the revised gold standard, it 
achieves a precision of 85% and a recall of 91%.


----------
-- v1.3 --
----------

DErivBase v1.3 builds upon v1.2, which was constructed as explained in Zeller et 
al. (2013). Version 1.3 underwent the following rule refinements refinements and 
post-processing steps subsequent to the final analysis described in the paper [1]:
- added two derivational rules (dNN28, dVA14): Catch some final coverage lacks
- refined one derivational rule (dVA13): Cover additional optional patterns
- manual post-processing: refactored the 20 biggest derivational families in v1.2:  
  Since the biggest families incorrectly aggregated several derivational families, 
  we split them up into their correct families. In case a corresponding family already 
  existed, we added the lemma(s) of the big family to the corresponding correct family.
  NOTE: this manual split is done *only* for the "DErivBase-v*.txt" file format, since
  the manual split would lead to broken derivation chains in formats describing the
  rule path / chain length.

Version 1.3 consists of 160 rules and 239,796 derivational families, thereof 17,863 
non-singletons. On the gold standard described in the paper [1], it achieves a precision 
of 83% and a recall of 71%.
 

----------
-- v1.2 --
----------

Version 1.2 is the version of DErivBase which was evaluated as model "DErivBase-L123" 
in Zeller et al. (2013). It consists of 158 rules and 240,237 derivational families, 
thereof 17,764 non-singletons. On the gold standard described in the paper [1], it 
achieves a precision of 83% and a recall of 71%.


++++++++++++++++++++++
++ Derivation rules ++
++++++++++++++++++++++

The file "DErivBase-v*-rules.txt" contains the derivation rules with which the
resource was built. The following information are given:
* One or more German derivation examples in a line beginning with "--"
* A unique name assigned to each rule, which indicates the parts-of-speech 
  involved
* The derivation rule itself

An exemplary rule description is:
  -- Kunst -> künstlich, Herr -> herrlich, Glück -> glücklich
  dNA01
  (sfx "lich" & try uml) nouns adjectives
where "Kunst -> künstlich" (art -> artificial) is one derivation example, the 
rule name "dNA01" indicates that this is a derivation rule which maps a noun into 
an adjective, and the derivation rule itself transforms a noun into an adjective 
by performing an umlaut shift within the noun (u -> ü) whenever possible (which
is the case for the vowels a, o, u) and obligatorily adding the suffix "lich".

This rule should enable interested users to understand which derivation rules are 
considered in DErivBase.


---

[1] Zeller, B., Šnajder, J., and Padó, S. (2013):
    DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German. 
    Proceedings of ACL 2013, Sofia, Bulgaria.

[2] Zeller, B., Padó, S., and Šnajder, J. (2014):
    Towards semantic validation of a derivational lexicon. 
    Proceedings of COLING 2014, Dublin, Ireland.