############### ## DErivBase ## ############### This documentation file * explains what DErivBase is * gives details about the resource's file formats and versions (changelogs) * provides information about the derivation rules with which DErivBase was created and which are also shipped. In case you wish to access an earlier version of DErivBase, please send a request to: zeller at cl dot uni-heidelberg dot de. +++++++++++++++++++++ ++ About DErivBase ++ +++++++++++++++++++++ DErivBase is a large-coverage derivational resource for German. It consists of derivational families, which are defined as equivalence classes of lemmas (nouns, verbs, and adjectives). The lemmas of one family are derivationally related among each other. They were extracted from the sdeWaC corpus; lemmatization and POS tagging was done with TreeTagger, further morphological analysis with the MATE tools. The resource was built with hand-written derivation rules, which use string transformation functions to map basis lemmas into derived lemmas. Since v2.0, DErivBase is additionally semantically refined. That is, derivational families are clustered according to semantically coherent sub-families (for details, see [2]). For instance, the family with 10 members: Anbaggern_Nn Ausbaggern_Nn Abbaggern_Nn Bagger_Nm Baggern_Nn baggern_V anbaggern_V aufbaggern_V ausbaggern_V abbaggern_V is split into the following three clusters, which are semantically coherent: 1. abbaggern_V Abbaggern_Nn 2. ausbaggern_V Ausbaggern_Nn baggern_V Baggern_Nn aufbaggern_V Bagger_Nm 3. anbaggern_V Anbaggern_Nn This clustering is achieved as follows: Pairs of derivationally related lemmas are classified by a supervised machine learning model, whether or not they are supposed to be semantically related. Then, we use hierarchical agglomerative clustering to transfer the pairwise decision to complete clusters within a family (the threshold for clustering was optimised for F1). DErivBase v2.0 covers 280,336 lemmas; 65,420 of them are grouped into 20,371 non-singleton families (i.e., 214,916 are singleton families). Each pair of words in a family is connected by a path of derivation rules. There is a weak negative correlation between derivation path length and relatedness of the connected lemmas. Thus, we assume that lemma pairs from the same derivational family are more connected, the less rules are necessary to connect them on the shortest path. This fact can be caught by, e.g., assigning a pair a weight weight 1/n, where n is the length of the shortest path between them. Path weights then can be employed when applying DErivBase, e.g. for measuring the semantic similarity of two lemmas. ++++++++++++++++++ ++ File formats ++ ++++++++++++++++++ The resource is available in three file formats: -------------------------------- -- Derivational families only -- -------------------------------- The file "DErivBase-v*.txt" contains information about derivational families, but not about rules. Each line of the file contains one derivational family, where the order of the individual lemmas is not relevant. For each lemma, a suffix specifies its part-of-speech (_V for verb, _Nf for feminine noun, _Nm for masculine noun, _Nn for neuter noun, _N for gender-unspecific noun, _A for adjective). The families are sorted by size before semantic clustering (i.e., according to the size in v1.4.1) in descending order, where same-size families were ordered alphabetically. The semantic clusters within the same derivational family is in random order (i.e., neither cluster size, nor alphabetical order matter). Example for a derivational family containing 10 words: Verschmelzung_Nf verschmelzend_A schmelzend_A Verschmelzen_Nn Schmelzen_Nn Schmelze_Nf Schmelz_Nm umschmelzen_V verschmelzen_V schmelzen_V ---------------------------------------------------- -- Derivational families with rule paths per pair -- ---------------------------------------------------- The file "DErivBase-v*-rulePaths.txt" contains the derivational families enriched with information about which derivation rules connect each lemma pair within the same derivational family, and how long the derivation path between the lemma pair is (see explanation above). In order to easily access the shortest rule path for all possible lemma pairs, this format contains *ALL* lemma pairs of a derivational family, and their connection by the shortest possible derivation path between them. This format also includes the corresponding length of the connecting rule path. Each lemma pair and its derivation path is presented in a single row. That is, for a derivational family of n members, there are "n choose 2" (= n*(n-1)/2) rows for this family in the "rulePaths" format. For example, the following derivational family with four members: Aalen_Nn Aalener_Nm Aal_Nn aalen_V is shown in the following six lines (4*3/2): Aal_Nn Aalen_Nn 2 Aal_Nn dNV09> aalen_Ven dVN09> Aalen_Nn Aal_Nn aalen_V 1 Aal_Nn dNV09> aalen_V Aal_Nn Aalener_Nm 3 Aal_Nn dNV09> aalen_Ven dVN09> Aalen_Nn dNN05> Aalener_Nm Aalen_Nn aalen_V 1 Aalen_Nn dNV09> aalen_V Aalen_Nn Aalener_Nm 1 Aalen_Nn dNN05> Aalener_Nm aalen_V Aalener_Nm 2 aalen_V dVN09> Aalen_Nn dNN05> Aalener_Nm where * the first two lemmas are the actual lemma pair * the number after the lemma pair indicates the (shortest) path length between the very first and the very last lemma of this line * each lemma pair is connected by exactly one derivation rule, given after the path length The derivation rules, in turn, correspond to the following pattern: * they always start with "d" * they always end with ">" * two capital letters indicate the input and output part-of-speech, e.g., "NV" means that the input lemma of the rule is a noun, and the output lemma is a verb * rules contain a number, which is just an ID number for this part-of-speech pair ("05") and optionally, a sub-number separated by a "."; e.g. "05.1". The sub-number range is [1,3]; it is technically necessary to indicate that the input lemma's gender (for nouns) or grammatical suffix (for verbs) is the same as that of the output lemma * an optional asterisk "*" indicates that the rule is applied inversely; e.g., "behalten_V dNV01*> Halt_Nm", as opposed to "Halt_Nm dNV01> behalten_V" Please note that the path weight proposed above can be trivially computed from the path length n indicated in each line, i.e., 1/n. For each lemma, a suffix specifies its part-of-speech (_Ven for verb ending with 'en', _Veln for verb ending wit 'eln', _Vern for verb ending with 'ern', _Nf for feminine noun, _Nm for masculine noun, _Nn for neuter noun, _N for gender-unspecific noun, _A for adjective). The families are sorted by size before semantic clustering (i.e., according to the size in v1.4.1) in descending order, where same-size families were ordered alphabetically. Within a family, the pairs as well as the lemmas in a pair are ordered in alphabetic order according to the German alphabet (i.e., ä=a, ö=o, ü=u, ß=ss). ------------------------------------------------------- -- Derivational families with probabilities per pair -- ------------------------------------------------------- The file "DErivBase-v2*-probabilities.txt" contains the derivational families enriched with the probability score produced by the machine learning classifier for semantic validation (see [2]). This format contains *ALL* lemma pairs of a derivational and semantically validated family, and their respective probability score. Each lemma pair and its probability score is presented in a single row. That is, for a derivational family of n members, there are "n choose 2" (= n*(n-1)/2) rows for this family in the "probabilities" format. For example, the derivational family with 3 members: Wasserchemie_Nf Wasserchemiker_Nm wasserchemisch_A is shown in the following 3 lines: Wasserchemie_Nf Wasserchemiker_Nm 0.779426142885 Wasserchemie_Nf wasserchemisch_A 0.981155397119 Wasserchemiker_Nm wasserchemisch_A 0.955528623541 Please note that this probability score can be used as a reliability weight for the respective lemma pair. For each lemma, a suffix specifies its part-of-speech (_Ven for verb ending with 'en', _Veln for verb ending wit 'eln', _Vern for verb ending with 'ern', _Nf for feminine noun, _Nm for masculine noun, _Nn for neuter noun, _N for gender-unspecific noun, _A for adjective). The families are sorted by size before clustering (i.e., according to the size in v1.4.1) in descending order, where same-size families were ordered alphabetically. Within a family, the pairs are ordered in alphabetic order according to the German alphabet (i.e., ä=a, ö=o, ü=u, ß=ss). +++++++++++++++ ++ Changelog ++ +++++++++++++++ ---------- -- v2.0 -- ---------- DErivBase v2.0 is a semantically refined version of v1.4.1, i.e., derivational families from v1.4.1 are split according to semantic coherence. Therefore, the size of many derivational families has changed. The rule set used for v1.4.1 remained unchanged. Additionally, there is a new data format: "Derivational families with probabilities per pair", as explained above. (October, 2014) ------------ -- v1.4.1 -- ------------ Exactly the same as DErivBase v1.4, but removed two duplicate lemmas (Einstelle_Nf, Einsteller_Nm). This change only applies to the "DErivBase-v1.4.1.txt" format file; all other files are exactly as in version 1.4. (May, 2014) ---------- -- v1.4 -- ---------- DErivBase v1.4 builds upon v1.2, which was constructed as explained in Zeller et al. (2013). Version 1.4 underwent three major changes: - addition of 109 derivation rules compared to v.1.2; in total 267 rules: This version of the resource tries to cover almost all (and surely all productive) possible German derivation processes. This involes especially more prefixation rules, but also some more suffixation rules. - manual post-processing: refactoring of the 20 biggest derivational families: see changelog v1.3 - new data format: Earlier DErivBase versions were available in the "Derivational families only" format, and in the "Derivational families with pairwise weighted path length" format, which was defined as follows: -------------------------------------------------------------- -- Derivational families with pairwise weighted path length -- -------------------------------------------------------------- The file "DErivBase-v*-pathLength.txt" contains the derivational families enriched with information about the length of the derivation path between each pair of lemmas of the same family (see explanation above). In order to easily access the path weights for all possible lemma pairs, each lemma and its corresponding family members including their pairwise path weights are presented in a single row. For example, the following derivational family with four members: Aalener_Nm Aalen_Nn aalen_V Aal_Nn is shown in four successive lines: Aalener_Nm: Aalen_Nn 1.00 aalen_V 0.50 Aal_Nn 0.33 Aalen_Nn: Aalener_Nm 1.00 aalen_V 1.00 Aal_Nn 0.50 aalen_V: Aal_Nn 1.00 Aalen_Nn 1.00 Aalener_Nm 0.50 Aal_Nn: aalen_V 1.00 Aalen_Nn 0.50 Aalener_Nm 0.33 where a number indicates the path weight between the lemma at the beginning of the line (in front of the colon), and the lemma directly preceding the number. For each lemma, a suffix specifies its part-of-speech (_V for verb, _Nf for feminine noun, _Nm for masculine noun, _Nn for neuter noun, _N for gender- unspecific noun, _A for adjective). The families are sorted in alphabetic order; the lines within one family are again ordered alphabetically regarding the first lemma in the line. Now, we provide DErivBase in the "Derivational families only" and the "Derivational families with rule paths per pair" formats (see above), which is more informative. Additionally, we revised the gold standard annotation presented in the paper [1], because we encountered some annotation errors. Version 1.4 consists of 267 rules and 219,214 derivational families, thereof 17,314 non-singletons. On the gold standard described in the paper [1], it achieves a precision of 85% and a recall of 87% (+16% compared to v1.3). On the revised gold standard, it achieves a precision of 85% and a recall of 91%. ---------- -- v1.3 -- ---------- DErivBase v1.3 builds upon v1.2, which was constructed as explained in Zeller et al. (2013). Version 1.3 underwent the following rule refinements refinements and post-processing steps subsequent to the final analysis described in the paper [1]: - added two derivational rules (dNN28, dVA14): Catch some final coverage lacks - refined one derivational rule (dVA13): Cover additional optional patterns - manual post-processing: refactored the 20 biggest derivational families in v1.2: Since the biggest families incorrectly aggregated several derivational families, we split them up into their correct families. In case a corresponding family already existed, we added the lemma(s) of the big family to the corresponding correct family. NOTE: this manual split is done *only* for the "DErivBase-v*.txt" file format, since the manual split would lead to broken derivation chains in formats describing the rule path / chain length. Version 1.3 consists of 160 rules and 239,796 derivational families, thereof 17,863 non-singletons. On the gold standard described in the paper [1], it achieves a precision of 83% and a recall of 71%. ---------- -- v1.2 -- ---------- Version 1.2 is the version of DErivBase which was evaluated as model "DErivBase-L123" in Zeller et al. (2013). It consists of 158 rules and 240,237 derivational families, thereof 17,764 non-singletons. On the gold standard described in the paper [1], it achieves a precision of 83% and a recall of 71%. ++++++++++++++++++++++ ++ Derivation rules ++ ++++++++++++++++++++++ The file "DErivBase-v*-rules.txt" contains the derivation rules with which the resource was built. The following information are given: * One or more German derivation examples in a line beginning with "--" * A unique name assigned to each rule, which indicates the parts-of-speech involved * The derivation rule itself An exemplary rule description is: -- Kunst -> künstlich, Herr -> herrlich, Glück -> glücklich dNA01 (sfx "lich" & try uml) nouns adjectives where "Kunst -> künstlich" (art -> artificial) is one derivation example, the rule name "dNA01" indicates that this is a derivation rule which maps a noun into an adjective, and the derivation rule itself transforms a noun into an adjective by performing an umlaut shift within the noun (u -> ü) whenever possible (which is the case for the vowels a, o, u) and obligatorily adding the suffix "lich". This rule should enable interested users to understand which derivation rules are considered in DErivBase. --- [1] Zeller, B., Šnajder, J., and Padó, S. (2013): DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German. Proceedings of ACL 2013, Sofia, Bulgaria. [2] Zeller, B., Padó, S., and Šnajder, J. (2014): Towards semantic validation of a derivational lexicon. Proceedings of COLING 2014, Dublin, Ireland.