Institut

Studium

Forschung


 

Experimental Settings for "A Graph-based Lattice Dependency Parser for Joint Morphological
Segmentation and Syntactic Analysis", TACL 2015, vol 3.

 

Data

The detached version of the Turkish Treebank has 49 sentences with loops. In the original annotation of the treebank the punctuation marks are left unattached unless they are part of a dependency relation. When the data is converted to the CoNNL format, those punctuation marks are attached to the next token with the dependency relation "notconnected". This automatic process causes loops e.g. in appositions. The example sentence below illustrates such a case:

1 Çünkü çünkü Conj Conj _ 6 S.MODIFIER
2 aleme alem Noun Noun A3sg|Pnon|Dat 5 DATIVE.ADJUNCT
3 bir bir Det Det _ 4 DETERMINER
4 ağa ağa Noun Noun A3sg|Pnon|Nom 5 SUBJECT
5 giriyor gir Verb Verb Pos|Prog1|A3sg 6 SENTENCE
6 : : Punc Punc _ 7 notconnected
7 Soğan soğan Noun Noun A3sg|Pnon|Nom 5 APPOSITION
8 . . Punc Punc _ 0 ROOT


The head attachement of the semicolon in token 6 causes the loop 5 -> 6 -> 7 -> 5. We break the loop by attaching token 5 to 8, instead of 6.

We limited ourselved to correcting only the loops in these 49 sentences. We did not look into other possible mistakes. The corrected data could be downloaded from here