Generation Ranking Experiment Data
We carried out a number of native speaker judgement experiments. The experiments are described in detail in Cahill and Forst (2009) and Cahill and Forst (to appear 2010). The main aims of this experiment were: (i) to establish how much variation in German word order is acceptable for human judges, (ii) to find an automatic evaluation metric that mirrors the findings of the human evaluation, (iii) to provide detailed feedback for the designers of the surface realisation ranking model and (iv) to establish what effect preceding context has on the choice of realisation.
Input Data
The input data for the 3 experiments are available to download for research purposes. The data for experiments 1a and 3a are identical, similarly for experiments 1b and 3b.
Experiment 1a and 3a Input Data
Experiment 1b and 3b Input Data
Experiment 2 Input Data
What to expect in each file
Experiments 1a and 3a
The file name indicates the sentence id. Each item contains at least 2 sentences of context, only presented to the participants during Experiment 3a. Then for each system (Gold, Language Model (LM), and Log-linear Model (LL), the sentence chosen as being most likely for that system is given.
Experiments 1b and 3b
The file name indicates the sentence id. Each item contains at least 2 sentences of context, only presented to the participants during Experiment 3b. Then the sentence chosen as being most likely by the language model is given.
Experiment 2
The file name indicates the sentence id. Each item contains at least 2 sentences of context. Then all possible alternatives for the item, as generated by the LFG grammar, are given.
Judgements
The native speaker judgements for each item in each experiment are available to download for research purposes. You can download the judgements by experiment number or participant number.
Experiment 1a Judgements
Experiment 1b Judgements
Experiment 2 Judgements
Experiment 3a Judgements
Experiment 3b Judgements
Judgements grouped by participant
What to expect in each file
Experiment 1a
The first line should contain "yes" if the participant has some linguistic experience or "no" otherwise. The second line describes their dialect (if they entered this information). The next line contains the sentence ids of all items shown to the native speaker, in the order they were presented (including duplicates). For each item, there are 4 lines describing the judgement. The first contains the sentence id, along with the order in which the different systems were present (0=gold, 1=lm, 2=ll). Lines 2-4 contain the rank assigned to each system.
Experiment 1b
The first line contains the sentence ids of all items shown to the native speaker, in the order they were presented (including duplicates). Then, for each sentence id, the rating of the sentence by the native speaker (between 1 and 5) is stored.
Experiment 2
The first line contains the sentence ids of all items shown to the native speaker, in the order they were presented (including duplicates). Then, for each sentence id, the sentence that was chosen by the native speaker as being most appropriate (indices start at 0).
Experiment 3a
As for Experiment 1a
Experiment 3b
As for Experiment 1b
References
Aoife Cahill and Martin Forst (to appear 2010) Human Evaluation of a German Surface Realisation Ranker In Emiel Krahmer and Mariet Theune (eds.), Empirical Methods in Natural Language Generation. Springer, 2010.
Aoife Cahill and Martin Forst (2009) Human Evaluation of a German Surface Realisation Ranker In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 112 - 120, Athens, Greece, March. Association for Computational Linguistics. [pdf]
Aoife Cahill (2009) Correlating Human and Automatic Evaluation of a German Surface Realiser In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers (ACL-IJCNLP 2009), pages 97--100, Singapore [pdf]