Position within the page tree

Institute for Natural Language Processing
Research
Resources
Corpora
SemEval-2020 Task 1: German Test Data

German Test Data for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection

Type

Corpus, Dataset

Author

Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi

Description

This data collection contains the German test data for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection:

a lemmatized German text corpus pair (corpus1/lemma/, corpus2/lemma/)
48 lemmas (targets) which have been annotated for their lexical semantic change between the two corpora (targets.txt)
the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (truth/)

Corpus 1 (lemma version)

based on: DTA
language: German
time covered: 1800-1899
size: ~70 million tokens
format: lemmatized, sentence length > 9 (before removal of punctuation), no punctuation, sentences randomly shuffled
encoding: UTF-8

Corpus 2 (lemma version)

based on: BZ and ND
language: German
time covered: 1946-1990
size: ~72 million tokens
format: lemmatized, sentence length > 9 (before removal of punctuation), no punctuation, sentences randomly shuffled
encoding: UTF-8
note: contains frequent OCR errors

Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version (corpus1/token/, corpus2/token/). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below.

The creation of the data was supported by the CRETA center and the CLARIN-D grant funded by the German Ministry for Education and Research (BMBF).

Reference

Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi. 2020. SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. SemEval@COLING2020.

Deutsches Textarchiv. 2017. Grundlage für ein Referenzkorpus der neuhochdeutschen Sprache. Herausgegeben von der Berlin-Brandenburgischen Akademie der Wissenschaften.

Berliner Zeitung. 2018. Diachronic newspaper corpus published by Staatsbibliothek zu Berlin.

Neues Deutschland. 2018. Diachronic newspaper corpus published by Staatsbibliothek zu Berlin.

Download

The resources are freely available.

This image shows Sabine Schulte im Walde

German Test Data for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection