German Test Data for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection
- Type
-
Corpus, Dataset
- Author
-
Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi
- Description
-
This data collection contains the German test data for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection:
- a lemmatized German text corpus pair (
corpus1/lemma/
,corpus2/lemma/
) - 48 lemmas (targets) which have been annotated for their lexical semantic change between the two corpora (
targets.txt
) - the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (
truth/
)
Corpus 1 (lemma version)
- based on: DTA
- language: German
- time covered: 1800-1899
- size: ~70 million tokens
- format: lemmatized, sentence length > 9 (before removal of punctuation), no punctuation, sentences randomly shuffled
- encoding: UTF-8
Corpus 2 (lemma version)
- based on: BZ and ND
- language: German
- time covered: 1946-1990
- size: ~72 million tokens
- format: lemmatized, sentence length > 9 (before removal of punctuation), no punctuation, sentences randomly shuffled
- encoding: UTF-8
- note: contains frequent OCR errors
Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version (
corpus1/token/
,corpus2/token/
). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below.The creation of the data was supported by the CRETA center and the CLARIN-D grant funded by the German Ministry for Education and Research (BMBF).
- a lemmatized German text corpus pair (
- Reference
-
Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi. 2020. SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. SemEval@COLING2020.
Deutsches Textarchiv. 2017. Grundlage für ein Referenzkorpus der neuhochdeutschen Sprache. Herausgegeben von der Berlin-Brandenburgischen Akademie der Wissenschaften.
Berliner Zeitung. 2018. Diachronic newspaper corpus published by Staatsbibliothek zu Berlin.
Neues Deutschland. 2018. Diachronic newspaper corpus published by Staatsbibliothek zu Berlin.
- Download
-
The resources are freely available.
Dominik Schlechtweg
Dr.Junior research group leader
Sabine Schulte im Walde
Prof. Dr.Akademische Rätin (Associate Professor)