English Test Data for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection

English Test Data for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection

English Test Data for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection

Type

Corpus, Dataset

Author

Dominik Schlechtweg, Haim Dubossarsky, Simon Hengchen, Barbara McGillivray, Nina Tahmasebi

Description

This data collection contains the English test data for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection:

  • a lemmatized English text corpus pair (corpus1/lemma/, corpus2/lemma/)
  • 37 lemmas (targets) which have been annotated for their lexical semantic change between the two corpora (targets.txt)
  • the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (truth/)

Corpus 1 (lemma version)

  • based on: CCOHA / COHA
  • language: English
  • time covered: 1810-1860
  • size: ~6 million tokens
  • format: lemmatized, sentence length > 9 (before removal of punctuation), no punctuation, sentences randomly shuffled
  • encoding: UTF-8
  • note: targets have been concatenated with their broad POS tag ("target_pos"); sentences are split at replacement tokens (10 x "@") and replacement tokens are removed

Corpus 2 (lemma version)

  • based on: CCOHA / COHA
  • language: English
  • time covered: 1960-2010
  • size: ~6 million tokens
  • format: lemmatized, sentence length > 9 (before removal of punctuation), no punctuation, sentences randomly shuffled
  • encoding: UTF-8
  • note: targets have been concatenated with their broad POS tag ("target_pos"); sentences are split at replacement tokens (10 x "@") and replacement tokens are removed

Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version (corpus1/token/, corpus2/token/). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below.

The creation of the data was supported by the CRETA center and the CLARIN-D grant funded by the German Ministry for Education and Research (BMBF).

Reference

Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi. 2020. SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. SemEval@COLING2020.

Reem Alatrash, Dominik Schlechtweg, Jonas Kuhn, and Sabine Schulte im Walde. 2020. CCOHA: Clean Corpus of Historical American English. In Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC’20). European Language Resources Association (ELRA).

Mark Davies. 2012. Expanding Horizons in Historical Linguistics with the 400-Million Word Corpus of Historical American English. Corpora, 7(2):121–157.

Download

The resources are freely available.

Dominik Schlechtweg

Dr.

Employee

This image shows Sabine Schulte im Walde

Sabine Schulte im Walde

Prof. Dr.

Akademische Rätin (Associate Professor)

To the top of the page