English Test Data for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection

English Test Data for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection

English Test Data for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection

Type

Corpus, Dataset

Author

Dominik Schlechtweg, Haim Dubossarsky, Simon Hengchen, Barbara McGillivray, Nina Tahmasebi

Description

This data collection contains the English test data for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection:

  • a lemmatized English text corpus pair (corpus1/lemma/, corpus2/lemma/)
  • 37 lemmas (targets) which have been annotated for their lexical semantic change between the two corpora (targets.txt)
  • the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (truth/)

Corpus 1 (lemma version)

  • based on: CCOHA / COHA
  • language: English
  • time covered: 1810-1860
  • size: ~6 million tokens
  • format: lemmatized, sentence length > 9 (before removal of punctuation), no punctuation, sentences randomly shuffled
  • encoding: UTF-8
  • note: targets have been concatenated with their broad POS tag ("target_pos"); sentences are split at replacement tokens (10 x "@") and replacement tokens are removed

Corpus 2 (lemma version)

  • based on: CCOHA / COHA
  • language: English
  • time covered: 1960-2010
  • size: ~6 million tokens
  • format: lemmatized, sentence length > 9 (before removal of punctuation), no punctuation, sentences randomly shuffled
  • encoding: UTF-8
  • note: targets have been concatenated with their broad POS tag ("target_pos"); sentences are split at replacement tokens (10 x "@") and replacement tokens are removed

Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version (corpus1/token/, corpus2/token/). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below.

The creation of the data was supported by the CRETA center and the CLARIN-D grant funded by the German Ministry for Education and Research (BMBF).

Reference

Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi. 2020. SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. SemEval@COLING2020.

This picture showsSabine Schulte im Walde
Apl. Prof. Dr.

Sabine Schulte im Walde

Akademische Rätin (Associate Professor)

To the top of the page