Wind-Of-Change Corpora (WOCC)

This collection contains the corpora (lemma version) used for the experiments in Schlechtweg et. al (2019)

Wind-Of-Change Corpora (WOCC)

Type

Corpus

Author

Dominik Schlechtweg, Anna Hätty, Marco del Tredici, und Sabine Schulte im Walde

Description

This collection contains the corpora (lemma version) used for the experiments in Schlechtweg et. al (2019). It consists of a diachronic and a domain-specific corpus pair:

diachronic:
- DTA18: sentences from documents in DTA (Deutsches Textarchiv) published between 1750–1799
- DTA19: sentences from documents in DTA published between 1850–1899

domain-specific:
- SDEWAC: a subsample of sentences from SdeWaC (Faaß & Eckart 2013)
- COOK: sentences from web-crawled cooking-related texts


Format
=====

Low-frequency words and punctuation have been removed and words have been lemmatized. Each line corresponds to one sentence. Sentences were randomly shuffled within each corpus. Large files were split and all files were zipped. (See also sample from DTA18.)

The corpora are the basis for the DURel and SURel datasets (Schlechtweg et. al 2018, Hätty et. al 2019). Together with these datasets they can be used to evaluate models of Lexical Semantic Change Detection on meaning changes between times or domains.

More detailed information about the corpora can be found in Schlechtweg et. al (2019).

Reference

Dominik Schlechtweg, Anna Hätty, Marco del Tredici, and Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. ACL.

Download

The corpora can be downloaded here (large files have been split):

Sabine Schulte im Walde
Apl. Prof. Dr.

Sabine Schulte im Walde

Akademische Rätin (Associate Professor)

To the top of the page