Position within the page tree

Institute for Natural Language Processing
Research
Resources
Corpora
Wind-Of-Change Corpora (WOCC)

Wind-Of-Change Corpora (WOCC)

This collection contains the corpora (lemma version) used for the experiments in Schlechtweg et. al (2019)

Wind-Of-Change Corpora (WOCC)

Type

Corpus

Author

Dominik Schlechtweg, Anna Hätty, Marco del Tredici, und Sabine Schulte im Walde

Description

This collection contains the corpora (lemma version) used for the experiments in Schlechtweg et. al (2019). It consists of a diachronic and a domain-specific corpus pair:

diachronic:
- DTA18: sentences from documents in DTA (Deutsches Textarchiv) published between 1750–1799
- DTA19: sentences from documents in DTA published between 1850–1899

domain-specific:
- SDEWAC: a subsample of sentences from SdeWaC (Faaß & Eckart 2013)
- COOK: sentences from web-crawled cooking-related texts

Format
=====

Low-frequency words and punctuation have been removed and words have been lemmatized. Each line corresponds to one sentence. Sentences were randomly shuffled within each corpus. Large files were split and all files were zipped. (See also sample from DTA18.)

The corpora are the basis for the DURel and SURel datasets (Schlechtweg et. al 2018, Hätty et. al 2019). Together with these datasets they can be used to evaluate models of Lexical Semantic Change Detection on meaning changes between times or domains.

More detailed information about the corpora can be found in Schlechtweg et. al (2019).

Reference

Dominik Schlechtweg, Anna Hätty, Marco del Tredici, and Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. ACL.

Download

The corpora can be downloaded here (large files have been split):

DTA18
DTA19
SDEWAC: part 1, part 2, part 3
COOK
Sample corpus from DTA18

Related Resources

DURel: annotation data relying on diachronic corpus pair
SURel: annotation data relying on domain-specific corpus pair
Metaphoric Change: annotation data relying on diachronic corpus pair

This image shows Sabine Schulte im Walde

Wind-Of-Change Corpora (WOCC)

Wind-Of-Change Corpora (WOCC)

Dominik Schlechtweg

Sabine Schulte im Walde

Audience

Formalities

Services

Organization

Wind-Of-Change Corpora (WOCC)

Wind-Of-Change Corpora (WOCC)

Dominik Schlechtweg

Sabine Schulte im Walde

Here you can reach us

Audience

Formalities

Services

Organization