Wind-Of-Change Corpora (WOCC)
Dominik Schlechtweg, Anna Hätty, Marco del Tredici, und Sabine Schulte im Walde
This collection contains the corpora (lemma version) used for the experiments in Schlechtweg et. al (2019). It consists of a diachronic and a domain-specific corpus pair:
- DTA18: sentences from documents in DTA (Deutsches Textarchiv) published between 1750–1799
- DTA19: sentences from documents in DTA published between 1850–1899
- SDEWAC: a subsample of sentences from SdeWaC (Faaß & Eckart 2013)
- COOK: sentences from web-crawled cooking-related texts
Low-frequency words and punctuation have been removed and words have been lemmatized. Each line corresponds to one sentence. Sentences were randomly shuffled within each corpus. Large files were split and all files were zipped. (See also sample from DTA18.)
The corpora are the basis for the DURel and SURel datasets (Schlechtweg et. al 2018, Hätty et. al 2019). Together with these datasets they can be used to evaluate models of Lexical Semantic Change Detection on meaning changes between times or domains.
More detailed information about the corpora can be found in Schlechtweg et. al (2019).
Dominik Schlechtweg, Anna Hätty, Marco del Tredici, and Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. ACL.
The corpora can be downloaded here (large files have been split):