Wind-Of-Change Corpora (WOCC)
- Type
-
Corpus
- Author
-
Dominik Schlechtweg, Anna Hätty, Marco del Tredici, und Sabine Schulte im Walde
- Description
-
This collection contains the corpora (lemma version) used for the experiments in Schlechtweg et. al (2019). It consists of a diachronic and a domain-specific corpus pair:
diachronic:
- DTA18: sentences from documents in DTA (Deutsches Textarchiv) published between 1750–1799
- DTA19: sentences from documents in DTA published between 1850–1899
domain-specific:
- SDEWAC: a subsample of sentences from SdeWaC (Faaß & Eckart 2013)
- COOK: sentences from web-crawled cooking-related texts
Format
=====Low-frequency words and punctuation have been removed and words have been lemmatized. Each line corresponds to one sentence. Sentences were randomly shuffled within each corpus. Large files were split and all files were zipped. (See also sample from DTA18.)
The corpora are the basis for the DURel and SURel datasets (Schlechtweg et. al 2018, Hätty et. al 2019). Together with these datasets they can be used to evaluate models of Lexical Semantic Change Detection on meaning changes between times or domains.
More detailed information about the corpora can be found in Schlechtweg et. al (2019).
- Reference
-
Dominik Schlechtweg, Anna Hätty, Marco del Tredici, and Sabine Schulte im Walde. 2019. A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. ACL.
- Download
- Related Resources
-
- DURel: annotation data relying on diachronic corpus pair
- SURel: annotation data relying on domain-specific corpus pair
- Metaphoric Change: annotation data relying on diachronic corpus pair
Dominik Schlechtweg
Dr.Junior research group leader
Sabine Schulte im Walde
Prof. Dr.Akademische Rätin (Associate Professor)