Clean Corpus of Historical American English (CCOHA)

Cleaned version of the Corpus of Historical American English (COHA)

Clean Corpus of Historical American English (CCOHA)

Typ

Corpus

Autor

Reem Alatrash, Dominik Schlechtweg, Jonas Kuhn, Sabine Schulte im Walde

Beschreibung

The Corpus of Historical American English (COHA) is one of the most commonly used large corpora in diachronic studies in English. We cleaned the corpus in order to overcome its main limitations, such as inconsistent lemmas and malformed tokens, without compromising its qualitative and distributional properties.

The resulting corpus CCOHA in addition contains a larger number of cleaned word tokens which can offer better insights into language change and allow for a larger variety of tasks to be performed.

Referenz

Reem Alatrash, Dominik Schlechtweg, Jonas Kuhn and Sabine Schulte im Walde. 2020. CCOHA: Clean Corpus of Historical American English. In Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC). European Language Resources Association (ELRA).

Download

If you want to obtain the corpus, please send us (with Mark Davies in CC) your license for COHA corpus. We can then share the cleaned corpus with you.

There exists a sample from the corpus used in SemEval-2020 Task 1.

Moreover, we provide the target word list used in the cleaning process.

This image shows Dominik Schlechtweg

Dominik Schlechtweg

Dr.

Junior research group leader

This image shows Sabine Schulte im Walde

Sabine Schulte im Walde

Prof. Dr.

Akademische Rätin (Associate Professor)

To the top of the page