Clean Corpus of Historical American English (CCOHA)

Methods to clean the downloadable version of the COHA corpus

Clean Corpus of Historical American English (CCOHA)

Typ

Corpus

Autor

Reem Alatrash, Dominik Schlechtweg, Sabine Schulte im Walde

Beschreibung

The Corpus of Historical American English (COHA) is one of the most commonly used large corpora in diachronic studies in English. We provide methods applicable to the downloadable version of the COHA corpus in order to overcome its main limitations, such as inconsistent lemmas and malformed tokens, without compromising its qualitative and distributional properties.

The resulting corpus CCOHA contains a larger number of cleaned word tokens which can offer better insights into language change and allow for a larger variety of tasks to be performed.

Download

https://www.ims.uni-stuttgart.de/data/ccoha

Sabine Schulte im Walde
Apl. Prof. Dr.

Sabine Schulte im Walde

Akademische Rätin

Zum Seitenanfang