Huge German Corpus (HGC)
The "Huge German Corpus (HGC)" is a collection of German texts (newspaper, law texts) of about 204 million tokens including punctuation in 12.2 million sentences (about 180 million "real" words). The corpus was automatically segmented into sentences. Furthermore, it was lemmatized and part-of-speech tagged by the TreeTagger (Schmid 1994) using the STTS tagset (Schiller et al. 1999). The corpus is partly based on data taken from the European Corpus Intitiative Mutlilingual Corpus I (EMI/MCI).
Anne Schiller, Simone Teufel, Christine Stöckert, and Christine Thielen. 1999. Guidelines für das Tagging deutscher Textcorpora mit STTS. Technical report. Institute for Natural Language Processing (IMS), University of Stuttgart.
Helmut Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of International Conference on New Methods in Language Processing, Manchester, UK.
Unfortunately, the corpus cannot be made available. As an alternative we would like to refer to SdeWaC, which can be requested by e-mail at the WaCky-Initiative.
Webmaster of the IMS
- Write e-mail
- If you have any problems with the website, please directly contact the webmaster.