Huge German Corpus (HGC)

The "Huge German Corpus" (HGC) is a collection of German-language texts (newspaper articles and legal texts) prepared for use with the IMS Corpus Workbench (CWB)

Huge German Corpus (HGC)

Type
Corpus
Description

The "Huge German Corpus (HGC)" is a collection of German texts (newspaper, law texts) of about 204 million tokens including punctuation in 12.2 million sentences (about 180 million "real" words). The corpus was automatically segmented into sentences. Furthermore, it was lemmatized and part-of-speech tagged by the TreeTagger (Schmid 1994) using the STTS tagset (Schiller et al. 1999). The corpus is partly based on data taken from the European Corpus Intitiative Mutlilingual Corpus I (EMI/MCI).

 

Reference

Anne Schiller, Simone Teufel, Christine Stöckert, and Christine Thielen. 1999. Guidelines für das Tagging deutscher Textcorpora mit STTS. Technical report. Institute for Natural Language Processing (IMS), University of Stuttgart.

Helmut Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of International Conference on New Methods in Language Processing, Manchester, UK.

Download

Unfortunately, the corpus cannot be made available. As an alternative we would like to refer to SdeWaC, which can be requested by e-mail at the WaCky-Initiative.

 

General Contact IMS

Pfaffenwaldring 5 b, 70569 Stuttgart

 

Webmaster of the IMS

  • Write e-mail
  • If you have any problems with the website, please directly contact the webmaster.
To the top of the page