Position within the page tree

Institute for Natural Language Processing
Research
Resources
Corpora
Huge German Corpus (HGC)

Huge German Corpus (HGC)

The "Huge German Corpus" (HGC) is a collection of German-language texts (newspaper articles and legal texts) prepared for use with the IMS Corpus Workbench (CWB)

Huge German Corpus (HGC)

Type: Corpus
Description: The "Huge German Corpus (HGC)" is a collection of German texts (newspaper, law texts) of about 204 million tokens including punctuation in 12.2 million sentences (about 180 million "real" words). The corpus was automatically segmented into sentences. Furthermore, it was lemmatized and part-of-speech tagged by the TreeTagger (Schmid 1994) using the STTS tagset (Schiller et al. 1999). The corpus is partly based on data taken from the European Corpus Intitiative Mutlilingual Corpus I (EMI/MCI).
Reference: Anne Schiller, Simone Teufel, Christine Stöckert, and Christine Thielen. 1999. Guidelines für das Tagging deutscher Textcorpora mit STTS. Technical report. Institute for Natural Language Processing (IMS), University of Stuttgart.

Helmut Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of International Conference on New Methods in Language Processing, Manchester, UK.
Download: Unfortunately, the corpus cannot be made available. As an alternative we would like to refer to SdeWaC, which can be requested by e-mail at the WaCky-Initiative.

Write e-mail
If you have any problems with the website, please directly contact the webmaster.

Huge German Corpus (HGC)

Huge German Corpus (HGC)

General Contact IMS

Pfaffenwaldring 5 b, 70569 Stuttgart

Webmaster of the IMS

Audience

Formalities

Services

Organization

Huge German Corpus (HGC)

Huge German Corpus (HGC)

General Contact IMS

Pfaffenwaldring 5 b, 70569 Stuttgart

Webmaster of the IMS

Here you can reach us

Audience

Formalities

Services

Organization