Institute

Studying

Research


 

Huge German Corpus (HGC)

Type Corpus
Title Huge German Corpus (HGC)

Description

The "Huge German Corpus (HGC)" is a collection of German texts (newspaper, law texts) of about 204 million tokens including punctuation in 12.2 million sentences (about 180 million "real" words). The corpus was automatically segmented into sentences. Furthermore, it was lemmatized and part-of-speech tagged by the TreeTagger (Schmid 1994) using the STTS tagset (Schiller et al. 1999). The corpus is partly based on data taken from the European Corpus Intitiative Mutlilingual Corpus I (EMI/MCI).

 


Reference

Anne Schiller, Simone Teufel, Christine Stöckert, and Christine Thielen. 1999. Guidelines für das Tagging deutscher Textcorpora mit STTS. Technical report. Institute for Natural Language Processing (IMS), University of Stuttgart.

Helmut Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of International Conference on New Methods in Language Processing, Manchester, UK.