Position within the page tree

Institute for Natural Language Processing
Research
Resources
Corpora
SdeWaC

SdeWaC

SdeWaC is based on the deWaC web corpus of the WaCky-Initative. SdeWaC contains parsable sentences from deWaC documents of the .de domain

SdeWaC

Type

Corpus

Description

SdeWaC is based on the deWaC web corpus of the WaCky-Initative. SdeWaC contains parsable sentences from deWaC documents of the .de domain.

SdeWaC is limited to the sentence context. The sentences were sorted and sentence duplicates within the same domain-name were removed. In addition, some heuristics based on Quasthoff et al. 2006: "Corpus Portal for Search in Monolingual Corpora" have been applied.

To extract parsable sentences the FSPar dependency parser was applied.

Download

SdeWaC-v3 is made available by the WaCky-Initiative (corpus request via e-mail) and comes in two formats:

one sentence per line
one token per line including part-of-speech and lemma annotation (Tokenizer and TreeTagger by H.Schmid)

In both formats, additional metadata encodes the domain-name and an "error-rate" of the parser.

Write e-mail
If you have any problems with the website, please directly contact the webmaster.

SdeWaC

SdeWaC

General Contact IMS

Pfaffenwaldring 5 b, 70569 Stuttgart

Webmaster of the IMS

Audience

Formalities

Services

Organization

SdeWaC

SdeWaC

General Contact IMS

Pfaffenwaldring 5 b, 70569 Stuttgart

Webmaster of the IMS

Here you can reach us

Audience

Formalities

Services

Organization