SdeWaC

SdeWaC is based on the deWaC web corpus of the WaCky-Initative. SdeWaC contains parsable sentences from deWaC documents of the .de domain

SdeWaC

Type
Corpus
Description

SdeWaC is based on the deWaC web corpus of the WaCky-Initative. SdeWaC contains parsable sentences from deWaC documents of the .de domain.

SdeWaC is limited to the sentence context. The sentences were sorted and sentence duplicates within the same domain-name were removed. In addition, some heuristics based on Quasthoff et al. 2006: "Corpus Portal for Search in Monolingual Corpora" have been applied.

To extract parsable sentences the FSPar dependency parser was applied.

Download

SdeWaC-v3 is made available by the WaCky-Initiative and comes in two formats:

  • one sentence per line
  • one token per line including part-of-speech and lemma annotation (Tokenizer and TreeTagger by H.Schmid)

In both formats, additional metadata encodes the domain-name and an "error-rate" of the parser.

 

General Contact IMS

Pfaffenwaldring 5 b, 70569 Stuttgart

 

Webmaster of the IMS

  • Write e-mail
  • If you have any problems with the website, please directly contact the webmaster.
To the top of the page