SdeWaC is based on the deWaC web corpus of the WaCky-Initative. SdeWaC contains parsable sentences from deWaC documents of the .de domain.
SdeWaC is limited to the sentence context. The sentences were sorted and sentence duplicates within the same domain-name were removed. In addition, some heuristics based on Quasthoff et al. 2006: "Corpus Portal for Search in Monolingual Corpora" have been applied.
To extract parsable sentences the FSPar dependency parser was applied.
SdeWaC-v3 is made available by the WaCky-Initiative (corpus request via e-mail) and comes in two formats:
- one sentence per line
- one token per line including part-of-speech and lemma annotation (Tokenizer and TreeTagger by H.Schmid)
In both formats, additional metadata encodes the domain-name and an "error-rate" of the parser.