Institute

Studying

Research


 

SdeWaC

Type Corpus
Title SdeWaC

Description

SdeWaC is based on the deWaC web corpus of the WaCky-Initative. SdeWaC contains parsable sentences from deWaC documents of the .de domain.

SdeWaC is limited to the sentence context. The sentences were sorted and sentence duplicates within the same domain-name were removed. In addition, some heuristics based on Quasthoff et al. 2006: "Corpus Portal for Search in Monolingual Corpora" have been applied.

To extract parsable sentences the FSPar dependency parser was applied.


Download

SdeWaC-v3 is made available by the WaCky-Initiative and comes in two formats:

  • one sentence per line
  • one token per line including part-of-speech and lemma annotation (Tokenizer and TreeTagger by H.Schmid)

In both formats, additional metadata encodes the domain-name and an "error-rate" of the parser.