The GRAIN corpus -- (G)erman-(RA)dio-(IN)terviews -- based on weekly broadcasted radio interviews

We present GRAIN (German RAdio INterviews) as part of the SFB732 Silver Standard Collection. GRAIN contains German radio interviews and is annotated on multiple linguistic layers. The data has been processed with state-of-the-art tools for text and speech and therefore represents a resource for text-based linguistic research as well as speech science. While there is a gold standard part with manual annotations, the (much larger) silver standard part (which is growing as the radio station releases more interviews) relies completely on automatic annotations. We explicitly release different versions of annotations for the same layers (e.g. morpho-syntax) with the aim to combine and compare multiple layers in order to derive confidence estimations for the annotations. Therefore, parts of the data where the output of several tools match can be considered clear-cut cases, while mismatches hint at areas of interest which are potentially challenging or where rare phenomena can be found.

Primary Data

  • German radio interviews, just under 10mins each (mp3)
  • their edited transcripts from radio station (pdf or doc)
  • 20 interviews chosen for gold-standard annotations on various layers (with an additional 3 interviews for training annotators)
  • silver-standard part (with only automatic annotations): remain-
    der of the interviews
  • at the moment: 143 interviews, about 221,00 word tokens and
    about 23hrs audio (silver-standard)

Automatic Annotations (silver-standard)

  • LAF Anchors to link annotations to characters in the text
  • Speaker turns
  • document structure
  • Tokenization [9]
  • Sentence segmentation according to punctuation
  • Acoustic alignment [4]:
    • word boundaries
    • phone boundaries
    • syllable boundaries
  • Parametrized Intonation Events ["Painte",3]
  • Intonation: 
    • PaIntE-based prediction of intonation events [10]:
      GToBI(S) pitch accent and boundary tone types [2]
    • CNN based prediction of pitch accent placement
    • (combination of those two layers yields higher precision)
  • additional syllable based phonetic features:
    • duration of the syllable
    • its position in the word
    • number of phonemes in onset and rhyme
    • VanSanten/Hirschberg classification of onset and rhyme [13]
  • Morpho-Syntax
    • 3 constituency parsers (BitPar [14], ISC[15], Stanford [19]
    • 4 dependency parsers (ISC[15], Mate [17] IMSTrans [18], Stanford[16])
    • morpho-syntacty annotations from the parsers
    • confidence estimations as meta-annotation (based on agreement of the different parsers)

Manual annotations (gold-standard)

  • textual un-normalization: re-introducing some features of orality to the text transcript (see [1])
  • POS Tagging [see 11]
  • Referential Information status [7]
  • Questions-under-discussion [6]
  • information struture [8]



Katrin Schweitzer, Kerstin Eckart, Markus Gärtner, Agnieszka Falenska, Arndt Riester, Ina Rösiger, Antje Schweitzer, Sabrina Stehwien, Jonas Kuhn. German Radio Interviews: The GRAIN Release of the SFB732 Silver Standard Collection. In: Proceedings of the 11th edition of the Language Resources and Evaluation Conference (LREC), 7-12 May 2018, Miyazaki (Japan). 


Silver-standard part of the corpus (automatic annotations):

  • Word / phone / syllable boundaries (as esps label files: words, syls, phones; or as part of praat TextGrids)
  • PaIntE based automatically predicted GToBI(S) pitch accents and bounday tones as esps label files (as esps label files for accents and tones or as part of praat TextGrids (same as above))
  • CNN based automatically predicted pitch accent and boundary placements (as esps label files: accents, boundaries)
  • Morphosyntactic annotations with confidence estimations (in modified CoNLL09 format) and a merged version with blended dependency trees (again in CoNLL09 format)
  • Raw textual versions of the original interview transcripts, the constituency parses and their confidence estimations, as well as files with merged content from both pipeline types (text and speech) will be available soon.

Gold-standard part of the corpus (manual annotations)

Audiofiles and Textual transcriptions

Process metadata

  • Curation of the process metadata is almost done. Text files and zips for all annotations (and the entire underlying processing chains) will be available for download on this site soon. E-mail us if you want to be informed, when the download is ready.