Institute

Studying

Research


 

GRAIN

Type Corpus
Title GRAIN
Author Katrin Schweitzer, Kerstin Eckart, Markus Gärtner, Agnieszka Falenska, Arndt Riester, Ina Rösiger, Antje Schweitzer, Sabrina Stehwien, Jonas Kuhn

Description

The GRAIN corpus -- (G)erman-(RA)dio-(IN)terviews -- based on weekly broadcasted radio interviews

We present GRAIN (German RAdio INterviews) as part of the SFB732 Silver Standard Collection. GRAIN contains German radio interviews and is annotated on multiple linguistic layers. The data has been processed with state-of-the-art tools for text and speech and therefore represents a resource for text-based linguistic research as well as speech science. While there is a gold standard part with manual annotations, the (much larger) silver standard part (which is growing as the radio station releases more interviews) relies completely on automatic annotations. We explicitly release different versions of annotations for the same layers (e.g. morpho-syntax) with the aim to combine and compare multiple layers in order to derive confidence estimations for the annotations. Therefore, parts of the data where the output of several tools match can be considered clear-cut cases, while mismatches hint at areas of interest which are potentially challenging or where rare phenomena can be found.

Primary Data

  • German radio interviews, just under 10mins each (mp3)
  • their edited transcripts from radio station (pdf or doc)
  • 20 interviews chosen for gold-standard annotations on various layers (with an additional 3 interviews for training annotators)
  • silver-standard part (with only automatic annotations): remain-
    der of the interviews
  • at the moment: 143 interviews, about 221,00 word tokens and
    about 23hrs audio (silver-standard)

Automatic Annotations (silver-standard)

  • LAF Anchors to link annotations to characters in the text
  • Speaker turns
  • document structure
  • Tokenization [9]
  • Sentence segmentation according to punctuation
  • Acoustic alignment [4]:
    • word boundaries
    • phone boundaries
    • syllable boundaries
  • Parametrized Intonation Events ["Painte",3]
  • Intonation: 
    • PaIntE-based prediction of intonation events [10]:
      GToBI(S) pitch accent and boundary tone types [2]
    • CNN based prediction of pitch accent placement
    • (combination of those two layers yields higher precision)
  • additional syllable based phonetic features:
    • duration of the syllable
    • its position in the word
    • number of phonemes in onset and rhyme
    • VanSanten/Hirschberg classification of onset and rhyme [13]
  • Morpho-Syntax
    • 3 constituency parsers (BitPar [14], ISC[15], Stanford [19]
    • 4 dependency parsers (ISC[15], Mate [17] IMSTrans [18], Stanford[16])
    • morpho-syntacty annotations from the parsers
    • confidence estimations as meta-annotation (based on agreement of the different parsers)

Manual annotations (gold-standard)

  • textual un-normalization: re-introducing some features of orality to the text transcript (see [1])
  • POS Tagging [see 11]
  • Referential Information status [7]
  • Questions-under-discussion [6]
  • information struture [8]

 

References:

[1] Eckart, K. and Gärtner, M. (2016). Creating Silver Standard Annotations for a Corpus of Non-Standard Data. In Dipper, S., Neubarth, F., and Zinsmeister, H., editors, Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), volume 16 of BLA: Bochumer Linguistische Arbeitsberichte, pages 90–96, Bochum, Germany.

[2] Mayer, J. (1995). Transcribing German intonation – the Stuttgart system. Technical report, Universität Stuttgart. (online version)

[3] Möhler, G. (2001). Improvements of the PaIntE model for F 0 parametrization. Technical report, Institute of Natural Language Processing, University of Stuttgart. Draft version.

[4] Rapp, S. (1998). Automatisierte Erstellung von Korpora für die Prosodieforschung. PhD thesis, IMS, Universität Stuttgart. AIMS 4 (1).

[5] Rebholz-Schuhmann, D., Jimeno-Yepes, A. J., van Mulligen, E. M., Kang, N., Kors, J., Milward, D., Corbett, P., Buyko, E., Tomanek, K., Beisswanger, E., and Hahn, U. (2010). The calbc silver standard corpus for biomedical named entities - a study in harmonizing the contributions from four independent named entity taggers. In Chair),
N. C. C., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., andTapias, D., editors, Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta. European Language Resources Association (ELRA).

[6] Reyle, U. and Riester, A. (2016). Joint information structure and discourse structure analysis in an Underspecified DRT framework. In Hunter, J., Simons, M., and Stone, M., editors, Proceedings of the 20th Workshop on the Semantics and Pragmatics of Dialogue (JerSem), pages 15–24, New Brunswick, NJ, USA.

[7] Riester, A. and Baumann, S. (2017). The RefLex Scheme – Annotation Guidelines, volume 14 of SinSpeC. Working Papers of the SFB 732. University of Stuttgart.

[8] Riester, A., Brunetti, L., and De Kuthy, K. (to appear). Annotation guidelines for Questions under Discussion and information structure. In Adamou, E., Haude, K., and Vanhove, M., editors, Information Structure in Lesser-Described Languages: Studies in Syntax and Prosody. Benjamins, Amsterdam.

[9] Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In International Conference on New Methods in Language Processing, pages 44–49, Manchester, UK.

[10] Schweitzer, A. (2010). Production and Perception of Prosodic Events – Evidence from Corpus-based Experiments. Doctoral dissertation, Universität Stuttgart.

[11] Seeker, W. (2016). Guidelines for the Annotation of Syntactic Structure in the IMS Interview Corpus.

[12] Stehwien, S. and Vu, N.T. (2017). Prosodic event detection using convolutional neural networks with context information. In Proceedings of Interspeech, pages 2326–2330.

[13] van Santen, J. and Hirschberg, J. (1994). Segmental effects on timing and height of pitch contours. In Proceedings of the 3rd International Conference on Spoken Language Processing (ICSLP 94), pages 719–722, Yokohama, Japan, 09.

[14] Schmid, H. (2006). Trace prediction and recovery with unlexicalized pcfgs and slash features. In Proceedings
of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association
for Computational Linguistics, pages 177–184, Sydney, Australia, July. Association for Computational Linguistics.

[15] Björkelund, A., Cetinoglu, O., Farkas, R., Mueller, T., and Seeker, W. (2013). (re)ranking meets morphosyntax:
State-of-the-art results from the SPMRL 2013 shared task. In Proceedings of the Fourth Workshop on Statis-
tical Parsing of Morphologically-Rich Languages, pages 135–145, Seattle, Washington, USA, October. Association for Computational Linguistics.

[16] Chen, D. and Manning, C. D. (2014). A fast and accurate dependency parser using neural networks. In Alessandro Moschitti, et al., editors, EMNLP, pages 740–750. ACL.

[17] Bohnet, B. and Nivre, J. (2012). A transition-based system for joint part-of-speech tagging and labeled non-
projective dependency parsing. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning, pages 1455–1465, Jeju Island, Korea,
July. Association for Computational Linguistics.

[18] Björkelund, A. and Nivre, J. (2015). Non-Deterministic Oracles for Unrestricted Non-Projective Transition-Based Dependency Parsing. In Proceedings of the 14th International Conference on Parsing Technologies, pages 76–86, Bilbao, Spain, July. Association for Computational Linguistics.

[19] Klein, D. and Manning, C. D. (2003). Fast exact inference with a factored model for natural language parsing. In Becker, S., Thrun, S., and Obermayer, K., editors, Advances in Neural Information Processing Systems 15, pages 3–10. MIT Press.

 


Reference

Katrin Schweitzer, Kerstin Eckart, Markus Gärtner, Agnieszka Falenska, Arndt Riester, Ina Rösiger, Antje Schweitzer, Sabrina Stehwien, Jonas Kuhn. German Radio Interviews: The GRAIN Release of the SFB732 Silver Standard Collection. In: Proceedings of the 11th edition of the Language Resources and Evaluation Conference (LREC), 7-12 May 2018, Miyazaki (Japan). 


Download

Silver-standard part of the corpus (automatic annotations):

  • Word / phone / syllable boundaries (as esps label files: words, syls, phones; or as part of praat TextGrids)
  • PaIntE based automatically predicted GToBI(S) pitch accents and bounday tones as esps label files (as esps label files for accents and tones or as part of praat TextGrids (same as above))
  • CNN based automatically predicted pitch accent and boundary placements (as esps label files: accents, boundaries)
  • Morphosyntactic annotations with confidence estimations (in modified CoNLL09 format) and a merged version with blended dependency trees (again in CoNLL09 format)
  • Raw textual versions of the original interview transcripts, the constituency parses and their confidence estimations, as well as files with merged content from both pipeline types (text and speech) will be available soon.

Gold-standard part of the corpus (manual annotations)

 

Audiofiles and Textual transcriptions

 
Process metadata

  • Curation of the process metadata is almost done. Text files and zips for all annotations (and the entire underlying processing chains) will be available for download on this site soon. E-mail us if you want to be informed, when the download is ready.