Corpora at the IMS

An overview of the corpora available at the IMS

Below you will find an overview of the corpora developed at the IMS.

Corpora of the IMS

Title Description
A Survey and Experiments on Annotated Corpora for Emotion Classification in Text Aggregated corpus of emotion classification datasets
ANVAN-LS: Lexical Substitution for Evaluating Compositional Distributional Models ANVAN-LS is a lexical substitution dataset for CDSM evaluation sampled from an English-language corpus with manual “all-words” lexical substitution annotation
Analysis of emotion communication channels in fan-fiction A corpus of fan fiction excerpts, annotated with emotion channels and emotion
Appraisal-based Emotion Analysis Corpora and Models for Appraisal-based Emotion Analysis
BASHI BASHI is a corpus consisting of 50 Wall Street Journal (WSJ) articles which adds bridging anaphors and their antecedents to the other gold annotations that have been created as part of the OntoNotes project. Bridging anaphors are context-dependent expressions that do not refer to the same entity as their antecedent, but to a related entity
Biomedical Claims in Tweets A corpus of 1200 Twitter posts with annotations of explicit and implicit biomedical claims
Chess Dataset This corpus consists of annotated chess games that were posted on
Clean Corpus of Historical American English (CCOHA) Cleaned version of the Corpus of Historical American English (COHA)
CoInCo: Concepts in Context An English corpus that adds add-words lexical substitution annotation to a sample of the newswire and fiction genres of the freely available MASC corpus
Comparisons in Product Reviews Sentences from camera reviews annotated with comparisons
DEmaNet Corpus DEmaNet
DIRE dataset Dataset from Boleda et al. IWCS 2017
DIRNDL (D)iscourse (I)nformation (R)adio (N)ews (D)atabase for (L)inguistic Analysis – is a corpus resource based on hourly broadcast German radio news
Data and Implementation for "Frowning Frodo, Wincing Leia, and a Seriously Great Friendship: Learning to Classify Emotional Relationships of Fictional Characters" Data for NAACL 2019 publication of Evgeny Kim and Roman Klinger
Data and Implementation for German Satire Detection with Adversarial Training Source with documentation
Data for the Intensifiers in the context of emotions Data for the papers: "Florian Strohm and Roman Klinger. An empirical analysis of the role of amplifiers, downtoners, and negations in emotion classification in microblogs.", and "Laura Ana Maria Bostan and Roman Klinger. Exploring fine-tuned embeddings that model intensifiers for emotion analysis."
Determinants of Grader Agreement: A Study of Short Answer Grading Corpora
Europarl Nominal Compound Database The Europarl Nominal Compound Database (ENCD) was automatically extracted from Europarl v7 of OPUS. This database contains English nominal compounds and their equivalents in up to nine languages
Europarl Nominal Compoundhood Ratings The Europarl Nominal Compoundhood Ratings (ENCR) is a selection of 394 sentences from the English portion of the Europarl corpus (Europarl v7, OPUS), annotated with 824 candidate compounds
Event-focused Emotion Corpora for German and English German and English emotion corpora for emotion classification, annotated with crowdsourcing in the style of the ISEAR resources
GRAIN The GRAIN corpus -- (G)erman-(RA)dio-(IN)terviews -- based on weekly broadcasted radio interviews We present GRAIN (German RAdio INterviews) as part of the SFB732 Silver Standard Collection
GRAIN-S GRAIN-S -- Manually annotated (S)yntax for (G)erman (RA)dio (IN)terviews
GerDraCor-Coref - German Drama Corpus for Coreference A corpus with coreference annotations for German dramatic texts
GoodNewsEveryone An annotation of the SemEval 2016 Twitter stance and sentiment corpus with emotion labels
Huge German Corpus (HGC) The "Huge German Corpus" (HGC) is a collection of German-language texts (newspaper articles and legal texts) prepared for use with the IMS Corpus Workbench (CWB)
IMS Citation Corpus Online appendix to the COLING 2012 paper "Towards a Generic and Flexible Citation Classifier Based on a Faceted Classification Scheme."
IMS GECO database Speech corpus of spontaneous conversations including participants' mutual social ratings and personality factors
IMSCONV database Corpus for investigating convergence in spontaneous speech dialogues
Multilingual parallel TED talk dataset Multilingual parallel TED talk dataset
NLI corpora (Stehwien & Pado 2015) Data for the paper "Generalization in Native Language Identification -- Learners versus Scientists" (Stehwien & Pado CLiC 2015)
Obituary Corpus Orbituaries annotated in sections
REMAN - Relational Emotion Annotation for Fiction Relational EMotion ANnotation – a corpus with 1720 fictional text exceprts from the Project Gutenberg
Referential Distributional Semantics: City and Country Datasets City und Country-Datensätze aus Gupta et al. EMNLP 2015
Resources for Emotion Analysis A collection of ressources created at IMS related to emotion and sentiment analysis
RiQuA – Rich Quotation Analysis Corpus A corpus of English literary texts, annotated for quotations including their social structures.
SCARE - The Sentiment Corpus of App Reviews with Fine-grained Annotations in German Fine-grained annotations for mobile application reviews
SciCorp Corpus of full-text English scientific papers of genetics and computational linguistics
SdeWaC SdeWaC is based on the deWaC web corpus of the WaCky-Initative. SdeWaC contains parsable sentences from deWaC documents of the .de domain
SemEval-2020 Task 1: English Test Data English Test Data for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection
SemEval-2020 Task 1: German Test Data German Test Data for SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection
SemEval-2020 Task 1: Test data Test data SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection der Universität Stuttgart
Sentiment Relevance Corpus This corpus contains 3847 sentences, taken from 125 documents annotated for Sentiment Relevance. The data is a subset of the v2.0 movie polarity dataset (Pang & Lee, 2004)
Sich20 Annotation & Jupyter Notebook for Pado & Hole 2020 (Distributional Analysis of Polysemous Function Words)
Span ID Meta Learning Code and Material for Performance Prediction on Meta Learning
Stance Sentiment Emotion Corpus (SSEC) An annotation of the SemEval 2016 Twitter stance and sentiment corpus with emotion labels
Stance and Hate/Offensive Speech Detection during the US2020 elections Corpora and Models for Appraisal-based Emotion Analysis
TIGER Corpus The TIGER Corpus consists of approximate 900,000 tokens (50,000 sentences) of German newspaper text, taken from the Frankfurter Rundschau. The corpus was semi-automatically POS-tagged and annotated with syntactic structure. Moreover, it contains morphological and lemma information for terminal nodes
USAGE Corpus This USAGE corpus consists of annotations of Amazon reviews for different product categories in the languages German and English. The reviews themselves are not part of this data publication
Visual Emotion Corpus Visual Emotion Corpus
Wind-Of-Change Corpora (WOCC) This collection contains the corpora (lemma version) used for the experiments in Schlechtweg et. al (2019)

General Contact IMS

Pfaffenwaldring 5 b, 70569 Stuttgart


Webmaster of the IMS

  • Write e-mail
  • If you have any problems with the website, please directly contact the webmaster.
To the top of the page