Theoretische Computerlinguistik


Welcome to the chair of Theoretical Computational Linguistics at IMS Stuttgart.

We conduct research in computational linguistics, mostly in the area of lexical or computational semantics, generally following a data-driven approach. The main research areas are:

Lexical and computational semantics
(Prof. Dr. Sebastian Padó, Chair)

  • Acquisition of lexical information: How can we automatically learn and extend lexicons from text that provide reliable information on various aspects of meaning and meaning variation?
  • Semantic representation: What formalisms are available to represent such meaning information in a manner that is ideally both linguistically and cognitively adequate?
  • Cross-lingual linguistic analysis: How can we use bilingual parallel and comparable corpora to learn more about linguistic structures in either language?
  • From words to texts: What is the nature of the interaction between the meaning of lexical units, of phrases, sentences, and complete discourses?
  • Applications of lexical semantics: How can all of the above contribute towards more intelligent and robust natural language processing applications that make a difference for the end user?

Text Mining and Machine Learning
(Dr. Roman Klinger, Senior Lecturer)

  • Emotion and sentiment analysis: How can we associate words and phrases with emotion and sentiment meaning including structured information, for instance aspects, causes, or themes? How can we apply such methods in social media mining or literary studies?
  • Biomedical and health natural language processing: How can we detect biomedical entities (for instance gene names or disease names) and relations (for instance proteint-disease relationships) from scientific publications or social media? Can we learn about specific medications directly from patients?
  • Structured learning and probabilistic models for natural language processing: How can NLP tasks be formulated in terms of probabilistic models (or other methods for structured learning) such that different subtasks contribute to each other?
  • Text mining and information retrieval: Which methods help in understanding what is in a document collections? How can we detect meaningful nuggets in unstructured or semistructured text and present this information to users?

(Dr. Diego Frassinelli, Lecturer)

  • Human sentence processing: What can we learn about the nature of the lexicon from psycholinguistic studies?
  • From text to multimodal representation: How do humans integrate images, sounds and gestures while processing linguistic information?
  • Combining experiments with corpora: How can we use behavioral data in combination with corpus data (distributional semantics) to learn more about word processing?






Project staff & Scholarship holders:



Research and Teaching Assistants

  • Sean Papay
  • Elnaz Shafaei Bajestan

Former members and visitors:

  • Laura Aina (U. of Pisa, U. of Amsterdam)
  • Dr. Gemma Boleda (UT Austin, U. Pompeu Fabra, U. of Trento)
  • Manaal Faruqui, PhD (IIT Kharagpur, CMU, Google Inc.)
  • Suhansanu Kumar (IIT Kharagpur)
  • Olga Nikitina (Lomonossov U., Saarland U., Kauz Semantics)
  • Tae-Gil Noh, PhD (Kyungpook U., NEC Labs Europe, OMQ GmbH)
  • dr Yves Peirsman (KU Leuven, Stanford U., Wolters Kluwer)
  • Dr. Jan Snajder (U. of Zagreb)
  • Dr. Alessandra Zarcone (U. of Pisa, Saarland U.)
  • Rossella Varvara (U. of Trento)
  • Dr. Britta Zeller (U. of Heidelberg, Molecular Health GmbH)

CRETA (Center for Reflected Text Analysis, 2016-2018)

CRETA is a BMBF-funded center for digital humanities whose goal is to collectively develop, test and use methods for reflected text analysis across text-oriented disciplines, connecting humanities subjects with computer science methods. The center will focus on methodological building blocks that are or can be used in more than one discipline and that will allow critically reflected insights into the topics under investigation. Our group's involvement is primarily in using language technology to further literary studies, notably  by assigning emotions to characters in narrative text. 

Involved personnel: Roman Klinger (project lead), Sebastian Padó (project lead), Evgeny Kim
More information:

KABI (Confidence Estimation for Biomedical Information Extraction, 2016-2018)

KABI is a project funded by the program “RiSC – Research Seed Capital” of the State Ministry of Baden-Württemberg for Sciences, Research and Arts, proposed by Roman Klinger. In the Life Sciences, most information is only available in free text in scientific publications. Automatic methods to extract such knowledge and to provide it in structured databases is challenged by a dilemma: Especially if potentially new information is detected in text, it is unclear if this information is actually correct or if it is wrongly extracted, for instance because the text is formulated in an uncommon way. In this project, methods will be developed which help to estimate the reliability of extracted information from biomedical publications.

Involved personnel: Roman Klinger (project lead), Camilo Thorne
More information

Distributional Characterization of Derivation (SFB 732 B9, 2014-2018)

Derivational morphology is an important process of word formation. Work in computational linguistics has usually focused on the orthographic level, modeling derivation as a string transformation. The semantic level, where orthographic derivation patterns such as -er, -ung correspond to a variety of semantic shifts, has received less attention in the field. 

The goal of this project is to model the semantics of derivational patterns using distributional methods. We will work in the recently developed framework of compositional distributional semantic models (CDSMs) which assumes that derivation is essentially a compositional process in which derivational patterns act as functors (represented as linear maps) that are applied to base terms (represented as vectors).

Project web site


Incrementality in Compositional Distributional Semantics (SFB 732 D10, 2014-2018)

The goal of this project is to contribute to research on tensor-based compositional distributional seman- tic models by developing a syntax-semantics interface with three properties: (a) it will be dependency- based rather than based on constituents; (b) it will be incremental, that is, construct semantics in a left-to-right manner; (c) it will incorporate a notion of plausibility for (partial) analyses based on expecta- tions at the level of individual composition operations. The first property is important to develop syntax- semantics interfaces for languages with a more free word order. The second and third are well-known properties of human sentence processing.

Project web site


EXCITEMENT -- Exploring Customer Interactions with Textual Entailment (Heidelberg, 2012-2014)

There are two interleaved high-level goals for this project. The first is to set up, for the first time, a generic architecture and a comprehensive implementation for a multilingual textual inference platform and to make it available to the scientific and technological communities. The second goal of the project is to develop a new generation of inference-based industrial text exploration applications for customer interactions, which will enable businesses to better analyze and make sense of their diverse and often unpredicted client content. These goals will be achieved for three languages – English, German and Italian, and for three customer interaction channels – speech (transcriptions), email and social media. This is a EC STREP project undertaken in collaboration with Bar Ilan University, DFKI, FBK, and the companies AlmaWave, NICE, and OMQ.

Project web site


Semantics beyond the sentence: Coherence in language processing (Heidelberg, 2012-2014)

The goal of the doctoral program is to extend semantic analysis to the discourse level and to approximate coherence-based interpretation through three mutually supporting research directions: (a) analysing semantic phenomena at the discourse level and representing them as “semantic graphs”; (b) using these graphs to improve semantic analysis; (c) evaluating (a) and (b) in NLP applications. The proposed PhD topics are integrated into and linked by the three directions. Further ties are ensured by the joint use of text collections (corpora).

Program web site


The full list of projects situated at IMS can be found here.


Our publications are listed in the IMS Bibliography.

  • 2016. Corpus for the Analysis of Irony and Sarcasm in Twitter (Ling, & Klinger 2016). See Roman Klinger's data page.
  • 2016. SCARE - The Sentiment Corpus of App Reviews with Fine-grained Annotations in German (Sänger, Leser, Kemmerer, Adolphs, Klinger). See Roman Klinger's data page.
  • 2016. German Emotion Dictionaries (Klinger, Samat, Reiter 2016). See Roman Klinger's data page.
  • 2015. ACL Anthology Native Language Identification Corpus (Stehwien & Pado 2015). See the IMS resources pages.
  • 2015. FreeBase City & Country datasets (Gupta et al. 2015). See the IMS resources page.
  • 2014. USAGE Corpus for fine-grained sentiment analysis in German and English (Klinger & Cimiano 2014). See Roman Klinger's data page.
  • 2014. Concepts in Context. (Kremer et al. 2014). See the IMS resources page.
  • 2014. Multilingual Syntax-Based DSM (Utt and Pado 2014). See the IMS resources page
  • 2013. Croatian Distributional Memory. (Snajder et al. 2013). See the TakeLab data page.
  • 2013. German Distributional Families. (Zeller et al. 2013). See the IMS resources page.
  • 2013. German Social Media Data. (Zeller and Pado 2013). See Britta Zeller's page.
  • 2012. German Distributional Memory. (Utt and Pado 2012). See Jason Utt's data page.
  • 2012. Locational inference annotation. (Feizabadi and Pado 2012). Guidelines as PDF,FrameNet motion verb list. Contact us for the data.
  • 2012. Regular polysemy evaluation dataset. (Boleda, Pado, and Utt 2012). Dataset used for experiments available for download: 28kB zip archive.
  • 2012. Parallel literary corpus with T/V pronoun labels. (Faruqui and Pado 2012). Dataset used for experiments available for download.
  • 2010. Textual Entailment Data with Discourse Annotation. (Mirkin, Dagan, and Pado 2010). The dataset and guidelines are stored externally. Please continue to
  • 2010. Manual Named Entity annotation for German EUROPARL data. German classifiers for the Stanford CRF-based NER systems (optimized in April 2010 and reported in Faruqui and Pado 2010) and manually annotated EUROPARL data as out-of-domain testset. See theGerman NER page.
  • 2010. Selectional Preferences for German and Spanish. (Peirsman and Pado 2010). Contact me.
  • 2009. Projection of semantic roles. The 1000-sentence bilingual English-German corpus with role-semantic annotation (Pado and Lapata 2009) is now available for download.
  • 2008. Semi-supervised SRL for event nouns. The specification of Pado, Pennacchiotti, and Sporleder 2008 is here.
  • 2007. Projection of frame-semantic classifications. Projected FrameNet predicate classes (Pado 2007) are available for German and French. Contact me.
  • Prof. Alessandro Lenci, University of Pisa
  • Dr. Jan Snajder, University of Zagreb
  • Prof. Gabriella Vigliocco, University College London
  • Dr. Gemma Boleda, University Pompeu Fabra, Bacelona
  • Prof. Ingo Plag, University of Düsseldorf
  • Prof. Hans Werner Müller & Dr. Nicole Brazda, University Hospital Düsseldorf
  • Prof. Philipp Cimiano, Bielefeld University
  • Prof. Dr. Stefan Evert, FAU Erlangen-Nürnberg
  • Prof. Ulf Leser, HU Berlin
  • Chefkoch GmbH
  • Semalytix GmbH
Sabine Dieterle
Telefon +49 (0) 711/685-81379
Fax+49 (0) 711/685-81366
Universität Stuttgart
Institut für Maschinelle Sprachverarbeitung
Pfaffenwaldring 5b
70569 Stuttgart

Information on theses at the Chair for Theoretical Computational Linguistics can be found here.