Institut

Studium

Forschung


 

Theoretische Computerlinguistik

???detail.Chair.Picture???

Welcome to the chair of Theoretical Computational Linguistics at IMS Stuttgart.

We conduct research in computational linguistics, mostly in the area of lexical or computational semantics, generally following a data-driven approach. The main research areas are:

Lexical and computational semantics
(Prof. Dr. Sebastian Padó, Chair)

  • Acquisition of lexical information: How can we automatically learn and extend lexicons from text that provide reliable information on various aspects of meaning and meaning variation?
  • Semantic representation: What formalisms are available to represent such meaning information in a manner that is ideally both linguistically and cognitively adequate?
  • Cross-lingual linguistic analysis: How can we use bilingual parallel and comparable corpora to learn more about linguistic structures in either language?
  • From words to texts: What is the nature of the interaction between the meaning of lexical units, of phrases, sentences, and complete discourses?
  • Applications of lexical semantics: How can all of the above contribute towards more intelligent and robust natural language processing applications that make a difference for the end user?


Text Mining and Machine Learning
(Dr. Roman Klinger, Senior Lecturer)

  • Emotion and sentiment analysis: How can we associate words and phrases with emotion and sentiment meaning including structured information, for instance aspects, causes, or themes? How can we apply such methods in social media mining or literary studies?
  • Biomedical and health natural language processing: How can we detect biomedical entities (for instance gene names or disease names) and relations (for instance proteint-disease relationships) from scientific publications or social media? Can we learn about specific medications directly from patients?
  • Structured learning and probabilistic models for natural language processing: How can NLP tasks be formulated in terms of probabilistic models (or other methods for structured learning) such that different subtasks contribute to each other?
  • Text mining and information retrieval: Which methods help in understanding what is in a document collections? How can we detect meaningful nuggets in unstructured or semistructured text and present this information to users?

 
Psycholinguistics
(Dr. Diego Frassinelli, Lecturer)

  • Human sentence processing: What can we learn about the nature of the lexicon from psycholinguistic studies?
  • From text to multimodal representation: How do humans integrate images, sounds and gestures while processing linguistic information?
  • Combining experiments with corpora: How can we use behavioral data in combination with corpus data (distributional semantics) to learn more about word processing?

 

 


Mitarbeiter

Professor:

Faculty:

Project staff & Scholarship holders:

Secretary:

 

Research and Teaching Assistants


Former members and visitors:

  • Laura Aina (U. of Pisa, U. of Amsterdam)
  • Dr. Gemma Boleda (UT Austin, U. Pompeu Fabra, U. of Trento)
  • Manaal Faruqui, PhD (IIT Kharagpur, CMU, Google Inc.)
  • Parvin Sadat Feizabadi, M.Sc.
  • Arun Kumar (University of Catalunya)
  • Suhansanu Kumar (IIT Kharagpur)
  • Olga Nikitina (Lomonossov U., Saarland U., Kauz Semantics)
  • Tae-Gil Noh, PhD (Kyungpook U., NEC Labs Europe, OMQ GmbH)
  • dr Yves Peirsman (KU Leuven, Stanford U., Wolters Kluwer)
  • Dr. Christian Scheible (Trusted Shops)
  • Prof. Dr. Jan Snajder (U. of Zagreb)
  • Dr. Alessandra Zarcone (U. of Pisa, Saarland U.)
  • Dr. Rossella Varvara (U. of Trento)
  • Dr. Britta Zeller (U. of Heidelberg, Molecular Health GmbH)
Projekte

SEAT (Structured Multi-Domain Emotion Analysis from Text 2018–2020)

Emotion analysis in natural language processings aims at associating text with emotions, for instance with anger, fear, joy, surprise, disgust or sadness. This task extends sentiment analysis, which adds further qualitative value in applications, for instance in social media analysis, in the analysis of fictional stories or news articles.Existing research has so far mainly focused on the association of text with specific emotion models from psychological research. The development of methods for detecting phrases in text which denote the emotion experiencer (the character or person who feels the emotion), the emotion theme (the cause of the development of an emotion) as well as the modifiers of an emotion (intensifiers and diminishers) has been neglected.In this project, we aim at filling this gap. We will develop manually annotated corpora from different domains (news, novels, social media) in German and English. Based on these resources, we develop models which are able to automatically recognize and extract such information. We work on different levels: Firstly, we connect words with emotions (with distributional and lexical methods), including grammatical variants. Then, secondly, we analyze these mentions in context with modifiers, the feeler and the theme (cause) of the emotion. Thirdly, we model these information in context, i.e., beyond seperated mentions. All methods will be analyzed regarding their domain and language independence.

Involved personnel: Roman Klinger (project lead), Laura Bostan, Evgeny Kim

MARDY (Modeling Argumentation Dynamics, 2018–2021)

 

tbd

QUOTE (Comprehensive Analysis of Quotation, 2017–2020) 

In many kinds of prose texts, both literary or newswire texts, reportedspeech plays an important role as a source of information aboutcharacters, their attitudes, and their relationships. Going further,such information can aid in the analysis of patterns of behavior and theconstruction of social networks.While readers do not have any problem in assembling representations forcomplete situations from individual instances of reported speech, thisis still a challenging task for computers. Current state of the artmethods are generally organized as "pipelines" which start fromindividual instances of reported speech and proceed incrementally tomore global properties of the situation or characters. Since individualinstances of reported speech are often short and uninformative, apipeline procedure often causes prediction errors which cannot berectified in retrospect.In this project, we develop joint inference methods to model the variousaspects of reported speech (who is the speaker? the hearer? What is thecontent? What is the relationship between speaker and hearer?) togetherinstead of individually. The resulting joint model takes account of theinterdependencies between these aspects. Thus, information from thedifferent aspects can complement each other. The result of this part ofthe project is a solid starting place (in terms of natural languageprocessing methods) for the application of such methods for theautomatic analysis of reported speech in digital humanities and socialsciences.This algorithmic goal is complemented by a goal from corpus andcomputational linguistics, namely elucidating the relationship betweenreported speech and other aspects of semantic analysis. In particular,there appears to be a close relationship between reported speech and (asubset) of semantic roles. Yet, no comprehensive formal analysis hasbeen carried out so far. We will provide a linguistic characterizationof the relationship and exploit it algorithmically to further improvethe recognition of reported speech as discussed above. The results ofthis part of the project is the (at least partial) consolidation of twostrands of research that have largely been treated as independent sofar.

Involved personnel: Sebastian Padó (project lead), Sean Papay

 

Oceanic Exchanges: Tracing Global Information Networks in Historical Newspaper Repositories (2017-2019)

Newspapers were the first big data for a mass audience. Their dramatic expansion over the long nineteenth century created a global culture of abundant information. Yet the significance of the newspaper has largely been defined in national terms in literary-historical scholarship of the period, and newspapers are predominantly collected, digitized, and accessed through nationally-focused institutions. "Oceanic Exchanges: Tracing Global Information Networks in Historical Newspaper Repositories, 1840-1914" (OcEx) brings together leading efforts in computational periodicals research to examine patterns of information flow across national and linguistic boundaries in nineteenth century newspapers and to link insights across large-scale corpora of digitized newspapers from national collections. For scholars of nineteenth century periodical culture and intellectual history, OcEx reframes how we understand the historical emergence of a globally-connected information network. It uncovers the ways that the international was refracted through the local as news, advice, vignettes, popular science, poetry, fiction, and more, all circulating around the globe and through multiple translations. By revealing the global networks through which texts and topics traveled in the period, OcEx promises to create an abundance of new evidence about how readers around the world perceived each other through the newspaper, evidence that will be of great interest to scholars in various fields. Computational linguistics and visualization provide a number of building blocks (recognizing translation, paraphrasing, text reuse, etc.) that can play enabling roles in scholarly investigations, with both historical and contemporary implications. At the same time, such methods raise fundamental questions regarding the validity and reliability of their results (such as the effects of noise in optical character recognition). Finally, by linking research across large-scale digital newspaper collections, OcEx will offer a model for national libraries and others developing large-scale data for digital scholarship. In tracing the ways texts, topics, and concepts crossed national and linguistic boundaries, Oceanic Exchanges seeks to break through the conceptual, institutional, and political barriers which have limited the promise of big data in the humanities: by bringing together historical newspaper experts from different countries and disciplines around common questions; by actively crossing the national boundaries that have previously separated digitized newspaper corpora, as well as those dividing public and private collections, through computational analysis; and by illustrating and making the global connectedness of nineteenth-century newspapers interactively explorable in ways hidden by typical organizations of digital cultural heritage along national lines.

Involved Personnell at IMS: Sebastian Padó, N.N.
Website: http://oceanicexchanges.org/

CRETA (Center for Reflected Text Analysis, 2016-2018)

CRETA is a BMBF-funded center for digital humanities whose goal is to collectively develop, test and use methods for reflected text analysis across text-oriented disciplines, connecting humanities subjects with computer science methods. The center will focus on methodological building blocks that are or can be used in more than one discipline and that will allow critically reflected insights into the topics under investigation. Our group's involvement is primarily in using language technology to further literary studies, notably  by assigning emotions to characters in narrative text. 

 

Involved personnel: Roman Klinger (project lead), Sebastian Padó (project lead), Evgeny Kim
More information: http://creta.uni-stuttgart.de

KABI (Confidence Estimation for Biomedical Information Extraction, 2016-2018)

KABI is a project funded by the program “RiSC – Research Seed Capital” of the State Ministry of Baden-Württemberg for Sciences, Research and Arts, proposed by Roman Klinger. In the Life Sciences, most information is only available in free text in scientific publications. Automatic methods to extract such knowledge and to provide it in structured databases is challenged by a dilemma: Especially if potentially new information is detected in text, it is unclear if this information is actually correct or if it is wrongly extracted, for instance because the text is formulated in an uncommon way. In this project, methods will be developed which help to estimate the reliability of extracted information from biomedical publications.

Involved personnel: Roman Klinger (project lead), Camilo Thorne
More information

Distributional Characterization of Derivation (SFB 732 B9, 2014-2018)

Derivational morphology is an important process of word formation. Work in computational linguistics has usually focused on the orthographic level, modeling derivation as a string transformation. The semantic level, where orthographic derivation patterns such as -er, -ung correspond to a variety of semantic shifts, has received less attention in the field. 

The goal of this project is to model the semantics of derivational patterns using distributional methods. We will work in the recently developed framework of compositional distributional semantic models (CDSMs) which assumes that derivation is essentially a compositional process in which derivational patterns act as functors (represented as linear maps) that are applied to base terms (represented as vectors).

Project web site

 

Incrementality in Compositional Distributional Semantics (SFB 732 D10, 2014-2018)

The goal of this project is to contribute to research on tensor-based compositional distributional seman- tic models by developing a syntax-semantics interface with three properties: (a) it will be dependency- based rather than based on constituents; (b) it will be incremental, that is, construct semantics in a left-to-right manner; (c) it will incorporate a notion of plausibility for (partial) analyses based on expecta- tions at the level of individual composition operations. The first property is important to develop syntax- semantics interfaces for languages with a more free word order. The second and third are well-known properties of human sentence processing.

Project web site

 

EXCITEMENT -- Exploring Customer Interactions with Textual Entailment (Heidelberg, 2012-2014)

There are two interleaved high-level goals for this project. The first is to set up, for the first time, a generic architecture and a comprehensive implementation for a multilingual textual inference platform and to make it available to the scientific and technological communities. The second goal of the project is to develop a new generation of inference-based industrial text exploration applications for customer interactions, which will enable businesses to better analyze and make sense of their diverse and often unpredicted client content. These goals will be achieved for three languages – English, German and Italian, and for three customer interaction channels – speech (transcriptions), email and social media. This is a EC STREP project undertaken in collaboration with Bar Ilan University, DFKI, FBK, and the companies AlmaWave, NICE, and OMQ.

Project web site

 

Semantics beyond the sentence: Coherence in language processing (Heidelberg, 2012-2014)

The goal of the doctoral program is to extend semantic analysis to the discourse level and to approximate coherence-based interpretation through three mutually supporting research directions: (a) analysing semantic phenomena at the discourse level and representing them as “semantic graphs”; (b) using these graphs to improve semantic analysis; (c) evaluating (a) and (b) in NLP applications. The proposed PhD topics are integrated into and linked by the three directions. Further ties are ensured by the joint use of text collections (corpora).

Program web site

 

The full list of projects situated at IMS can be found here.

Publikationen

Our publications are listed in the IMS Bibliography.

Ressourcen
  • 2016. Corpus for the Analysis of Irony and Sarcasm in Twitter (Ling, & Klinger 2016). See Roman Klinger's data page.
  • 2016. SCARE - The Sentiment Corpus of App Reviews with Fine-grained Annotations in German (Sänger, Leser, Kemmerer, Adolphs, Klinger). See Roman Klinger's data page.
  • 2016. German Emotion Dictionaries (Klinger, Samat, Reiter 2016). See Roman Klinger's data page.
  • 2015. ACL Anthology Native Language Identification Corpus (Stehwien & Pado 2015). See the IMS resources pages.
  • 2015. FreeBase City & Country datasets (Gupta et al. 2015). See the IMS resources page.
  • 2014. USAGE Corpus for fine-grained sentiment analysis in German and English (Klinger & Cimiano 2014). See Roman Klinger's data page.
  • 2014. Concepts in Context. (Kremer et al. 2014). See the IMS resources page.
  • 2014. Multilingual Syntax-Based DSM (Utt and Pado 2014). See the IMS resources page
  • 2013. Croatian Distributional Memory. (Snajder et al. 2013). See the TakeLab data page.
  • 2013. German Distributional Families. (Zeller et al. 2013). See the IMS resources page.
  • 2013. German Social Media Data. (Zeller and Pado 2013). See Britta Zeller's page.
  • 2012. German Distributional Memory. (Utt and Pado 2012). See Jason Utt's data page.
  • 2012. Locational inference annotation. (Feizabadi and Pado 2012). Guidelines as PDF,FrameNet motion verb list. Contact us for the data.
  • 2012. Regular polysemy evaluation dataset. (Boleda, Pado, and Utt 2012). Dataset used for experiments available for download: 28kB zip archive.
  • 2012. Parallel literary corpus with T/V pronoun labels. (Faruqui and Pado 2012). Dataset used for experiments available for download.
  • 2010. Textual Entailment Data with Discourse Annotation. (Mirkin, Dagan, and Pado 2010). The dataset and guidelines are stored externally. Please continue tohttp://www.cs.biu.ac.il/~nlp/downloads/discourse-for-entailment.html.
  • 2010. Manual Named Entity annotation for German EUROPARL data. German classifiers for the Stanford CRF-based NER systems (optimized in April 2010 and reported in Faruqui and Pado 2010) and manually annotated EUROPARL data as out-of-domain testset. See theGerman NER page.
  • 2010. Selectional Preferences for German and Spanish. (Peirsman and Pado 2010). Contact me.
  • 2009. Projection of semantic roles. The 1000-sentence bilingual English-German corpus with role-semantic annotation (Pado and Lapata 2009) is now available for download.
  • 2008. Semi-supervised SRL for event nouns. The specification of Pado, Pennacchiotti, and Sporleder 2008 is here.
  • 2007. Projection of frame-semantic classifications. Projected FrameNet predicate classes (Pado 2007) are available for German and French. Contact me.
Kooperationen
  • Prof. Alessandro Lenci, University of Pisa
  • Dr. Jan Snajder, University of Zagreb
  • Prof. Gabriella Vigliocco, University College London
  • Dr. Gemma Boleda, University Pompeu Fabra, Bacelona
  • Prof. Ingo Plag, University of Düsseldorf
  • Prof. Hans Werner Müller & Dr. Nicole Brazda, University Hospital Düsseldorf
  • Prof. Philipp Cimiano, Bielefeld University
  • Prof. Dr. Stefan Evert, FAU Erlangen-Nürnberg
  • Prof. Dr. Ulf Leser, HU Berlin
  • Chefkoch GmbH
  • Semalytix GmbH
Kontakt
Sekretariat
Sabine Dieterle
Sekretariat
Telefon +49 (0) 711/685-81379
Fax+49 (0) 711/685-81366
E-Mail
Postadresse
Universität Stuttgart
Institut für Maschinelle Sprachverarbeitung
Pfaffenwaldring 5b
70569 Stuttgart
Deutschland
Links

Information on theses at the Chair for Theoretical Computational Linguistics can be found here.