Position within the page tree

Institute for Natural Language Processing
Institute
Research Groups
Theoretical Computational Linguistics

Department Theoretical Computational Linguistics

Department Theoretical Computational Linguistics, Head Prof. Dr. Sebastian Padó

Welcome to the Chair of Theoretical Computational Linguistics at IMS Stuttgart. The group has been led by Prof. Sebastian Padó since 2013.

We conduct research in computational linguistics, mostly in the area of lexical or computational semantics, generally following a data-driven approach.

Team

Research interests of group members

Lexical and computational semantics

Prof. Dr. Sebastian Padó, Chair

Acquisition of lexical information: How can we automatically learn and extend lexicons from text that provide reliable information on various aspects of meaning and meaning variation?
Semantic representation: What formalisms are available to represent such meaning information in a manner that is ideally both linguistically and cognitively adequate?
Cross-lingual linguistic analysis: How can we use bilingual parallel and comparable corpora to learn more about linguistic structures in either language?
From words to texts: What is the nature of the interaction between the meaning of lexical units, of phrases, sentences, and complete discourses?
Applications of lexical semantics: How can all of the above contribute towards more intelligent and robust natural language processing applications that make a difference for the end user?

NLP in social context

Dr. Agnieszka Faleńska

Bias and fairness in NLP: How can we detect and mitigate biases in NLP models? In what ways do demographic variables such as gender, ethnicity, or socioeconomic status influence the outputs of NLP systems? How do inequalities manifest in primary data, and what impact do they have on the outcomes of NLP models?
Detecting and handling harmful language: How can NLP systems reliably identify harmful or offensive speech? What challenges arise when NLP systems attempt to detect subtle forms of harm, such as microaggressions or implicit bias?
Computational modeling of linguistic variability: How can we model linguistic variability in a way that preserves meaning while acknowledging differences in expression? How do we design NLP systems to adapt to users' linguistic preferences, including non-standard expressions?
NLP for Computational Social Science: How can NLP contribute to cross-disciplinary research by offering new ways to model and analyze social science data? How can NLP methods be used to analyze large-scale social data and uncover insights into social behavior, communication patterns, or public opinion?

Projects

CEAT: Computational Event Evaluation based on Appraisal Theories for Emotion Analysis (2021–2024)

Emotion analysis has typically been formulated as text classification task in which predefined emotion labels are assigned to textual units. The label set commonly follows the set of basic emotions as proposed by Ekman (Anger, Fear, Joy, Surprise, Sadness, Disgust) or Plutchik (adding Trust and Anticipation) or the valence-arousal-dominance model. This constitutes a gap between the state of research in psychology and computational linguistics, as the appraisal theories are widely accepted, but have not been used so far for emotion analysis in text.

With CEAT, we fill this gap and develop computational models of the cognitive appraisal of events and, to a lesser degree, of bodily symptoms and action tendencies. To represent the cognitive appraisal, we build on top of Smith/Ellsworth's (1985) work who show that the variables pleasantness, responsibility, certainty, attention, effort and situational control are sufficient to discriminate between a set of 15 emotions.

In this project, we create two approaches to assign these appraisal dimensions to textual event descriptions, firstly by building on top of semantic parsing and secondly in a deep learning setting. Based on these dimensions, we then predict the emotion associated with the textual fragment. This will lead to models that can automatically assign an emotion to an event description, even if no emotion words or self reports of feeling are available.

FIBISS: Automatic Fact Checking for Biomedical Information in Social Media and Scientific Literature

Most research on methods and models for automatic fact checking, which can distinguish misinformation and desinformation from correct information, focus on the news domain. News, including those shared in social media spaces, are checked for their truthfulness. Such methods have not been developed for the biomedical domain yet. Challenges include the richness of (established) sources of information, the complexity of information, as well as the differences between the language of experts and medical laypeople. In this project, we develop information extraction systems for laypeople and expert language, map the extracted information onto each other and finally check their truthfulness, based on established sources. The project combines therefore methods from transfer learning, information extraction, and fact checking for the biomedical domain, especially in social media.

Involved personnel: Roman Klinger (project lead), Amelie Wührl

SEAT (Structured Multi-Domain Emotion Analysis from Text 2018–2020)

Emotion analysis in natural language processings aims at associating text with emotions, for instance with anger, fear, joy, surprise, disgust or sadness. This task extends sentiment analysis, which adds further qualitative value in applications, for instance in social media analysis, in the analysis of fictional stories or news articles.Existing research has so far mainly focused on the association of text with specific emotion models from psychological research. The development of methods for detecting phrases in text which denote the emotion experiencer (the character or person who feels the emotion), the emotion theme (the cause of the development of an emotion) as well as the modifiers of an emotion (intensifiers and diminishers) has been neglected.In this project, we aim at filling this gap. We will develop manually annotated corpora from different domains (news, novels, social media) in German and English. Based on these resources, we develop models which are able to automatically recognize and extract such information. We work on different levels: Firstly, we connect words with emotions (with distributional and lexical methods), including grammatical variants. Then, secondly, we analyze these mentions in context with modifiers, the feeler and the theme (cause) of the emotion. Thirdly, we model these information in context, i.e., beyond seperated mentions. All methods will be analyzed regarding their domain and language independence.

Involved personnel: Roman Klinger (project lead), Laura Bostan, Evgeny Kim

MARDY (Modeling Argumentation Dynamics, 2018–2021)

This interdisciplinary collaboration project involving Computational Linguistics, Machine Learning and Political Science has the aim of developing new computational models and methods for analyzing argumentation in political discourse – specifically capturing the dynamics of discursive exchanges on controversial issues over time. The goal is to develop tools to support analysis of the possible impact of arguments advanced by different political actors.

Involved personnel: Jonas Kuhn, Sebastian Padó (project leads at IMS), Erenay Dayanik, André Blessing

QUOTE (Comprehensive Analysis of Quotation, 2017–2020)

In many kinds of prose texts, both literary or newswire texts, reportedspeech plays an important role as a source of information aboutcharacters, their attitudes, and their relationships. Going further,such information can aid in the analysis of patterns of behavior and theconstruction of social networks.While readers do not have any problem in assembling representations forcomplete situations from individual instances of reported speech, thisis still a challenging task for computers. Current state of the artmethods are generally organized as "pipelines" which start fromindividual instances of reported speech and proceed incrementally tomore global properties of the situation or characters. Since individualinstances of reported speech are often short and uninformative, apipeline procedure often causes prediction errors which cannot berectified in retrospect.In this project, we develop joint inference methods to model the variousaspects of reported speech (who is the speaker? the hearer? What is thecontent? What is the relationship between speaker and hearer?) togetherinstead of individually. The resulting joint model takes account of theinterdependencies between these aspects. Thus, information from thedifferent aspects can complement each other. The result of this part ofthe project is a solid starting place (in terms of natural languageprocessing methods) for the application of such methods for theautomatic analysis of reported speech in digital humanities and socialsciences.This algorithmic goal is complemented by a goal from corpus andcomputational linguistics, namely elucidating the relationship betweenreported speech and other aspects of semantic analysis. In particular,there appears to be a close relationship between reported speech and (asubset) of semantic roles. Yet, no comprehensive formal analysis hasbeen carried out so far. We will provide a linguistic characterizationof the relationship and exploit it algorithmically to further improvethe recognition of reported speech as discussed above. The results ofthis part of the project is the (at least partial) consolidation of twostrands of research that have largely been treated as independent sofar.

Involved personnel: Sebastian Padó (project lead), Sean Papay

Oceanic Exchanges: Tracing Global Information Networks in Historical Newspaper Repositories (2017-2019)

Newspapers were the first big data for a mass audience. Their dramatic expansion over the long nineteenth century created a global culture of abundant information. Yet the significance of the newspaper has largely been defined in national terms in literary-historical scholarship of the period, and newspapers are predominantly collected, digitized, and accessed through nationally-focused institutions. "Oceanic Exchanges: Tracing Global Information Networks in Historical Newspaper Repositories, 1840-1914" (OcEx) brings together leading efforts in computational periodicals research to examine patterns of information flow across national and linguistic boundaries in nineteenth century newspapers and to link insights across large-scale corpora of digitized newspapers from national collections. For scholars of nineteenth century periodical culture and intellectual history, OcEx reframes how we understand the historical emergence of a globally-connected information network. It uncovers the ways that the international was refracted through the local as news, advice, vignettes, popular science, poetry, fiction, and more, all circulating around the globe and through multiple translations. By revealing the global networks through which texts and topics traveled in the period, OcEx promises to create an abundance of new evidence about how readers around the world perceived each other through the newspaper, evidence that will be of great interest to scholars in various fields. Computational linguistics and visualization provide a number of building blocks (recognizing translation, paraphrasing, text reuse, etc.) that can play enabling roles in scholarly investigations, with both historical and contemporary implications. At the same time, such methods raise fundamental questions regarding the validity and reliability of their results (such as the effects of noise in optical character recognition). Finally, by linking research across large-scale digital newspaper collections, OcEx will offer a model for national libraries and others developing large-scale data for digital scholarship. In tracing the ways texts, topics, and concepts crossed national and linguistic boundaries, Oceanic Exchanges seeks to break through the conceptual, institutional, and political barriers which have limited the promise of big data in the humanities: by bringing together historical newspaper experts from different countries and disciplines around common questions; by actively crossing the national boundaries that have previously separated digitized newspaper corpora, as well as those dividing public and private collections, through computational analysis; and by illustrating and making the global connectedness of nineteenth-century newspapers interactively explorable in ways hidden by typical organizations of digital cultural heritage along national lines.

Involved Personnell at IMS: Sebastian Padó, Martin Riedl
Website: http://oceanicexchanges.org/

CRETA (Center for Reflected Text Analysis, 2016-2018)

CRETA is a BMBF-funded center for digital humanities whose goal is to collectively develop, test and use methods for reflected text analysis across text-oriented disciplines, connecting humanities subjects with computer science methods. The center will focus on methodological building blocks that are or can be used in more than one discipline and that will allow critically reflected insights into the topics under investigation. Our group's involvement is primarily in using language technology to further literary studies, notably by assigning emotions to characters in narrative text.

Involved personnel: Roman Klinger (project lead), Sebastian Padó (project lead), Evgeny Kim
More information: http://creta.uni-stuttgart.de

KABI (Confidence Estimation for Biomedical Information Extraction, 2016-2018)

KABI is a project funded by the program “RiSC – Research Seed Capital” of the State Ministry of Baden-Württemberg for Sciences, Research and Arts, proposed by Roman Klinger. In the Life Sciences, most information is only available in free text in scientific publications. Automatic methods to extract such knowledge and to provide it in structured databases is challenged by a dilemma: Especially if potentially new information is detected in text, it is unclear if this information is actually correct or if it is wrongly extracted, for instance because the text is formulated in an uncommon way. In this project, methods will be developed which help to estimate the reliability of extracted information from biomedical publications.

Involved personnel: Roman Klinger (project lead), Camilo Thorne
More information

Distributional Characterization of Derivation (SFB 732 B9, 2014-2018)

Derivational morphology is an important process of word formation. Work in computational linguistics has usually focused on the orthographic level, modeling derivation as a string transformation. The semantic level, where orthographic derivation patterns such as -er, -ung correspond to a variety of semantic shifts, has received less attention in the field.

The goal of this project is to model the semantics of derivational patterns using distributional methods. We will work in the recently developed framework of compositional distributional semantic models (CDSMs) which assumes that derivation is essentially a compositional process in which derivational patterns act as functors (represented as linear maps) that are applied to base terms (represented as vectors).

Project web site

Incrementality in Compositional Distributional Semantics (SFB 732 D10, 2014-2018)

The goal of this project is to contribute to research on tensor-based compositional distributional seman- tic models by developing a syntax-semantics interface with three properties: (a) it will be dependency- based rather than based on constituents; (b) it will be incremental, that is, construct semantics in a left-to-right manner; (c) it will incorporate a notion of plausibility for (partial) analyses based on expecta- tions at the level of individual composition operations. The first property is important to develop syntax- semantics interfaces for languages with a more free word order. The second and third are well-known properties of human sentence processing.

Project web site

EXCITEMENT -- Exploring Customer Interactions with Textual Entailment (Heidelberg, 2012-2014)

There are two interleaved high-level goals for this project. The first is to set up, for the first time, a generic architecture and a comprehensive implementation for a multilingual textual inference platform and to make it available to the scientific and technological communities. The second goal of the project is to develop a new generation of inference-based industrial text exploration applications for customer interactions, which will enable businesses to better analyze and make sense of their diverse and often unpredicted client content. These goals will be achieved for three languages – English, German and Italian, and for three customer interaction channels – speech (transcriptions), email and social media. This is a EC STREP project undertaken in collaboration with Bar Ilan University, DFKI, FBK, and the companies AlmaWave, NICE, and OMQ.

Project web site

Semantics beyond the sentence: Coherence in language processing (Heidelberg, 2012-2014)

The goal of the doctoral program is to extend semantic analysis to the discourse level and to approximate coherence-based interpretation through three mutually supporting research directions: (a) analysing semantic phenomena at the discourse level and representing them as “semantic graphs”; (b) using these graphs to improve semantic analysis; (c) evaluating (a) and (b) in NLP applications. The proposed PhD topics are integrated into and linked by the three directions. Further ties are ensured by the joint use of text collections (corpora).

Program web site

Collaborations

Prof. Gemma Boleda, University Pompeu Fabra, Barcelona
Prof. Hanno Ehrlicher, University of Tübingen
Prof. Sebastian Haunss, University of Bremen
Prof. Roman Klinger, University of Bamberg
Prof. Alessandro Lenci, University of Pisa
Prof. Jan Snajder, University of Zagreb

Theses at Department Theoretical …

How do you find a thesis topic and how do you prepare for it. From finding a topic to registration …