Für die volle Funktionalität dieser Site ist JavaScript notwendig. Hier finden Sie eine Anleitung zum Aktivieren von JavaScript in Ihrem Browser.

Position innerhalb des Seitenbaumes

Institut für Maschinelle Sprachverarbeitung
Institut
Arbeitsgruppen
Theoretische Computerlinguistik

Abteilung Theoretische Computerlinguistik

Abteilung Theoretische Computerlinguistik, Leiter Prof. Dr. Sebastian Padó

Willkommen am Lehrstuhl für Theoretische Computerlinguistik am IMS der Universität Stuttgart. Seit 2013 wird die Gruppe von Prof. Sebastian Padó geleitet.

Wir forschen in der Computerlinguistik, vor allem im Bereich der lexikalischen oder computergestützten Semantik, im Allgemeinen nach einem datengesteuerten Ansatz.

Team

Hauptforschungsgebiete

Lexical and computational semantics

Prof. Dr. Sebastian Padó, Chair

Acquisition of lexical information: How can we automatically learn and extend lexicons from text that provide reliable information on various aspects of meaning and meaning variation?
Semantic representation: What formalisms are available to represent such meaning information in a manner that is ideally both linguistically and cognitively adequate?
Cross-lingual linguistic analysis: How can we use bilingual parallel and comparable corpora to learn more about linguistic structures in either language?
From words to texts: What is the nature of the interaction between the meaning of lexical units, of phrases, sentences, and complete discourses?
Applications of lexical semantics: How can all of the above contribute towards more intelligent and robust natural language processing applications that make a difference for the end user?

NLP im sozialen Kontext

Dr. Agnieszka Faleńska

Voreingenommenheit und Fairness in NLP: Wie können wir Voreingenommenheit in NLP-Modellen erkennen und verringern? Inwiefern beeinflussen demografische Variablen wie Geschlecht, Ethnizität oder sozioökonomischer Status die Ergebnisse von NLP-Systemen? Wie manifestieren sich Ungleichheiten in Primärdaten, und welchen Einfluss haben sie auf die Ergebnisse von NLP-Modellen?
Erkennung und Handhabung schädlicher Sprache: Wie können NLP-Systeme schädliche oder beleidigende Sprache zuverlässig identifizieren? Welche Herausforderungen treten auf, wenn NLP-Systeme versuchen, subtile Formen von Schaden wie Mikroaggressionen oder implizite Voreingenommenheit zu erkennen?
Rechnergestütztes Modellieren von sprachlicher Variabilität: Wie können wir sprachliche Variabilität modellieren, sodass die Bedeutung erhalten bleibt und Unterschiede im Ausdruck anerkannt werden? Wie entwerfen wir NLP-Systeme, die sich an die sprachlichen Präferenzen der Nutzer, einschließlich nicht-standardmäßiger Ausdrücke, anpassen?
NLP für die rechnergestützte Sozialwissenschaft: Wie kann NLP zu interdisziplinärer Forschung beitragen, indem es neue Möglichkeiten bietet, sozialwissenschaftliche Daten zu modellieren und zu analysieren? Wie können NLP-Methoden verwendet werden, um große soziale Datensätze zu analysieren und Einblicke in das soziale Verhalten, Kommunikationsmuster oder öffentliche Meinungen zu gewinnen?

Drittmittelfinanzierte Projekte

EPIC: Expertise und Politisierung im COVID-Diskurs (2025-2027)

Der öffentliche Diskurs zu Fachthemen läuft in einem sprachlich multidimensionalen Raum ab. Zwei zentrale Dimensionen sind (a) wie expertenspezifisch vs. allgemeinverständlich die Diskussionen ablaufen; und (b) wie politisiert diese Diskussionen sind, d.h. wie prominent politische im Gegensatz zu fachspezifischen Aspekte sind. Diese zwei Dimensionen haben große Auswirkungen auf die allgemeine Wahrnehmung solcher Diskurse in der Gesellschaft. Aus dieser Situation ergibt sich eine Reihe von Forschungsfragen (FFs) an der Schnittstelle zwischen Politikwissenschaft und (Computer-)Linguistik:

(FF1) Was sind die sprachlichen Mittel, mit denen die zwei Dimensionen ausgedrückt werden? Wie lassen sich diese zwei Dimensionen komputationell so messen, dass die Messungen genreund themenunspezifisch sind und auch für kurze Eingaben (z.B. einzelne Sätze) möglichst verlässlich?
(FF2) Wie dynamisch sind diese zwei Dimensionen über die Zeit innerhalb spezifischer etablierter Foren und über verschiedene Foren hinweg?
(FF3) Wie einheitlich ist das Verhalten einzelner Diskursteilnehmer in multilogischer Kommunikation? Hängt uneinheitliches Verhalten mit strategischen Interessen dieser Teilnehmer zusammen?

In diesem Projekt analysieren wir die öffentliche Kommunikation in Deutschland
zu einem der wichtigsten Themen der letzten Jahre, der COVID-Pandemie, unter diesen Gesichtspunkten.

Kooperationsprojekt mit Prof. Sebastian Haunss, Universität Bremen.

WR-AI-TING: Kreatives Schreiben mit KI-Tools in Schul- und Museumskontexten – Gestaltungsmöglichkeiten für den Umgang mit Potenzialen und Risiken digitaler Innovationen in der kulturellen Bildung (2024-2025)

WR-AI-TING adressiert Potenziale und Risiken Künstlicher Intelligenz (KI) in der kulturellen Bildung exemplarisch anhand KI-unterstützter literarisch-kreativer Schreibszenarien. Die hohe technische Fortschrittsdynamik im Bereich sprachlicher KI wirft Fragen ihrer zukünftigen Rolle im Kontext kultureller Bildungsprozesse auf. Wichtige Aspekte digitaler Transformationen durch Sprach-KI umfassen den Umgang mit der (fehlenden) „Bedeutung“ KI-generierter „Werke“, ihre Urheberschaft und Kreativität, sowie die Akzeptanz von KI in künstlerisch-kreativen Kontexten. Auch mögliche Abhängigkeiten einer sinnvollen KI-Unterstützung von Sprachhintergrund oder -kompetenz der Nutzenden sowie Fragen der räumlichen Zugänglichkeit KI-unterstützter kultureller Bildungsangebote werden adressiert, indem kreatives Schreiben in Museumskontexten, schulischen Zusammenhängen aber auch virtuellen Welten untersucht wird.

Konsortialprojekt unter der Leitung des Instituts für Wissensmedien, Tübingen (Peter Gerjets)

MULTIVIEW: Klassifikation und Einordnung von Perspektiven in Dokumenten (2024-2026)

Die heutige Gesellschaft steht vor der Herausforderung, dass sie nicht genügend Zeit hat, um sich in der überwältigenden Menge an Informationen auf verschiedenen Online-Quellen wie Websites, Blogs, Foren und mehr zurechtzufinden. Die Erforschung von Methoden, um diese große Menge an Inhalten zu sichten und den Nutzer*innen relevante Informationen zu liefern, ist von entscheidender Bedeutung. Es reicht jedoch nicht mehr aus, sich nur auf relevante Informationen zu konzentrieren, sondern dass das Hauptziel darin bestehen muss, sowohl relevante als auch vielfältige Inhalte zu liefern. Eine Reihe von Texten, die ein breites Spektrum an Perspektiven abbilden, ist informativer, weil sie das Potenzial haben, das Wissen und das Verständnis eines Themas zu erweitern.

Die Erforschung der Klassifizierung von Perspektiven stellt einen neuen Forschungspfad dar, da dieser Bereich in der NLP noch unerforscht ist. Daher werden wir am Ende dieses Projekts die Forschung zur Identifizierung von Perspektiven in ganzen Textabschnitten beleuchtet haben. Außerdem werden wir eine Pipeline entwickelt und evaluiert haben, die die Empfehlung von Dokumenten nach den oben genannten Kriterien automatisiert. Und schließlich werden wir eine Benutzerstudie durchgeführt haben, um die Qualität der Empfehlungen in einem Anwendungsszenario zu überprüfen.

CEAT: Komputationelle Ereignisauswertung auf der Grundlage von Appraisal-Theorien für die Emotionsanalyse (2021–2024)

Emotionsanalyse wurde bisher in der Regel als Textklassifikationsaufgabe formuliert, in der vordefininierte Klassen Textsegmenten zugewiesen wurden. Die Klassen entsprechen typischerweise den Basisemotion, wie Sie von Ekman (Wut, Angst, Freude, Überraschung, Traurigkeit, Ekel) oder Plutchik (zusätzlich Vertrauen und Antizipation) vorgeschlagen wurden. Eine weitere Alternative stellt das Valenz-Arousal-Dominanz-Modell als Referenzsystem dar. Diese Ansätze stellen allerdings einen Unterschied in dem Stand der Forschung zwischen Psychologie und komputationeller Linguistik dar, da in dem erstgenannten Feld die Appraisaltheorien akzeptiert sind, aber bisher nie fur Textanalyse genutzt wurden.

Diesen Unterschied im Forschungsstand der verschiedenen Disziplinen verkleinern wir mit dem Projekt CEAT. Wir erstellen komputationelle Modelle auf Basis des kognitiven Appraisals von Ereignissen und, zu einem geringeren Maße, auf Beschreibungen von körperlichen Reaktionen und der Motivationskomponente von Emotionen. Als Basis für die Modellierung des kognitiven Appraisals nutzen wir die Arbeiten von Smith/Ellsworth (1985), welche zeigten, dass die Variablen wie angenehm ein Ereignis ist, wie verantwortlich man sich fühlt, wie sicher man ist, wieviel Aufmerksamkeit man dem Ereignis entgegenbringt und wieviel situationelle Kontrolle man hat, ausreichend sind um zwischen 15 Emotionen zu diskriminieren.

In diesem Projekt erstellen wir zwei Modelle um diese Appraisaldimensionen textuellen Ereignisbeschreibungen zuzuweisen, zum einen auf Basis von semantischem Parsing, zum anderen auf Basis von tiefen neuronalen Netzen. Diese Dimensionen werden dann genutzt um die Emotion vorherzusagen, welche mit dem beschriebenen Ereignis wahrscheinlich verknüpft wird. Diese Modell werden erstmalig die Möglichkeit schaffen, Emotionen Ereignisbeschreibungen zuzuweisen, auch wenn Emotionsworte oder direkte Nennungen der Emotion nicht verfügbar sind.

Mitarbeiter: Roman Klinger (PI), Laura Oberländer, Enrica Troiano

FIBISS: Automatische Faktenüberprüfung für Biomedizinische Informationen in Sozialen Medien und Wissenschaftlicher Literatur

Die Erforschung von Methoden zur automatischen Überprüfung von Fakten, also Computermodelle, welche korrekte Information von Fehlinformation oder Desinformation unterscheiden können, fokussiert weitestgehend auf die Nachrichtendomäne. So werden Nachrichten, auch solche, welche in sozialen Medien geteilt werden, auf ihren Wahrheitsgehalt überprüft. Solche Methoden sind bisher nicht für die biomedizinische Domäne entwickelt worden. Besondere Herausforderungen sind hier unter anderem die Reichhaltigkeit an existierenden (etablierten) Informationsquellen, die Komplexität der enthaltenen Information und der Unterschied der verwendeten Sprache von Experten und medizinischen Laien. In diesem Projekt entwickeln wir Informationsextraktionsysteme für Laien- und Expertensprache und Methoden um die extrahierten Informationen automatisch aufeinander abzubilden und in diesem gemeinsamen semantischen Raum Informationen automatisch abzugleichen, und schließlich auf ihren Wahrheitsgehalt unter Betrachtung von etablierten Quellen zu überprüfen. Das Projekt kombiniert somit Methoden des Transferlernens, der Informationsextraktion, und der Faktenüberprüfung für die biomedizinische Domäne insbesondere in sozialen Medien.

Mitarbeiter: Roman Klinger (Projektleitung), Amelie Wührl

SEAT (Structured Multi-Domain Emotion Analysis from Text 2018–2020)

Emotion analysis in natural language processings aims at associating text with emotions, for instance with anger, fear, joy, surprise, disgust or sadness. This task extends sentiment analysis, which adds further qualitative value in applications, for instance in social media analysis, in the analysis of fictional stories or news articles.Existing research has so far mainly focused on the association of text with specific emotion models from psychological research. The development of methods for detecting phrases in text which denote the emotion experiencer (the character or person who feels the emotion), the emotion theme (the cause of the development of an emotion) as well as the modifiers of an emotion (intensifiers and diminishers) has been neglected.In this project, we aim at filling this gap. We will develop manually annotated corpora from different domains (news, novels, social media) in German and English. Based on these resources, we develop models which are able to automatically recognize and extract such information. We work on different levels: Firstly, we connect words with emotions (with distributional and lexical methods), including grammatical variants. Then, secondly, we analyze these mentions in context with modifiers, the feeler and the theme (cause) of the emotion. Thirdly, we model these information in context, i.e., beyond seperated mentions. All methods will be analyzed regarding their domain and language independence.

Involved personnel: Roman Klinger (project lead), Laura Bostan, Evgeny Kim

MARDY (Modeling Argumentation Dynamics, 2018–2021)

This interdisciplinary collaboration project involving Computational Linguistics, Machine Learning and Political Science has the aim of developing new computational models and methods for analyzing argumentation in political discourse – specifically capturing the dynamics of discursive exchanges on controversial issues over time. The goal is to develop tools to support analysis of the possible impact of arguments advanced by different political actors.

Involved personnel: Jonas Kuhn, Sebastian Padó (project leads at IMS), Erenay Dayanik, André Blessing

QUOTE (Comprehensive Analysis of Quotation, 2017–2020)

In many kinds of prose texts, both literary or newswire texts, reportedspeech plays an important role as a source of information aboutcharacters, their attitudes, and their relationships. Going further,such information can aid in the analysis of patterns of behavior and theconstruction of social networks.While readers do not have any problem in assembling representations forcomplete situations from individual instances of reported speech, thisis still a challenging task for computers. Current state of the artmethods are generally organized as "pipelines" which start fromindividual instances of reported speech and proceed incrementally tomore global properties of the situation or characters. Since individualinstances of reported speech are often short and uninformative, apipeline procedure often causes prediction errors which cannot berectified in retrospect.In this project, we develop joint inference methods to model the variousaspects of reported speech (who is the speaker? the hearer? What is thecontent? What is the relationship between speaker and hearer?) togetherinstead of individually. The resulting joint model takes account of theinterdependencies between these aspects. Thus, information from thedifferent aspects can complement each other. The result of this part ofthe project is a solid starting place (in terms of natural languageprocessing methods) for the application of such methods for theautomatic analysis of reported speech in digital humanities and socialsciences.This algorithmic goal is complemented by a goal from corpus andcomputational linguistics, namely elucidating the relationship betweenreported speech and other aspects of semantic analysis. In particular,there appears to be a close relationship between reported speech and (asubset) of semantic roles. Yet, no comprehensive formal analysis hasbeen carried out so far. We will provide a linguistic characterizationof the relationship and exploit it algorithmically to further improvethe recognition of reported speech as discussed above. The results ofthis part of the project is the (at least partial) consolidation of twostrands of research that have largely been treated as independent sofar.

Involved personnel: Sebastian Padó (project lead), Sean Papay

Oceanic Exchanges: Tracing Global Information Networks in Historical Newspaper Repositories (2017-2019)

Newspapers were the first big data for a mass audience. Their dramatic expansion over the long nineteenth century created a global culture of abundant information. Yet the significance of the newspaper has largely been defined in national terms in literary-historical scholarship of the period, and newspapers are predominantly collected, digitized, and accessed through nationally-focused institutions. "Oceanic Exchanges: Tracing Global Information Networks in Historical Newspaper Repositories, 1840-1914" (OcEx) brings together leading efforts in computational periodicals research to examine patterns of information flow across national and linguistic boundaries in nineteenth century newspapers and to link insights across large-scale corpora of digitized newspapers from national collections. For scholars of nineteenth century periodical culture and intellectual history, OcEx reframes how we understand the historical emergence of a globally-connected information network. It uncovers the ways that the international was refracted through the local as news, advice, vignettes, popular science, poetry, fiction, and more, all circulating around the globe and through multiple translations. By revealing the global networks through which texts and topics traveled in the period, OcEx promises to create an abundance of new evidence about how readers around the world perceived each other through the newspaper, evidence that will be of great interest to scholars in various fields. Computational linguistics and visualization provide a number of building blocks (recognizing translation, paraphrasing, text reuse, etc.) that can play enabling roles in scholarly investigations, with both historical and contemporary implications. At the same time, such methods raise fundamental questions regarding the validity and reliability of their results (such as the effects of noise in optical character recognition). Finally, by linking research across large-scale digital newspaper collections, OcEx will offer a model for national libraries and others developing large-scale data for digital scholarship. In tracing the ways texts, topics, and concepts crossed national and linguistic boundaries, Oceanic Exchanges seeks to break through the conceptual, institutional, and political barriers which have limited the promise of big data in the humanities: by bringing together historical newspaper experts from different countries and disciplines around common questions; by actively crossing the national boundaries that have previously separated digitized newspaper corpora, as well as those dividing public and private collections, through computational analysis; and by illustrating and making the global connectedness of nineteenth-century newspapers interactively explorable in ways hidden by typical organizations of digital cultural heritage along national lines.

Involved Personnell at IMS: Sebastian Padó, Martin Riedl
Website: http://oceanicexchanges.org/

CRETA (Center for Reflected Text Analysis, 2016-2018)

CRETA is a BMBF-funded center for digital humanities whose goal is to collectively develop, test and use methods for reflected text analysis across text-oriented disciplines, connecting humanities subjects with computer science methods. The center will focus on methodological building blocks that are or can be used in more than one discipline and that will allow critically reflected insights into the topics under investigation. Our group's involvement is primarily in using language technology to further literary studies, notably by assigning emotions to characters in narrative text.

Involved personnel: Roman Klinger (project lead), Sebastian Padó (project lead), Evgeny Kim
More information: http://creta.uni-stuttgart.de

KABI (Confidence Estimation for Biomedical Information Extraction, 2016-2018)

KABI is a project funded by the program “RiSC – Research Seed Capital” of the State Ministry of Baden-Württemberg for Sciences, Research and Arts, proposed by Roman Klinger. In the Life Sciences, most information is only available in free text in scientific publications. Automatic methods to extract such knowledge and to provide it in structured databases is challenged by a dilemma: Especially if potentially new information is detected in text, it is unclear if this information is actually correct or if it is wrongly extracted, for instance because the text is formulated in an uncommon way. In this project, methods will be developed which help to estimate the reliability of extracted information from biomedical publications.

Involved personnel: Roman Klinger (project lead), Camilo Thorne
More information

Kooperationen

Prof. Gemma Boleda, University Pompeu Fabra, Barcelona
Prof. Hanno Ehrlicher, University of Tübingen
Prof. Sebastian Haunss, University of Bremen
Prof. Roman Klinger, University of Bamberg
Prof. Alessandro Lenci, University of Pisa
Prof. Jan Snajder, University of Zagreb

Abschlussarbeiten in der Abteilung …

Wie findet man ein Thema für eine Bachelor- oder Masterarbeit und wie bereitet man sich darauf vor? …