Probabilistic Linguistic Models

Hinrich Schütze


Date: 2005-02-22

1. Problem Statement - Fragestellung

Much has been written about the need to combine symbolic and statistical methods in linguistics. We wholeheartedly agree with this view. However, a change of perspective is suggested in this proposal. Instead of viewing a theory with both symbolic and statistical elements as a combination, it is of interest to point out that every probabilistic model has a symbolic core. This symbolic core can be simple as in the case of Markov models where it corresponds to a directed graph with n times n edges where n is the number of nodes. It is more complex for a probabilistic context-free grammar (PCFG) - in that case it is a (symbolic) context-free grammar. But all probabilistic models have such a symbolic core. The alternative view then is that we should consider linguistic models that ``probabilize'' some of their symbolic components. This would mean that linguistic models come in two guises: traditional symbolic models and ``enhanced'' models that probabilize some of their components. This proposal argues that this is a promising area of linguistic research and offers a rich set of topics to prospective graduate students.

For simplicity, I will refer to models that are partially probabilized as ``probabilistic models''. It would be clearer (but too cumbersome) to call them ``symbolic models which probabilize at least one part of their symbolic core.''

2. State of the Art - Stand der Forschung

2.1. Examples of probabilistic linguistic models

Historically, the only linguistic subdiscipline that took probabilistic models seriously was quantitative sociolinguistics (Labov 1994). Explanations were probabilistic (for example, the deletion rates of certain consonants in the pronunciation of English words) and were empirically validated against data sets compiled in extensive field work. It is true that the models used were mathematically simplistic and limited to mostly phonetic variation. But there is much linguistic insight in this early work and kudos must be paid to a group of researchers that stuck to its methodology even though it was frowned upon by the mainstream.

Language variation was resistant to the anti-probabilistic bias of mainstream linguistics because variation is hard to explain with a purely symbolic theory. To some extent, variation shares this characteristic with three other subdisciplines: language acquisition, historical change and typology. In these subdisciplines language often moves gradually from one state to another - or related states coexist. Symbolic theories posit two distinct states to explain these phenomena. An example is that at some point English (or the English of individual speakers) switched from a state of analyzing ``he is going to drink a beer'' as a statement about a walking action to a state where it was an expression of future tense. Or children at some point switch from a state in which ``I explained him the game'' is grammatically correct to a state where it is not. Still another example is the constraint that topics precede foci. It is grammatically mandatory in some languages, but coexists as a weaker regularity with other "laws" in languages like English (where it surfaces, for example, in the locative construction). It is descriptively adequate to posit two distinct states and a mechanism that switches between them. But there is a potential here for more explanatory linguistic models - probabilistic models that unify the two states and explain a state transition as a change of parameters within the same model. Four examples of linguistic explanations along these lines, selected more or less arbitrarily, follow.

Bybee (2001) explains several observations about liaison in French using exemplar theory. In exemplar theory, all linguistic material is stored permanently in form of exemplars after it has been either produced or perceived. Production and perception then use exemplars to process language. For example, a word that has never been produced before (e.g., ``flume'') will be produced in analogy to the exemplars of previously produced similar words (``gloom'', ``room'', ``floor''). Exemplar theory can model many phenomena in French liaison. Liaisons between frequent words have persisted longer than liaisons between infrequent words. If one of two variants of a word becomes extinct, then it is the one that is less frequent (word initial ``z'' must be pronounced for ``(z)yeux'', word final ``t'' must be omitted for ``apparaissant''). Exemplar theory formalizes the tension between (1) generalizations that have a tendency to spread (the number of silent word-final consonants increases over time) and (2) cases that resist the generalization either permanently (the ``z'' in ``(z)yeux'') or temporarily (frequent constructions (e.g., with object pronouns) maintain liaison). This resistance can be explained by the basic mechanism of exemplar theory: a dense cloud of similar exemplars protects against the encroachment of spreading change. If there is no such dense cloud, the change is predicted to take effect.

Haspelmath (2004) explains the universal that extroverted verbs express reflexives in a more complex way than introverted verbs by the principle of economic motivation. If one applies shaving to oneself (and introverted verb), then this is expected and can be expressed with an unstressed particle if reflexivity is expressed at all. On the other hand, hating (an extroverted verb) is usually not ``self-directed'', so reflexivity needs to be expressed clearly and with more phonological material. Haspelmath suggests that this type of expectation is a fact about the world and proposes a frequentist account (a particular type of probabilistic model).

In the tradition of Rumelhart and McClelland (1986), Schütze (1997) shows that probabilistic linguistic models can explain the acquisition of complex English subcategorization frames. He gives evidence that non-probabilistic models such as the one proposed by Pinker (1989) are not explanatorily adequate. The idea behind the learning model is that learning captures broad generalizations first, which may be initially applied too broadly, resulting in overgeneralization. Exceptions are learned subsequently. Negative evidence is implicit in a combination of frequency and model fit. Simplifying somewhat, there are two types of negative evidence. Negative evidence for frequent items consists of simple absence. The frequent verb ``explain'' does not participate in the dative alternation because we don't experience it in this construction. Negative evidence for infrequent items consists of absence in the class. The infrequent verb ``bond'' does not form the past tense ``bont'' in analogy to ``sent'' because this pattern is absent from the class of English verbs that ``bond'' is a member of.

Some of the specific examples discussed here concern the language as a whole, not the language of an individual. But in each case there are examples of the same phenomenon that apply to an individual speaker. For example, variation in liaison occcurs in individuals as well as historically.

Most current probabilistic accounts emphasize the explanatory importance of frequency. They are really frequentist models (a simple subclass of probabilistic models) and often lack mathematical sophistication. There is a great opportunity here to improve the state of the art by applying better models and more rigor. This process has started in exemplar theory (Kirchner, 1999; Pierrehumbert, 2001), but is at an early stage.

2.2. Probabilistic Models vs. Optimality theory

2.2.1. Standard optimality theory

Optimality theory has some of the same goals as probabilistic linguistics. One can view optimality theory as a framework that allows the statement of broad, explanatory generalizations and arbitration between them when they conflict. Similarly, probabilistic models capture broad generalizations in their symbolic core and use the apparatus of probability theory to mediate between them. Many phenomena receive an explanation in optimality theory that is elegant and concise and probabilistic models may not be able to improve on these accounts.

However, there is also a large class of phenomena that cannot be explained well in standard optimality theory. First, there are often several acceptable linguistic forms, not just one. Optimality theory relies on ranking and loses much of its appeal when we replace the simple winner-take-all approach with something more complex that allows several winners.

Equally important is the phenomenon of ``ganging up''. The violation of several lesser constraints is often worse than the violation of one big constraint. Optimality theory in its current form cannot formalize this. See Manning (2002) for discussion.

2.2.2. Stochastic optimality theory

These two limitations of classical optimality theory, its winner take all property and its difficulty in formalizing "ganging up", don't apply to stochastic OT. Stochastic OT (Boersma 1998) and other non-standard OT variants (Smolensky et al 2005, Keller 2002) are probabilistic linguistic models in the sense that the term is used here. Extensions of optimality theory that allow for more flexible interaction of constraints (more flexible than ranking) seem particularly promising for linguistic theories that can explain phenomena not yet explained by current theories.

2.3. Explanatory adequacy of probabilistic models

In this section, we review some common arguments against the theoretical adequacy of probabilistic models for language.

2.3.1. Colorless green ideas

The most famous and most infamous argument against probabilistic models is Chomsky's. Chomsky argued that there are both grammatical and ungrammatical sentences that we have never seen. If grammaticality is a function of how frequently a sentence occurs, then a probabilistic model cannot distinguish between grammatical and ungrammatical sentences (Chomsky, 1957). This is a convincing argument against a particular type of probabilistic model, one that estimates the probability of a sentence as its relative frequency. But it is not an argument against other types of models. For a more general class of models, Markov models, Chomsky showed that they also do not model language correctly. Again, this means that Markov models are not adequate and is not an argument against probabilistic linguistics in general. For example, the argument has nothing to say about PCFGs (although it is hard to argue that PCFGs are linguistically explanatory). See Abney (1996) for a discussion of Chomsky's arguments.

2.3.2. Empirical validation

Statistical models are sometimes accused of not being explanatory because any data set can be fitted by either fiddling with the parameters of the model or by declaring counterexamples to be exceptions. This is a valid concern. Probabilistic models have more knobs to tweak than their symbolic equivalents -- each parameter is a knob. And each parameter can take on an infinite number of values. So empirical validation is definitely a challenge for probabilistic models.

But empirical validation is a challenge for other linguistic theories too. Witness this discussion between two generative grammarians:

The corresponding argument between two probabilistic linguists is: Probabilistic theories of language make statements about distributions and overall regularities. Individual counterexamples cannot be used to falsify them. A debate about the adequacy of a probabilistic linguistic model must therefore be a debate about its symbolic core; about the way it models learning; about the overall distribution of data in the language under study; and about similar properties. Just as we don't let the generative grammarian off the hook when she appeals to her own dialect, we shouldn't let the probabilistic linguist get away with dismissing exceptions too easily. Empirical validation is a hard problem in all sciences and needs to be approached with great care.

2.3.3. Integers, rationals, reals

I haven't been able to find this argument in print, but many linguists are uncomfortable with including numbers in a linguistic theory. There is less resistance to integers because of the success of optimality theory. Ranking is an important explanatory device and it is equivalent in representational power to integers. But rationals and reals are met with great suspicion. I've even heard the argument that rationals are more acceptable than reals because rationals can be viewed as ratios of integers whereas no such reduction to integers is possible for reals.

Perhaps the perceived problem is that counting and integers are natural parts of language (e.g., every language has words for them), but rationals and reals are not. But this can only be an argument about the subject of our research in linguistic science (the languages we study like English and Chinese), not about our scientific metalanguage. Be that as it may, there is no scientific basis for a priori exclusion of numbers from linguistic explanations.

2.3.4. A correlation is not an explanation

There are many probabilistic investigations of language that are descriptively oriented. For example, Zipf's law states that, on a logarithmic scale, the rank of a word is correlated with the inverse of its frequency (where words are ranked according to frequency). It is not clear whether this "law" explains anything or whether it is in turn in need of explanation. In fact, Zipf proposed an informal probabilistic account of his law. But Zipf's law, it its descriptive form, is often regarded as a typical example of probabilistic linguistics.

Again, this objection to probabilistic explanations in language is a valid objection to particular instances, like Zipf's law. But it does not apply to probabilistic explanations in general, for example, to those in Section 2.1.

2.3.5. It's behaviorism

Some scientific subcommunities espouse linguistic behaviorism. By behaviorism I mean a theory of language that attempts to explain linguistic phenomena by stimulus-response learning (Pulvermüller, 1999). Anybody who understands language at the level of the average linguist cannot agree with this view. Language cannot be equated with salivation. In general, probabilistic linguistic theories involve some kind of learning that may look like stimulus-response learning to the uninitiated. Hence the suspicion that theories with probabilistic or statistical elements are behaviorist and cannot be accepted.

The fallacy here is that, by definition, learning involves stimuli and responses. Perhaps memorization is a form of learning that merely collects stimuli, but pure memorization is not learning if there is no potential of acting on the memories. So the distinction between behaviorist and non-behaviorist learning is not that one involves stimuli and responses and the other does not. The difference is that non-behaviorist theories admit to the possibility that there is some form of prior (or innate) knowledge about the learning problem - the ``bias'' as it is technically called in machine learning. Behaviorism rejects strong ``nativism''. It denies the existence of complex innate knowledge or at least deems its study unscientific. The behaviorist learner is an association machine that associates entities with other entities without any active intervention. (This is somewhat of a caricature, but one that is not too far from the truth.)

The debate about the nature of innate human knowledge is largely independent of the position one takes on probabilistic vs. non-probabilistic models. If anything, the debate is likely to be more informed among practitioners of probabilistic linguistics since they learned in Introductory Machine Learning that learning without bias is impossible. So the question is not: Is there a bias, yes or no? It must be: What is the bias (the innate knowledge)?

In summary, behaviorist models can be viewed as a subclass of probabilistic models, but in general probabilistic models are not behaviorist.

2.3.6. It's engineering

Most probabilistic models for language in use today have little explanatory adequacy. It is true that fundamental assumptions are often justified linguistically. For example, data-oriented parsing analyzes the structure of a sentence in analogy to the structure of known sentences, an exemplar-based approach that is quite similar in spirit to exemplar theory (Bod et al., 2003). But it is unclear how much data-oriented parsing has to say about linguistic theory apart from this very important, but also very basic insight.

There are also examples of more linguistically oriented models, but they are far and few between compared to the mass of work firmly anchored in the engineering sciences. Other uses of statistics and probability theory as an auxiliary science can also be assigned to this category, hypothesis testing being the most prominent example: probability serves an important function, but it is not part of the theoretical apparatus.

Perhaps surprisingly, I would also question the role of corpus linguistics in this context. There can be no doubt that corpus linguistics is absolutely essential for theoretical linguistics. I would claim that many advances in linguistics in the last decade have been motivated and supported by corpus-based work. One example is the theoretical understanding of subcategorization in English, which has evolved considerably because of the now widespread use of corpus-based methods (Manning, 2002). Another one is that our theoretical understanding of the lexicon has changed substantially, partly because of corpus-based lexical resources like WordNet (Miller et al., 1990) and FrameNet (Fillmore and Baker, 2001). And there are many more.

But corpus linguistics has mostly had the role of an auxiliary science. It has not directly contributed to theoretical advances.

My claim that there is a dearth of work in probabilistic linguistics (work that is strong probabilistically as well as linguistically.) is not an indictment of this approach. It just means that a lot of excellent research is being done in engineering and corpus linguistics and that it has goals other than contributing directly to linguistic theory. Far from being an argument against this emerging field, it suggests there is a great opportuntity for probabilistic linguists to do innovative research in an area that is just beginning to evolve.

2.3.7. Performance vs. Competence

If we define linguistics as the study of linguistic competence and if competence is the core of grammar that has no strong interactions with other cognitive abilities, then probabilistic models have little to contribute to linguistics. Linguists who view the performance-competence distinction as a central tenet of linguistics (as opposed to a research strategy that directs attention to a subset of linguistic phenomena that is of particular importance) are unlikely to find much of interest in probabilistic explanations.

2.3.8. Language is not random

The prototypical probabilistic device is a coin. We toss it and it randomly comes up heads or tails. How can this be a valid model of language? Clearly, the sentences we produce are not strings of randomly selected words.

The pedestrian defense against this view is to point out that we mostly work with conditional probabilities. For example, if I see a jay-walker crossing the street and a big truck approaching, I might shout either ``Stop!'' or ``Watch out!'' It seems plausible that there is some randomness in which one I choose.

A more philosophical answer might be that heads or tails depends deterministically on the way I toss the coin. A probabilistic model is the most explanatory model of the tossing without making any assertions about lofty concepts like free will or determinism. I view randomness as a subjective interpretation of the model that neither adds to nor subtracts from its explanatory power.

2.3.9. Can probability be the truth about language?

Intuitions vary considerably, but most would agree that many linguistic phenomena are best explained non-probabilistically. In English, the subject noun phrase precedes the verb phrase. The most insightful and most explanatory way of stating this scientific fact in a theory of linguistics will always be S$ \Rightarrow$ NP VP or a variant thereof. If some explanations are necessarily symbolic, isn't it a problem to have a mix of probabilistic and symbolic explanations? Can language be both probabilistic and symbolic?

An analogy from physics may help. The gas law states:

$\displaystyle pV=nRT
$

where $ p$ is the pressure, $ V$ the volume, $ n$ the number of moles, $ R$ the gas constant, and $ T$ the temperature. Ultimately, the gas law can be derived from the kinetics of individual molecules. But clearly the right level of explanation, the formulation that allows us best to understand its main insight and also the formulation that allows us best to make predictions is the level at which it is stated. There can be no more concise and explanatory statement about the set of phenomena we want to capture here than the six-character string ``pV=nRT''.

This physics example is an analogy on two levels. First, it shows that there is no dichotomoy between probabilistic and symbolic explanations. The symbolic explanation (the "better" explanation in this case) emerges from the probabilistic one. And there is nothing novel about pointing this out: Emergence of symbols and symbolic relationships is a staple of connectionism (Rumelhart et al., 1986; McClelland et al., 1986).

Secondly, the ardent probabilist may be tempted to claim that her explanation is better because it's more basic and close to the ultimate truth of the basic laws (in this case, molecular kinetics). But as we know molecules consist of atoms, atoms consist of particles etc. The idea that one level of explanation is superior to another on metaphysical grounds is typical of the logical positivist program that all of science can be axiomatized like mathematics and then derived from axioms. The discussion in this article is in the spirit of Dupré who rejects the positivist program and accepts the diversity and disorder of the world (Dupré, 1993). If we let different theories coexist with each other, then there is no a priori reason to prefer one level of explanation to another - or to object to mixing them for that matter. Each has to fend for itself with arguments like explanatory adequateness and predictiveness without the positivist belief that there is a single truth, a unified theory that explains all phenomena.

3. Results of this Research Group - Eigene Vorarbeiten

There is a tradition of work on probabilistic models at the Institut für Maschinelle Sprachverarbeitung reaching back more than 10 years. Topics have included part of speech tagging (Schmid, 1994), head-lexicalized PCFGs (Carroll and Rooth, 1998), learning semantic roles from corpora (Rooth et al., 1999; Beil et al., 1999), probabilistic morphology (Schmid, 2005), and statistical models of collocations (Evert, 2004). This work was mainly concerned with solving computational problems in applications like machine-readable dictionaries and grammar development. Linguistic explanation usually was a secondary goal. Still, there is rich expertise in probabilistic models in the research group that will be invaluable to students working on the project proposed here.

Research results in the area of child language acquisition were discussed earlier (Schütze, 1997).

4. Topics for Doctoral and Postdoctoral Students

A list of topics follows. These are merely suggestions. Any topic in the area of probabilistic linguistics would be appropriate as long as it is well-founded both linguistically and mathematically. Additional topics are listed in the following section.

5. Position within the Graduiertenkolleg - Verknuepfung mit anderen Projekten des Graduiertenkollegs

In principle, the question posed in the beginning of this proposal arises in each of the other projects in the Graduiertenkolleg: Is this linguistic phenomenon best explained by a purely symbolic theory or does a model that partially probabilizes its symbolic core yield better explanations? In working with students, preference would be given to those linguistic phenomena that are the subject of one of the other projects. The student could then contribute a probabilistic perspective to the research conducted in the second project. This would further collaboration of students and advisors within the Graduiertenkolleg. Some examples follow.

I am grateful to Artemis Alexiadou, Bernd Moebius, Hans Kamp, and Jonas Kuhn for comments on earlier drafts.

6. References

Steven Abney.
Statistical methods and linguistics.
In Judith Klavans and Philip Resnik, editors, The Balancing Act: Combining Symbolic and Statistical Approaches to Language, pages 1-26. The MIT Press, 1996.

Franz Beil, Glenn Carroll, Detlef Prescher, Stefan Riezler, and Mats Rooth.
Inside-outside estimation of a lexicalized pcfg for german.
In Proc. of ACL, 1999.

Rens Bod, Remko Scha, and Khalil Sima$ '$an.
Data-Oriented Parsing.
CSLI Publications, 2003.

Paul Boersma.
Functional phonology: Formalizing the interactions between articulatory and perceptual drives.
PhD thesis, University of Amsterdam, 1998.

Joan Bresnan.
The emergence of the unmarked pronoun.
In Geraldine Legendre, Sten Vikner, and Jane Grimshaw, editors, Optimality-theoretic Syntax. The MIT Press, 2000.

Joan Bybee.
Frequency effects on french liaison.
In Joan Bybee and Paul Hopper, editors, Frequency effects and Emergent Grammar, pages 337-359. John Benjamins, Amsterdam, 2001.

Glen Carroll and Mats Rooth.
Valence induction with a head-lexicalized PCFG.
In Proc. of EMNLP, Granada, Spain, 1998.

Noam Chomsky.
Syntactic Structures.
Mouton, The Hague, 1957.

John Dupré.
The Disorder of Things.
Harvard University Press, 1993.

Stefan Evert.
The statistical analysis of morphosyntactic distributions.
In Proc. of LREC, pages 1539-1542, Lisbon, Portugal, 2004.

Charles J. Fillmore and Collin F. Baker.
Frame semantics for text understanding.
In Proc. of WordNet and Other Lexical Resources Workshop, NAACL, 2001.

Martin Haspelmath.
A frequentist explanation of some universals of reflexive marking.
Handout, 2004.

Frank Keller and Ash Asudeh.
Probabilistic learning algorithms and optimality theory.
Linguistic Inquiry , 33(2):225-244, 2002.

Robert Kirchner.
Preliminary thoughts on phonologization within an exemplar-based speech processing system.
Technical report, UCLA Working Papers in Linguistics, Los Angeles CA, 1999.

Jonas Kuhn.
Optimality-Theoretic Syntax - A Declarative Approach.
CSLI Publications, 2003.

William Labov.
Principles of linguistic change. Volume 1: Internal factors. .
Blackwell, 1994.

Chris Manning.
Probabilistic syntax.
In Rens Bod, Jennifer Hay, and Stefanie Jannedy, editors, Probabilistic Linguistics. MIT Press, Cambridge MA, 2002.

James L. McClelland, David E. Rumelhart, and the PDP Research Group, editors.
Parallel Distributed Processing. Explorations in the Microstructure of Cognition. Volume 2: Psychological and Biological Models.
The MIT Press, Cambridge, MA, 1986.

George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J. Miller.
Introduction to WordNet: An on-line lexical database.
Journal of Lexicography, 3 (4): 235-244, 1990.

Janet Pierrehumbert.
Exemplar dynamics: Word frequency, lenition, and contrast.
In Joan Bybee and Paul Hopper, editors, Frequency effects and Emergent Grammar, pages 137-157. John Benjamins, Amsterdam, 2001.

Steven Pinker.
Learnability and Cognition.
The MIT Press, Cambridge MA, 1989.

Friedemann Pulvermüller.
Words in brain s language.
Behavioral and Brain Science, 22: 253-336, 1999.

Stefan Riezler, Tracy H. King, Ronald M. Kaplan, Richard S. Crouch, John T. Maxwell III, and Mark Johnson.
Parsing the wall street journal using a lexical-functional grammar and discriminative estimation techniques.
In ACL, pages 271-278, 2002.

Mats Rooth, Stefan Riezler, Detlef Prescher, Glenn Carroll, and Franz Beil.
Inducing a semantically annotated lexicon via em-based clustering.
In Proc. of ACL, 1999.

Antje Rossdeutscher and Hans Kamp.
Remarks on lexical structure and drs construction.
Theoretical Linguistics, 20 (2/3): 97-164, 1994.

D. E. Rumelhart and J. L. McClelland.
On learning the past tenses of English verbs.
In James L. McClelland, David E. Rumelhart, and the PDP Research Group, editors, Parallel Distributed Processing. Explorations in the Microstructure of Cognition. Volume 2: Psychological and Biological Models, pages 216-271. The MIT Press, Cambridge, MA, 1986.

David E. Rumelhart, James L. McClelland, and the PDP research group, editors.
Parallel Distributed Processing. Explorations in the Microstructure of Cognition. Volume 1: Foundations.
The MIT Press, Cambridge, MA, 1986.

Helmut Schmid.
Probabilistic part-of-speech tagging using decision trees.
In Proc. of the International Conference on New Methods in Language Processing (NeMLaP), pages 44-49, 1994.

Helmut Schmid.
Disambiguation of morphological structure using a pcfg.
Submitted, 2005.

Hinrich Schütze.
Ambiguity Resolution in Language Learning.
CSLI Publications, Stanford, CA, 1997.

Paul Smolensky and Geraldine Legendre.
The Harmonic Mind: From Neural Computation To Optimality-Theoretic Grammar .
MIT Press, 2005.

Whitney Tabor.
Syntactic Innovation: A Connectionist Model.
PhD thesis, Stanford University, 1994.