1 / 36

Linguistics Methodology meets Language Reality:

Linguistics Methodology meets Language Reality:. the quest for robustness, scalability, and portability in (spoken) language applications. Bob Carpenter SpeechWorks International. The Standard Cliché(s). Moore’s Cliché:

keaton
Download Presentation

Linguistics Methodology meets Language Reality:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linguistics Methodology meets Language Reality: the quest for robustness, scalability, and portability in (spoken) language applications Bob Carpenter SpeechWorks International

  2. The Standard Cliché(s) • Moore’s Cliché: • Exponential growth in computing power and memory will continue to open up new possibilities • The Internet Cliché: • With the advent and growth of the world-wide web, an ever increasing amount of information must be managed

  3. More Standard Clichés • The Convergence Cliché: • Data, voice and video networking will be integrated over a universal network, that: • includes land lines and wireless; • includes broadband and narrowband • likely implementation is IP (internet protocol) • The Interface Cliché: • The three forces above (growth in computing power, information online, and networking) will both enable and require new interfaces • Speech will become as common as graphics

  4. Some Comp Ling Clichés • The Standard Linguist’s Cliché • But it must be recognized that the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term. • Noam Chomsky, 1969 [essay on Quine] • The Standard Engineer’s Cliché • Anytime a linguist leaves the group the recognition rate goes up. • Fred Jelinek, 1988 [address to DARPA]

  5. The “Theoretical Abstraction” • mature, monolingual, native language speaker • idealized to complete knowledge of language • static, homogenous language community • all speakers learn identical grammars • “competence” (vs. “performance”) • “performance” is a natural class • wetware “implementation” follows theory in divorcing “knowledge of language” from processing • assumes the existence and innateness of a “language faculty”

  6. The Explicit Methodology • “Emprical” Basis is binary grammaticality judgements • “intuitive” (to a “properly” trained linguist) • innateness and the “language faculty” • appropriate for phonetics through dialogue • in practice, very little agreement at boundaries and no standard evaluations of theories vs. data • Models of particular languages • by grammars that generate formal languages • low priority for transformationalists • high priority for monostratalists/computationalists

  7. The Holy Grail of Linguistics • A grammar meta-formalism in which • all and only natural language grammars (idealized as above) can be expressed • assumed to correspond to the “language faculty” • Grail is sought by every major camp of linguist • Explains why all major linguistic theories look alike from any perspective outside of a linguistics department • The expedient abstractions have become an end in themselves

  8. But, Applications Require • Robustness • acoustic and linguistic variation • disfluencies and noise • Scalability • from embedded devices to palmtops to clients to servers • across tasks from simple to complex • system-initiative form-filling to mixed initiative dialogue • Portability • simple adaptation to new tasks and new domains • preferably automated as much as possible

  9. The $64,000 Question • How do humans handle unrestricted language so effortlessly in real time? • Unfortunately, the “classical” linguistic assumptions and methodology completely ignore this issue • Psycholinguistics has uncovered some baselines: • lexicon (and syntax?): highly parallel • time course of processing: totally online • information integration: <= 200ms for all sources • But is short on explanations

  10. (AI) Success by Stupidity • Jaime Carbonell’s Argument (ECAI, mid 1990s) • Apparent “intelligence” because they’re too limited to do anything wrong: “right” answer hardcoded • Typical in Computational NL Grammars • lexicon limited to demo • rules limited to common ones (eg: no heavy shift) • Scaling up usually destroys this limited “success” • 1,000,000s of “grammatical” readings with large grammars

  11. Eyes track Semantic resolution ~200 ms tracking time My Favorite Experiments: I • Mike Tanenhaus et al. (Univ. Rochester) • Head-Mounted Eye Tracking Pick up the yellow plate Clearly shows that understanding is online

  12. My Favorite Experiments (II) • Garden Paths and Context Sensitive • Crain & Steedman (U.Connecticut & U. Edinburgh) • if noun is not unique in context, postmodificiation is much more likely than if noun picks out unique individual • Garden Paths are Frequency and Agreement Sensitive • Tanenhaus et al. • The horse raced past the barn fell. (raced likely past) • The horses brought into the barn fell. (brought likely participle, and less likely activity for horses)

  13. Stats: Explanation or Stopgap • A Common View • Statistics are some kind of approximation of underlying factors requiring further explanation. • Steve Abney’s Analogy (AT&T Labs) • Statistical Queueing Theory • Consider traffic flows through a toll gate on a highway. • Underlying factors are diverse, and explain the actions of each driver, their cars, possible causes of flat tires, drunk drivers, etc. • Statistics is more insightful [explanatory] in this case as it captures emergent generalizations • It is a reductionist error to insist on low-level account

  14. Competence vs. Performance • What is computed vs. how it is computed • The what can be traditional grammatical structure • All structures not computed, regardless of the how • Define what probabilistically, independently of how

  15. Algebraic vs. Statistical • False Dichotomy • All statistical systems have an algebraic basis, even if trivial • The Good News: • Best statistical systems have best linguistic conditioning (most “explanatory” in traditional sense) • Statistical estimatiors far less significant than the appropriate linguistic conditioning • Rest of the talk provides examples of this

  16. Bayesian Statistical Modeling • Concerned with prior and posterior probabilities • Allows updates of reasoning • Bayes’ Law: P(A,B) = P(A|B) P(B) = P(B|A) P(A) • Eg: Source/Channel Model for Speech Recognition • Ws: sequence of words • As: sequence of acoustic observations • Compute ArgMax_Ws P(Ws|As) ArgMax_Ws P(Ws|As) = ArgMax_Ws P(As|Ws) P(Ws) / P(As) = ArgMax_Ws P(As|Ws) P(Ws) P(As|Ws) : acoustic model P(Ws) : language model

  17. Simple Bayesian Update Example • Monty Hall’s Let’s Make a Deal • Three curtains with prize behind one, no other info • Contestant chooses one of three • Monty then opens curtain of one of others that does not have the prize • if you choose curtain 2, then one of curtain 1 or 3 must not contain prize • Monty then lets you either keep your first guess, or change to the remaining curtain he didn’t open. • Should you switch, stay, or doesn’t it matter?

  18. prize behind you select Switch P(win) = 2/3 Stay P(win) = 1/3 Answer • Yes! You should switch. • Why? Consider possiblities:

  19. Defaults via Bayesian Inference • Bayesian Inference provides an explanation for “rationality” of default reasoning • Reason by choosing an action to maximize expected payoff given some knowledge • ArgMax_Action Payoff(Action) * P(Action|Knowledge) • Given additional information update to Knowledge’ • ArgMax_Action Payoff(Action) * P(Action|Knowledge’) • Chosen action may be different, as in Let’s Make a Deal • Inferences are not logically sound, but are “rational” • Bayesian framework integrates partiality and uncertainty of background knowledge

  20. Example: Allophonic Variation • English Pronunciation (M. Riley & A. Llolje, AT&T) • Derived from TIMIT with phoneme/phone labels • orthographic: bottle • phonological: / b aa t ax l / (ARPAbet phonemes) • phonetic: 0.75 [ b aa dx el ] (TIMITbet phones) • 0.13 [ b aa t el ] • 0.10 [ b aa dx ax l ] • 0.02 [ b aa t ax l ] • Allophonic variation is non-deterministic

  21. Eg: Allophonic Variation (cont’d) • Simple statistical model (simplified w/o insertion) • Estimate probability of phones given phonemes: P(a1,…,aM|p1,…,pM) = P(a1|p1,…,pM) * P(a2|p1,…,pM,a1) * … * * P(aM|p1,…,pM,a1,…,aM-1) • Approximate phoneme context to +/- k phones • Approximate phone history to 0 or 1 phones • 0: … P(aJ|pJ-K,…,pJ,…,pJ+K) ... • 1: … P(aJ|pJ-K,…,pJ,…,pJ+K, aJ-1) … • Uses word boundary marker and stress

  22. Eg: Allophonic Variation (concl’d) • Cluster phonological features using decision trees • Sparse data smoothed by decision trees over standard features (+/- stop, voicing, aspiration, etc.) • Conditional entropy: w/o context 1.5 bits, w 0.8 • Most likely allophone correct 85.5%, in top 5, 99% • Average 17 pronunciations/word to get 95% • Robust: handles multiple pronunciations • Scalable: to whole of English pronunciation • Portable: easy to move to new dialects with training • K. Knight (ISI): similar techniques for Japenese pronunciation of English words!

  23. Example: Co-articulation • HMMs have been applied to speech since mid-70s • Two major recent improvements, the first being simply more training data and cycles • Second is: Context-dependent triphones • Instead of one HMM per phoneme/phone, use one per context-dependent triphone • example: t-r+u ‘an r preceded by t and followed by u’ • crucially clustered by phonological features to overcome sparsity

  24. Exploratory Data Analysis (Trendier: data mining; Trendiest: information harvesting) • Specious Argument: A statistical model won’t help explain linguistic processes. • Counter 1: Abney’s anti-reductionist • But even if you don’t believe that: • Counter 2: In “other sciences” (pace linguistic tradition), statistics is used to discover regularities • Allophone example: “had your” pronunciation • / d / is 51%likely to realize as [ jh ], 37% as [ d ] • if / d / realizes as [ jh ], / y / deletes 84% • if / d / realizes as [ d ], / y / deletes 10%

  25. Balancing Gricean Maxims • Grice gives us conflicting maxims: • quantity (exactly as informative as required) • quality (try to make your contribution true) • manner (be perspicuous; eg. avoid ambiguity, be brief) • Manner pulls in opposite directions • quality without ambiguity lengthens statements • quantity and and (part of) manner require brevity • Balance by estimating a multidimensional “goodness” metric for generation

  26. Gricean Balance (cont’d) • Consider problem for aggregation in generation • Every student ran slowly or walked quickly. Aggregates to: • Every student ran slowly or every student walked quickly. • This reduces sentence length, shortens clause length, and increases ambiguity. • These tradeoffs need to be balanced

  27. Collins’ Head/Dependency Parser • Michael Collins 1998 UPenn PhD thesis • Parses WSJ with ~90% constituent precision/recall • Generative model of tree probabilities • Clever Linguistic Decomposition and Training • P(RootCat, HeadTag, HeadWord) • P(DaughterCat|MotherCat, HeadTag, HeadWord) • P(SubCat|MotherCat, DtrCat, HeadTag, HeadWord) • P(ModifierCat, ModiferTag, ModifierWord | SubCat, MotherCat, DaughterCat, HeadTag, HeadWord, Distance)

  28. Eg: Collins’ Parser (cont’d) • Distance encodes heaviness • Adjunct vs. Complement modifiers distinguished • Head Words and Tags model lexical variation and word-word attachment preferences • Also conditions punctuation, coordination, UDCs • 12,000 word vocabulary plus unknown word attachment model (by Collins) and tag model (by A. Ratnaparkhi, another 1998 UPenn thesis) • Smoothed by backing off words to categories • Trivial statistical estimators; power is conditioning

  29. Computational Complexity • Wide coverage linguistic grammar generate millions of readings • But Collins’ parser runs faster than real time on a notebook on unseen sentences of length up to 100 • How? Pruning. • Collins’ found tighter statistical estimates of tree likelihoods with more features and more complex grammars ran faster because a tighter beam could be used • (E. Charniak & S. Caraballo at Brown have really pushed the envelope here)

  30. Complexity (cont’d) • Collins’ parser is not complete in the usual sense • But neither are humans (eg. garden paths) • Can trade speed for accuracy in statistical parsers • Syntax is not processed autonomously • Humans can’t parse without context, semantics, etc. • Even phone or phoneme detection is very challenging, especially in a noisy environment • Top-down expectations and knowledge of likely bottom-up combinations prune the vast search space on line • Question is how to combine it with other factors

  31. Austin today flights from Boston for pay Boston lights to for N-best and Word Graphs • Speech recognizers can return n-best histories • flights from Boston today • flights from Austin today • flights for Boston to pay • lights for Boston to pay • Can also return a packed word graph of histories; sum of path log probs equal acoustics / word-string joint log prob

  32. Probabilistic Graph Processing • The architecture we’re exploring in the context of spoken dialogue systems involves: • Speech recognizers that produce probabilistic word graph output • A tagger that transforms a word graph into a word/tag graph with scores given by joint probabilities • A parser that transforms a word/tag graph into a graph-based chart (as in CKY or chart parsing) • Allows each module to rescore output of previous module’s decision • Apply this architecture to speech act detection, dialogue act selection, and in generation

  33. rose:VBD sharply:RB after:RB prices: NN hours:NNS after:IN sharply:RB after:RB rose:VBD sharply:RB rose:VBP prices:NNS hours:NNS after:IN rose:NN sharply:RB after:IN rose:NNP after:IN sharply:RB Prices rose sharply after hours15-best as a word/tag graph + minimization

  34. Challenge: Beat n-grams • Backed off trigram models estimated from 300M words of WSJ provide best language models • We know there is more to language than two words of history • Challenge is to find out how to model it.

  35. Conclusions • Need ranking of hypotheses for applications • Beam can reduce processing time to linear • need good statistics to do this • More linguistic features are better for stat models • can induce the relevant ones and weights from data • linguistic rules emerge from these generalizations • Using acoustic / word / tag / syntax graphs allows the propogation of uncertainty • ideal is totally online (model is compatible with this) • approximation allows simpler modules to do first pruning

  36. Plugs Run, don’t walk, to read: • Steve Abney. 1996. Statistical methods and linguistics. In J. L. Klavans and P. Resnik, eds., The Balancing Act. MIT Press. • Mark Seidenberg and Maryellen MacDonald. 1999. A probabilistic constraints approach to language acquisition and processing. Cognitive Science. • Dan Jurafsky and James H. Martin. 2000. Speech and Language Processing. Prentice-Hall. • Chris Manning and Hinrich Schuetze. 1999. Statistical Natural Language Processing. MIT Press.

More Related