LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing • Lecture 29 • 5/1/2013

Recommended reading • Steven Abney. Statistical Methods and Linguistics. In: Judith Klavans and Philip Resnik (eds.), The Balancing Act: Combining Symbolic and Statistical Approaches to Language. The MIT Press, Cambridge, MA. 1996. • Fernando Pereira. 2000. Formal Grammar and Information Theory: Together Again? Philosophical Transactions of the Royal Society, 358, 1239-1253. • Lillian Lee. "I'm sorry Dave, I'm afraid I can't do that": Linguistics, Statistics, and Natural Language Processing circa 2001. Computer Science: Reflections on the Field, Reflections from the Field, pp. 111-118, 2004.

Recommended reading • Chater, N., & Manning, C. (2006). Probabilistic models of language processing and acquisition. Trends in Cognitive Sciences, 10, 287-291. • Lappin, S. and S. Shieber (2007), Machine Learning Theory and Practice as a Source of Insight into Universal Grammar, Journal of Linguistics 43(2), pp. 393-427. • RensBod. 2009. Constructions at Work or at Rest? Cognitive Linguistics 20(1).

Outline • Free Food • Grades • Thoughts on Linguistics and NLP • Future study in NLP • Evaluations

Free food • Cheesecake and tiramisu, “homemade style” • Eegee’s slush drinks • Made with real fruit • No fat • No caffeine • No high fructose corn syrup • 100% Vitamin C • Please be greedy. • Unrelated to course evaluations

Programming assignments • Turn in PA #5 and all late assignments by Thursday 5/9 at 11:59 p.m. • If everything below applies, you should talk to me: • You have made progress on other assignments but are running short on time, and are capable of finishing them • You are willing to put in super amounts of effort in order to avoid receiving a low grade • Then reason for having not done well so far is mostly because of a lack of time, rather than conceptual or technical difficulties

Current approximate grade distribution • I have handed out your current grade as of Sunday 4/28 (does not include WA #6) • Grade distribution for class: • (shown in class)

Goals • Produce systems that understand and use human language • Practical applications: • Translation • Question answering • Sentiment analysis • And other kinds of text understanding • (Speech recognition)

Tasks • Break down complex problems into linguistic tasks: • N-grams • Smoothing • POS tagging • Parsing • Word sense disambiguation • Named entity recognition • Coreference resolution • etc.

Key research questions in approaches to language • Generative linguistics • What sort of representations are needed in Universal Grammar in order to provide an elegant account of the similarities and differences in languages of the world, and human intuitions about linguistic structure? • Non-statistical computational linguistics • What kind of mathematical models are capable of generating the structures we see in human languages, and also can be efficiently parsed? • Chomsky hierarchy: regular, context-free, and beyond • Statistical NLP • What kind of weighted/probabilistic grammars/features can be extracted from a corpus with statistically reliability, such that the result is a reasonably good approximation of the linguistic data, and which can be tested against a corpus?

Generative linguistics • Computationally, accounts for: • Representation of language • Goals • Describe similarities and differences in human languages • Explain human intuitions • Explain human learning • Connections to processing and learning • Much research on taking existing aspects of linguistic theory as is, and accommodating them into a framework for learning and processing • Example: parameter setting algorithms • But there is very little influence in the other direction, from computational or psycholinguistics to theoretical linguistics

Pre-statistical computational linguistics • Computationally, accounts for: • Representation of language • Processing of language • Symbolic grammars • From formal language theory: regular expressions, CFGs, etc., and their variants • Must be able to be recognized efficiently (parseable in polynomial time) • Hand-built systems for classification • Complicated series of rules to account for different cases

Grammars in statistical NLP • Computationally, accounts for: • Representation of language • Processing of language • Learning from data • Representation of language (i.e., the grammar) should be: • Powerful/strong/complicated enough to represent many constructions of a language • Simple enough to be extracted from a corpus, which is sparse • Have a statistical learning algorithm associated with it

General-purpose machine learning algorithms • Separate data points according to their features • Apply to any type data, including language • Supervised classification • Perceptron, Decision Trees, Naïve Bayes, Neural Networks, Maximum Entropy, Support Vector Machines, Memory-Based Learning, etc. • Unsupervised learning • k-means clustering, hierarchical clustering, minimum description length

Structured learning • Assume that a set of data points are related together in a particular structure • Structured supervised classification • Assume data has a particular structure: HMM for sequences, CKY for PCFGS • Structured unsupervised learning • Sequence segmentation • Different representations of morphology • Topic and vector space models of semantics • CFG for syntax

Use of corpora in statistical NLP • Practical benefits • Train machine learning algorithms • Quantitative evaluation • Linguistic benefits • Higher coverage of range of linguistic phenomena than in hand-built grammars • Examples of how ambiguous forms are used in context • Examples of language as people actually use it: includes very complicated sentences, and ill-formed sentences

Limitations of Statistical NLP • Sparse data problem and Zipf’s Law • Overtraining / model selection • Domain specificity: systems perform best on the style of language they were developed on • New languages • Cost/time of annotation • Could use semi-supervisedand unsupervised learning

Important to use real data • Some researchers use artificial data: • Saffran’s study on infant word segmentation: only 6 “words”, all 3 syllables long • Grunwald: “it should be noted here that in experiments with toy grammars much better grammar rules were formed.” • Goldsmith: assume that most words appear in most morphological forms in a corpus • Conclusions formed on the basis of successful experiments on fake data do not necessarily extend to the real world • We want to discover the precise set of representations and learning algorithms that are able to successfully learn from data that is sparse and Zipf-distributed. • For other representations and algorithms that do not work in such circumstances, we cannot infer that these are adequate models of the representations/processes that the human mind is using.

NLP and Machine Learning • To some extent, NLP is machine learning • Apply general-purpose machine learning methodology for any SNLP problem: • Get a corpus. • Choose some features. • Apply machine learning algorithm, which will pick up on the linguistically relevant features through their statistics of occurrence in the corpus.

Sparse data in machine learning, and relation to linguistics • Deal with sparse data • Strategies: smoothing, back off, factoring according to independence assumptions • Example: Markov model • Involves Markov (limited context) assumption, chain rule, smoothing • Do this because it is infeasible to compute p(sequence) • Goal: find the appropriate representation of your data such that it can be learned efficiently, and with statistical reliability, and adequately models the linguistic problem • Linguistic interpretations: • Find a model of the linguistic data, but constrained by data statistics and computationally efficient learning • Ultimate claim: a top-performance model would reveal the actual structure of the data

Feature selection • Get high performance in machine learning through a combination of: • Good choice of features • Powerful machine learning algorithm (able to find generalizations in data points) • But don’t discount the importance of good features • If the features aren’t sufficient to separate data points in the first place, then even the fanciest, newest ML algorithm won’t be useful • http://v1v3kn.tumblr.com/post/47193952400/the-trouble-with-svms

Choice of features • Feature templates • “next word”, “previous POS”, “whether word is capitalized”, etc. • Generates specific instantiations of features: “next word is to”, “previous POS is NN”. • Machine learning will automatically decide on relevant features for the problem, through their frequency of occurrence in the data set • Benefit: don’t have to think that hard

Is your performance with a simple set of features good? • Is the cup half-full or half-empty? • Half-full: “Look how well we can do with a simple set of features!” • Half-empty: “The problem is extremely complex and a huge amount of effort will be needed in order to truly solve the problem.” • Because of Zipf’s law, there will be a huge number of cases to be accounted for, each requiring a particular sets of features • Because of Zipf’s law, it is hard to distinguish between noise, and relevant features that are rare

Good feature choice requires careful linguistic analysis • Error analysis • Look at errors made by program • Find patterns, implement new features • Example: implement a feature that identifies how many other movies by the same director were mentioned in a movie review • Good feature engineering looks rule-based! • Hand-crafted semantic classes • Detailed features might only apply in a small number of cases in the data

Statistical NLP and human learning • Cognitive modeling: • Want to understand how children learn language • Want to understand how adults process language • For modeling child language acquisition, additional constraints on learning system: • Model process of human learning over time • Constrained by quantity of data (young children only have so much linguistic experience) • Constrained by type of data (spoken language, perhaps child-directed)

Different approaches to computational cognitive modeling • Representational: • Computations are performed using structures / rules / constraints / principles / parameters like those specifically proposed in linguistic theories. • Emergent: • Anti-representational: opposed to the hypothesis of computational mechanisms specifically designed for language • Alternative: • Lots of data is input to the brain. • Apply a general-purpose learning algorithm, such as a Neural Network. System discovers statistical associations and “works”. • Therefore, complicated linguistic representations are not necessary. Because the desired behavior is approximately computed, any apparent linguistic structure is just an emergent property of the system.

Personal preference for representational approach • My grad school advisor said to me: “I like models that go ka-chunk.” • Results of emergentist models have not been satisfactory to linguists. • If a model doesn’t work well, it is easy to inspect the inner structure of the system to find out where errors are being made, and propose alternative representations or processing mechanisms associated with those representations, in order to get better performance. • Interested in modelinghow language is computed, independent of hardware with which to compute it, and what is literally computed • In the research literature, modifications of emergentist models often end up with additional assumptions about linguistic representation.

Different empiricist approaches • Generative linguistics / rationalist perspective: • Humans are born with knowledge of language, in the form of principles / constraints / rules for the construction and processing of linguistic forms • Empiricist perspective: • Knowledge comes from observation of the world • Emergentist: • Knowledge comes from learning and representations do not need to be specifically encoded • “Structured empiricist”: • You must have some basic model of linguistic structure and representation (finite-state, context-free, etc.) • Would claim that much of our knowledge of linguistic “principles / rules / constraints” can be gained by counting frequencies of observed forms

Probabilistic grammars in Statistical NLP • Papers: • Abney 1996 • Pereira 2000 • Lee 2004 (2001) • Chater & Manning 2006 • Lappin & Shieber 2007 • “Structured empiricist” perspective • Claims that much of our knowledge of linguistic “principles / rules / constraints” can be gained by counting frequencies of observed forms. • Therefore do not need to assume innate linguistic knowledge • When I read these papers I get a strong sense of “cup half-full”.

Empiricist, non-computational approaches to linguistics • Example: construction grammar • http://en.wikipedia.org/wiki/Construction_grammar • The term construction grammar (CxG) covers a family of theories, or models, of grammar that are based on the idea that the primary unit of grammar is the grammatical construction rather than the atomic syntactic unit and the rule that combines atomic units, and that the grammar of a language is made up of taxonomies of families of constructions.

Generative syntax vs. construction grammar • Example: 2 sentence types: • Barack killed Osama. X V-ed Y • Osama was killed by Barack. Y was V-ed by X • Generative approach: apply movement • Begin with X V-ed Y • Move Y to beginning • Insert “was” and “by” • Construction grammar: • Two kinds of constructions • Fill in slots V, X and Y with lexical items

Construction grammar and language acquisition • Adele Goldberg, Michael Tomasello, etc. • In essence: • A child hears lots of sentences • A child finds generalizations between sentences, to discover “slots” that may be filled in • No principled distinction between different linguistic units: phonemes, morphemes, words, phrases, etc.

Formal theories of language • Human language is a very complex phenomenon • Understand better through formal theories • “Theory” = simplified model that focuses on particular aspects of the problem • “Formal” = define so precisely that you can implement it as a computer program

Need an algorithmic formulation • Bod’s (2009) criticism of Construction Grammar

Need an algorithmic formulation • Bod’s criticism of Construction Grammar

Summary • Linguistics is important in NLP. • Very hard to find appropriate representations that can be learned efficiently from sparse data, that get good performance. • May ultimately influence linguistic theory! • If it appears that you are doing well, look further: • Is apparently good performance just a result of getting the common cases correct? • What information is necessary to get the other cases correct?

You did not understand everything • Statistical NLP combines methods from different fields to the study of language • Linguistics • Computer science • Math and statistics • Engineering: operations research, information theory, etc. • Cognitive science • Study of other fields is strongly recommendednecessary if you want to do a Ph.D. in topics of modern statistical NLP • Technical classes you should take: Automata theory, Probability and statistics, Scientific computing, Convex optimization, Artificial intelligence, Machine learning

Other NLP courses • LING / PSYC / C SC 438/538, Introduction to Computational Linguistics • Fall 2013 (Fong) • http://dingo.sbs.arizona.edu/~sandiway/ling538-12/index.html • LING 581, Advanced Computational Linguistics • Spring 2014 (Fong) • http://dingo.sbs.arizona.edu/~sandiway/ling581-13/index.html • LING 439/539 • Fall 2013 (Hammond) • Will be taught differently from my course • LING 478/578 Speech Technology • ISTA 455/555, Applied Natural Language Processing • Fall 2013 (Surdeanu) • YouTube videos of course by Chris Manning and Dan Jurafsky

Also take Linguistics classes that are non-computational • Appreciate the subtleties of language • Gain knowledge about a range of possible theories that you could implement on a computer • Get a computational linguistics job that involves working with other languages

Linguists: learn more computer science • Take classes not directly related to NLP • Gain intuitions about how computers work, and limits on what computers can do (and, hence, theories of language) • Algorithms, Compilers, Hardware, etc. • Learn more programming languages • Java or C++ • MATLAB or R • Python or Perl • Learn to use Unix/linux operating systems

Be up to date • Join the Corpora List • http://linguistlist.org/lists/join-list.cfm?List=21 • Postings about conferences, jobs, questions about corpora, general discussions about NLP • Read papers in NLP • http://aclweb.org/anthology-new/

Learn to use NLP software • http://en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits • Also shows the languages that the software was written in • Stanford NLP tools: http://nlp.stanford.edu/software/ • LingPipe: http://alias-i.com/lingpipe/ • OpenNLP: http://incubator.apache.org/opennlp/ • Many others

Get a job • Computational linguistics jobs • http://linguistlist.org/jobs/browse-jobs.cfm • http://languagelog.ldc.upenn.edu/nll/?p=1067 • Software engineering jobs for people who know something about NLP

LING / C SC 439/539 Statistical Natural Language Processing