CS506/606: Computational Linguistics Fall 2009 Unit 1

Richard Sproat URL: http://www.cslu.ogi.edu/~sproatr/Courses/CompLing/ CS506/606: Computational LinguisticsFall 2009Unit 1

Computational Linguistics This Unit • Overview of the course • What is computational linguistics? • First linguistic problem: grammatical part-of-speech tagging • The problem • The source-channel model • Language modeling • Estimation • Finite-state methods • First homework: a WFST-based implementation of a source-channel tagger.

Computational Linguistics Format of the course • Lectures • Homeworks • 2-3 homeworks, which will be 70% of the grade • The homeworks will be work. • It is assumed you know how to program • Individual projects (30% of the grade) • You must discuss your project with me by the end of the third week • The final week will consist of (short) project presentations

Computational Linguistics Final projects • The project can be on any topic related to the course, e.g.: • Implement a parsing algorithm • Design a morphological analyzer for a non-trivial amount of morphology for a language • Build a sense-disambiguation system • Design a word-segmentation method for some written language that doesn't delimit words with spaces • Doing a serious literature review of some area of the field

Computational Linguistics Readings for the course • Textbooks: • Brian Roark, Richard Sproat. Computational Approaches to Syntax and Morphology. Oxford University Press, 2007. • Daniel Jurafsky, James H. Martin. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Second Edition. Prentice Hall, 2008. • A few readings from online sources

Computational Linguistics Prerequisites for this course • Not many really: • You must know how to program: • Any programming language is fine • Should also know simple shell scripting • You will need access to a computer (duh). • You will need linux or a linux-like environment: • For Windows users I recommend cygwin(www.cygwin.com)

Computational Linguistics What you should expect to get out of this course • An understanding of a range of linguistic issues: • The course is organized around “units”, each of which deals with one linguistic problem and one or more computational solution • Some problems are “practical” some of more theoretical interest • Some sense of variation across languages and the kinds of things to expect when one deals with computational problems in various languages • A feel for some of the kinds of computational solutions people have proposed

Computational Linguistics Computational linguistics

Computational Linguistics I often agree with XKCD…

Computational Linguistics linguistics? computational linguistics physics chemistry biology neuropsychology psychology literary criticism more rigorous less rigorous more flakey

Computational Linguistics What defines the rigor of a field? • Whether results are reproducible • Whether theories are testable/falsifiable • Whether there are a common set of methods for similar problems • Whether approaches to problems can yield interesting new questions/answers

Computational Linguistics Linguistics

Computational Linguistics engineering linguistics sociology literary criticism more rigorous less rigorous

Computational Linguistics The true situation with linguistics “theoretical” linguistics (e.g. lexical-functional grammar)‏ other areas of sociolinguistics (e.g. Deborah Tannen)‏ some areas of sociolinguistics (e.g. Bill Labov)‏ “theoretical” linguistics (e.g. minimalist syntax)‏ experimental phonetics historical linguistics psycholinguistics more rigorous less rigorous

Computational Linguistics Okay enough alreadyWhat is computational linguistics • Text normalization/segmentation • Morphological analysis • Automatic word pronunciation prediction • Transliteration • Word-class prediction: e.g. part of speech tagging • Parsing • Semantic role labeling • Machine translation • Dialog systems • Topic detection • Summarization • Text retrieval • Bioinformatics • Language modeling for automatic speech recognition • Computer-aided language learning (CALL)‏

Computational Linguistics Computational linguistics • Often thought of as natural language engineering • But there is also a serious scientific component to it.

Computational Linguistics Goals of Computational Linguistics/ Natural Language Processing • To get computers to deal with language the way humans do: • They should be able to understand language and respond appropriately in language • They should be able to learn human language the way children do • They should be able to perform linguistic tasks that skilled humans can do, such as translation • Yeah, right

Computational Linguistics Some interesting themes… • Finite-state methods: • Many application areas • Raises many interesting questions about how “regular” language is • Grammar induction: • Linguists have done a poor job at their stated goal of explaining how humans learn grammar • Computational models of language change: • Historical evidence for language change is only partial. There are many changes in language for which we have no direct evidence.

Computational Linguistics Why CL may seem ad hoc • Wide variety of areas (as in linguistics) • If it’s natural language engineering the goal is often just to build something that works • Techniques tend to change in somewhat faddish ways… • For example: machine learning approaches fall in and out of favor

Computational Linguistics

Computational Linguistics Machine learning in CL • In general it’s a plus since it has meant that evaluation has become more rigorous • But it’s important that the field not turn into applied machine learning • For this to be avoided, people need to continue to focus on what linguistic features are important • Fortunately, this seems to be happening

Computational Linguistics A well-worn example Astronauts Poole (Gary Lockwood) and Bowman (Keir Dullea) trying to elude the HAL 9000 computer.

Computational Linguistics The HAL 9000 • Perfect speech recognition • Perfect language understanding • Perfect synthesis: • Here’s the current reality: • Perfect modeling of discourse • (Vision) • (World knowledge) • … • And “experts” in the 1960’s thought this would all be possible

Computational Linguistics Another example The Gorn uses the Universal Translator in the Star Trek episode “Metamorphosis”)‏

Computational Linguistics Are these even reasonable goals? • These are nice goals but they have more to do with science fiction than with science fact • Realistically we don’t have to go this far to have stuff that is useful: • Spelling correctors, grammar checkers, MT systems, tools for linguistic analysis, … • Limited speech interaction systems: • Early systems like AT&T’s VRCP (Voice Recognition Call Processing): • Please say collect, third party or calling card • More recent examples: Goog411, United Airlines flight info

Computational Linguistics Named Entity Recognition • Build a system that can find the names in a text: Israeli Leader Suffers Serious Stroke By STEVEN ERLANGER JERUSALEM, Thursday, Jan. 5 - Israeli Prime Minister Ariel Sharon suffered a serious stroke Wednesday night after being taken to the hospital from his ranch in the Negev desert, and he underwent brain surgery early today to stop cerebral bleeding, a hospital official said. Mr. Sharon's powers as prime minister were transferred to Vice Premier Ehud Olmert, said the cabinet secretary, Yisrael Maimon.

Computational Linguistics Name Transliteration • Handle cross-language transliteration

Computational Linguistics Abbreviation Expansion • Recover the underlying words in cases such as:

Computational Linguistics Machine Translation

Computational Linguistics Interpret text into scenes the very huge fried egg is on the table-vp23846. the very large american party hat is six inches above the egg. the chinstrap of the hat is invisible. the table is on the white tile floor. the french door is behind the table. the tall white wall is behind the french door. a white wooden chair is to the right of the table. it is facing left. it is sunrise. the impi-61 photograph is on the wall. it is three inches left of the door. it is three feet above the ground. the photograph is eighteen inches wide. a white table-vp23846 is one foot to the right of the chair. the big white teapot is on the table.

Computational Linguistics Interpret Text into Scenes the glass bowling ball is behind the bowling pin. the ground is silver. a goldfish is inside the bowling ball.

Computational Linguistics Interpret Text into Scenes the humongous blue transparent ice cube is on the silver mountain range. the humongous green transparent ice cube is next to the blue ice cube. the humongous red transparent ice cube is on top of the green ice cube. the humongous yellow transparent ice cube is to the left of the green ice cube. the tiny santa claus is inside the red ice cube. the tiny christmas tree is inside the blue ice cube. the four tiny reindeer are inside the green ice cube. the tiny blue sleigh is inside the yellow ice cube. the small snowman-vp21048 is three feet in front of the green ice cube. the sky is pink.

Computational Linguistics Interpret Text into Scenes the donut shop is on the dirty ground. the donut of the donut shop is silver. a green a tarmac road is to the right of the donut shop. the road is 1000 feet long and 50 feet wide. a yellow volkswagen bus is eight feet to the right of the donut shop. it is on the road. a restaurant waiter is in front of the donut shop. a red volkswagen beetle is eight feet in front of the volkswagen bus. the taxi is ten feet behind the volkswagen bus. the convertible is to the left of the donut shop. it is facing right. the shoulder of the road has a dirt texture. the grass of the road has a dirt texture.

Computational Linguistics Interpret Text into Scenes The shiny blue goldfish is on the watery ground. The shiny red colorful-vp3982 is six inches away from the shiny blue goldfish. The polka dot colorful-vp3982 is to the right of the shiny blue goldfish. The polka dot colorful-vp3982 is five inches away from the shiny blue goldfish. The transparent orange colorful-vp3982 is above the shiny blue goldfish.The striped colorful-vp3982 is one foot away from the transparent orange colorful-vp3982. The huge silver wall is facing the shiny blue goldfish. The shiny blue goldfish is facing the silver wall. The silver wall is five feet away from the shiny blue goldfish.

Computational Linguistics How does the NLP in WordsEye work? • Statistical part-of-speech tagger • Simple morphological analyzer • Statistical parser • Reference resolution model based on world model • Semantic hierarchy (similar to WordNet)

Computational Linguistics Part-of-speech tagging • Part of speech (POS) tagging is simply the problem of placing words into equivalence classes. • Notion of part of speech tags can be attributed to Dionysius Thrax, 1st Century BC Greek grammarian who classified Greek words into eight classes: • noun, verb, pronoun, preposition, adverb, conjunction, participle and article. • Tagging is arguably easiest in languages with rich (inflectional) morphology (e.g. Spanish) for two reasons: • It’s more obvious what the basic set of tags should be since words fall into • The morphology gives important cues to what the part of speech is: • cantaremos is highly likely to be a verb given the ending -ar-emos. • It’s arguably hardest in languages with minimal (inflectional) morphology: • there are fewer cues in English than there are in Spanish • for some languages like Chinese, cues are almost completely absent • linguists can’t even agree on whether (e.g.) Chinese distinguishes verbs from adjectives.

Computational Linguistics Part-of-speech tags • Linguists typically distinguish a relatively small set of basic categories (like Dionysius Thrax)—sometimes just 4 in the case of Chomsky’s [±N,±V] proposal. • But usually these analyses assume an additional set of morphosyntactic features. • Computational models of tagging usually involve a larger set, which in manycases can be thought of as the linguists’ small set, plus the features squished into one term: • eat/VB, eat/VBP, eats/VBZ, ate/VBD, eaten/VBN • Tagset size has a clear effect on performance of taggers: • “the Penn Treebank project collapsed many tags compared to the original Brown tagset, and got better results.” (http://www.ilc.cnr.it/EAGLES96/morphsyn/node18.html) • But choosing the right size tagset depends upon the intended application. • As far as I know, there is no demonstration of what is the “optimal” tagset.

Computational Linguistics The Penn Treebank tagset • 46 tags, collapsed from the Brown Corpus tagset • Some details: • to/TO not disambiguated • verbs and auxiliaries (have, be) not distinguished (though these were in the Brown tagset). • Some links: • http://www.computing.dcu.ie/~acahill/tagset.html • http://www.mozart-oz.org/mogul/doc/lager/brill-tagger/penn.html • http://www.scs.leeds.ac.uk/amalgam/tagsets/upenn.html • Link for the original Brown corpus tags: • http://www.scs.leeds.ac.uk/ccalas/tagsets/brown.html • Motivations for the Penn tagset modifications • “the Penn Treebank tagset is based on that of the Brown Corpus. However the stochastic orientation of the Penn Treebank and the resulting concern with sparse data led us to modify the Brown tagset by paring it down considerably” (Marcus, Santorini and Marcinkiewicz, 1993). • eliminated distinctions that were lexically recoverable: thus no separate tags for be, do, have • as well as distinctions that were syntactically recoverable (e.g. the distinction between subject and object pronouns)

Computational Linguistics Problematic cases • Even with a well-designed tagset, there are cases that even experts find it difficult to agree on. • adjective or participle? • a seen event, a rarely seen event, an unseen event, • a child seat, *a very child seat, *this seat is child • but: that’s a very MIT paper, she’s sooooooo California • Some cases are difficult to get in the absence of further knowledge: preposition or particle? • he threw out the garbage • he threw the garbage out • he threw the garbage out the door • he threw the garbage the door out

Computational Linguistics Typical examples used to motivate tagging • Can they can cans? • May may leave • He does not shoot does • You might use all your might • I am arriving at 3 am

Computational Linguistics How hard is tagging?

Computational Linguistics Approaches to tagging

Computational Linguistics Transformation-based learning 151)

Computational Linguistics Transformation-based learning

Computational Linguistics Example rules from Brill’s thesis

Computational Linguistics The source-channel model • The basic idea: a lot of problems in computational linguistics can be construed as the problem of reconstructing an underlying “truth” given possibly noisy observations. • This is very much like the problem that Claude Shannon (the “father of Information Theory”) set out to solve for communication over a phone line. • Input I is clean speech • The channel (the phone line) corrupts I and produces O — what you hear at the other end • Can we reconstruct I from O? • Answer: you can if you have an estimate of the probability of the possible I’s and an estimate of the probability of generating O given I: • First term P(I) is the language model and the second term P(O|I) is the channel model.

CS506/606: Computational Linguistics Fall 2009 Unit 1

CS506/606: Computational Linguistics Fall 2009 Unit 1

Presentation Transcript

URL : cslu.ogi/~sproatr/Courses/TextNorm /