Introduction to Computational Linguistics (LING/C.SC/PSYC.438/538)

LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong

Syllabus • Details: • 538: there are no formal pre-requisites for this class • 438: LING 388 or a course in one of the following: formal languages, syntax, data structures, or compilers. • Instructor: Sandiway Fong, Depts. of Linguistics and Computer Science • Office: Douglass 311 (ph. 626-6567) • Hours: • by appt. orwalk-in • after class (best if you have Qs about the class) • Email: sandiway@email.arizona.edu • Meet here in computer lab • 5:45pm–7:00pm • Social Sciences 224 • Lab has to be closed after 7pm but you are free to use the facilities during normal hours • Course objectives: • introduction to computational linguistics • survey a range of topics • introduction to programming • hands-on experience: computer lab exercises • Expected learning outcomes: • acquire ability to write short programs • familiarity with basic concepts, techniques and applications • be equipped to take more advanced classes in computational linguistics • Grading • 438 • homeworks 65% • (note: all homeworks are required but not necessarily graded • homework may also be in the form of readings: (graded) pop quiz next lecture) • midterm exam 35% • 538 • homeworks 50% • (homeworks, will be a superset of exercises for 438) • midterm exam 25% • term report and presentation 25%

Class Demographics

Syllabus • Absences • tell me ahead of time so we can make special arrangements • ok: religious, dean-sanctioned, medical, etc... • Required text • Speech and Language Processing, Jurafsky & Martin, 2nd edition, Prentice Hall 2008 • Special equipment • none • all software required for the course is freely available off the net • Classroom etiquette • you may ask questions • use your own laptop or lab computer • you may collaborate on classroom exercises • switch off cellphones • no communication during an exam • Homeworks • you may discuss the question with other students • however, you must write it up yourself (in your own words) • cite (web) references and your classmates (in the case of discussion) • Student Code of Academic Integrity: plagiarism etc. • http://dos.web.arizona.edu/uapolicies • Other stuff • See http://web.arizona.edu/%7Epolicy/Undergraduate%20Course%20Syllabus%20Policy.pdf • Revisions to the syllabus • “the information contained in the course syllabus, other than the grade and absence policies, may be subject to change with reasonable advance notice, as deemed appropriate by the instructor.”

Textbook (J&M) 2008(2nd edition) Nearly 1000 pages (full year’s worth…) 25 chapters Divided into 5 parts Words Speech – not this course Syntax Semantics and Pragmatics Applications

Syllabus • Coverage • Intro to programming • we’re going to use Perl • (Python is another popular language) • Topics: selected chapters from J&M • Chapters 1–6, skip Speech part (7–11), 12–25 • Guest lectures • Applications of computational linguistics • Tentatively… • Nirav Merchant, Director of Biotechnology Computing (BIO5) • TBA • research opportunities here on campus

Book chapters • 1. Introduction • 1.1. Knowledge in Speech and Language Processing • 1.2. Ambiguity • 1.3. Models and Algorithms • 1.4. Language, Thought, and Understanding • 1.5. The State of the Art • 1.6. Some Brief History • 1.6.1. Foundational Insights: 1940s and 1950s • 1.6.2. The Two Camps: 1957–1970 • 1.6.3. Four Paradigms: 1970–1983 • 1.6.4. Empiricism and Finite-State Models Redux: 1983–1993 • 1.6.5. The Field Comes Together: 1994–1999 • 1.6.6. The Rise of Machine Learning: 2000–2008 • 1.6.7. On Multiple Discoveries • 1.6.8. A Final Brief Note on Psychology • 1.7. Summary • Bibliographical and Historical Notes • I. Words • 2. Regular Expressions and Automata • 2.1. Regular Expressions • 2.1.1. Basic Regular Expression Patterns • 2.1.2. Disjunction, Grouping, and Precedence • 2.1.3. A Simple Example • 2.1.4. A More Complex Example • 2.1.5. Advanced Operators • 2.1.6. Regular Expression Substitution, Memory, and ELIZA • 2.2. Finite-State Automata • 2.2.1. Use of an FSA to Recognize Sheeptalk • 2.2.2. Formal Languages • 2.2.3. Another Example • 2.2.4. Non-Deterministic FSAs • 2.2.5. Use of an NFSA to Accept Strings • 2.2.6. Recognition as Search • 2.2.7. Relation of Deterministic and Non-Deterministic Automata • 2.3. Regular Languages and FSAs • 2.4. Summary • Bibliographical and Historical Notes • Exercises • 3. Words and Transducers • 3.1. Survey of (Mostly) English Morphology • 3.1.1. Inflectional Morphology • 3.1.2. Derivational Morphology • 3.1.3. Cliticization • 3.1.4. Non-Concatenative Morphology • 3.1.5. Agreement • 3.2. Finite-State Morphological Parsing • 3.3. Construction of a Finite-State Lexicon • 3.4. Finite-State Transducers • 3.4.1. Sequential Transducers and Determinism • 3.5. FSTs for Morphological Parsing • 3.6. Transducers and Orthographic Rules • 3.7. The Combination of an FST Lexicon and Rules • 3.8. Lexicon-Free FSTs: The Porter Stemmer • 3.9. Word and Sentence Tokenization • 3.9.1. Segmentation in Chinese • 3.10. Detection and Correction of Spelling Errors • 3.11. Minimum Edit Distance • 3.12. Human Morphological Processing • 3.13. Summary • Bibliographical and Historical Notes • Exercises • 4. N-Grams • 4.1. Word Counting in Corpora • 4.2. Simple (Unsmoothed) N-Grams • 4.3. Training and Test Sets • 4.3.1. N-Gram Sensitivity to the Training Corpus • 4.3.2. Unknown Words: Open Versus Closed Vocabulary Tasks • 4.4. Evaluating N-Grams: Perplexity • 4.5. Smoothing • 4.5.1. Laplace Smoothing • 4.5.2. Good-Turing Discounting • 4.5.3. Some Advanced Issues in Good-Turing Estimation • 4.6. Interpolation • 4.7. Backoff • 4.7.1. Advanced: Details of Computing Katz Backoffα and P* • 4.8. Practical Issues: Toolkits and Data Formats • 4.9. Advanced Issues in Language Modeling • 4.9.1. Advanced Smoothing Methods: Kneser-Ney Smoothing • 4.9.2. Class-Based N-Grams • 4.9.3. Language Model Adaptation and Web Use • 4.9.4. Using Longer-Distance Information: A Brief Summary • 4.10. Advanced: Information Theory Background • 4.10.1. Cross-Entropy for Comparing Models • 4.11. Advanced: The Entropy of English and Entropy Rate Constancy • 4.12. Summary • Bibliographical and Historical Notes • Exercises

Lecture slides • Download from my homepage • http://dingo.sbs.arizona.edu/~sandiway/#courses • available from class time (and afterwards) • In .pptx and .pdf formats

Homework: Reading • Chapter 1 from JM • introduction and history • available online • http://www.cs.colorado.edu/%7Emartin/SLP/Updates/1.pdf • Whole book is available as an e-book • www.coursesmart.com

Homework: Perl • Install Perl on your laptop • If you don’t already have it, it’s freely available here • http://www.activestate.com/ (free and for-fee versions)

Learning Perl • Learn Perl • Books… • Online resources • http://learn.perl.org/ • Next time, we begin with ... • http://perldoc.perl.org/perlintro.html

Language and Computers • Enormous amounts of data stored • world-wide web (WWW) • corporate databases • your own hard drive • Major categories of data • numeric • Language: words, text, sound • pictures, video

Language and Computers • We know what we want from computer software • “killer applications” • those that can make sense of language data • retrieve language data: (IR) • summarize knowledge contained in language data • answer questions (QA), make logical inferences • translate from one language into another • recognize speech: transcribe • etc...

Language and Computers • In other words, we’d like computers to be smart about language • possess “intelligence” • pass the Turing Test …

Language and Computers • In other words, we’d like computers to be smart about language • possess intelligence • well, perhaps not too smart… From 2001… (HAL)

Language and Computers • (Un)fortunately, we’re not there yet… • gap between what computers can do and • what we want them to be able to do Often quoted (but not verified): "The spirit is strong, but the flesh is weak" was translated into Russian and then back to English, the result was "The vodka is good, but the meat is rotten." but with Google translate or babelfish, it’s not difficult to find (funny) examples…

Language and Computers • and how can we tell if the translation is right anyway? • http://fun.drno.de/pics/english/only-in-china/TranslateServerError.jpg

Language and Computers • Human interpreters make errors too …

Language and Computers • Google Translate • two personal experiences….

Language and Computers • 4,000,000 yen or 40,000 yen?

Language and Computers • Non-compositionality Puzzle 40,000 yen + less than

Language and Computers • Non-compositionality Puzzle

Language and Computers • What happened? • 4万円 + 以下 • 40,000 yen + less than • 万円 = million (error)

Language and Computers

Language and Computers • I paid a visit to the Peabody Essex Museum (Salem MA) • Yin Yu Tang: A Chinese Home … so what do those 3 characters (Yin Yu Yang) – the name of the house actually mean?

Language and Computers … excuse my poor attempt at handwriting input

Language and Computers First character 蔭 translates as Yam. Capitalization is important: it’s not an edible tuber. It’s returning a name. But this is not a translation! Even if it’s just romanization, shouldn’t it be Yin? BTW, Yam here suggests Cantonese pronunciation Why?

Language and Computers • almost useful..? Gee, seems like one needs to already know Chinese to get Google translate to yield a translation of the first character 蔭 as shade. (I cheated by entering the collocation.) All 3 characters gets us back to square zero 1st two characters returns title/name

Language and Computers • almost useful…? • Is it possible to use Google translate to figure out the name of the house means • “Mansion of plentiful shade” can you come up with a hypothesis for why this happens? i.e. what technology do you think Google translate uses?

Introduction to Computational Linguistics (LING/C.SC/PSYC.438/538)