1 / 44

LING/C SC/PSYC 438/538

LING/C SC/PSYC 438/538. Lecture 1 Sandiway Fong. Syllabus. Details: 538: introductory level, so there are no formal pre-requisites for this class 438: LING 388 or familiarity with one or more of the following: formal languages, syntax, data structures, or compilers

meris
Download Presentation

LING/C SC/PSYC 438/538

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong

  2. Syllabus • Details: • 538: • introductory level, so there are no formal pre-requisites for this class • 438: • LING 388 or familiarity with one or more of the following: formal languages, syntax, data structures, or compilers • Instructor: Sandiway Fong Depts. of Linguistics and Computer Science • Office: Douglass 311 (ph. 626-6567) • Hours: • by appt. or walk-in • after class (best if you have Qs about the class) • Email: sandiway@email.arizona.edu • Meet • Tuesdays in ECE 229, 2-3:15pm • Thursdays in Shantz 338, 2-3:15pm • No class on • Thursday November 28th (Thanksgiving) • Course objectives: • introduction to computational linguistics • survey a range of topics • introduction to programming • hands-on experience: computer lab exercises • Expected learning outcomes: • acquire ability to write short programs • familiarity with basic concepts, techniques and applications • be equipped to take more advanced classes in computational linguistics, e.g. 581 (Spring)

  3. Syllabus • Grading • 438 • homeworks 100% • note: all homeworks are required • 538 • homeworks 75% • (homeworks, will be a superset of the exercises for 438) • chapter presentation 25% • Homework submissions • email only • sandiway@email.arizona.edu • by midnight of due date • typically: one week • (homeworks will be presented in class) • Homeworks • you may discuss questions with other students • however, you must write it up yourself (in your own words) • cite (web) references and your classmates (in the case of discussion) • Student Code of Academic Integrity: plagiarism etc. • http://deanofstudents.arizona.edu/codeofacademicintegrity • Revisions to the syllabus • “the information contained in the course syllabus, other than the grade and absence policies, may be subject to change with reasonable advance notice, as deemed appropriate by the instructor.”

  4. Syllabus • Absences • tell me ahead of time so we can make special arrangements • I expect you to attend lectures (though attendance will not be taken) • Required text • Speech and Language Processing, Jurafsky & Martin, 2nd edition, Prentice Hall 2008 • Special equipment • none • all software required for the course is freely available off the net • Classroom etiquette • ask questions • use your own laptop or lab computer • you may collaborate on classroom exercises • Topics (16 weeks) • Programming Language: Perl • Regular Expressions • Automata (Finite State) • Transducers (Finite State) • Programming Language: Prolog (definite clause grammars) • Part of Speech Tagging • Stemming (Morphology) • Edit Distance (Spelling) • Grammars (Regular, Context-free) • Parsing (Syntax trees, algorithms) • N-grams (Probability, Smoothing) • and more …

  5. Course website • Download lecture slides from my homepage • http://dingo.sbs.arizona.edu/~sandiway/#courses • available from class time (and afterwards, look for corrections/updates) • In .pptx (animations) and .pdf formats

  6. Course website

  7. Miss a lecture? • Available for review: • linked via course homepage tohttp://ua.hosted.panopto.com/ • access to low-res video, laptop screen, slides, index (searchable)

  8. Textbook (J&M) 2008 (2nd edition) Nearly 1000 pages (maybe more than a full year’s worth…) 25 chapters Divided into 5 parts Words Speech – not this course Syntax Semantics and Pragmatics Applications

  9. Book chapters • 1. Introduction • 1.1. Knowledge in Speech and Language Processing • 1.2. Ambiguity • 1.3. Models and Algorithms • 1.4. Language, Thought, and Understanding • 1.5. The State of the Art • 1.6. Some Brief History • 1.6.1. Foundational Insights: 1940s and 1950s • 1.6.2. The Two Camps: 1957–1970 • 1.6.3. Four Paradigms: 1970–1983 • 1.6.4. Empiricism and Finite-State Models Redux: 1983–1993 • 1.6.5. The Field Comes Together: 1994–1999 • 1.6.6. The Rise of Machine Learning: 2000–2008 • 1.6.7. On Multiple Discoveries • 1.6.8. A Final Brief Note on Psychology • 1.7. Summary • Bibliographical and Historical Notes • I. Words • 2. Regular Expressions and Automata • 2.1. Regular Expressions • 2.1.1. Basic Regular Expression Patterns • 2.1.2. Disjunction, Grouping, and Precedence • 2.1.3. A Simple Example • 2.1.4. A More Complex Example • 2.1.5. Advanced Operators • 2.1.6. Regular Expression Substitution, Memory, and ELIZA • 2.2. Finite-State Automata • 2.2.1. Use of an FSA to Recognize Sheeptalk • 2.2.2. Formal Languages • 2.2.3. Another Example • 2.2.4. Non-Deterministic FSAs • 2.2.5. Use of an NFSA to Accept Strings • 2.2.6. Recognition as Search • 2.2.7. Relation of Deterministic and Non-Deterministic Automata • 2.3. Regular Languages and FSAs • 2.4. Summary • Bibliographical and Historical Notes • Exercises • 3. Words and Transducers • 3.1. Survey of (Mostly) English Morphology • 3.1.1. Inflectional Morphology • 3.1.2. Derivational Morphology • 3.1.3. Cliticization • 3.1.4. Non-Concatenative Morphology • 3.1.5. Agreement • 3.2. Finite-State Morphological Parsing • 3.3. Construction of a Finite-State Lexicon • 3.4. Finite-State Transducers • 3.4.1. Sequential Transducers and Determinism • 3.5. FSTs for Morphological Parsing • 3.6. Transducers and Orthographic Rules • 3.7. The Combination of an FST Lexicon and Rules • 3.8. Lexicon-Free FSTs: The Porter Stemmer • 3.9. Word and Sentence Tokenization • 3.9.1. Segmentation in Chinese • 3.10. Detection and Correction of Spelling Errors • 3.11. Minimum Edit Distance • 3.12. Human Morphological Processing • 3.13. Summary • Bibliographical and Historical Notes • Exercises • 4. N-Grams • 4.1. Word Counting in Corpora • 4.2. Simple (Unsmoothed) N-Grams • 4.3. Training and Test Sets • 4.3.1. N-Gram Sensitivity to the Training Corpus • 4.3.2. Unknown Words: Open Versus Closed Vocabulary Tasks • 4.4. Evaluating N-Grams: Perplexity • 4.5. Smoothing • 4.5.1. Laplace Smoothing • 4.5.2. Good-Turing Discounting • 4.5.3. Some Advanced Issues in Good-Turing Estimation • 4.6. Interpolation • 4.7. Backoff • 4.7.1. Advanced: Details of Computing Katz Backoffα and P* • 4.8. Practical Issues: Toolkits and Data Formats • 4.9. Advanced Issues in Language Modeling • 4.9.1. Advanced Smoothing Methods: Kneser-Ney Smoothing • 4.9.2. Class-Based N-Grams • 4.9.3. Language Model Adaptation and Web Use • 4.9.4. Using Longer-Distance Information: A Brief Summary • 4.10. Advanced: Information Theory Background • 4.10.1. Cross-Entropy for Comparing Models • 4.11. Advanced: The Entropy of English and Entropy Rate Constancy • 4.12. Summary • Bibliographical and Historical Notes • Exercises 1. Introduction 1.1. Knowledge in Speech and Language Processing 1.2. Ambiguity 1.3. Models and Algorithms 1.4. Language, Thought, and Understanding 1.5. The State of the Art 1.6. Some Brief History 1.6.1. Foundational Insights: 1940s and 1950s 1.6.2. The Two Camps: 1957–1970 1.6.3. Four Paradigms: 1970–1983 1.6.4. Empiricism and Finite-State Models Redux: 1983–1993 1.6.5. The Field Comes Together: 1994–1999 1.6.6. The Rise of Machine Learning: 2000–2008 1.6.7. On Multiple Discoveries 1.6.8. A Final Brief Note on Psychology 1.7. Summary Bibliographical and Historical Notes

  10. Syllabus • Coverage • Intro to programming • we’re going to use Perl • Python is another (perhaps more) popular language • Topics: selected chapters from J&M • Chapters 1–6, skip Speech part (7–11), 12–25

  11. Homework: Reading • Chapter 1 from JM • introduction and history • available online • http://www.cs.colorado.edu/~martin/SLP/Updates/1.pdf • Whole book is available as an e-book • www.coursesmart.com

  12. Homework: Install Perl • Install Perl on your laptop • should be pre-installed on macs and Linux (Ubuntu), check your machine • on Windows PCs, if you don’t already have it, it’s freely available here • http://www.activestate.com/ (don’t pay, get the free version)

  13. Homework: Install Perl • Ubuntu: • Mac:

  14. Homework: Install Perl http://learn.perl.org/installing/

  15. Learning Perl • Learn Perl • Books… • Online resources • http://learn.perl.org/ • Next time, we begin with ... • http://perldoc.perl.org/perlintro.html

  16. Language and Computers • Enormous amounts of data stored • world-wide web (WWW) • corporate databases • your own hard drive • Major categories of data • numeric • Language: words, text, sound • pictures, video

  17. Language and Computers • We know what we want from computer software • “killer applications” • those that can make sense of language data • retrieve language data: (IR) • summarize knowledge contained in language data • sentiment analysis from online product reviews • answer questions (QA), make logical inferences • translate from one language into another • recognize speech: transcribe • etc...

  18. Language and Computers • In other words, we’d like computers to be smart about language • possess “intelligence” • pass the Turing Test …

  19. Language and Computers • In other words, we’d like computers to be smart about language • possess intelligence • well, perhaps not too smart… From 2001… (HAL)

  20. Language and Computers • (Un)fortunately, we’re not there yet… • gap between what computers can do and • what we want them to be able to do Often quoted (but not verified): "The spirit is strong, but the flesh is weak" was translated into Russian and then back to English, the result was "The vodka is good, but the meat is rotten." but with Google translate or babelfish, it’s not difficult to find (funny) examples…

  21. Language and Computers • and how can we tell if the translation is right anyway? • http://fun.drno.de/pics/english/only-in-china/TranslateServerError.jpg

  22. Language and Computers

  23. Language and Computers • Obama: "At a certain point, I've just concluded that for me personally it is important for me to go ahead and affirm that I think same-sex couples should be able to get married." Is this sentence complicated? Why?

  24. Language and Computers

  25. Language and Computers Executive Summarization

  26. Language and Computers Do you trust Google Translate? • a real case: 4,000,000 yen or 40,000 yen?

  27. Language and Computers • Puzzle: translation of 4万円以下 no spaces: segmentation task 4 万円 10,000 yen 以下 less than/below/not exceeding • Now fixed (almost) with auto-detect on

  28. Language and Computers • Non-compositionality Puzzle

  29. Language and Computers • What happened? 4万円以下 • can be segmented as follows: 4 万円 10,000 yen 以下 less than/below/not exceeding 4 万円以下 Million yen

  30. Language and Computers • Still problems remain (as of August 27 2013): another glitch but an order of magnitude in the other direction: 10,000 -> 1,000 but better than “million”

  31. Language and Computers • a visit to the Peabody Essex Museum (Massachusetts) • Qing dynasty Huizhou-style house … so what do those 3 characters (Yin Yu Yang) – the name of the house actually mean? 蔭餘堂(simplified 荫余堂)

  32. Language and Computers • Meaning of 荫余堂/蔭餘堂 (simplified/traditional spelling) the pinyin romanization is not the translation I’m looking for…

  33. Language and Computers • Meaning of 荫余堂/蔭餘堂 Meaning in language is (mostly) compositional

  34. Language and Computers • Meaning of 荫余堂/蔭餘堂

  35. Language and Computers • Meaning of 荫余堂/蔭餘堂

  36. Language and Computers • Meaning of 荫余堂/蔭餘堂

  37. Applications • technology is still in development • even if we are willing to pay... • machine translation has been worked on since after World War II (1950s) • still not perfected today • why? • what are the properties of human languages that make it hard?

  38. Natural Language Properties • which properties are going to be difficult for computers to deal with? • grammar (Rules for putting words together into sentences) • How many rules are there? • 100, 1000, 10000, more … • Portions learnt or innate • Do we have all the rules written down somewhere? • lexicon(Dictionary) • How many words do we need to know? • 1000, 10000, 100000 … • meaning and inference (semantic interpretation, commonsense world knowledge)

  39. Computers vs. Humans • Knowledge of language • Computers are way faster than humans • They kill us at arithmetic and chess • But human beings are so good at language, we often take our ability for granted • Processed without conscious thought • Do pretty complex things and now Jeopardy as well …

  40. Examples • Knowledge • Which report did you file without reading? • (Parasitic gap sentence) We take for granted this process of “fillingin” or recovering the missing information

  41. Examples • Changes in interpretation • John is too stubborn to talk to • John is too stubborn to talk to Bill

  42. Examples • Ambiguity • where can I see the bus stop? • stop: verb or part of the noun-noun compound bus stop • Context (Discourse or situation)

  43. Examples • Ungrammaticality • *Which book did you file the report without reading? • * = ungrammatical Colorless green ideas sleep furiously. Furiously sleep ideas green colorless. • ungrammatical vs. incomprehensible

  44. Examples • The human parser has quirks • Ian told the man that he hired a story • Ian told the man that he hired a secretary • Garden-pathing: a temporary ambiguity • tell: someone something vs. … • More subtle differences • The reporter who the senator attacked admitted the error • The reporter who attacked the senator admitted the error

More Related