1 / 43

LING/C SC/PSYC 438/538

LING/C SC/PSYC 438/538. Lecture 1 Sandiway Fong. Syllabus. Details: 538: introductory level, no formal pre-requisites 438: LING 388 or familiarity with one or more of the following: formal languages, syntax, data structures, or compilers

violet-yang
Download Presentation

LING/C SC/PSYC 438/538

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING/C SC/PSYC 438/538 Lecture 1 Sandiway Fong

  2. Syllabus Details: • 538: introductory level, no formal pre-requisites • 438: LING 388 or familiarity with one or more of the following: formal languages, syntax, data structures, or compilers • Instructor: Sandiway Fong, Depts. of Linguistics and Computer Science • Office: Douglass 311 (ph. 626-6567) • Hours: • by appt. or walk-in • after class (best if you have quick Qs) • Email: sandiway@email.arizona.edu • Meet: Tuesday/Thursdays in AME S314, 2-3:15pm • No class on • November 11th (Veterans Day) • November 27th (Thanksgiving)

  3. Syllabus • Course objectives: • introduction to computational linguistics • survey a range of topics • introduction to programming • Expected learning outcomes: • acquire ability to write short programs • familiarity with basic concepts, techniques and applications • be equipped to take more advanced classes in computational linguistics, e.g. 581 (Spring)

  4. Syllabus • Grading • 438 • homeworks 100% • note: all homeworks are required • 538 • homeworks 75% • (homeworks, will be a superset of the exercises for 438) • chapter presentation 25% • Homework submissions • email only • sandiway@email.arizona.edu • by midnight of due date • typically: one week • (homeworks will be presented in class)

  5. Syllabus • Homeworks • you may discuss questions with other students • however, you must write it up yourself (in your own words) • cite (web) references and your classmates (in the case of discussion) • Student Code of Academic Integrity: plagiarism etc. • http://deanofstudents.arizona.edu/codeofacademicintegrity • Revisions to the syllabus • “the information contained in the course syllabus, other than the grade and absence policies, may be subject to change with reasonable advance notice, as deemed appropriate by the instructor.”

  6. Syllabus • Absences • tell me ahead of time so we can make special arrangements • I expect you to attend lectures (though attendance will not be taken) • Required text • Speech and Language Processing, Jurafsky & Martin, 2nd edition, Prentice Hall 2008 • Special equipment • none • all software required for the course is freely available off the net • Classroom etiquette • ask questions • use your own laptop or lab computer • Topics (16 weeks) • Programming Language: Perl • Regular Expressions • Automata (Finite State) • Transducers (Finite State) • Programming Language: Prolog (definite clause grammars) • Part of Speech Tagging • Stemming (Morphology) • Edit Distance (Spelling) • Grammars (Regular, Context-free) • Parsing (Syntax trees, algorithms) • N-grams (Probability, Smoothing) • and more …

  7. Course website • Download lecture slides from my homepage • http://dingo.sbs.arizona.edu/~sandiway/#courses • available from class time (and afterwards, look for corrections/updates) • in .pptx (animations) and .pdf formats

  8. Course website

  9. Miss a lecture? • Available for review: • linked via course homepage tohttp://ua.lecturecast.arizona.edu/ • access to low-res video, laptop screen, slides, index (searchable)

  10. Textbook (J&M) 2008 (2nd edition) Nearly 1000 pages (maybe more than a full year’s worth…) 25 chapters Divided into 5 parts Words Speech – not this course Syntax Semantics and Pragmatics Applications

  11. Book chapters • 1. Introduction • 1.1. Knowledge in Speech and Language Processing • 1.2. Ambiguity • 1.3. Models and Algorithms • 1.4. Language, Thought, and Understanding • 1.5. The State of the Art • 1.6. Some Brief History • 1.6.1. Foundational Insights: 1940s and 1950s • 1.6.2. The Two Camps: 1957–1970 • 1.6.3. Four Paradigms: 1970–1983 • 1.6.4. Empiricism and Finite-State Models Redux: 1983–1993 • 1.6.5. The Field Comes Together: 1994–1999 • 1.6.6. The Rise of Machine Learning: 2000–2008 • 1.6.7. On Multiple Discoveries • 1.6.8. A Final Brief Note on Psychology • 1.7. Summary • Bibliographical and Historical Notes • I. Words • 2. Regular Expressions and Automata • 2.1. Regular Expressions • 2.1.1. Basic Regular Expression Patterns • 2.1.2. Disjunction, Grouping, and Precedence • 2.1.3. A Simple Example • 2.1.4. A More Complex Example • 2.1.5. Advanced Operators • 2.1.6. Regular Expression Substitution, Memory, and ELIZA • 2.2. Finite-State Automata • 2.2.1. Use of an FSA to Recognize Sheeptalk • 2.2.2. Formal Languages • 2.2.3. Another Example • 2.2.4. Non-Deterministic FSAs • 2.2.5. Use of an NFSA to Accept Strings • 2.2.6. Recognition as Search • 2.2.7. Relation of Deterministic and Non-Deterministic Automata • 2.3. Regular Languages and FSAs • 2.4. Summary • Bibliographical and Historical Notes • Exercises • 3. Words and Transducers • 3.1. Survey of (Mostly) English Morphology • 3.1.1. Inflectional Morphology • 3.1.2. Derivational Morphology • 3.1.3. Cliticization • 3.1.4. Non-Concatenative Morphology • 3.1.5. Agreement • 3.2. Finite-State Morphological Parsing • 3.3. Construction of a Finite-State Lexicon • 3.4. Finite-State Transducers • 3.4.1. Sequential Transducers and Determinism • 3.5. FSTs for Morphological Parsing • 3.6. Transducers and Orthographic Rules • 3.7. The Combination of an FST Lexicon and Rules • 3.8. Lexicon-Free FSTs: The Porter Stemmer • 3.9. Word and Sentence Tokenization • 3.9.1. Segmentation in Chinese • 3.10. Detection and Correction of Spelling Errors • 3.11. Minimum Edit Distance • 3.12. Human Morphological Processing • 3.13. Summary • Bibliographical and Historical Notes • Exercises • 4. N-Grams • 4.1. Word Counting in Corpora • 4.2. Simple (Unsmoothed) N-Grams • 4.3. Training and Test Sets • 4.3.1. N-Gram Sensitivity to the Training Corpus • 4.3.2. Unknown Words: Open Versus Closed Vocabulary Tasks • 4.4. Evaluating N-Grams: Perplexity • 4.5. Smoothing • 4.5.1. Laplace Smoothing • 4.5.2. Good-Turing Discounting • 4.5.3. Some Advanced Issues in Good-Turing Estimation • 4.6. Interpolation • 4.7. Backoff • 4.7.1. Advanced: Details of Computing Katz Backoffα and P* • 4.8. Practical Issues: Toolkits and Data Formats • 4.9. Advanced Issues in Language Modeling • 4.9.1. Advanced Smoothing Methods: Kneser-Ney Smoothing • 4.9.2. Class-Based N-Grams • 4.9.3. Language Model Adaptation and Web Use • 4.9.4. Using Longer-Distance Information: A Brief Summary • 4.10. Advanced: Information Theory Background • 4.10.1. Cross-Entropy for Comparing Models • 4.11. Advanced: The Entropy of English and Entropy Rate Constancy • 4.12. Summary • Bibliographical and Historical Notes • Exercises 1. Introduction 1.1. Knowledge in Speech and Language Processing 1.2. Ambiguity 1.3. Models and Algorithms 1.4. Language, Thought, and Understanding 1.5. The State of the Art 1.6. Some Brief History 1.6.1. Foundational Insights: 1940s and 1950s 1.6.2. The Two Camps: 1957–1970 1.6.3. Four Paradigms: 1970–1983 1.6.4. Empiricism and Finite-State Models Redux: 1983–1993 1.6.5. The Field Comes Together: 1994–1999 1.6.6. The Rise of Machine Learning: 2000–2008 1.6.7. On Multiple Discoveries 1.6.8. A Final Brief Note on Psychology 1.7. Summary Bibliographical and Historical Notes

  12. Syllabus • Coverage • Intro to programming • we’re going to use Perl • Python is another (perhaps more) popular language • Topics: selected chapters from J&M • Chapters 1–6, skip Speech part (7–11), 12–25

  13. Homework: Reading • Chapter 1 from JM • introduction and history • available online • http://www.cs.colorado.edu/~martin/SLP/Updates/1.pdf • Whole book is available as an e-book • www.coursesmart.com

  14. Homework: Install Perl • Install Perl on your laptop • should be pre-installed on macs and Linux (Ubuntu), check your machine • on Windows PCs, if you don’t already have it, it’s freely available here • http://www.activestate.com/ (don’t pay, get the free version)

  15. Homework: Install Perl • Ubuntu: perl –v which perl • Mac:

  16. Homework: Install Perl Other methods See http://learn.perl.org/installing/

  17. Learning Perl • Learn Perl • Books… • Online resources • http://learn.perl.org/ • Next time, we begin with ... • http://perldoc.perl.org/perlintro.html

  18. Language and Computers • Enormous amounts of data stored • world-wide web (WWW) • corporate databases • your own hard drive • Major categories of data • numeric • Language: words, text, sound • pictures, video

  19. Language and Computers • We know what we want from computer software • “killer applications” • those that can make sense of language data • retrieve language data: (IR) • summarize knowledge contained in language data • sentiment analysis from online product reviews • answer questions (QA), make logical inferences • translate from one language into another • recognize speech: transcribe • etc...

  20. Language and Computers • In other words, we’d like computers to be smart about language • possess “intelligence” • pass the Turing Test …

  21. Language and Computers • In other words, we’d like computers to be smart about language • possess intelligence • well, perhaps not too smart… From 2001… (HAL)

  22. Language and Computers • (Un)fortunately, we’re not there yet… • gap between what computers can do and • what we want them to be able to do Often quoted (but not verified): "The spirit is strong, but the flesh is weak" was translated into Russian and then back to English, the result was "The vodka is good, but the meat is rotten." but with Google translate or babelfish, it’s not difficult to find (funny) examples…

  23. Language and Computers • and how can we tell if the translation is right anyway? • http://fun.drno.de/pics/english/only-in-china/TranslateServerError.jpg

  24. Language and Computers

  25. Language and Computers • Obama: "At a certain point, I've just concluded that for me personally it is important for me to go ahead and affirm that I think same-sex couples should be able to get married." Is this sentence complicated? Why?

  26. Language and Computers

  27. Language and Computers Executive Summarization

  28. Language and Computers Do you trust Google Translate? • a real case: 4,000,000 yen or 40,000 yen?

  29. Language and Computers • Puzzle: translation of 4万円以下 no spaces: segmentation task 4 万円 10,000 yen 以下 less than/below/not exceeding • Now fixed (almost) with auto-detect on

  30. Language and Computers • Non-compositionality Puzzle

  31. Language and Computers • What happened? 4万円以下 • can be segmented as follows: 4 万円 10,000 yen 以下 less than/below/not exceeding 4 万円以下 Million yen

  32. Language and Computers • Still problems remain (as of August 27 2013): another glitch but an order of magnitude in the other direction: 10,000 -> 1,000 but better than “million”

  33. Language and Computers • a visit to the Peabody Essex Museum (Massachusetts) • Qing dynasty Huīzhōu (徽州) -style house … so what do those 3 characters (Yin Yu Yang) – the name of the house actually mean? 蔭餘堂(simplified 荫余堂)

  34. Language and Computers • Meaning of 荫余堂/蔭餘堂 (simplified/traditional spelling) the strange romanization is not the translation I’m looking for…

  35. Language and Computers • Meaning of 荫余堂/蔭餘堂 Meaning in language is (mostly) compositional

  36. Language and Computers • Meaning of 1st character: 荫/蔭

  37. Language and Computers • Meaning of 2nd character: 余/餘

  38. Language and Computers • Meaning of 余/餘

  39. Language and Computers • Meaning of 3rd character: 堂

  40. Language and Computers • Meaning of 蔭餘堂 • shade I Church • shady remainder Hall

  41. Applications • technology is still in development • even if we are willing to pay... • machine translation has been worked on since after World War II (1950s) • still not perfected today • why? • what are the properties of human languages that make it hard?

  42. Natural Language Properties • which properties are going to be difficult for computers to deal with? • grammar (Rules for putting words together into sentences) • How many rules are there? • 100, 1000, 10000, more … • Portions learnt or innate • Do we have all the rules written down somewhere? • lexicon(Dictionary) • How many words do we need to know? • 1000, 10000, 100000 … • meaning and inference (semantic interpretation, commonsense world knowledge)

  43. Computers vs. Humans • Knowledge of language • Computers are way faster than humans • They kill us at arithmetic and chess • But human beings are so good at language, we often take our ability for granted • Processed without conscious thought • Do pretty complex things and now Jeopardy as well …

More Related