1 / 69

Introduction & Tokenization

Introduction & Tokenization. Ling570 Shallow Processing Techniques for NLP September 28, 2011 . Roadmap. Course Overview Tokenization Homework #1. Course Overview. Course Information. Course web page: http://courses.washington.edu/ling570 Syllabus: Schedule and readings

hal
Download Presentation

Introduction & Tokenization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

  2. Roadmap • Course Overview • Tokenization • Homework #1

  3. Course Overview

  4. Course Information • Course web page: • http://courses.washington.edu/ling570 • Syllabus: • Schedule and readings • Links to other readings, slides, links to class recordings • Slides posted before class, but may be revised • Catalyst tools: • GoPost discussion board for class issues • CollectItDropbox for homework submission and TA comments • Gradebook for viewing all grades

  5. GoPost Discussion Board • Main venue for course-related questions, discussion • What not to post: • Personal, confidential questions; Homework solutions • What to post: • Almost anything else course-related • Can someone explain…? • Is this really supposed to take this long to run? • Key location for class participation • Post questions or answers • Your discussion space: Sanghoun & I will not jump in often

  6. GoPost • Emily’s 5-minute rule: • If you’ve been stuck on a problem for more than 5 minutes, post to the GoPost! • Mechanics: • Please use your UW NetID as your user id • Please post early and often ! • Don’t wait until the last minute • Notifications: • Decide how you want to receive GoPost postings

  7. Email • Should be used only for personal or confidential issues • Grading issues, extended absences, other problems • General questions/comments go on GoPost • Please send email from your UW account • Include Ling570 in the subject • If you don’t receive a reply in 24 hours, please follow-up

  8. Homework Submission • All homework should be submitted through CollectIt • Tar cvf hw1.tar hw1_dir • Homework due 11:45 Wednesdays • Late homework receives 10%/day penalty (incremental) • Most major programming languages accepted • C/C++/C#, Java, Python, Perl, Ruby • If you want to use something else, please check first • Please follow naming, organization guidelines in HW • Expect to spend 10-20 hours/week, including HW docs

  9. Grading • Assignments: 90% • Class participation: 10% • No midterm or final exams • Grades in Catalyst Gradebook • TA feedback returned through CollectIt • Incomplete: only if all work completed up last two weeks • UW policy

  10. Recordings • All classes will be recorded • Links to recordings appear in syllabus • Available to all students, DL and in class • Please remind me to: • Record the meeting (look for the red dot) • Repeat in-class questions • Note: Instructor’s screen is projected in class • Assume that chat window is always public

  11. Contact Info • Gina: Email: levow@uw.edu • Office hour: • Fridays: 10-11 (before Treehouse meeting) • Location: Padelford B-201 • Or by arrangement • Available by Skype or Adobe Connect • All DL students should arrange a short online meeting • TA: Sanghoun Song: Email: sanghoun@uw.edu • Office hour: Time: TBD, see GoPost • Location:

  12. Online Option • Please check you are registered for correct section • CLMS online: Section A • State-funded: Section B • CLMS in-class: Section C • NLT/SCE online (or in-class): Section D • Online attendance for in-class students • Not more than 3 times per term (e.g. missed bus, ice) • Please enter meeting room 5-10 before start of class • Try to stay online throughout class

  13. Online Tip • If you see: • You are not logged into Connect. The problem is one of the following: the permissions on the resource you are trying to access are incorrectly set.Please contact your instructor/Meeting Host/etc. • you do not have a Connect account but need to have one. For UWEO students: • If you have just created your UW NetID or just enrolled in a course • ….. • Clear your cache, close and restart your browser

  14. Course Description

  15. Course Prerequisites • Programming Languages: • Java/C++/Python/Perl/.. • Operating Systems: Basic Unix/linux • CS 326 (Data structures) or equivalent • Lists, trees, queues, stacks, hash tables, … • Sorting, searching, dynamic programming,.. • Automata, regular expressions,… • Stat 391 (Probability and statistics): random variables, conditional probability, Bayes’ rule, ….

  16. Textbook • Jurafsky and Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edition, 2008 • Available from UW Bookstore, Amazon, etc • Reference: Manning and Schutze, Foundations of Statistical Natural Language Processing

  17. Topics in Ling570 • Unit #1: Formal Languages and Automata (2-3 weeks) • Formal languages • Finite-state Automata • Finite-state Transducers • Morphological analysis • Unit #2: Ngram Language Models and HMMs • Ngram Language Models and Smoothing • Part-of-speech (POS) tagging: • HMM • Ngram

  18. Topics in Ling570 • Unit #3: Classification (2-3 weeks) • Intro to classification • POS tagging with classifiers • Chunking • Named Entity (NE) recognition • Other topics (2 weeks) • Intro, tokenization • Clustering • Information Extraction • Summary

  19. Roadmap • Motivation: • Applications • Language and Thought • Knowledge of Language • Cross-cutting themes • Ambiguity, Evaluation, & Multi-linguality • Course Overview

  20. Motivation: Applications • Applications of Speech and Language Processing • Call routing • Information retrieval • Question-answering • Machine translation • Dialog systems • Spam tagging • Spell- , Grammar- checking • Sentiment Analysis • Information extraction….

  21. Shallow vs Deep Processing • Shallow processing (Ling 570) • Usually relies on surface forms (e.g., words) • Less elaborate linguistic representations • E.g. Part-of-speech tagging; Morphology; Chunking

  22. Shallow vs Deep Processing • Shallow processing (Ling 570) • Usually relies on surface forms (e.g., words) • Less elaborate linguistic representations • E.g. Part-of-speech tagging; Morphology; Chunking • Deep processing (Ling 571) • Relies on more elaborate linguistic representations • Deep syntactic analysis (Parsing) • Rich spoken language understanding (NLU)

  23. Shallow or Deep? • Applications of Speech and Language Processing • Call routing • Information retrieval • Question-answering • Machine translation • Dialog systems • Spam tagging • Spell- , Grammar- checking • Sentiment Analysis • Information extraction….

  24. Language & Intelligence • Turing Test: (1949) – Operationalize intelligence • Two contestants: human, computer • Judge: human • Test: Interact via text questions • Question: Can you tell which contestant is human? • Crucially requires language use and understanding

  25. Limitations of Turing Test • ELIZA (Weizenbaum 1966) • Simulates Rogerian therapist • User: You are like my father in some ways • ELIZA: WHAT RESEMBLANCE DO YOU SEE • User: You are not very aggressive • ELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE…

  26. Limitations of Turing Test • ELIZA (Weizenbaum 1966) • Simulates Rogerian therapist • User: You are like my father in some ways • ELIZA: WHAT RESEMBLANCE DO YOU SEE • User: You are not very aggressive • ELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE... • Passes the Turing Test!! (sort of)

  27. Limitations of Turing Test • ELIZA (Weizenbaum 1966) • Simulates Rogerian therapist • User: You are like my father in some ways • ELIZA: WHAT RESEMBLANCE DO YOU SEE • User: You are not very aggressive • ELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE... • Passes the Turing Test!! (sort of) • “You can fool some of the people....”

  28. Limitations of Turing Test • ELIZA (Weizenbaum 1966) • Simulates Rogerian therapist • User: You are like my father in some ways • ELIZA: WHAT RESEMBLANCE DO YOU SEE • User: You are not very aggressive • ELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE... • Passes the Turing Test!! (sort of) • “You can fool some of the people....” • Simple pattern matching technique • Very shallow processing

  29. Turing Test Revived • “On the web, no one knows you’re a….” • Problem: ‘bots’ • Automated agents swamp services • Challenge: Prove you’re human • Test: Something human can do, ‘bot can’t

  30. Turing Test Revived • “On the web, no one knows you’re a….” • Problem: ‘bots’ • Automated agents swamp services • Challenge: Prove you’re human • Test: Something human can do, ‘bot can’t • Solution: CAPTCHAs • Distorted images: trivial for human; hard for ‘bot

  31. Turing Test Revived • “On the web, no one knows you’re a….” • Problem: ‘bots’ • Automated agents swamp services • Challenge: Prove you’re human • Test: Something human can do, ‘bot can’t • Solution: CAPTCHAs • Distorted images: trivial for human; hard for ‘bot • Key: Perception, not reasoning

  32. Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that.

  33. Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that. • Phonetics & Phonology (Ling 450/550) • Sounds of a language, acoustics • Legal sound sequences in words

  34. Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that. • Morphology (Ling 570) • Recognize, produce variation in word forms • Singular vs. plural: Door + sg: -> door; Door + plural -> doors • Verb inflection: Be + 1st person, sg, present -> am

  35. Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that. • Part-of-speech tagging (Ling 570) • Identify word use in sentence • Bay (Noun) --- Not verb, adjective

  36. Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that. • Syntax • (Ling 566: analysis; Ling 570 – chunking; Ling 571- parsing) • Order and group words in sentence • I’m I do , sorry that afraid Dave I can’t.

  37. Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. • HAL: I'm sorry, Dave. I'm afraid I can't do that. • Semantics (Ling 571) • Word meaning: • individual (lexical), combined (compositional) • ‘Open’ : AGENT cause THEME to become open; • ‘pod bay doors’ : (pod bay) doors

  38. Knowledge of Language • What does HAL (of 2001, A Space Odyssey) need to know to converse? • Dave: Open the pod bay doors, HAL. (request) • HAL: I'm sorry, Dave. I'm afraid I can't do that. (statement) • Pragmatics/Discourse/Dialogue (Ling 571, maybe) • Interpret utterances in context • Speech act (request, statement) • Reference resolution: I = HAL; that = ‘open doors’ • Politeness: I’m sorry, I’m afraid I can’t

  39. Cross-cutting Themes • Ambiguity • How can we select among alternative analyses? • Evaluation • How well does this approach perform: • On a standard data set? • When incorporated into a full system? • Multi-linguality • Can we apply this approach to other languages? • How much do we have to modify it to do so?

  40. Ambiguity • “I made her duck” • Means....

  41. Ambiguity • “I made her duck” • Means.... • I caused her to duck down

  42. Ambiguity • “I made her duck” • Means.... • I caused her to duck down • I made the (carved) duck she has

  43. Ambiguity • “I made her duck” • Means.... • I caused her to duck down • I made the (carved) duck she has • I cooked duck for her

  44. Ambiguity • “I made her duck” • Means.... • I caused her to duck down • I made the (carved) duck she has • I cooked duck for her • I cooked the duck she owned

  45. Ambiguity • “I made her duck” • Means.... • I caused her to duck down • I made the (carved) duck she has • I cooked duck for her • I cooked the duck she owned • I magically turned her into a duck

  46. Ambiguity: POS • “I made her duck” • Means.... • I caused her to duck down • I made the (carved) duck she has • I cooked duck for her • I cooked the duck she owned • I magically turned her into a duck V Poss N Pron

  47. Ambiguity: Syntax • “I made her duck” • Means.... • I made the (carved) duck she has • ((VP (V made) (NP (POSS her) (N duck)))

  48. Ambiguity: Syntax • “I made her duck” • Means.... • I made the (carved) duck she has • ((VP (V made) (NP (POSS her) (N duck))) • I cooked duck for her • ((VP (V made) (NP (PRON her)) (NP (N (duck)))

  49. Ambiguity • Pervasive • Pernicious • Particularly challenging for computational systems • Problem we will return to again and again in class

  50. Tokenization • Given input text, split into words or sentences • Tokens: words, numbers, punctuation • Example: • Sherwood said reaction has been "very positive.” • Sherwood said reaction has been ” very positive . " • Why tokenize?

More Related