1 / 39

SIMS 290-2: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing. Marti Hearst August 30, 2004. Today. Motivation: SIMS student projects Course Goals Why NLP is difficult How to solve it? Corpus-based statistical approaches What we’ll do in this course. ANLP Motivation: SIMS Masters Projects.

oliver
Download Presentation

SIMS 290-2: Applied Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SIMS 290-2: Applied Natural Language Processing Marti Hearst August 30, 2004

  2. Today • Motivation: SIMS student projects • Course Goals • Why NLP is difficult • How to solve it? Corpus-based statistical approaches • What we’ll do in this course

  3. ANLP Motivation:SIMS Masters Projects • Breaking Story (2002) • Summarize trends in news feeds • Needs categories and entities assigned to all news articles http://dream.sims.berkeley.edu/newshound/ • BriefBank (2002) • System for entering legal briefs • Needs a topic category system for browsing http://briefbank.samuelsonclinic.org/ • Chronkite (2003) • Personalized RSS feeds • Needs categories and entities assigned to all web pages • Paparrazi (2004) • Analysis of blog activity • Needs categories assigned to blog content

  4. Goals of this Course • Learn about the problems and possibilities of natural language analysis: • What are the major issues? • What are the major solutions? • How well do they work • How do they work (but to a lesser extent than CS 295-4) • At the end you should: • Agree that language is subtle and interesting! • Feel some ownership over the algorithms • Be able to assess NLP problems • Know which solutions to apply when, and how • Be able to read papers in the field

  5. Today • Motivation: SIMS student projects • Course Goals • Why NLP is difficult • How to solve it? Corpus-based statistical approaches • What we’ll do in this course

  6. We’ve past the year 2001,but we are not closeto realizing the dream(or nightmare …)

  7. Dave Bowman: “Open the pod bay doors, HAL” HAL 9000: “I’m sorry Dave. I’m afraid I can’t do that.”

  8. Why is NLP difficult? • Computers are not brains • There is evidence that much of language understanding is built-in to the human brain • Computers do not socialize • Much of language is about communicating with people • Key problems: • Representation of meaning • Language presupposed knowledge about the world • Language only reflects the surface of meaning • Language presupposes communication between people

  9. Hidden Structure • English plural pronunciation • Toy + s  toyz ; add z • Book + s  books ; add s • Church + s  churchiz ; add iz • Box + s  boxiz ; add iz • Sheep + s  sheep ; add nothing • What about new words? • Bach + ‘s  boxs ; why not boxiz? Adapted from Robert Berwick's 6.863J

  10. Language subtleties • Adjective order and placement • A big black dog • A big black scary dog • A big scary dog • A scary big dog • A black big dog • Antonyms • Which sizes go together? • Big and little • Big and small • Large and small • Large and little

  11. World Knowledge is subtle • He arrived at the lecture. • He chuckled at the lecture. • He arrived drunk. • He chuckled drunk. • He chuckled his way through the lecture. • He arrived his way through the lecture. Adapted from Robert Berwick's 6.863J

  12. Words are ambiguous(have multiple meanings) • I know that. • I know that block. • I know that blocks the sun. • I know that block blocks the sun. Adapted from Robert Berwick's 6.863J

  13. Headline Ambiguity • Iraqi Head Seeks Arms • Juvenile Court to Try Shooting Defendant • Teacher Strikes Idle Kids • Kids Make Nutritious Snacks • British Left Waffles on Falkland Islands • Red Tape Holds Up New Bridges • Bush Wins on Budget, but More Lies Ahead • Hospitals are Sued by 7 Foot Doctors Adapted from Robert Berwick's 6.863J

  14. The Role of Memorization • Children learn words quickly • As many as 9 words/day • Often only need one exposure to associate meaning with word • Can make mistakes, e.g., overgeneralization “I goed to the store.” • Exactly how they do this is still under study

  15. The Role of Memorization • Dogs can do word association too! • Rico, a border collie in Germany • Knows the names of each of 100 toys • Can retrieve items called out to him with over 90% accuracy. • Can also learn and remember the names of unfamiliar toys after just one encounter, putting him on a par with a three-year-old child. http://www.nature.com/news/2004/040607/pf/040607-8_pf.html

  16. But there is too much to memorize! establish establishment the church of England as the official state church. disestablishment antidisestablishment antidisestablishmentarian antidisestablishmentarianism is a political philosophy that is opposed to the separation of church and state. Adapted from Robert Berwick's 6.863J

  17. Rules and Memorization • Current thinking in psycholinguistics is that we use a combination of rules and memorization • However, this is very controversial • Mechanism: • If there is an applicable rule, apply it • However, if there is a memorized version, that takes precedence. (Important for irregular words.) • Artists paint “still lifes” • Not “still lives” • Past tense of • think  thought • blink  blinked • This is a simplification; for more on this, see Pinker’s “Words and Language” and “The Language Instinct”.

  18. Representation of Meaning • I know that block blocks the sun. • How do we represent the meanings of “block”? • How do we represent “I know”? • How does that differ from “I know that.”? • Who is “I”? • How do we indicate that we are talking about earth’s sun vs. some other planet’s sun? • When did this take place? What if I move the block? What if I move my viewpoint? How do we represent this?

  19. How to tackle these problems? • The field was stuck for quite some time. • A new approach started around 1990 • Well, not really new, but the first time around, in the 50’s, they didn’t have the text, disk space, or GHz • Main idea: combine memorizing and rules • How to do it: • Get large text collections (corpora) • Compute statistics over the words in those collections • Surprisingly effective • Even better now with the Web

  20. Corpus-based Example: Pre-Nominal Adjective Ordering • Important for translation and generation • Examples: • big fat Greek wedding • fat Greek big wedding • Some approaches try to characterize this as semantic rules, e.g.: • Age < color, value < dimension • Data-intensive approaches • Assume adjective ordering is independent of the noun they modify • Compare how often you see {a, b} vs {b, a} Keller & Lapata, “The Web as Baseline”, HLT-NAACL’04

  21. Corpus-based Example: Pre-Nominal Adjective Ordering • Data-intensive approaches • Compare how often you see {a, b} vs {b, a} • What happens when you encounter an unseen pair? • Shaw and Hatzivassiloglou ’99 use transitive closutres • Malouf ’00 uses a back-off bigram model • P(<a,b>|{a,b}) vs. P(<b,a>|{a,b}) • He also uses morphological analysis, semantic similarity calculations and positional probabilities • Keller and Lapata ’04 use just the very simple algorithm • But they use the web as their training set • Gets 90% accuracy on 1000 sequences • As good as or better than the complex algorithms Keller & Lapata, “The Web as Baseline”, HLT-NAACL’04

  22. Real-World Applications of NLP • Spelling Suggestions/Corrections • Grammar Checking • Synonym Generation • Information Extraction • Text Categorization • Automated Customer Service • Speech Recognition (limited) • Machine Translation • In the (near?) future: • Question Answering • Improving Web Search Engine results • Automated Metadata Assignment • Online Dialogs Adapted from Robert Berwick's 6.863J

  23. NLP in the Real World • Synonym generation for • Suggesting advertising keywords • Suggesting search result refinement and expansion

  24. Synonym Generation

  25. Synonym Generation

  26. Synonym Generation

  27. Synonym Generation

  28. What We’ll Do in this Course • Read research papers and tutorials • Use NLTK (Natural Language ToolKit) to try out various algorithms • Some homeworks will be to do some NLTK exercises • Three mini-projects • Two involve a selected collection • The third is your choice, can also be on the selected collection

  29. What We’ll Do in this Course • Adopt a large text collection • Use a wide range of NLP techniques to process it • Release the results for others to use

  30. Which Text Collection?

  31. How to analyze a big collection? • Your ideas go here

  32. Python • A terrific language • Interpreted • Object-oriented • Easy to interface to other things (web, DBMS, TK) • Good stuff from: java, lisp, tcl, perl • Easy to learn • I learned it this summer by reading Learning Python • FUN!

  33. Questions?

More Related