1 / 36

Seminar: Efficient NLP Session 2, NLP behind Broccoli

Seminar: Efficient NLP Session 2, NLP behind Broccoli. November 2nd, 2011 Elmar Haußmann Chair for Algorithms and Data Structures Department of Computer Science University of Freiburg. A genda. Motivation and Problem Definition Rule based Approach Machine Learning based Approach

gary
Download Presentation

Seminar: Efficient NLP Session 2, NLP behind Broccoli

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Seminar: Efficient NLPSession 2, NLP behind Broccoli • November 2nd, 2011 • ElmarHaußmann • Chair for Algorithms and Data Structures • Department of Computer Science • University of Freiburg

  2. Agenda • Motivation and Problem Definition • Rule based Approach • Machine Learning based Approach • Conclusion / Current and Future Work Nov. 2, 2011 2 NLP behind Broccoli

  3. MotivationandProblemDefinition • The ideaofsemanticfull-text search • Search in full-text • But combinedwith “structuredinformation“ • Broccoli performsthefollowing NLP-tasks: • Entityrecognition • Based on the links inside Wikipedia articlesandheuristics • Anaphora resolution • Based on simple, yetefficientheuristics • ContextualSentenceDecomposition • This talk Nov. 2, 2011 3 NLP behind Broccoli

  4. Example Query plant edible leaves Result Sentence The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. MotivationandProblemDefinition • The motivation for Contextual Sentence Decomposition – the “heavy” NLP-task behind Broccoli Nov. 2, 2011 4 NLP behind Broccoli

  5. MotivationandProblemDefinition • Many false-positives caused by words, appearing in same sentence, but part of a different context • Apply natural language processing to decompose sentence based on context and search resulting „sentences“ independently Result Sentence The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. Nov. 2, 2011 5 NLP behind Broccoli

  6. Decomposed Sentence • The usable parts of rhubarb are the medicinally used roots • The usable parts of rhubarb are the edible stalks • its leaves are toxic MotivationandProblemDefinition Original Sentence The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. Nov. 2, 2011 6 NLP behind Broccoli

  7. MotivationandProblemDefinition Problem Definition Contextual Sentence Decomposition • Contextual Sentence Decomposition • is the process of performing • Sentence Constituent Identification • followed by • 2. Sentence Constituent Recombination Nov. 2, 2011 7 NLP behind Broccoli

  8. MotivationandProblemDefinition Sentence Constituent Identification • Identify specific parts of sentence • Differentiate 4 types of constituents • Relative clauses • Appositions • List items • Separators Albert Einstein, who was born in Ulm, ... Albert Einstein, a well-known scientist, ... Albert Einstein published papers on Brownian motion, the photelectric effect and special relativity. Albert Einstein was recognized as a leading scientist and in 1921 he received the Nobel Prize in Physics. Nov. 2, 2011 8 NLP behind Broccoli

  9. MotivationandProblemDefinition Original Sentence with Identified Constituents The usable parts of rhubarb are the medicinally used roots and the edible stalks,howeverits leaves are toxic. • list itemseparator Nov. 2, 2011 9 NLP behind Broccoli

  10. MotivationandProblemDefinition Sentence Constituent Recombination • Recombine identified constituents into sub-sentences • Split sentences at separators • Attachrelative clauses and appositions to noun (-phrase) they describe • Apply „distributive law“ to list items Nov. 2, 2011 10 NLP behind Broccoli

  11. Decomposed Sentence • its leaves are toxic • The usable parts of rhubarb are the medicinally used roots • The usable parts of rhubarb are the edible stalks MotivationandProblemDefinition Original Sentence The usable parts of rhubarb are the medicinally used roots and the edible stalks,however its leaves are toxic. Nov. 2, 2011 11 NLP behind Broccoli

  12. MotivationandProblemDefinition Remarks • Given identified constituents, recombination comparably simple - identification challenging part • Constituents possibly nested, e.g. relative clause can contain enumeration etc. • Resulting sub-sentences often grammatically correct but not required to be • Approach must befeasible in termsofefficiency (English Wikipedia ~ 30GB rawtext) Nov. 2, 2011 12 NLP behind Broccoli

  13. MotivationandProblemDefinition And…Natural Language is Tricky • Ambiguous, even for humans: • “Time flies like an arrow; fruit flies like a banana.” • “Flying planes can be dangerous.” • “I once shot an elephant in my pajamas. • How he got into my pajamas, I'll never know.” • Focus: large partoflesscomplicatedsentences Nov. 2, 2011 13 NLP behind Broccoli

  14. Difficult Sentence Panofsky was known to be friends with Wolfgang Pauli, one of the main contributors to quantum physics and atomic theory, as well as Albert Einstein, born in Ulm and famous for his discovery of the law of the photoelectric effect and theories of relativity. • MotivationandProblemDefinition …Natural Language is Tricky • Even if meaning is clear to a human: arbitrarily deep nesting and syntactic ambiguity Difficult Sentence Difficult Sentence • Apposition similar to an element of enumeration • Relative clause contains enumeration and starts in reduced form Nov. 2, 2011 14 NLP behind Broccoli

  15. Agenda • Motivation and Problem Definition • Rule based Approach • Machine Learning based Approach • Conclusion / Current and Future Work Nov. 2, 2011 15 NLP behind Broccoli

  16. Sentence containing Relative Clause Koffi Annan, who is the current U.N. Secretary General, has spent much of his tenure working to promote peace in the Third World. Rule based Approach Idea • Devise hand-crafted rules by closely inspecting sentence structure • Example: relative clause is set off by comma, starts with word „who“ and extends to the next comma Nov. 2, 2011 16 NLP behind Broccoli

  17. Original Sentence with marked Stop-words The usable parts of rhubarb are the medicinally used roots and the edible stalks , however its leaves are toxic. Rule based Approach • Basic Approach • Identify „stop-words“ • For each marked word decide if and which constituent it starts • Determine corresponding constituent ends Nov. 2, 2011 17 NLP behind Broccoli

  18. Original Sentence with Identified Stop-words The usable parts of rhubarb are the medicinally used roots and the edible stalks , however its leaves are toxic. Rule based Approach Determine Constituent Starts Nov. 2, 2011 18 NLP behind Broccoli

  19. The usable parts of rhubarb are the medicinally used roots and the edible stalks , however its leaves are toxic. Rule based Approach Determine Constituent Starts • If a verb follows but a noun preceeds it: • separator Original Sentence with Identified Separator Nov. 2, 2011 19 NLP behind Broccoli

  20. The usable parts of rhubarb are the medicinally used roots and the edible stalks , however its leaves are toxic. Rule based Approach Determine Constituent Starts • If a verb follows but a noun preceeds it: • separator • If it is no relative clause or apposition: • next wordlist item start Original Sentence with Identified List Item Start Nov. 2, 2011 20 NLP behind Broccoli

  21. The usable parts of rhubarb are the medicinally used roots and the edible stalks , however its leaves are toxic. Rule based Approach Determine Constituent Starts • If a verb follows but a noun preceeds it: • separator • If it is no relative clause or apposition: • next wordlist item start • First list item starts at noun-phrase preceeding already discovered list item start Original Sentence with all Identified List Item Starts Nov. 2, 2011 21 NLP behind Broccoli

  22. Original Sentence with all Identified List Item Starts The usable parts of rhubarb are the medicinally used roots and the edible stalks , however its leaves are toxic. Rule based Approach Determine Constituent Ends • For each start assign a matching end Nov. 2, 2011 22 NLP behind Broccoli

  23. Original Sentence with Identified Constituents The usable parts of rhubarb are the medicinally used roots and the edible stalks , however its leaves are toxic. Rule based Approach Determine Constituent Ends • For each start assign a matching end • A list item extends to the next constituent start or the sentence end Nov. 2, 2011 23 NLP behind Broccoli

  24. Original Sentence with Identified Constituents The usable parts of rhubarb are the medicinally used roots and the edible stalks , however its leaves are toxic. Rule based Approach Determine Constituent Ends • For each start assign a matching end • A list item extends to the next constituent start or the sentence end Nov. 2, 2011 24 NLP behind Broccoli

  25. Agenda • Motivation and Problem Definition • Rule based Approach • Machine Learning based Approach • Conclusion / Current and Future Work Nov. 2, 2011 25 NLP behind Broccoli

  26. Original Sentence The usable parts of rhubarb are the medicinally used roots and the edible stalks , however its leaves are toxic. Machine Learning based Approach Idea • Use supervised learning to train classifiers that identify the start and end of constituents • Train Support Vector Machines for each constituent start and end Nov. 2, 2011 26 NLP behind Broccoli

  27. 3. Apply list item end classifier 2. Apply list item start classifier I. Apply separator classifier Machine Learning based Approach Basic Approach • Apply classifiers in turn to each word • Ideally this would already give a correct solution Nov. 2, 2011 27 NLP behind Broccoli

  28. Machine Learning based Approach • However classifiers are not perfect • Some additional ends and beginnings might be identified • Decisions are local and do not consider admissible constituent structure Nov. 2, 2011 28 NLP behind Broccoli

  29. Apply list itemclassifier Machine Learning based Approach • Train classifiers that identify whether a span of the sentence denotes a valid constituent • Still, identified constituents might overlap • Structural constraints must be satisfied Nov. 2, 2011 29 NLP behind Broccoli

  30. Determine MWIS using enumeration or greedy approach for large problem sizes Machine Learning based Approach • Reduce to the maximum weight independent set problem Nov. 2, 2011 30 NLP behind Broccoli

  31. Original Sentence with Identified Constituents The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. Machine Learning based Approach • Final result adheres to structural constraints • More resistant to wrong „local“ classifications Nov. 2, 2011 31 NLP behind Broccoli

  32. Agenda • Motivation and Problem Definition • Rule based Approach • Machine Learning based Approach • Conclusion / Current and Future Work Nov. 2, 2011 32 NLP behind Broccoli

  33. Evaluation / Conclusion Evaluation on three levels • Compare identification using a ground truth • Compare resulting decomposition using a ground truth • Evaluate influence on search quality against ground truth Nov. 2, 2011 33 NLP behind Broccoli

  34. Evaluation / Conclusion Results • Rule based approach viable, clear improvement • Machine Learning based approach viable, currently less effective • Search qualityincreasesdepend on exactquery, but gouptodoublingprecision, withhardlyloss in recall • Contextual Sentence Decomposition integral part of Semantic Full-Text Search Nov. 2, 2011 34 NLP behind Broccoli

  35. Evaluation / Conclusion Current Work • Increasingqualityofdecompositionby: • efficient additional NLP (deep-parsers?…) • improvementsofrules • better understandingwhatextentofdecompositionisreasonableandnecessary Nov. 2, 2011 35 NLP behind Broccoli

  36. Thank you Thank you for your attention! Nov. 2, 2011 36 NLP behind Broccoli

More Related