1 / 9

Natural Language Processing for the Web

Natural Language Processing for the Web. Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116 Office Hours: Thurs 12-1, 8-9. Logistics. Class evaluation Please do If there were topics you particularly liked, please say so

zita
Download Presentation

Natural Language Processing for the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116 Office Hours: Thurs 12-1, 8-9

  2. Logistics • Class evaluation • Please do • If there were topics you particularly liked, please say so • If there were topics you particularly disliked, please so • Anything you particularly liked or disliked about class format • Project presentations • Need eight people to go first, April 29th • Not necessary to have all results • 2nd date: May 13, 7:10pm UNLESS…. • Sign up by end of class or I will sign you up  : http://www.cs.columbia.edu/~kathy/NLPWeb/finalpresentations.htm

  3. Machine Reading • Goal: to read all texts on the web, extract all knowledge and represent in DB/KB format • DARPA program on machine reading

  4. Issues • Background theory and text facts may be inconsistent • -> probabilistic representation • Beliefs may only be implicit • -> need inference • Supervised learning not an option due to variety of relations on the web • -> IE not a valid solution • May require many steps of entailment • -> Need more general approach than textual entailment

  5. Initial Approaches • Systems that learn relations using examples (supervised) • Systems that learn how to learn patterns using a seed set: SNOBALL (semi-supervised) • Systems that can label their own training examples using domain independent patterns: KNOWITALL (self-supervised)

  6. KnowItAll • Require no hand-tagged data • A generic pattern • <Class> such as <Mem> • Learn Seattle, New York City, London as examples of cities • Learn new patterns “Headquartered in <city>” to learn more cities • Problem: relation-specific requiring bootstrapping for each relation

  7. TextRunner “The use of NERs as well as syntactic or dependency parsers is a common thread that unifies most previous work. But this rather “heavy” linguistic technology runs into problems when applied to the heterogeneous text found on the Web.” • Self-supervised learner • Given a small corpus as example • Uses Stanford parser • Retains tuples if: • Finds all entities in the parse • Keeps tuples if there is a dependency between 2 entities shorter than a cerrtain length • The path from e1 to e2 does not cross a sentence like boundary (e.g., rel clause) • Neither e1 or e2 are a pronoun • Learns a classifier that tags tuples as “trustworthy” • Each tuple converted to a feature vector • Feature = POS sequence • Number of stop words in r • Number of tokens in r • Learned classifier contains no relation-specific or lexical features • Single pass extractor • No parsing but POS tagging and lightweight NP chunker • Entities = NP chunks • Relations words in between but heursitically eliminating words like prepositions • Generates one or more candidate tuples per sentence and retains one that classifier determines are trustworthy • Redundancy-based Assessor • Assigns a probability to each one based on a probablistic model of redundancy

  8. TextRunner Capabilities • Tuple outputs are placed in a graph • TextRunner operates at large scale, processing 90 million web pages, producing 1 billion tuples, with estimated 70% accuracy • Problems: inconsistencies, polysemy, synonymy, entity duplication

  9. How close are we to realizing the dream of machine reading?

More Related