natural language processing for the web n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Natural Language Processing for the Web PowerPoint Presentation
Download Presentation
Natural Language Processing for the Web

Loading in 2 Seconds...

play fullscreen
1 / 9

Natural Language Processing for the Web - PowerPoint PPT Presentation


  • 108 Views
  • Uploaded on

Natural Language Processing for the Web. Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116 Office Hours: Thurs 12-1, 8-9. Logistics. Class evaluation Please do If there were topics you particularly liked, please say so

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Natural Language Processing for the Web' - zita


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
natural language processing for the web

Natural Language Processing for the Web

Prof. Kathleen McKeown

722 CEPSR, 939-7118

Office Hours: Wed, 1-2; Tues 4-5

TA:

Yves Petinot

719 CEPSR, 939-7116

Office Hours: Thurs 12-1, 8-9

logistics
Logistics
  • Class evaluation
    • Please do
    • If there were topics you particularly liked, please say so
    • If there were topics you particularly disliked, please so
    • Anything you particularly liked or disliked about class format
  • Project presentations
    • Need eight people to go first, April 29th
    • Not necessary to have all results
    • 2nd date: May 13, 7:10pm UNLESS….
    • Sign up by end of class or I will sign you up  : http://www.cs.columbia.edu/~kathy/NLPWeb/finalpresentations.htm
machine reading
Machine Reading
  • Goal: to read all texts on the web, extract all knowledge and represent in DB/KB format
  • DARPA program on machine reading
issues
Issues
  • Background theory and text facts may be inconsistent
    • -> probabilistic representation
  • Beliefs may only be implicit
    • -> need inference
  • Supervised learning not an option due to variety of relations on the web
    • -> IE not a valid solution
  • May require many steps of entailment
    • -> Need more general approach than textual entailment
initial approaches
Initial Approaches
  • Systems that learn relations using examples (supervised)
  • Systems that learn how to learn patterns using a seed set: SNOBALL (semi-supervised)
  • Systems that can label their own training examples using domain independent patterns: KNOWITALL (self-supervised)
knowitall
KnowItAll
  • Require no hand-tagged data
  • A generic pattern
    • <Class> such as <Mem>
    • Learn Seattle, New York City, London as examples of cities
    • Learn new patterns “Headquartered in <city>” to learn more cities
  • Problem: relation-specific requiring bootstrapping for each relation
textrunner
TextRunner

“The use of NERs as well as syntactic or dependency parsers is a common thread that unifies most previous work. But this rather “heavy” linguistic technology runs into problems when applied to the heterogeneous text found on the Web.”

  • Self-supervised learner
    • Given a small corpus as example
    • Uses Stanford parser
    • Retains tuples if:
      • Finds all entities in the parse
      • Keeps tuples if there is a dependency between 2 entities shorter than a cerrtain length
      • The path from e1 to e2 does not cross a sentence like boundary (e.g., rel clause)
      • Neither e1 or e2 are a pronoun
    • Learns a classifier that tags tuples as “trustworthy”
      • Each tuple converted to a feature vector
        • Feature = POS sequence
        • Number of stop words in r
        • Number of tokens in r
      • Learned classifier contains no relation-specific or lexical features
  • Single pass extractor
    • No parsing but POS tagging and lightweight NP chunker
    • Entities = NP chunks
    • Relations words in between but heursitically eliminating words like prepositions
    • Generates one or more candidate tuples per sentence and retains one that classifier determines are trustworthy
  • Redundancy-based Assessor
    • Assigns a probability to each one based on a probablistic model of redundancy
textrunner capabilities
TextRunner Capabilities
  • Tuple outputs are placed in a graph
  • TextRunner operates at large scale, processing 90 million web pages, producing 1 billion tuples, with estimated 70% accuracy
  • Problems: inconsistencies, polysemy, synonymy, entity duplication