crawling the hidden web n.
Skip this Video
Download Presentation
Crawling the Hidden Web

Loading in 2 Seconds...

play fullscreen
1 / 25

Crawling the Hidden Web - PowerPoint PPT Presentation

  • Uploaded on

Crawling the Hidden Web. Sriram Raghavan Hector Garcia-Molina @ Stanford University. Introdution. What’s the problem? Current-day crawlers retrieve only Publicly Indexable Web (PIW) Why is it a problem? Large amounts of high quality information are ‘hidden’ behind search forms

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Crawling the Hidden Web' - damian-puckett

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
crawling the hidden web

Crawling the Hidden Web

Sriram Raghavan

Hector Garcia-Molina

@ Stanford University

  • What’s the problem?
    • Current-day crawlers retrieve only Publicly Indexable Web (PIW)
  • Why is it a problem?
    • Large amounts of high quality information are ‘hidden’ behind search forms
    • The hidden Web is 500 times as large as PIW
introduction cont d
Introduction (cont’d)
  • What’s the solution?
    • Design a crawler capable of extracting content from the hidden Web
    • A generic operational model of a hidden Web crawler, Hidden Web Exposer (HiWE)
  • Why is HiWE a solution?
challenges and simplifications
Challenges and Simplifications
  • Challenges
    • Parse, process and interact with search forms
    • Fill out forms for submission
  • Simplifications
    • Application dependant
    • With user assistance
    • Only address content retrieval and resource discovery step is done
performance metrics
Performance Metrics
  • Coverage Metric
  • Submission Efficiency
  • Lenient Submission Efficiency
design issues
Design Issues
  • Internal Form Representation
  • Task-specific Database
  • Matching Function
  • Response Analysis
hiwe task specific database
HiWE – Task-Specific Database
  • Label Value-Set (LVS) Tables
  • Vaule Set

is a fuzzy set of element values

is a membership function to assign weights [0, 1] to the member of the set

hiwe populating the lvs table
HiWE – Populating the LVS Table
  • Explicit Initialization
  • Built-in Entries
  • Wrapped Data Sources
  • Crawling Experience
hiwe computing weights
HiWE – Computing Weights
  • Values from explicit initialization and built-in categories have weight 1
  • Values from external data sources assigned weights by wrappers [0, 1]
  • Values gathered by crawlers
    • Extract and Match the label – add new values
    • Extract and can not match the label – add new entries (L,V)
    • Can not extract the label – find closest entry and add new values
hiwe matching function
HiWE – Matching Function
  • Enumerate values for finite domain elements
  • Label matching
    • step 1: string normalization
    • step 2: string matching
  • Evaluate value assignment
    • Fuzzy Conjunction
    • Average
    • Probabilistic
hiwe extraction from pages
HiWE – extraction from pages
  • Prune form page and only keep forms
  • Approximately lay-out the pruned page using a lay-out engine
  • Using lay-out engine to identify candidate labels to form elements
  • Rank each candidate and chose the best one
future work
Future Work
  • Recognize and respond to the dependencies between form elements
  • Support partially filling-out forms
  • Propose an application specific approach to hidden Web crawling
  • Implement a prototype crawler – HiWE
  • Set the stage for designing a variety of hidden Web crawlers