Crawling the hidden web
Download
1 / 25

Crawling the Hidden Web - PowerPoint PPT Presentation


  • 198 Views
  • Uploaded on

Crawling the Hidden Web. Sriram Raghavan Hector Garcia-Molina @ Stanford University. Introdution. What’s the problem? Current-day crawlers retrieve only Publicly Indexable Web (PIW) Why is it a problem? Large amounts of high quality information are ‘hidden’ behind search forms

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Crawling the Hidden Web' - damian-puckett


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Crawling the hidden web

Crawling the Hidden Web

Sriram Raghavan

Hector Garcia-Molina

@ Stanford University


Introdution
Introdution

  • What’s the problem?

    • Current-day crawlers retrieve only Publicly Indexable Web (PIW)

  • Why is it a problem?

    • Large amounts of high quality information are ‘hidden’ behind search forms

    • The hidden Web is 500 times as large as PIW


Introduction cont d
Introduction (cont’d)

  • What’s the solution?

    • Design a crawler capable of extracting content from the hidden Web

    • A generic operational model of a hidden Web crawler, Hidden Web Exposer (HiWE)

  • Why is HiWE a solution?



Challenges and simplifications
Challenges and Simplifications

  • Challenges

    • Parse, process and interact with search forms

    • Fill out forms for submission

  • Simplifications

    • Application dependant

    • With user assistance

    • Only address content retrieval and resource discovery step is done



Performance metrics
Performance Metrics

  • Coverage Metric

  • Submission Efficiency

  • Lenient Submission Efficiency


Design issues
Design Issues

  • Internal Form Representation

  • Task-specific Database

  • Matching Function

  • Response Analysis



Hiwe form representaion
HiWE – Form Representaion


Hiwe sample forms
HiWE – Sample Forms


Hiwe task specific database
HiWE – Task-Specific Database

  • Label Value-Set (LVS) Tables

  • Vaule Set

    is a fuzzy set of element values

    is a membership function to assign weights [0, 1] to the member of the set


Hiwe populating the lvs table
HiWE – Populating the LVS Table

  • Explicit Initialization

  • Built-in Entries

  • Wrapped Data Sources

  • Crawling Experience


Hiwe computing weights
HiWE – Computing Weights

  • Values from explicit initialization and built-in categories have weight 1

  • Values from external data sources assigned weights by wrappers [0, 1]

  • Values gathered by crawlers

    • Extract and Match the label – add new values

    • Extract and can not match the label – add new entries (L,V)

    • Can not extract the label – find closest entry and add new values


Hiwe matching function
HiWE – Matching Function

  • Enumerate values for finite domain elements

  • Label matching

    • step 1: string normalization

    • step 2: string matching

  • Evaluate value assignment

    • Fuzzy Conjunction

    • Average

    • Probabilistic



Hiwe extraction from pages
HiWE – extraction from pages

  • Prune form page and only keep forms

  • Approximately lay-out the pruned page using a lay-out engine

  • Using lay-out engine to identify candidate labels to form elements

  • Rank each candidate and chose the best one


Hiwe extraction from pages cont d
HiWE – extraction from pages(cont’d)


Hiwe experiments
HiWE – Experiments


Hiwe experiments cont d
HiWE – Experiments (cont’d)


Hiwe experiments cont d1
HiWE – Experiments (cont’d)


Hiwe experiments cont d2
HiWE – Experiments (cont’d)


Hiwe experiments cont d3
HiWE – Experiments (cont’d)

93% accuracy


Future work
Future Work

  • Recognize and respond to the dependencies between form elements

  • Support partially filling-out forms


Conclusion
Conclusion

  • Propose an application specific approach to hidden Web crawling

  • Implement a prototype crawler – HiWE

  • Set the stage for designing a variety of hidden Web crawlers


ad