Crawling the hidden web
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Crawling the Hidden Web PowerPoint PPT Presentation


  • 63 Views
  • Uploaded on
  • Presentation posted in: General

Crawling the Hidden Web. Sriram Raghavan Hector Garcia-Molina @ Stanford University. Introdution. What’s the problem? Current-day crawlers retrieve only Publicly Indexable Web (PIW) Why is it a problem? Large amounts of high quality information are ‘hidden’ behind search forms

Download Presentation

Crawling the Hidden Web

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Crawling the hidden web

Crawling the Hidden Web

Sriram Raghavan

Hector Garcia-Molina

@ Stanford University


Introdution

Introdution

  • What’s the problem?

    • Current-day crawlers retrieve only Publicly Indexable Web (PIW)

  • Why is it a problem?

    • Large amounts of high quality information are ‘hidden’ behind search forms

    • The hidden Web is 500 times as large as PIW


Introduction cont d

Introduction (cont’d)

  • What’s the solution?

    • Design a crawler capable of extracting content from the hidden Web

    • A generic operational model of a hidden Web crawler, Hidden Web Exposer (HiWE)

  • Why is HiWE a solution?


User form interaction

User Form Interaction


Challenges and simplifications

Challenges and Simplifications

  • Challenges

    • Parse, process and interact with search forms

    • Fill out forms for submission

  • Simplifications

    • Application dependant

    • With user assistance

    • Only address content retrieval and resource discovery step is done


Crawler form interaction

Crawler Form Interaction


Performance metrics

Performance Metrics

  • Coverage Metric

  • Submission Efficiency

  • Lenient Submission Efficiency


Design issues

Design Issues

  • Internal Form Representation

  • Task-specific Database

  • Matching Function

  • Response Analysis


Hiwe architecure

HiWE Architecure


Hiwe form representaion

HiWE – Form Representaion


Hiwe sample forms

HiWE – Sample Forms


Hiwe task specific database

HiWE – Task-Specific Database

  • Label Value-Set (LVS) Tables

  • Vaule Set

    is a fuzzy set of element values

    is a membership function to assign weights [0, 1] to the member of the set


Hiwe populating the lvs table

HiWE – Populating the LVS Table

  • Explicit Initialization

  • Built-in Entries

  • Wrapped Data Sources

  • Crawling Experience


Hiwe computing weights

HiWE – Computing Weights

  • Values from explicit initialization and built-in categories have weight 1

  • Values from external data sources assigned weights by wrappers [0, 1]

  • Values gathered by crawlers

    • Extract and Match the label – add new values

    • Extract and can not match the label – add new entries (L,V)

    • Can not extract the label – find closest entry and add new values


Hiwe matching function

HiWE – Matching Function

  • Enumerate values for finite domain elements

  • Label matching

    • step 1: string normalization

    • step 2: string matching

  • Evaluate value assignment

    • Fuzzy Conjunction

    • Average

    • Probabilistic


Configuring hiwe

Configuring HiWE


Hiwe extraction from pages

HiWE – extraction from pages

  • Prune form page and only keep forms

  • Approximately lay-out the pruned page using a lay-out engine

  • Using lay-out engine to identify candidate labels to form elements

  • Rank each candidate and chose the best one


Hiwe extraction from pages cont d

HiWE – extraction from pages(cont’d)


Hiwe experiments

HiWE – Experiments


Hiwe experiments cont d

HiWE – Experiments (cont’d)


Hiwe experiments cont d1

HiWE – Experiments (cont’d)


Hiwe experiments cont d2

HiWE – Experiments (cont’d)


Hiwe experiments cont d3

HiWE – Experiments (cont’d)

93% accuracy


Future work

Future Work

  • Recognize and respond to the dependencies between form elements

  • Support partially filling-out forms


Conclusion

Conclusion

  • Propose an application specific approach to hidden Web crawling

  • Implement a prototype crawler – HiWE

  • Set the stage for designing a variety of hidden Web crawlers


  • Login