crawling the hidden web l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CRAWLING THE HIDDEN WEB PowerPoint Presentation
Download Presentation
CRAWLING THE HIDDEN WEB

Loading in 2 Seconds...

play fullscreen
1 / 24

CRAWLING THE HIDDEN WEB - PowerPoint PPT Presentation


  • 395 Views
  • Uploaded on

CRAWLING THE HIDDEN WEB. Authors: S. Raghavan & H. Garcia-Molina Presenter: Nga Chung. OUTLINE. Introduction Challenges Approach Experimental Results Contributions Pros and Cons Related Work. INTRODUCTION. Hidden Web

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CRAWLING THE HIDDEN WEB' - libitha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
crawling the hidden web

CRAWLING THE HIDDEN WEB

Authors: S. Raghavan & H. Garcia-Molina

Presenter: Nga Chung

outline
OUTLINE
  • Introduction
  • Challenges
  • Approach
  • Experimental Results
  • Contributions
  • Pros and Cons
  • Related Work
introduction
INTRODUCTION
  • Hidden Web
    • Content stored in databases that can only be retrieved through user query, such as, medical research databases, flight schedules, product listings, news archives
    • Social media blog posts, comments
  • So why should we care?
    • Scale of the web (55 ~ 60 billions of pages) does not include the deep web or pages behind security walls [2]
    • Estimate in 2001, Hidden Web is 500 times the publicly indexed web
    • Mike Bergman, “The long-term impact of Deep Web search had more to do with transforming business than with satisfying the whims of Web surfers.” [5]
challenges
CHALLENGES
  • From a search engine perspective
    • Locate the hidden databases
    • Identify which databases to search for a given user query
  • From a crawler’s perspective
    • Interact with a search form
      • Search can be form-based, facet/guided navigation, free-text, which are intended for users [3]
    • Know what keywords to put into the form fields
    • Filter search results returned from search queries
    • Define metrics to measure crawler’s performance
hidden web exposer architecture hiwe
HIDDEN WEB EXPOSER ARCHITECTURE(HIWE)

URL List

Task Specific Database

Parser

Crawl Manager

Label Value Set (LVS)

Form Analyzer

WWW

Form Submission

LVS Manager

Form Processor

Feedback

Response

Response Analyzer

Data Sources

form analysis
FORM ANALYSIS
  • How does a crawler interact with a search form?
    • Crawler builds an “Internal Form Representation”
      • F = ({E1, E2, …, En}, S, M)
      • Label(E1) is descriptive text describing the field e.g. Date
      • Domain(E1) is set of possible values for the field which can be finite (select box) of infinite (text box)

set of n form elements

meta-information e.g. URL of form page, web site hosting form, links to form

submission information e.g. submission URL

form analysis7
FORM ANALYSIS

Label(E1) = Make

Domain(E1) = {Acura, Lexus…}

Label(E5) = Your ZIP

Domain(E5) = {s | s is a text string}

task specific database
TASK SPECIFIC DATABASE
  • How does a crawler know what keywords to put into the form fields?
    • Crawler has a “task-specific database”
      • For instance, if the task is to search archives pertaining to the automobile industry, the database will contain lists of all car makes and models.
    • Database has a Label Value Set (LVS) table
      • Each row contains
        • L – a label e.g. “Car Make”
        • V = {v1, …, vn} – a graded set of values e,g, {‘Toyota’, ‘Honda’, ‘Mercedes-Benz’, …}
      • Membership function Mv assigns weight to each member of the set V
task specific database9
TASK SPECIFIC DATABASE
  • LVS table can be populated through
    • Explicit initialization by human intervention
    • Built-in entries for commonly used categories e.g. dates
    • Querying external data sources e.g. Open Directory Project
    • Crawler’s encounter with forms that have finite domain fields

Categories

Regional:

North America: United States

task specific database10
TASK SPECIFIC DATABASE
  • Computing weights M(v1)
    • Case 1: Precomputed
    • Case 2: Computed by respective data source wrapper
    • Case 3: Computed by crawling experience shown below

Extract Label

Extracted?

Found?

Find Label in LVS Table

Add new entry to LVS

no

yes

no

yes

Find entry that close resembles Domain(E) and add Domain(E) to set

Replace (L, V) with (L, V U Domain(E)

matching function
MATCHING FUNCTION
  • “Matching function” maps values from database to form field
  • Step 1: Label matching
    • Normalize form label and use string matching algorithm to compute minimum edit distance between form label and all LVS labels

E1 = Car Make

v1 = Toyota

E1 = Car Make

Match

E2 = Car Model

E2 = Car Model

v2 = Prius

matching function12
MATCHING FUNCTION
  • Step 2: Value assignment
    • Take all possible combinations of value assignments, rank them, and choose the best set to use for form submission
    • There are three ranking functions
      • Fuzzy conjunction
      • Average
      • Probabilistic
    • Example: form with 2 fields car make and year
      • Jaguar, 2009 where Mv1(Jaguar) = 0.5 and Mv2(2009) = 1
        • ρfuz = 0.5
        • ρavg = ½ (0.5 + 1) = 0.75
        • ρprob = 1 – [(1 – 0.5) * (1 – 1)] = 1
      • Toyota, 2010, where Mv1(Toyota) = 1 and Mv2(2010) = 1
        • ρfuz = 1
        • ρavg = ½ (1 + 1) = 1
        • ρprob = 1 – [(1 – 1) * (1 – 1)] = 1
layout based information extraction lite
LAYOUT-BASED INFORMATION EXTRACTION(LITE)

Label Extraction Method

Results

response analysis
RESPONSE ANALYSIS
  • How does crawler determine whether response page contains results or error message?
    • Identify significant portion of the response page by removing header, footer, etc. and find content in middle of the page
    • See if content matches predefined error messages e.g. “No results,” “No matches”
    • Store hash of significant portion and assume that if hash occurs very often, then hash is that of an error page
metrics
METRICS
  • How to measure the efficiency of the hidden web crawler?
    • Define submission efficiency SE
      • Ntotal = total number of forms submitted
      • Nsuccess = total number of submissions that resulted in response page containing search results
      • Nvalid = number of semantically correct submissions (e.g. inputting “Orange” for form element labeled “Vegetable” is semantically incorrect)
experiment
EXPERIMENT
  • Task: Market analyst interested in building an archive of information about the semiconductor industry in the past10 years
  • LVS table populated from online sources such as Semiconductor Research Corporation, Lycos Companies Online
experimental results ranking function
EXPERIMENTAL RESULTS – RANKING FUNCTION
  • Crawler executed 3 times with different ranking function
  • ρfuz and ρavg submission efficiency above 80%
  • ρfuz does better but less forms are submitted as compared to ρavg

83.1%

88.8%

experimental results minimum form size
EXPERIMENTAL RESULTS – MINIMUM FORM SIZE
  • Effect of minimum form size – crawler performs better on larger forms

78.9%

88.77%

88.96%

contributions
CONTRIBUTIONS
  • Introduces HiWE, one of the first publicly available techniques for crawling the hidden web
  • Introduces LITE, a technique to extract form data, by incorporating the physical layout of the HTML page
    • Techniques prior to this were based on pattern recognition of the underlying HTML
slide20
PROS
  • Defines clear performance metric from which to analyze the crawler’s efficiency
  • Points out known limitations of technique from which future work can be done
  • Directs readers to technical report which provides more detailed explanation of HiWE implementation
slide21
CONS
  • Not an automatic approach, requires human intervention
  • Task-specific
    • Requires creation of LVS table per task
  • Technique has lots of limitations
    • Can only retrieve search results from HTML based forms
    • Cannot support forms that is driven by Javascript events e.g. onclick, onselect
  • No mention of whether forms submitted through HTTP post were stored/indexed
related work
RELATED WORK
  • USC ISI Extract Data from Web (1999 - 2001) [7, 8]
    • Describe relevant information on web page with a formal grammar and automatically adapt to web page changes
  • Research at UCLA (2005) [4]
    • Adaptive approach – automatically generate queries by examining results from previous queries
  • Google’s Deep-Web Crawler (2008) [1]
    • Select only a small number of input combinations that provides good coverage of content in underlying database and adds the resulting HTML pages into a search engine index
  • DeepPeep [6]
    • Tracks 45,000 forms across 7 domains and allows users to search for these forms
references
REFERENCES

[1] J. Madhavan, D. Ko, Ł. Kot, V. Ganapathy, A. Rasmussen, & A. Halev, “Google’s Deep-Web Crawl,” Proceedings of the VLDB Endowment, 2008. Available: http://www.cs.cornell.edu/~lucja/Publications/I03.pdf. [Accessed June 13, 2010]

[2] C. Mattmann, “Characterizing the Web,” Available: http://sunset.usc.edu/classes/cs572_2010/Characterizing_the_Web.ppt. [Accessed May 19, 2010]

[3] C. Mattmann, “Query Models,” Available: http://sunset.usc.edu/classes/cs572_2010/Query_Models.ppt. [Accessed June 10, 2010]

[4] A. Ntoulas, P. Zerfos, & J. Cho, “Downloading Textual Hidden Web Content by Keyword Queries,” Proceedings of the Joint Conference on Digital Libraries,June 2005. Available: http://oak.cs.ucla.edu/~cho/papers/ntoulas-hidden.pdf. [Accessed June 13, 2010]

[5] A. Wright, “Exploring a ‘Deep Web’ That Google Can’t Grasp,” The New York Times, February 22, 2009. Available: http://www.nytimes.com/2009/02/23/technology/internet/23search.html?_r=1&th&emc=th. [Accessed June 1, 2010]

[6] DeepPeep beta, Available: http://www.deeppeep.org/index.jsp

[7] C. A. Knoblock, K. Lerman, S. Minton, & I. Muslea, “Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach,” IEEE Data Engineering Bulletin, 1999. Available: http://www.isi.edu/~muslea/PS/deb-2k.pdf. [Accessed June 28, 2010]

[8] C. A. Knoblock, S. Minton, & I. Muslea,” Hierarchical Wrapper Induction for Semistructured Information Sources,” Journal of Autonomous Agents and Multi-Agent Systems, 2001. Available: http://www.isi.edu/~muslea/PS/jaamas-2k.pdf. [Accessed June 28, 2010]