Crawling the hidden web
Sponsored Links
This presentation is the property of its rightful owner.
1 / 28

Crawling the Hidden Web PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on
  • Presentation posted in: General

Crawling the Hidden Web. Authors: Sriram Raghavan Hector Gracia-Molina Presented by: Jorge Zamora. Outline. Hidden Web Crawler Operation Model HiWE – Hidden Web Exposer LITE – Layout-based Information Extraction Experimental Results Relation to class lectures Pros/Cons Conclusion.

Download Presentation

Crawling the Hidden Web

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Crawling the Hidden Web

Authors: Sriram Raghavan

Hector Gracia-Molina

Presented by: Jorge Zamora


Outline

  • Hidden Web

  • Crawler Operation Model

  • HiWE – Hidden Web Exposer

  • LITE – Layout-based Information Extraction

  • Experimental Results

  • Relation to class lectures

  • Pros/Cons

  • Conclusion

Crawling the Hidden Web


Hidden Web

  • PIW – Publicly Indexable Web

  • Deep Web

    • 500 times the PIW

  • Hidden Crawler

    • Parse, process and interact with forms

  • Task specific approach

  • Two Steps

    • Resource Discovery

    • Content Extraction

Crawling the Hidden Web


Hidden Crawler – Operation Model

Crawling the Hidden Web


Hidden Crawler – Operation Model

  • Internal form representation

    F = ({{E1, E2,…,En},S,M})

  • Task specific database

    • Formulates search queries

  • Matching Function

    Match(({E1,…,En},S,M),D) = {[E1<-v1,…,En<- Vn]}.

  • Response Analysis

    • Success and error pages, Storage, Tuning

Crawling the Hidden Web


Hidden Crawler – Performance

  • Challenge

    • Wanted to get away from a metric significantly depended on D

  • Submission Effiency

    • Ntotal = total number of forms crawler submits

    • SEstrict = Nsucess/Ntotal

      • Penalizes the crawler which might be correct but did not yield any results

    • SElenient = Nvalid/NTotal

      • Penalized only if the form submission is semantically incorrect.

      • Difficult to evaluate - must evaluate every form submission.

Crawling the Hidden Web


HiWE

  • Hidden Web Exposer

  • Prototype Hidden Web Crawler built at Stanford

  • Basic idea

    • extract some kind of descriptive information or label for each element in the form

    • task-specific which contains a finite set of categories with associated labels

    • Matching algorithms attempts to match form labels with database values to form value assignment sets

Crawling the Hidden Web


HiWE – Conceptual Parts

Crawling the Hidden Web


HiWE – Form Representation

  • F = ({E1,E2,…,En} S, 0)

    • Dom(Ei)

    • Label(Ei)

Crawling the Hidden Web


HiWE – Task specific Database

  • Organized as a finite set of concepts of categories

  • Each concept has one or more labels and associated values

  • Each Row in the LVS table is of the form (L, V),

    • L is a label

    • V = {v1,…, vn} is a fuzzy

    • vi represents a value

    • Fuzzy set V has associated membership function Mv

    • Mv(vi) is the crawlers confidence of assignment

Crawling the Hidden Web


HiWE – Matching Function

  • Label Matching

    • All labels are normalized

      • Common case, Stemming, Stop word removal

    • String Matching

      • with min edit distances, word orderings

    • Threshold of Sigma < edit operations. Then set to nil

  • Ranking Value Assignments

    • Min Rho.

    • Fuzzy Conjunction - Rho fuz

    • Average – Rho avg

    • Probabilistic – Rho prob

Crawling the Hidden Web


HiWE – Populating LVS Table

  • Explicit Initialization

  • Built-in entries

    • Dates, Times, names of months, days of the week

  • Wrapped data Sources

    • Set of Labels, new entries

    • Set of Values, search similar, expand existing

  • Crawling Experience

    • Finite domain elements

    • Can be used to fill out the second form more efficiently

Crawling the Hidden Web


HiWE – Computing Weights

  • Explicit initialization

    • Fixed, predefined weights (usually 1) representing maximum confidence in human supplied values

  • External data sources or crawler activity

    • Positive boost – Successful

    • Negative boost – Unsuccessful

  • Initial weights obtained from external data sources are computed by the wrapper

Crawling the Hidden Web


HiWE – Computing Weights

  • Finite domain

    • Case 1 – Crawler Extracts label, Label Match found

      • Unions the values to the

      • Boost the weights/confidence of the existing values

    • Case 2 – Crawler Extracts label, Label Match = nil

      • New row is added in LVS table

    • Case 3 – Can not extract label

      • Identify values that most closely resembles Dom(E)

      • Once located, add values in Dom(E) to value set

Crawling the Hidden Web


HiWE – Explicit Configuration

Crawling the Hidden Web


LITE

  • Layout-based information extraction

  • Used in automatically extracting semantic information from search forms.

  • In addition to text, uses the physical layout of the page to aid in extraction

  • Not always reflected in HTML markup

Crawling the Hidden Web


LITE – Usage in HiWE

  • Used in Label Extraction

  • Implemented by page pruning. Isolate elements that directly influence the layout of the form elements and labels

Crawling the Hidden Web


LITE – Steps

  • Approximate layout of pruned page discarding images, font styles and style sheets

  • Identifies pieces of text closest to form element as candidates

  • Ranks Each candidate taking into account position, font size, font style, number of words

  • Chooses the highest ranked candidate as label associated with element

Crawling the Hidden Web


Experiment - Parameters

  • Task 1 Shown which is for “News articles, reports, press releases, and white papers relating to the semiconductor industry, dated sometime in the last ten years”

Crawling the Hidden Web


Results – Value Ranking

  • Was executed three times with same parameters, initializations values and parameters but using different ranking function

  • Pave might be a better choice for maximum content extraction

  • Pfuz is the most efficient

  • Pprob submits the most forms but performs poorly

Crawling the Hidden Web


Results – Form Size

78.9%

3735

88.77%

88.96%

3214

2950

2853

2800

2491

90%

Number of form submissions

1404

Crawling the Hidden Web


Results – Crawler additions to LVS

Crawling the Hidden Web


Results – LITE Label Extraction

  • Elements from 1 to 10

  • Manually analyzed to

    derive correct label

  • Also ran other label extraction heuristics

    • Purely textual analysis

    • Common ways forms are laid out

  • LITE was 93% vs 72% and 83%

Crawling the Hidden Web


Relation to Class Notes

  • Content driven Crawler

    • Different crawlers for different purposes

  • Contains Similar crawler Metrics

    • Crawling speed

    • Scalability

    • Page importance

    • Freshness

  • Data Transfer

    • Stored after crawled

Crawling the Hidden Web


Cons

  • Freshness/Recrawling isn’t addressed

  • Task specific, human configuration

  • Login Based, Cookie JAR implementation

  • Didn’t discuss Hidden fields or Capchas

  • Didn’t run task 1 results without LITE.

  • Not using the “name” element tag in form elements

  • Required fields vs. not required

  • Wild cards, incomplete forms

  • Form element decencies.

Crawling the Hidden Web


Pros

  • First Hidden Crawler Report

  • Not run at runtime

    • VS. shopping and travel sites that do.

  • Gets better overtime

Crawling the Hidden Web


Conclusion / Thoughts

  • Hidden web is much bigger now.

  • Hidden web reached now with google analytics and google ads

  • Now we also have ajax based forms. How do we deal with ajax based forms?

Crawling the Hidden Web


Thank You

Questions

?

Crawling the Hidden Web


  • Login