Crawling the hidden web
Download
1 / 28

Crawling the Hidden Web - PowerPoint PPT Presentation


  • 136 Views
  • Uploaded on

Crawling the Hidden Web. Authors: Sriram Raghavan Hector Gracia-Molina Presented by: Jorge Zamora. Outline. Hidden Web Crawler Operation Model HiWE – Hidden Web Exposer LITE – Layout-based Information Extraction Experimental Results Relation to class lectures Pros/Cons Conclusion.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Crawling the Hidden Web' - fleta


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Crawling the hidden web

Crawling the Hidden Web

Authors: Sriram Raghavan

Hector Gracia-Molina

Presented by: Jorge Zamora


Outline
Outline

  • Hidden Web

  • Crawler Operation Model

  • HiWE – Hidden Web Exposer

  • LITE – Layout-based Information Extraction

  • Experimental Results

  • Relation to class lectures

  • Pros/Cons

  • Conclusion

Crawling the Hidden Web


Hidden web
Hidden Web

  • PIW – Publicly Indexable Web

  • Deep Web

    • 500 times the PIW

  • Hidden Crawler

    • Parse, process and interact with forms

  • Task specific approach

  • Two Steps

    • Resource Discovery

    • Content Extraction

Crawling the Hidden Web


Hidden crawler operation model
Hidden Crawler – Operation Model

Crawling the Hidden Web


Hidden crawler operation model1
Hidden Crawler – Operation Model

  • Internal form representation

    F = ({{E1, E2,…,En},S,M})

  • Task specific database

    • Formulates search queries

  • Matching Function

    Match(({E1,…,En},S,M),D) = {[E1<-v1,…,En<- Vn]}.

  • Response Analysis

    • Success and error pages, Storage, Tuning

Crawling the Hidden Web


Hidden crawler performance
Hidden Crawler – Performance

  • Challenge

    • Wanted to get away from a metric significantly depended on D

  • Submission Effiency

    • Ntotal = total number of forms crawler submits

    • SEstrict = Nsucess/Ntotal

      • Penalizes the crawler which might be correct but did not yield any results

    • SElenient = Nvalid/NTotal

      • Penalized only if the form submission is semantically incorrect.

      • Difficult to evaluate - must evaluate every form submission.

Crawling the Hidden Web


Crawling the hidden web
HiWE

  • Hidden Web Exposer

  • Prototype Hidden Web Crawler built at Stanford

  • Basic idea

    • extract some kind of descriptive information or label for each element in the form

    • task-specific which contains a finite set of categories with associated labels

    • Matching algorithms attempts to match form labels with database values to form value assignment sets

Crawling the Hidden Web


Hiwe conceptual parts
HiWE – Conceptual Parts

Crawling the Hidden Web


Hiwe form representation
HiWE – Form Representation

  • F = ({E1,E2,…,En} S, 0)

    • Dom(Ei)

    • Label(Ei)

Crawling the Hidden Web


Hiwe task specific database
HiWE – Task specific Database

  • Organized as a finite set of concepts of categories

  • Each concept has one or more labels and associated values

  • Each Row in the LVS table is of the form (L, V),

    • L is a label

    • V = {v1,…, vn} is a fuzzy

    • vi represents a value

    • Fuzzy set V has associated membership function Mv

    • Mv(vi) is the crawlers confidence of assignment

Crawling the Hidden Web


Hiwe matching function
HiWE – Matching Function

  • Label Matching

    • All labels are normalized

      • Common case, Stemming, Stop word removal

    • String Matching

      • with min edit distances, word orderings

    • Threshold of Sigma < edit operations. Then set to nil

  • Ranking Value Assignments

    • Min Rho.

    • Fuzzy Conjunction - Rho fuz

    • Average – Rho avg

    • Probabilistic – Rho prob

Crawling the Hidden Web


Hiwe populating lvs table
HiWE – Populating LVS Table

  • Explicit Initialization

  • Built-in entries

    • Dates, Times, names of months, days of the week

  • Wrapped data Sources

    • Set of Labels, new entries

    • Set of Values, search similar, expand existing

  • Crawling Experience

    • Finite domain elements

    • Can be used to fill out the second form more efficiently

Crawling the Hidden Web


Hiwe computing weights
HiWE – Computing Weights

  • Explicit initialization

    • Fixed, predefined weights (usually 1) representing maximum confidence in human supplied values

  • External data sources or crawler activity

    • Positive boost – Successful

    • Negative boost – Unsuccessful

  • Initial weights obtained from external data sources are computed by the wrapper

Crawling the Hidden Web


Hiwe computing weights1
HiWE – Computing Weights

  • Finite domain

    • Case 1 – Crawler Extracts label, Label Match found

      • Unions the values to the

      • Boost the weights/confidence of the existing values

    • Case 2 – Crawler Extracts label, Label Match = nil

      • New row is added in LVS table

    • Case 3 – Can not extract label

      • Identify values that most closely resembles Dom(E)

      • Once located, add values in Dom(E) to value set

Crawling the Hidden Web


Hiwe explicit configuration
HiWE – Explicit Configuration

Crawling the Hidden Web


Crawling the hidden web
LITE

  • Layout-based information extraction

  • Used in automatically extracting semantic information from search forms.

  • In addition to text, uses the physical layout of the page to aid in extraction

  • Not always reflected in HTML markup

Crawling the Hidden Web


Lite usage in hiwe
LITE – Usage in HiWE

  • Used in Label Extraction

  • Implemented by page pruning. Isolate elements that directly influence the layout of the form elements and labels

Crawling the Hidden Web


Lite steps
LITE – Steps

  • Approximate layout of pruned page discarding images, font styles and style sheets

  • Identifies pieces of text closest to form element as candidates

  • Ranks Each candidate taking into account position, font size, font style, number of words

  • Chooses the highest ranked candidate as label associated with element

Crawling the Hidden Web


Experiment parameters
Experiment - Parameters

  • Task 1 Shown which is for “News articles, reports, press releases, and white papers relating to the semiconductor industry, dated sometime in the last ten years”

Crawling the Hidden Web


Results value ranking
Results – Value Ranking

  • Was executed three times with same parameters, initializations values and parameters but using different ranking function

  • Pave might be a better choice for maximum content extraction

  • Pfuz is the most efficient

  • Pprob submits the most forms but performs poorly

Crawling the Hidden Web


Results form size
Results – Form Size

78.9%

3735

88.77%

88.96%

3214

2950

2853

2800

2491

90%

Number of form submissions

1404

Crawling the Hidden Web


Results crawler additions to lvs
Results – Crawler additions to LVS

Crawling the Hidden Web


Results lite label extraction
Results – LITE Label Extraction

  • Elements from 1 to 10

  • Manually analyzed to

    derive correct label

  • Also ran other label extraction heuristics

    • Purely textual analysis

    • Common ways forms are laid out

  • LITE was 93% vs 72% and 83%

Crawling the Hidden Web


Relation to class notes
Relation to Class Notes

  • Content driven Crawler

    • Different crawlers for different purposes

  • Contains Similar crawler Metrics

    • Crawling speed

    • Scalability

    • Page importance

    • Freshness

  • Data Transfer

    • Stored after crawled

Crawling the Hidden Web


Crawling the hidden web
Cons

  • Freshness/Recrawling isn’t addressed

  • Task specific, human configuration

  • Login Based, Cookie JAR implementation

  • Didn’t discuss Hidden fields or Capchas

  • Didn’t run task 1 results without LITE.

  • Not using the “name” element tag in form elements

  • Required fields vs. not required

  • Wild cards, incomplete forms

  • Form element decencies.

Crawling the Hidden Web


Crawling the hidden web
Pros

  • First Hidden Crawler Report

  • Not run at runtime

    • VS. shopping and travel sites that do.

  • Gets better overtime

Crawling the Hidden Web


Conclusion thoughts
Conclusion / Thoughts

  • Hidden web is much bigger now.

  • Hidden web reached now with google analytics and google ads

  • Now we also have ajax based forms. How do we deal with ajax based forms?

Crawling the Hidden Web


Thank you
Thank You

Questions

?

Crawling the Hidden Web