Crawling the hidden web
Download
1 / 73

Crawling the Hidden Web - PowerPoint PPT Presentation


  • 111 Views
  • Uploaded on

Crawling the Hidden Web. by Michael Weinberg [email protected] Internet DB Seminar, The Hebrew University of Jerusalem, School of Computer Science and Engineering, December 2001. Agenda. Hidden Web - what is it all about?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Crawling the Hidden Web' - roddy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Crawling the hidden web

Crawling the Hidden Web

by

Michael Weinberg

[email protected]

Internet DB Seminar,

The Hebrew University of Jerusalem,

School of Computer Science and Engineering,

December 2001


Agenda
Agenda

  • Hidden Web - what is it all about?

  • Generic model for a hidden Web crawler

  • HiWE (Hidden Web Exposer)

  • LITE – Layout-based Information Extraction Technique

  • Results from experiments conducted to test these techniques

Michael Weinberg, SDBI Seminar


Web crawlers
Web Crawlers

  • Automatically traverse the Web graph, building a local repository of the portion of the Web that they visit

  • Traditionally, crawlers have only targeted a portion of the Web called the publiclyindexable Web (PIW)

  • PIW – the set of pages reachable purely by following hypertext links, ignoring search forms and pages that require authentication

Michael Weinberg, SDBI Seminar


The hidden web
The Hidden Web

  • Recent studies show that a significant fraction of Web content in fact lies outside the PIW

  • Large portions of the Web are ‘hidden’ behind search forms in searchable databases

  • HTML pages are dynamically generated in response to queries submitted via the search forms

  • Also referred as the ‘Deep’ Web

Michael Weinberg, SDBI Seminar


The hidden web growth
The Hidden Web Growth

  • Hidden Web continues to grow, as organizations with large amount of high-quality information are placing their content online, providing web-accessible search facilities over existing databases

  • For example:

    • Census Bureau

    • Patents and Trademarks Office

    • News media companies

  • InvisibleWeb.com lists over 10000 such databases

Michael Weinberg, SDBI Seminar


Surface web
Surface Web

Michael Weinberg, SDBI Seminar


Deep web
Deep Web

Michael Weinberg, SDBI Seminar


Deep web content distribution
Deep Web Content Distribution

Michael Weinberg, SDBI Seminar


Deep web stats
Deep Web Stats

  • The Deep Web is 500 times larger than PIW !!!

  • Contains 7,500 terabytes of information (March 2000)

  • More than 200,000 Deep Web sites exist

  • Sixty of the largest Deep Web sites collectively contain about 750 terabytes of information

  • 95% of the Deep Web is publicly accessible (no fees)

  • Google indexes about 16% of the PIW, so we search about 0.03% of the pages available today

Michael Weinberg, SDBI Seminar


The problem
The Problem

  • Hidden Web contains large amounts of high-quality information

  • The information is buried on dynamically generated sites

  • Search engines that use traditional crawlers never find this information

Michael Weinberg, SDBI Seminar


The solution
The Solution

  • Build a hidden Web crawler

  • Can crawl and extract content from hidden databases

  • Enable indexing, analysis, and mining of hidden Web content

  • The content extracted by such crawlers can be used to categorize and classify the hidden databases

Michael Weinberg, SDBI Seminar


Challenges
Challenges

  • Significant technical challenges in designing a hidden Web crawler

  • Should interact with forms that were designed primarily for human consumption

  • Must provide input in the form of search queries

  • How equip the crawlers with input values for use in constructing search queries?

  • To address these challenges, we adopt the task-specific, human-assisted approach

Michael Weinberg, SDBI Seminar


Task specificity
Task-Specificity

  • Extract content based on the requirements of a particular application or task

  • For example, consider a market analyst interested in press releases, articles, etc… pertaining to the semiconductor industry, and dated sometime in the last ten years

Michael Weinberg, SDBI Seminar


Human assistance
Human-Assistance

  • Human-assistance is critical to ensure that the crawler issues queries that are relevant to the particular task

  • For instance, in the semiconductor example, the market analyst may provide the crawler with lists of companies or products that are of interest

  • The crawler will be able to gather additional potential company and product names as it processes a number of pages

Michael Weinberg, SDBI Seminar


Two steps
Two Steps

  • There are two steps in achieving our goal:

    • Resource discovery – identify sites and databases that are likely to be relevant to the task

    • Content extraction – actually visit the identified sites to submit queries and extract the hidden pages

  • In this presentation we do not directly address the resource discovery problem

Michael Weinberg, SDBI Seminar


Hidden web crawlers

Hidden Web Crawlers

Michael Weinberg, SDBI Seminar


User form interaction
User form interaction

Form page

(2) View form

(1) Download form

(4) Submit form

(3) Fill-out form

Web query front-end

(5) Download response

Hidden Database

(6) View result

Response page

Michael Weinberg, SDBI Seminar


Operation model
Operation Model

  • Our model of a hidden Web crawler consists of four components:

    • Internal Form Representation

    • Task-specific database

    • Matching function

    • Response Analysis

  • Form Page – the page containing the search form

  • Response Page – the page received in response to a form submission

Michael Weinberg, SDBI Seminar


Generic operational model

Task specific database

Hidden Database

Repository

Generic Operational Model

Hidden Web Crawler

Form page

Internal Form Representation

Form analysis

Download form

Match

Web query front-end

Form submission

Set of value-assignments

Download response

Response Analysis

Response page

Michael Weinberg, SDBI Seminar


Internal form representation
Internal Form Representation

  • Form F:

  • is a set of n form elements

  • S – submission information associated with the form:

    • submission URL

    • Internal identifiers for each form element

  • M – meta-information about the form:

    • web-site hosting the form

    • set of pages pointing to this form page

    • other text on the page besides the form

Michael Weinberg, SDBI Seminar


Task specific database
Task-specific Database

  • The crawler is equipped with a task-specific database D

  • Contains the necessary information to formulate queries relevant to the particular task

  • In the ‘market analyst’ example, D could contain list of semiconductor company and product names

  • The actual format and organization of D are specific for to a particular crawler implementation

  • HiWE uses a set of labeled fuzzy sets

Michael Weinberg, SDBI Seminar


Matching function
Matching Function

  • Matching algorithm properties:

    • Input: Internal form representation and current contents of the database D

    • Output: Set of value assignments

    • associates value with element

Michael Weinberg, SDBI Seminar


Response analysis
Response Analysis

  • Module that stores the response page in the repository

  • Attempts to distinguish between pages containing search results and pages containing error messages

  • This feedback is used to tune the matching function

Michael Weinberg, SDBI Seminar


Traditional performance metric
Traditional Performance Metric

  • Traditional crawlers performance metrics:

    • Crawling speed

    • Scalability

    • Page importance

    • Freshness

  • These metrics are relevant to hidden web crawlers, but do not capture the fundamental challenges in dealing with the Hidden Web

Michael Weinberg, SDBI Seminar


New performance metrics
New Performance Metrics

  • Coverage metric:

    • ‘Relevant’ pages extracted / ‘relevant’ pages present in the targeted hidden databases

    • Problem: difficult to estimate how much of the hidden content is relevant to the task

Michael Weinberg, SDBI Seminar


New performance metrics1
New Performance Metrics

  • : the total number of forms that the crawler submits

  • : num of submissions which result in response page with one or more search results

  • Problem: the crawler is penalized if the database didn’t contain any relevant search results

Michael Weinberg, SDBI Seminar


New performance metrics2
New Performance Metrics

– : number of semantically correct form submissions

  • Penalizes the crawler only if a form submission is semantically incorrect

  • Problem: difficult to evaluate since a manual comparison is needed to decide whether the form is semantically correct

Michael Weinberg, SDBI Seminar


Design issues
Design Issues

  • What information about each form element should the crawler collect?

  • What meta-information is likely to be useful?

  • How should the task-specific database be organized, updated and accessed?

  • What Match function is likely to maximize submission efficiency?

  • How to use the response analysis module to tune the Match function?

Michael Weinberg, SDBI Seminar


Hiwe hidden web exposer
HiWE: Hidden Web Exposer

Michael Weinberg, SDBI Seminar


Basic idea
Basic Idea

  • Extract descriptive information (label) for each element of a form

  • Task-specific database is organized in terms of categories, each of which is also associated with labels

  • Matching function attempts to match from form labels to database categories to compute a set of candidate values assignments

Michael Weinberg, SDBI Seminar


Hiwe architecture
HiWE Architecture

URL 1 URL 2

URL N

URL List

LVS Table

WWW

Label1 Value-Set1

Label2 Value-Set2

Labeln Value-Setn

Parser

Crawl Manager

Form Analyzer

Form submission

LVS Manager

Form Processor

Feedback

Response

Response Analyzer

Custom data sources


Hiwe s main modules
HiWE’s Main Modules

  • URL List:

    • contains all the URLs the crawler has discovered so far

  • Crawl Manager:

    • controls the entire crawling process

  • Parser:

    • extracts hypertext links from the crawled pages and adds them to the URL list

  • Form Analyzer, Form Processor, Response Analyzer:

    • Together implement the form processing and submission operations

Michael Weinberg, SDBI Seminar


Hiwe s main modules1
HiWE’s Main Modules

  • LVS Manager:

    • Manages additions and accesses to the LVS table

  • LVS table:

    • HiWE’s implementation of the task-specific database

Michael Weinberg, SDBI Seminar


Hiwe s form representation
HiWE’s Form Representation

  • Form

    • The third component of F is an empty set since current implementation of HiWE does not collect any meta-information about the form

  • For each element , HiWE collects a domain Dom( ) and a label label( )

Michael Weinberg, SDBI Seminar


Hiwe s form representation1
HiWE’s Form Representation

  • Domain of an element:

    • Set of values which can be associated with the corresponding form element

    • May be a finite set (e.g., domain of a selection list)

    • May be infinite set (e.g., domain of a text box)

  • Label of an element:

    • The descriptive information associated with the element, if any

    • Most forms include some descriptive text to help users understand the semantics of the element

Michael Weinberg, SDBI Seminar


Form representation figure
Form Representation - Figure

Element E1

Label(E1) = "Document Type"

Dom(E1 ) = {Articles, Press Releases,

Reports}

Element E2

Label(E2) = "Company Name"

Dom(E2) = {s | s is a text string}

Element E3

Label(E3) = "Sector"

Dom(E3) = {Entertainment, Automobile

Information Technology,

Construction}

Michael Weinberg, SDBI Seminar


Hiwe s task specific database
HiWE’s Task-specific Database

  • Task-specific information is organized in terms of a finite set of concepts or categories

  • Each concept has one or more labels and an associated set of values

  • For example the label ‘Company Name’ could be associated with the set of values {‘IBM’, ‘Microsoft’, ‘HP’,…}

Michael Weinberg, SDBI Seminar


Hiwe s task specific database1
HiWE’s Task-specific Database

  • The concepts are organized in a table called the Label Value Set (LVS)

  • Each entry in the LVS is of the form (L,V):

    • L : label

    • fuzzy set of values

    • Fuzzy set V has an associated membership function that assigns weights, in the range [0,1] to each member of the set

    • is a measure of the crawler’s confidence that the assignment of to E is semantically meaningful

Michael Weinberg, SDBI Seminar


Hiwe s matching function
HiWE’s Matching Function

  • For elements with a finite domain:

    • The set of possible values is fixed and can be exhaustively enumerated

    • In this example, the crawler can first retrieve all relevant articles, then all relevant press releases and finally all relevant reports

Element E1

Label(E1) = "Document Type"

Dom(E1 ) = {Articles, Press Releases,

Reports}

Michael Weinberg, SDBI Seminar


Hiwe s matching function1
HiWE’s Matching Function

  • For elements with an infinite domain:

    • HiWE textually matches the labels of these elements with labels in the LVS table

    • For example, if a textbox element has the label “Enter State” which best matches an LVS entry with the label “State” , the values associated with that LVS entry (e.g., “California”) can be used to fill the textbox

    • How do we match Form labels with LVS labels?

Michael Weinberg, SDBI Seminar


Label matching
Label Matching

  • Two steps in matching Form labels with LVS labels:

    • 1. Normalization: includes conversion to a common case and standard style

    • 2. Use of an approximate string matching algorithm to compute minimum edit distances

    • HiWE employs D. Lopresti and A. Tomkins string matching algorithm that takes word reordering into account

Michael Weinberg, SDBI Seminar


Label matching1
Label Matching

  • Let LabelMatch( ) denote the LVS entry with the minimum distance to label( )

  • Threshold

  • If all LVS entries are more than edit operations away from label( ) , LabelMatch( ) = nil

Michael Weinberg, SDBI Seminar


Label matching2
Label Matching

  • For each element , compute ( , ):

    • If has an infinite domain and (L,V) is the closest matching LVS entry, then = V and =

    • If has a finite domain, then =Dom( ) and

  • The set of value assignments is computed as the product of all the `s:

  • Too many assignments?

Michael Weinberg, SDBI Seminar


Ranking value assignments
Ranking Value Assignments

  • HiWE employs an aggregation function to compute a rank for each value assignment

  • Uses a configurable parameter, a minimum acceptable value assignment rank ( )

  • The intent is to improve submission efficiency by only using ‘high-quality’ value assignments

  • We will show three possible aggregation functions

Michael Weinberg, SDBI Seminar


Fuzzy conjunction
Fuzzy Conjunction

  • The rank of a value assignment is the minimum of the weights of all the constituent values.

  • Very conservative in assigning ranks. Assigns a high rank only if each individual weight is high

Michael Weinberg, SDBI Seminar


Average
Average

  • The rank of a value assignment is the average of the weights of the constituent values

  • Less conservative than fuzzy conjunction

Michael Weinberg, SDBI Seminar


Probabilistic
Probabilistic

  • This ranking function treats weights as probabilities

  • is the likelihood that the choice of is useful and is the likelihood that it is not

  • The likelihood of a value assignment being useful is:

  • Assigns low rank if all the individual weights are very low

Michael Weinberg, SDBI Seminar


Populating the lvs table
Populating the LVS Table

  • HiWE supports a variety of mechanisms for adding entries to the LVS table:

    • Explicit Initialization

    • Built-in entries

    • Wrapped data sources

    • Crawling experience

Michael Weinberg, SDBI Seminar


Explicit initialization
Explicit Initialization

  • Supply labels and associated value sets at startup time

  • Useful to equip the crawler with labels that the crawler is most likely to encounter

  • In the ‘semiconductor’ example, we supply HiWE with a list of relevant company names and associate the list with labels ‘Company’ , ‘Company Name’

Michael Weinberg, SDBI Seminar


Built in entries
Built-in Entries

  • HiWE has built-in entries for commonly used concepts:

    • Dates and Times

    • Names of months

    • Days of week

Michael Weinberg, SDBI Seminar


Wrapped data sources
Wrapped Data Sources

  • LVS Manager can query data sources through a well-defined interface

  • The data source must be ‘wrapped’ by a program that supports two kinds of queries:

    • Given a set of labels, return a value set

    • Given a set of values, return other values that belong to the same value set

Michael Weinberg, SDBI Seminar


Hiwe architecture1
HiWE Architecture

URL 1 URL 2

URL N

URL List

LVS Table

WWW

Label1 Value-Set1

Label2 Value-Set2

Labeln Value-Setn

Parser

Crawl Manager

Form Analyzer

Form submission

LVS Manager

Form Processor

Feedback

Response

Response Analyzer

Custom data sources


Crawling experience
Crawling Experience

  • Finite domain form elements are a useful source of labels and associated value sets

  • HiWE adds this information to the LVS table

  • Effective when similar label is associated with a finite domain element in one form and with an infinite domain element in another

Michael Weinberg, SDBI Seminar


Computing weights
Computing Weights

  • New value added to the LVS must be assigned a suitable weight

  • Explicit initialization and build-in values have fixed weights

  • Values obtained from external data sources or through the crawler’s own activity, are assigned weights that vary with time

Michael Weinberg, SDBI Seminar


Initial weights
Initial Weights

  • For external data sources - computed by the respective wrappers

  • For values directly gathered by the crawler:

    • Finite domain element E with Dom(E)

    • = 1 iff

    • Three cases arise when incorporating Dom(E) into the LVS table

Michael Weinberg, SDBI Seminar


Updating lvs case 1
Updating LVS – Case 1

  • Crawler successfully extracts label(E) and computes LabelMatch(E)=(L,V):

    • Replace the (L,V) entry by the entry

    • Intuitively, Dom(E) provides new elements to the value set and ‘boosts’ the weights of existing elements

Michael Weinberg, SDBI Seminar


Updating lvs case 2
Updating LVS – Case 2

  • Crawler successfully extracts label(E) but LabelMatch(E) = nil:

    • A new entry ( label(E),Dom(E) ) is created in the LVS

Michael Weinberg, SDBI Seminar


Updating lvs case 3
Updating LVS – Case 3

  • Crawler can not extract label(E):

    • For each entry (L,V):

      • Compute a score :

      • Identify the entry with the maximum score

      • Identify the value of the maximum score

      • Replace entry with new entry

      • Confidence of new values:

Michael Weinberg, SDBI Seminar


Configuring hiwe
Configuring HiWE

  • Initialization of the crawling activity includes:

    • Set of sites to crawl

    • Explicit initialization for the LVS table

    • Set of data sources

    • Label matching threshold

    • Minimum acceptable value assignment rank

    • Value assignment aggregation function

Michael Weinberg, SDBI Seminar


Introducing lite
Introducing LITE

  • Layout-based Information Extraction Technique

  • Physical Layout of a page is also used to aid in extraction

  • For example, a piece of text that is physically adjacent to a form element is very likely a description of that element

  • Unfortunately, this semantic associating is not always reflected in the underlying HTML of the Web page

Michael Weinberg, SDBI Seminar


Layout based information extraction technique
Layout-based Information Extraction Technique

Michael Weinberg, SDBI Seminar


The challenge
The Challenge

  • Accurate extraction of the labels and domains of form elements

  • Elements that are visually close on the screen, may be separated arbitrarily in the actual HTML text

  • Even when HTML provides a facility for semantic relationships, it’s not used in a majority of pages

  • Accurate page layout is a complex process

  • Even a crude approximate layout of portions of a page, can yield very useful semantic information

Michael Weinberg, SDBI Seminar


Form analysis in hiwe
Form Analysis in HiWE

  • LITE-based heuristic:

    • Prune the form page and isolate elements which directly influence the layout

    • Approximately layout the pruned page using a custom layout engine

    • Identify the pieces of text that are physically closest to the form element (these are candidates)

    • Rank each candidate using a variety of measures

    • Choose the highest ranked candidate as the label

Michael Weinberg, SDBI Seminar


Pruning before partial layout
Pruning Before Partial Layout

Michael Weinberg, SDBI Seminar


Lite figure
LITE - Figure

  • Key Idea in LITE:

    Physical page layout embeds significant semantic information

DOM Parser

DOM Representation

DOM API

Prune

Pruned Page

List of Elements Submission Info

Partial Layout

Labels & Domain Values

Internal Form Representation

Michael Weinberg, SDBI Seminar


Experiments
Experiments

  • A number of experiments were conducted to study the performance of HiWE

  • We will see how performance depends on:

    • Minimum form size

    • Crawler input to LVS table

    • Different ranking functions

Michael Weinberg, SDBI Seminar


Parameter values for task 1
Parameter Values for Task 1

  • Task 1:

    News articles, reports, press releases and white papers relating to the semiconductor industry, dated sometime in the last ten years

Michael Weinberg, SDBI Seminar


Variation of performance with
Variation of Performance with

Michael Weinberg, SDBI Seminar


Effect of crawler input to lvs
Effect of Crawler input to LVS

Michael Weinberg, SDBI Seminar


Different ranking functions
Different Ranking Functions

  • When using and the crawler’s submission efficiency is mostly above 80%

  • performs poorly

  • submits more forms than (less conservative)

Michael Weinberg, SDBI Seminar


Label extraction
Label Extraction

  • LITE-based heuristic achieved overall accuracy of 93%

  • The test set was manually analyzed

Michael Weinberg, SDBI Seminar


Conclusion
Conclusion

  • Addressed the problem of extending current-day crawlers to build repositories that include pages from the ‘Hidden Web’

  • Presented a simple operation model of a hidden web crawler

  • Described the implementation of a prototype crawler – HiWE

  • Introduced a technique for Layout-based information extraction

Michael Weinberg, SDBI Seminar


Bibliography
Bibliography

  • Crawling the Hidden Web, by S. Raghavan and H. Garcia-Molina, Stanford University, 2001

  • BrightPlanet.com white papers

  • D. Lopresti and A. Tomkins. Block edit models for approximate string matching

Michael Weinberg, SDBI Seminar


ad