Deep web crawling enlightening the dark side of the web
This presentation is the property of its rightful owner.
Sponsored Links
1 / 35

Deep-Web Crawling “Enlightening the dark side of the web” PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on
  • Presentation posted in: General

Deep-Web Crawling “Enlightening the dark side of the web”. Daniele Alfarone ○ Erasmus student ○ Milan (Italy). Structure. Introduction What is the Deep-Web How to crawl it Google’s Approach Problem statement Main algorithms Performance evaluation Improvements Main limitations

Download Presentation

Deep-Web Crawling “Enlightening the dark side of the web”

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Deep web crawling enlightening the dark side of the web

Deep-Web Crawling“Enlightening the dark side of the web”

Daniele Alfarone○ Erasmus student ○ Milan (Italy)


Structure

Structure

  • Introduction

    • What is the Deep-Web

    • How to crawl it

  • Google’s Approach

    • Problem statement

    • Main algorithms

    • Performance evaluation

  • Improvements

    • Main limitations

    • Some ideas to improve

  • Conclusions


What is the deep web

What is the Deep-Web?

Deep-Web is the

content “hidden” behind HTML forms


Hidden content

Hidden content

This content cannot be reached by traditional crawlers

Deep-Web has 10 times more data than the currently searchable content


How do webmasters deal with it

How do webmasters deal with it?

  • Not only the search engines are interested:the websites want to be more accessible to the crawlers

  • The websites publish pages with long lists of static links to let traditional crawlers index them


How can search engines crawl the deep web

But search engines cannot pretend thatevery website does the same…

How can search engines crawl the Deep-Web?

Developing vertical search engines

focused on a specific topic

flightsjobs

But…

  • Limited to the number of topics for which a vertical search engine has been built

  • Difficult to keep semantic maps between individual data sources and a common DB

  • Boundaries between different domains are fuzzy


Are there smarter approaches

Are there smarter approaches?

Currently the Web contains more than 10 millions“high-quality” HTML forms and it is still growing exponentially

Number of websites since 1990 (7% has an high-quality form)

Any approach which involves human effort can't scale: we need a fully-automatic approach without site-specific coding

  • Solution: the surfacing approach

  • Choose a set of queries to submit to the web form

  • Store the URL of the page obtained

  • Pass all the URLs to the crawler


Part 2 google s approach

Part 2 Google’s approach

  • Problem statement

  • Main algorithms

  • Performance evaluation


Solving the surfacing problem google s approach

Solving the surfacing problem: Google’s approach

The problem is divided in two sub-problems

1

2


Html form example

HTML formexample

Free-text inputs

Choiceinputs


Html form example1

HTML formexample

Presentation inputs

Selectioninputs

Selectioninputs


Which form inputs to fill query templates

Which form inputs to fill:Query templates

Defined by Google as:

the list of input types to be filled to create a set of queries

Query Template #1


Which form inputs to fill query templates1

Which form inputs to fill:Query templates

Defined by Google as:

the list of input types to be filled to create a set of queries

Query Template #2


How to create informative query templates

How to createinformative query templates

  • discard presentation inputs

    • currently a big challenge

  • choose the optimal dimension for the template

    • too big: increase crawling traffic and produce pages without results

    • too small: every submission will get a large numbers of results and the website site may:

      • limit the number of results

      • allow to browse results through pagination (which is not always easy to follow)


Informativeness tester

Informativeness tester

How Google evaluates if a template is informative?

  • Query templates are evaluated upon the distinctness of the web pages resulting from the form submissions generated

  • To estimate the number of distinct web pages,the results are clustered based on the similarity of their content

A template is informative if…

# distinct pages

# pages

> 25%


How to scale to big web forms

How to scale to big web forms?

Given a form with N inputs, the possible templates are

2N – 1

To avoid running the informativeness tester on all possible templates, Google developed an algorithm called

Incremental Search for Informative Query Templates

I.S.I.T.


Isit example

ISIT example

X

X


Generating input values

Generating input values

  • To assign values to a select menu is as easy as select all the possible values

  • To generate meaningful values for text boxes is a big challenge

    Text boxes are used in different ways in web forms:

  • Generic text boxes: to retrieve all documents in a database that match the words typed (e.g. title or author of a book)

  • Typed text boxes: as a selection predicate on a specific attribute in the where clause of a SQL query(e.g. zip codes, US states, prices)


Values for generic text boxes

Values for generic text boxes

2

1

Initial seed keywords are extracted from the form page

A query template with only the generic text box is submitted

Runs until a sufficient number of keywordshas been extracted

4

3

Discard keywords not representative for the page (TF-IDF rank)

Additional keywords are extracted from the resulting page


Values for typed text boxes

Values for typed text boxes

The number of types which can appear in HTML forms of different domains are limited (e.g.: city, date, price, zip)

Forms with typed text boxes will produce reasonableresult pages only with type-appropriate values

To recognize the correct type, the form is submitted with known values of different types and the one with highest distinctness fraction is considered to be the correct type


Performance evaluation query templates with only select menus

Performance evaluationquery templates with only select menus

As the number of inputs increase, the number of possible templates increases exponentially, but the number tested only increases linearly, as does the number found to be informative


Performance evaluation mixed query templates

Performance evaluationmixed query templates

Testing on 1 million HTML forms, the URLs were generated using a template which had:

  • only one text box (57%)

  • one or more select menus (37%)

  • one text box and one or more select menus (6%)

Today on Google.com one query out of 10 contains "surfaced" results


Part 3 improvements

Part 3Improvements

  • Main limitations

  • Some ideas to improve


1 post forms are discarded

1. POST forms are discarded

  • The output of the whole Deep-Web crawling by Google is a list of URLs for each form considered.

  • The result pages from a form submitted with method=“POST” don’t have a unique URL

  • Google bypasses these forms relying on the fact theRFC specifications recommend POST forms only foroperations that write on the website database(e.g.: comments in a forum, sign-up to a website)

    But …

    In reality websites make massive use of POST forms, for:

  • URL Shortening

  • Maintaining the state of a form after its submission


How can we crawl post forms

How can we crawl POST forms?

Two approaches can drop the limitation put by Google:

  • POST forms can be crawled sending to the server a complete HTTP request, rather than just an URL.The problem becomes how to link (in the SERP) the page obtained submitting the POST form.

  • An approach which would solve all the problems stated is to simply convert the POST form to its GET equivalent.An analysis is required to assess which percentage of websites accept also GET parameters for POST forms.


2 select menus with bad default values

2. Select menus with bad default values

  • When instantiating a query template, for select menus not included in the template, the default value of the menu is assigned, making the assumption that it's a wild card value like "Any" or “All”.

  • This assumption is probably too strong: in several select menus the default option is simply the first one of the list.

e.g. for a select menu of U.S. cities we would expect “All”, but we can find “Alabama”.If a bad option like Alabama is selected,a high percentage of the database will remain undiscovered.


How can we recognize a bad default value

How can we recognize a bad default value?

Idea:to submit the form with all possible values andcount the results

… if the number of results with the (potentially) default valueis close to the sum of all the other results,probably it is a “real” default value.

  • Once we recognize a bad default value, we force the inclusion of the select menu in every template for the given form.


3 managing mandatory inputs

3. Managing mandatory inputs

Often the HTML forms indicate to the user which inputs are mandatory (e.g.: with asterisks or red borders).

To recognize the mandatory inputs can offer some benefits:

  • Reduce the number of URLs generated by ISITonly the templates which contain all the mandatory fieldswill be passed to the informativeness tester

  • Avoid to instantiate the default value (not always correct)to inputs that can just be discarded because they are not mandatory


4 filling text boxes exploiting javascript suggestions

4. Filling text boxes exploitingJavascript suggestions

An alternative approach for filling text boxes can be to exploit whenever a website uses suggestions proposed via Javascript.


Algorithm to extract the suggestions

Algorithm to extract the suggestions

  • Type in the text box all the possible first 3 letters (with the English alphabet: 263 = 17.576 submissions)

  • For each combination of 3 letters, retrieve all theauto-completion suggestions using a Javascript simulator

  • All suggestions can be assumed as valid inputs, we don’t need to filter according to relevance

  • The relevance filter will be applied only if the website is not particularly interesting


5 input correlations not taken into account

5. Input correlations not taken into account

  • Google uses the same set of values to fill an input for all templates that contain that input.

  • Usually some inputs are correlated e.g.: the text box “US city" and select menu "US state" or two text boxes representing a range

    Advantages of taking correlation into account:

  • More relevant keywords for text boxese.g. in a correlation between a text box and a select menu, we can submit the form for different select menu values and extract relevant keywords for the associated text box

  • Less zero-results pages are generated, resulting in less load for the search engine crawler and the website servers


How to recognize a correlation

How to recognize a correlation?

To detect correlations between any two input types we can:

  • Use the informativeness testassuming that values are correlated only if the query results are informative

  • Recognize particular types of correlationse.g. if we have 2 select menus, where filling the first one restricts the possible values of the second one (US state/city, car brand/model)we can use a Javascript simulator to manage the correlation


Conclusions

Conclusions

  • Deep-Web Crawling is one the most interestingtoday’s challenges for search engines

  • Google already implemented the surfacing approach obtaining encouraging results

    But …

  • There are still several limitations

  • Some ideas have been illustrated to solve them


References

References

  • J. Madhavan et al. (2008)Google’s Deep-Web Crawlhttp://www.cs.washington.edu/homes/alon/files/vldb08deepweb.pdf

  • J. Madhavan et al. (2009)Harnessing the deep web: Present and futurehttp://arxiv.org/ftp/arxiv/papers/0909/0909.1785.pdf

  • W3C, Hypertext Transfer Protocol - HTTP/1.1GET and POST methods definitionhttp://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html

  • E. CharalambousHow postback works in ASP.NEThttp://www.xefteri.com/articles/show.cfm?id=18


Deep web crawling enlightening the dark side of the web

Thankyou

for the attention :)

Questions?


  • Login