Crawling deep web content through query forms
This presentation is the property of its rightful owner.
Sponsored Links
1 / 36

Crawling Deep Web Content Through Query Forms PowerPoint PPT Presentation


  • 58 Views
  • Uploaded on
  • Presentation posted in: General

Crawling Deep Web Content Through Query Forms. Jun Liu, Zhaohui Wu, Lu Jiang, Qinghua Zheng and Xiao Liu Speaker: Lu Jiang Xi’an Jiaotong University P.R.China. Outline. Background Related work Minimum Executable Pattern Adaptive Crawling Algorithm Experimental results Conclusions.

Download Presentation

Crawling Deep Web Content Through Query Forms

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Crawling deep web content through query forms

Crawling Deep Web Content Through Query Forms

Jun Liu, Zhaohui Wu, Lu Jiang, Qinghua Zheng and Xiao Liu

Speaker: Lu Jiang

Xi’an Jiaotong University

P.R.China


Outline

Outline

  • Background

  • Related work

  • Minimum Executable Pattern

  • Adaptive Crawling Algorithm

  • Experimental results

  • Conclusions


Outline1

Outline

  • Background

  • Related work

  • Minimum Executable Pattern

  • Adaptive Crawling Algorithm

  • Experimental results

  • Conclusions


What is the deep web

What is the Deep Web

  • Deep Web (or Hidden Web) refers to World Wide Web content that is not part of the surface Web which is directly indexed by search engines.


Why the deep web

Data retrieval in Deep Web [Michael K. Bergman,2001]

Why the Deep Web

  • Organizes high-quality content

  • Significant piece of the Web


What is the problem

What is the problem?

  • Ordinary crawlers retrieve content only in Surface Web.

  • Challenge: make the Deep Web accessible to web search.

  • A Practical solution: Deep Web crawling


Outline2

Outline

  • Background

  • Related work

  • Minimum Executable Pattern

  • Adaptive Crawling Algorithm

  • Experimental results

  • Conclusions


Related work

Related Work

  • The prior knowledge-based query methods:

    • generate queries under the guidance of prior knowledge

    • E.g. HIdden Web Exposer [Raghavan, 2001]

  • The non-prior knowledge methods

    • generate new query by analyzing the data records returned from the previous queries

    • E.g. Deep Web crawler [Ntoulas, 2005]


Outline3

Outline

  • Background

  • Related work

  • Minimum Executable Pattern

  • Adaptive Crawling Algorithm

  • Experimental results

  • Conclusions


The idea of the mep

The idea of the MEP

  • Previouswork is based on either the genetic textbox or the entire query form.

    • For genetic textbox: the harvest rate (capability of obtaining new records) of queries are relatively low and simplex.

    • For entire form. incorrectness of filling out the entire form is excessive.

  • A proper granularity of pattern is required.


What is the mep

What is the MEP

  • Query Form. A query form F is a query interface of Deep Web, which can be defined as a set of all elements in it. where ei is an element of F such as a checkbox, text box or radio button.

  • Executable Pattern (EP). is an executable pattern if the deep web database returns the corresponding results after the query with value assignments of elements in it is issued.

  • Minimum Executable Pattern (MEP). Given is an executable pattern ,then it is a MEP iff any proper subset of it is not an executable pattern.


Mep classification

MEP Classification

  • Two types of the MEP.

    • If there is an infinite domain element (text box) in MEP set, then the MEP is called infinite domain MEP (IMEP).

    • If all its element are finite domain (radio button, check boxes), then the MEP is called finite domain MEP (FMEP).


What is the mep1

What is the MEP

6 FMEPs

5 IMEPs

1 IMEP


Outline4

Outline

  • Background

  • Related work

  • Minimum Executable Pattern

  • Adaptive Crawling Algorithm

  • Experimental results

  • Conclusions


What is a query

What is a Query

  • The ith query to database is implemented using MEP mep and its corresponding keyword vector kv.

    • E.g. qi(mep(keywords),”art”).

    • The harvest rate of a query is the ability of obtaining new records.


Overall algorithm

Overall Algorithm

Data Accumulation Phase

Prediction Phase


How does a crawler work

How does a Crawler Work

Obtained x new records while accessing y records.

Harvest rate = x/y.

q (mep(keywords),”art”).

The harvest rate and extracted records are used to evaluate query candidate.

Iteration goes on until stop condition is satisfied

art


Overall algorithm1

Overall Algorithm

Prediction Phase


Pattern harvest rate

Pattern Harvest Rate

  • Pattern harvest rate of the mep, depends on the pattern mep itself, rather than choice of keyword vectors.

    • E.g. MEP(Keywords) and MEP(Abstract)

  • Two approaches to predict the value.

    • Continuous prediction

    • Weighted prediction


Keyword vector harvest rate

Keyword Vector Harvest Rate

  • Keyword vector harvest rate represents the conditional harvest rate of kv among all candidate keyword vectors of the given mep.

    • E.g. given the MEP(keywords), find out which kv will bring the most new records.

  • The estimation of kv harvest rate consists of two parts

    • Calculate how many records containing kv has been downloaded (SampleDF) Sampling

    • Estimate how many records containing kv reside in Deep Web (Keyword Capability) Zipf Law

    • Keyword Vector Harvest rate = Keyword Capability – SampleDF


Convergence analysis

Convergence Analysis

  • When to terminate crawling the Deep web database, especially when the size of target database is unknown?

  • If we assume mk is constant, We have:

  • S is the record numeber of Deep Web Database

  • akis the cumulated fraction of new records

  • mkis the fraction of records returned by the kth query

Crawler Bottleneck!


Outline5

Outline

  • Background

  • Related work

  • Minimum Executable Pattern

  • Adaptive Crawling Algorithm

  • Experimental results

  • Conclusions


Effectiveness

Effectiveness


Comparison with state of art method

Comparison with state of art method

We believe MEP method with multi-MEP outperforms than that with a single one of the multi-MEP


Outline6

Outline

  • Background

  • Related work

  • Minimum Executable Pattern

  • Adaptive Crawling Algorithm

  • Experimental results

  • Conclusions


Conclusion

Conclusion

  • The novel concept of MEP provides a foundation to study Deep Web crawling through query forms.

  • The adaptive crawling method and its related prediction algorithm offer a efficient way to crawling Deep Web content through query forms.


Thanks you

Thanks You!


Appendix

Appendix

Here comes the Appendix


Mep generation algorithm

MEP Generation Algorithm


Examples of prediction

Examples of Prediction


Comparison with lvs method

Comparison with LVS method


Continues prediction

Continues Prediction

  • The current harvest rate of a MEP totally depends on the harvest rate of the latest issued query by the MEP.

Issue a query via mep1 and get 200 record assessing 250 records

Accessing new record rate = 200/250 = 0.8

mep1 = 0.8/(0.33+0.33+0.8) = 0.55

mep2 = 0.33/(0.33+0.33+0.8) = 0.22

mep3 = 0.33/(0.33+0.33+0.8) = 0.22

Issue a query via mep1 and get 30 record assessing 100 records

Accessing new record rate = 30/100 = 0.3

mep1 = 0.3/(0.22+0.22+0.3) = 0.40

mep2 = 0.22/(0.22+0.22+0.3) = 0.29

mep3 = 0.22/(0.22+0.22+0.3) = 0.29

mep1 mep2 mep3

0.33

0.33

0.33

0.55

0.22

0.22

0.40

0.29

0.29


Weighted prediction

Weighted Prediction

  • The current harvest rate of a MEP depends on all its previous harvest rates of issued query by the MEP.


Sampledf calculation

SampleDF Calculation

  • document frequency of observed keyword vectorkv in sample croups {d1,...,ds}.

  • where kvxk is the corresponding Boolean vector of kv in dk, and similarly mepx is the Boolean vector of mep.

    • ith dimension of vector kv contains in document corresponding dimension of vector kvx is assigned to 1. 0 otherwise;

    • ith dimension of mep is infinite domain mep then the corresponding position is assigned to 1. 0 otherwise.


Sampledf calculation example

SampleDF Calculation Example

  • kx = (a,b) mep = (student id, exam id, subject)

  • Four documents D1,D2,D3 and D4

  • D1 has both Student ID a and Exam ID b

  • D2 has only Student ID a

  • D3 has only Exam ID b

  • D4 has neither Student ID a and Exam ID b

  • mepx = (1,1,0)

  • D1 kvx1 (1,1,0) cos<(1,1,0),(1,1,0)> = 1

  • D2 kvx2 (1,0,0) cos<(1,0,0),(1,1,0)> = 0.707

  • D3 kvx3 (0,1,0) cos<(0,1,0),(1,1,0)> = 0.707

  • D4 kvx4 (0,0,0) cos<(0,0,0),(1.1.0)> = 0

  • SampleDF((a,b)| mep) = 1+0.707+0.707+0 = 2.414


Keyword capability estimation

Keyword Capability Estimation

  • Keyword capability denote capability of obtaining records. (differ from harvest rate)

  • |Dt| is Cartesian product of values of finite element in MEP

  • For FMEP: f = 1

  • For IMEP: Zipf-Mandelbrot Law to estimate f

Keyword capability =


  • Login