1 / 19

Crawling the Hidden Web

Crawling the Hidden Web. Sriram Raghavan Hector Garcia-Molina Computer Science Department Stanford University Reviewed by Pankaj Kumar. Introduction. What are web crawlers? Programs, that traverses Web graph in a structured manner, retrieving web pages .

skip
Download Presentation

Crawling the Hidden Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina Computer Science Department Stanford University Reviewed by Pankaj Kumar

  2. Introduction • What are web crawlers? Programs, that traverses Web graph in a structured manner, retrieving web pages. • Are they really crawling the whole web graph? Their target: Publicly Index-able Web (PIW) • They are missing something… Crawling Hidden Web

  3. What about results, which can only be obtained by: • Search Forms • Web pages, that need authorization. • Let’s face the truth: • Size of hidden web with respect to PIW • High Quality information are present out there. Example – Patents & Trademark Office, News Media Crawling Hidden Web

  4. Now…The Goal: • To create a web crawler, which can crawl and extract information from hidden database. • Indexing, analysis and mining of hidden web content. • But, the path is not easy: • Automatic parsing and processing of form-based interfaces. • Input to the form of search queries. Crawling Hidden Web

  5. Our approach: • Task-specificity – • Resource Discovery (will NOT focus in this paper) • Content Extraction • Human Assistance – It is critical, as it • enables the crawler to use relevant values. • gathers additional potential values. Crawling Hidden Web

  6. Hidden Web Crawlers • A new operational model – developed at Stanford University. • First of all… • How a user interacts with a web form: Crawling Hidden Web

  7. Now, how a crawler should interact with a web form: • Wait…what is this all about ??? - Let’s understand the terminologies first. That will help us. Crawling Hidden Web

  8. Terminologies: • Form Page: Actual web page containing the form. • Response Page: Page received in response to a form submission. • Internal Form Representation: Created by the crawler, for a certain web form, F. F = ({E1, E2,…, En}, S, M) • Task-specific Database: Information, that the crawler needs. • Matching Function: It implements the “Match” algorithm to produce value assignments for the form elements. Match(({E1, E2,…, En}, S, M), D) = [E1v1, E2v2,…, Envn] • Response Analysis: Receives and stores the form submission in the crawler’s repository. Crawling Hidden Web

  9. Submission Efficiency (Performance): Let, Ntotal= Total # of forms submitted by the crawler, Nsuccess= # of submissions which result in a response page containing one or more search results, and Nvalid= # of semantically correct form submissions. Then, • Strict Submission Efficiency (SEstrict) = (Nsuccess) / (Ntotal) • Lenient Submission Efficiency (SElenient) = (Nvalid) / (Ntotal) Crawling Hidden Web

  10. HiWE: Hidden Web Exposer • HiWE Architecture: Crawling Hidden Web

  11. But, how does this fit in our operational model ???? • Form Representation • Task Specific Database (LVS Table) • Matching Function • Computing Weights Crawling Hidden Web

  12. LITE: Layout-based Information Extraction Technique What is it ?? A technique where page layout aids in label extraction. • Prune the form page. • Approximately layout the pruned page using Custom Layout Engine. • Identify and rank the Candidate. • The highest ranked candidate is the label associated with the form element. Crawling Hidden Web

  13. Experiments • Task Description: Collect Web pages containing “News articles, reports, press releases, and white papers relating to the semiconductor industry, dated sometime in the last ten years”. • Parameter values: Crawling Hidden Web

  14. Effect of Value Assignment Ranking function (ρfuzz , ρavg and ρprob ): • Label Extraction: • LITE: 93% • Heuristic purely based on Textual Analysis : 72% • Heuristic based on Extensive manual observation: 83% Crawling Hidden Web

  15. Effect of α: • Effect of crawler input to LVS table: Crawling Hidden Web

  16. Pros and Cons… • Pros • More amount of information is crawled • Quality of information is very high • More focused results • Crawler inputs increases the number of successful submissions • Cons • Crawling becomes slower • Task-specific Database can limit the accuracy of results • Unable to process simple form element dependencies • Lack of support for partially filled out forms Crawling Hidden Web

  17. Where does our course fit in here…?? • In Content Extraction • Given the set of resources, i.e. sites and databases, automate the information retrieval • In Label Matching (Matching Function) • Label Normalization • Edit Distance Calculation • In LITE-based heuristic for extracting labels • Identify and Rank Candidates • In maintaining Crawler’s repository Crawling Hidden Web

  18. Related Works… • J. Madhavan et al, VLDS, 2008, Google's Deep Web Crawl • J. Madhavan et al, CIDR, Jan. 2009, Harnessing the Deep Web: Present and Future • Manuel Álvarez, Juan Raposo, Fidel Cacheda and Alberto Pan, Aug. 2006, A Task-specific Approach for Crawling the Deep Web • Lu Jiang, Zhaohui Wu, Qian Feng, Jun Liu, Qinghua Zheng, Efficient Deep Web Crawling Using Reinforcement Learning • Manuel Álvarez et al, Crawling the Content Hidden Behind Web Forms • Yongquan Dong, Qingzhong Li, 2012, A Deep Web Crawling Approach Based on Query Harvest Model • Alexandros Ntoulas, Petros Zerfos, Junghoo Cho, Downloading Hidden Web Content • Rosy Madaan, Ashutosh Dixit, A.K. Sharma, Komal Kumar Bhatia, 2010, A Framework for Incremental Hidden Web Crawler • Ping Wu, Ji-Rong Wen, Huan Liu, Wei-Ying Ma, Query Selection Techniques for Efficient Crawling of Structured Web Sources • http://deepweb.us/ Crawling Hidden Web

  19. So…what’s the “Conclusion” ? • Traditional Crawler’s limitations • Issues related to extending the Crawlers for accessing the “Hidden Web” • Need for narrow application focus • Promising results of HiWE • Limitations (of HiWE): • Inability to handle simple dependencies between form elements • Lack of support for partial filled out forms Crawling Hidden Web

More Related