1 / 53

Deep-Web Crawling and Related Work

Deep-Web Crawling and Related Work. Matt Honeycutt CSC 6400. Outline. Basic background information Google’s Deep-Web Crawl Web Data Extraction Based on Partial Tree Alignment Bootstrapping Information Extraction from Semi-structured Web Pages

jeffersonp
Download Presentation

Deep-Web Crawling and Related Work

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deep-Web Crawling and Related Work Matt Honeycutt CSC 6400

  2. Outline • Basic background information • Google’s Deep-Web Crawl • Web Data Extraction Based on Partial Tree Alignment • Bootstrapping Information Extraction from Semi-structured Web Pages • Crawling Web Pages with Support for Client-Side Dynamism • DeepBot: A Focused Crawler for Accessing Hidden Web Content

  3. Background • Publicly-Indexable Web (PIW) • Web pages exposed by standard search engines • Pages link to one another • Deep-web • Content behind HTML forms • Database records • Estimated to be much larger than PIW • Estimated to be of higher quality than PIW

  4. Google’s Deep-Web Crawl J. MadhavanD. Ko L. Kot V. GanapathyA. Rasmussen A. Halevy

  5. Summary • Describes process implemented by Google • Goal is to ‘surface’ content for indexing • Contributions: • Informativeness test • Query selection techniques and algorithm for generating appropriate text inputs

  6. About the Google Crawler • Estimates that there are ~10 million high-quality HTML forms • Index representative deep-web content across many forms, driving search traffic to the deep-web • Two problems: • Which inputs to fill in? • What values to use?

  7. Example Form

  8. Query Templates • Correspond to SQL-like queries: select * from D where P • First problem is to select the best templates • Second problem is to select the best values for those templates • Want to ignore presentation-related fields

  9. Incremental Search for Informative Query Templates • Classify templates as either informative or uninformitive • Template is informative if it generates sufficiently distinct pages from other templates • Build more complex templates from simpler informative ones • Signatures computed for each page

  10. Informativeness Test • T is informative if: • Heuristically limit to templates with 10,000 or fewer possible submissions and no more than 3 dimensions • Can estimate informativeness using a sample of possible queries (ie: 200)

  11. Results

  12. Observations • URLs generated for larger templates are not as useful • ISIT Generates far fewer URLs than CP but still has high coverage • Most common reason for inability to find informative template: JavaScript • Ignoring JavaScript errors, informative templates found for 80% of forms tested

  13. Generating Input Values • Text boxes may be typed or untyped • Special rules for small number of typed inputs that are common • Can’t use generic lists, best keywords are site specific • Select seed keywords from form, then iterate and select candidate keywords from results using TF-IDF • Results are clustered and representative keywords are chosen for each cluster, ranked by page length • Once candidate keywords have been selected, treat text inputs as select inputs

  14. Identifying Typed Inputs

  15. Conclusions • Describes the innovations of “the first large-scale deep-web surfacing system” • Results are already integrated into Google • Informativness test is a useful building block • No need to cover individual sites completely • Heuristics for common input types are useful • Future work: support for JavaScript and handling dependencies between inputs • Limitation: only supports GET requests

  16. Web Data Extraction Based on Partial Tree Alignment Yanhong Zhai Bing Liu

  17. Summary • Novel technique for extracting data from record lists: DEPTA (Data Extraction based on Partial Tree Alignment) • Automatically identifies records and aligns their fields • Overcomes limitations of existing techniques

  18. Example

  19. Approach • Step 1: Build tag tree • Step 2: Segment page to identify data regions • Step 3: Identify data records within the regions • Step 4: Align records to identify fields • Step 5: Extract fields into common table

  20. Building the Tag Tree and Finding Data Regions • Computes bounding regions for each element • Associate items to parents based on containment to build tag tree • Next, compare tag strings with edit distance to find data regions • Finally, identify records within regions

  21. Identifying Regions

  22. Partial Tree Alignment • Tree matching is expensive • Simple Tree Matching – faster, but not as accurate • Longest record tree becomes seed • Fields that don’t match are added to seed • Finally, field values extracted and inserted into table

  23. Seed Expansion

  24. Conclusions • Surpasses previous work (MDR) • Capable of extracting data very accurately • Recall: 98.18% • Precision: 99.68%

  25. Bootstrapping Information Extraction from Semi-structured Web Pages A. Carlson C. Schafer

  26. Summary • Method for extracting structured records from web pages • Method requires very little training and achieves good results in two domains

  27. Introduction • Extracting structured fields enables advanced information retrieval scenarios • Much previous work has been site-specific or required substantial manual labeling • Heuristic-based approaches have not had great success • Uses semi-supervised learning to extract fields from web pages • User only has to label 2-5 pages for each of 4-6 sites

  28. Technical Approach • Human specifies domain schema • Labels training records from representative sites • Utilizes partial tree alignment to acquire additional records for each site • New records are automatically labeled • Learns regression model that predicts mappings from fields to schema columns

  29. Mapping Fields to Columns • Calculate score between each field and column • Score based on field contexts and contexts observed in training • Most probable mapping above a threshold is accepted

  30. Example Context Extraction

  31. Feature Types • Precontext 3-grams • Lowercase value tokens • Lowercase value 3-grams • Value token type categories

  32. Example Features

  33. Scoring • Field mappings based on comparing feature distributions • Distribution computed from training contexts • Distribution computed from observed contexts • Completely dissimilar field/column pairs are fully divergent • Exact field/column pairs have no divergence • Feature similarities combined using “stacked” linear regression model • Weights for the model are learned in training

  34. Results

  35. Crawling Web Pages with Support for Client-Side Dynamism Manuel Alvarez Alberto Pan Juan Raposo Justo Hidalgo

  36. Summary • Advanced crawler based on browser automation • NSEQL - Language for specify browser actions • Stores URLs and path back to URL

  37. Limitations of Typical Crawlers • Built on low-level HTTP APIs • Limited or no support for client-side scripts • Limited support for sessions • Can only see what’s in the HTML

  38. Their Crawler’s Features • Built on “mini web browsers” – MSIE Browser Control • Handles client-side JavaScript • Routes fully support sessions • Limited form-handling capabilities

  39. NSEQL

  40. Identifying New Routes • Routes can come from links, forms, and JavaScript • ‘href’ attributes extracted from normal anchor tags • Tags with JavaScript click events are identified and “clicked” • Captures actions and inspects them

  41. Results and Conclusions • Large scale websites are crawler-friendly • Many medium-scale, deep-web sites aren’t • Crawlers should handle client-side script • Presented crawler has been applied to real-world applications

  42. DeepBot: A Focused Crawler for Accessing Hidden Web Content Manuel Alvarez Juan Raposo Alberto Pan

  43. Summary • Presents a focused deep-web crawler • Extension of previous work • Crawls links and handles search forms

  44. Architecture

  45. Domain Definitions • Attributes a1…aN • Each attribute has name, aliases, specificity index • Queries q1…qN • Each query contains 1 or more (attribute,value) pairs • Relevance threshold

  46. Example Definition

  47. Evaluating Forms • Obtains bounding coordinates of all form fields and potential labels • Distances and angles computed between fields and labels

  48. Evaluating Forms • If label l is within min-distance of field f, l is added to f’s list • Ties are broken using angle • Lists are pruned so that labels appear in only one list and all fields have at least one possible label

  49. Evaluating Forms • Text similarity measures used to link domain attributes to fields • Computes relevance of form • If form score exceeds relevance threshold, DeepBot executes queries

  50. Results and Conclusions • Evaluated on three domain tasks: book, music, and movie shopping • Achieves very high precision and recall • Errors due to: • Missing aliases • Forms with too few fields to achieve minimum support • Sources that did not label fields

More Related