1 / 15

Deep Web Crawling and Mining

Deep Web Crawling and Mining. Presented by: Group 17 AIA 8803 Course Feb 28, 2008. What ’ s the Problem?. Large Amount of Deep Web Content Refers to World Wide Web content that is not part of the surface Web indexed by search engines (Bergman, 2001)

mscully
Download Presentation

Deep Web Crawling and Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deep Web Crawling and Mining Presented by: Group 17 AIA 8803 Course Feb 28, 2008

  2. What’s the Problem? • Large Amount of Deep Web Content • Refers to World Wide Web content that is not part of the surface Web indexed by search engines (Bergman, 2001) • In 2000, it was estimated that the deep Web contained approximately 7,500 terabytes of data and 550 billion individual documents • Characteristics of Deep Web Data: • Mostly generated by backend database • Intrinsic – behind database scheme

  3. Our solution • Deep web crawling • Iterative querying • Deep web mining • Attribute labeling • Advanced search • Database construction • Object-level search • Comparison

  4. Deep Web Crawling • Why it’s difficult in dynamic web space? • Hidden Web, Deep Web • Different from traditional web crawler where a hyperlink graph is traversed with BFS or WFS to crawl web pages • Seed-based crawler • Seed  Crawl  New Seed  Crawl  …

  5. An Crawler Example • Initial seed: car • New seeds: Lincoln, Deluxe, TracRac, Truck, SUV

  6. Deep Web Mining • What we have: • Large amount of web pages gathered from the crawler Machine Learning / Data Mining techniques • What we need: • A structured database for web application

  7. Deep Web Mining • Problem • Different web sites may have different layouts

  8. Deep Web Mining • Conditional Random Fields (CRFs) • An undirected graphic model • X (Gray nodes): observations • Features extracted from the crawled web pages • Y (White nodes): hidden states • Labels • Product name, price, customer rating, etc.. • CRF models the conditional probability p(y|x) • Key advantage • Rich, correlated feature sets

  9. Web database from mining • Data fusion will be necessary where multiple copies of data exist across sites

  10. What We Have • Web object extraction and mining • Structured databases of web objects Next Step • improve the state-of-the-arts Web search • make some money

  11. Building Advanced Web Search Application 1. object-level web search combine different features or attributes of an identical Web object in different Web sites to respond to a user query DBLP (manual but high-precise) Citeseer (auto but less-precise) Challenge is on how to build an precise and automatic object-level search platform DBLP? 2. comparison Web search compare attributes (e.g. price, performance, etc) of Web objects across different sites or sources

  12. Building a LAMP Server • "LAMP" system: Linux, Apache, MySQL and PHP. 1. low acquisition cost 2. ubiquity of its components

  13. Fancy restaurant (dynamic web server) • Apache: chef. • PHP: waiter. • MySQL: stockroom of ingredients • When a patron (or Web site visitor) comes to your restaurant, he or she sits down and orders a meal with specific requirements. • The waiter (PHP) takes those specific requirements back to the kitchen and passes them off to the chef (Apache). • The chef then goes to the stockroom (MySQL) to retrieve the ingredients (or data) to prepare the meal and presents the final dish to the patron, exactly the way he or she ordered the meal.

  14. Thank you. Q&A

More Related