Download
mining the n.
Skip this Video
Loading SlideShow in 5 Seconds..
Mining the PowerPoint Presentation

Mining the

122 Views Download Presentation
Download Presentation

Mining the

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Deep Web Mining the Michael Hunter Reference Librarian Hobart and William Smith Colleges For Western New York Library Resources Council Member Libraries’ Staff Sponsored by the Western New York Library Resources Council

  2. For today . . . • From Web to Deep Web • Search Services: Genres and Differences • The Topography of the Internet • Mining the Deep Web: Techniques and Tips • Hands-on Session • Evaluating Deep Web Resources • Using Proprietary Software

  3. Web to Deep Web • 1991 – Gopher • Menu-based text only • You had to KNOW the sites • 1992 – Veronica • Menus of menus • Difficult to access

  4. Web to Deep Web • 1991 - Hyper-Text Markup Language • Linkage capability leads you to related information elsewhere • “Classic” Web Site • Relatively stable content of static, separate documents or files • Typically no larger than 1,000 documents navigated via static directory structures

  5. Web to Deep Web • 1994 – Lycos launched • First crawler-based search engine with database of 54,000 html documents (CMU) • Growth of html documents unprecedented and unanticipated • 2000 (April) “The Web is doubling in size every 8 months” (FAST)

  6. Web to Deep Web • 1996 – Three phenomena pivotal for the development of the Deep Web: • HTML-based database technology introduced • Bluestone’s Sapphire/Web, Oracle • Commercialization of the Web • Growth of home PC-users and e-commerce • Web Servers adapted to embrace “dynamic” serving of data • Microsoft’s ASP, Unix PHP and others

  7. Web to Deep Web • 1998 – Deep Web comes of Age Larger sites redesigned with a database orientation rather than static directory structure • U.S Bureau of the Census • Securities and Exchange Commission • Patent and Trademark Office

  8. Search Services:Genres and Differences • Exclusively crawler-created • Search engines • Meta search engines • Human created and/or influenced • Directories • Specialized search engines • Subject metasites • Deep Web gateway sites

  9. WS WS WS WS WS WS WS WS WS CR CR CR CR WS DATABASE CR CR CR - Crawler WS - Web Server

  10. User 1 User 2 User 3 User 4 User 5 User 6 User 7 Search Engine DATABASE

  11. Search Services:Exclusively Crawler Created • Database compiled through automated, link-dependent crawling and site submission • Unable to access • Dynamically-created pages • Proprietary, non-html filetypes • Multimedia • Software • Password-protected sites • Sites prohibiting crawlers (robots.txt exclusion)

  12. Dynamically-created Web pages • Created at the moment of the query using the most recent version of the database. • Database-driven • Require interaction • Amazon.com • What titles are available? At what price? • Are there recent reviews? What about shipping? • Used widely in e-commerce, news, statistical and other time-sensitive sites.

  13. Dynamically-created Web pages • Why can’t crawlers download them? Technically they can interact, within limits of programming capability Very costly and time-consuming for general search services

  14. Dynamically-created Web pages • How can a crawler detect a dynamically-created page? • From any of the following in the URL ? , % , $ , = , ASP , PHP , CFM and others

  15. proquest.umi.com/pqdweb?Did=000000209668731&Fmt=1&Deli=1&Mtd=1&Idx=5&Sid=1&RQT=309proquest.umi.com/pqdweb?Did=000000209668731&Fmt=1&Deli=1&Mtd=1&Idx=5&Sid=1&RQT=309

  16. Proprietary Filetypes • PDF • Spreadsheets • Word-processed documents • Google does it! Why can’t you?

  17. Adobe Portable Document Format (pdf) Adobe PostScript (ps) Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wk Lotus WordPro (lwp) MacWrite (mw) Microsoft Excel (xls) Microsoft PowerPoint (ppt) Microsoft Word (doc) Microsoft Works (wks, wps, wdb) Microsoft Write (wri) Rich Text Format (rtf) Text (ans, txt) Google’s Deep Web Components: Non-html filetypes (1.75%)SEARCH SYNTAX “california power shortage” filetype:pdf

  18. Google Non-html FiletypesWarning! • FOR NON-HTML FILES • Clicking on a title in the results list opens the application as well, involving risk of a virus or worm that may be attached to the file • INSTEAD, click the “View as HTML” option; no applications will be opened and no risk of virus or worm • NOTE: Titles for non-html files are frequently not descriptive of content

  19. “homeland security” filetype:ppt

  20. Search ServicesHuman created or influenced • Directories – general and specialized • Specialized search engines • Subject metasites or gateways • Deep Web gateways

  21. Search ServicesHuman created or influenced • Content of sites is examined and categorized or crawling is human-focused and refined • CAN include sites with dynamically created pages • CAN be limited to database-driven sites (Deep Web) • CAN include non-html files NOTE: Some specialized search engines may include little human influence eg. Search.edu

  22. The Topography of the Internetor The Layers of the Web • Mapping the web is challenging • Unregulated in nature • Influences from all over the globe • Fulfills many purposes, from personal to commercial • Changes rapidly and unexpectedly • Divisions and terminology are inherently ambiguous eg. “Deep” vs “Invisible” Web

  23. May I suggest a biological, nautical metaphor, perhaps the ocean? SURFACE WEB SHALLOW WEB OPAQUE WEB DEEP WEB

  24. Surface Web • Static html documents • Crawler-accessible

  25. Shallow Web • Static html documents loaded on servers that use ColdFusion or Lotus Domino or other similar software • A different URL for the same page is created each time it is served. • Crawlers skip these to avoid multiple copies of the same page in their database • Technically human accessible via directories, Deep Web gateways or links from other sites

  26. Opaque Web • Static html documents • Technically crawler accessible • 2 types: • Downloaded and indexed by crawler • Not downloaded or indexed by crawler

  27. Opaque Web • Downloaded and indexed by crawler • Buried in search results you never look at • A casualty of “relevance” ranking • Not downloaded or indexed by crawler due to programmed download limits • Document buried deep in the site • Part of a large document that did not get downloaded (Typical crawl per page is 110 K or less) • Document added since last crawler visit (Even the best revisit on an average of every 2 weeks, depending on amount of change at a site)

  28. Opaque Web • Access to the Opaque Web • Specialized search engines • General and specialized directories • Subject metasites • These services typically index more thoroughly and more often than large, general search engines

  29. Deep WebTwo Categories • Technicallyinaccessible to crawlers • Technicallyaccessible to crawlers

  30. Deep Web • Technicallyinaccessible to crawlers • Dynamically created pages • Databases • Non-textual files • Password protected sites • Sites prohibiting crawlers

  31. Deep Web • Technicallyaccessible to crawlers • Textual files in non-html formats (Google does it!) • Pages excluded from crawler by editorial policy or bias

  32. Mining the Deep WebTechniques and Tips

  33. How large is the Deep Web? • White Paper by Michael K. Bergman published in the Journal of Electronic Publishing in 2000. • http://www.brightplanet.com/deepcontent/ tutorials/DeepWeb/index.asp • Currently a scarcity of unbiased research due to its fluid nature, dynamic content and multiple points of access

  34. How large is the Deep Web?Bergman Study • Over 150,000 databases • Over 95% publicly available • Perhaps 500 times larger than the Surface Web • Growth rate currently greater than the Surface Web

  35. What’s in the Deep Web? • Information likely to be stored in a database • People, address, phone number locators • Patents • Laws • Dictionary definitions • Items for sale or auction • Technical reports • Other specialized data

  36. What’s in the Deep Web? • Information that is new and dynamically changing • News • Job postings • Travel schedules and prices • Financial data • Library catalogs and databases • Topical coverage is extremely varied.

  37. Mining the Deep WebA world different from search engines . . . Hunter’s Maxim for Searching the Deep Web Plan to first locate the category of information you want, then browse. Don’t be too specific in your searches. Cast a wide net. Brush up on your Gopher-type search skills (if you were searching the ‘Net back then). We’ve become accustomed to search engine free-text searching. This is a different world.

  38. Basic Strategies for Mining the Deep Web • Using directories, general and specialized • Using general search engines • Using specialized (subject-focused) search engines • Using subject metasites (link-oriented) • Using Deep Web gateway sites (database-oriented) NOTE: Many sites contain elements of all of the above, in varying degrees and combinations

  39. Using directories • Yahoo! > “web directories” > 840 category matches • Yahoo! > database > 22 categories and 7423 site matches • Google Directory > link collections > 493,000 • Databases may also be found under general subject categories • Also use research directories such as Infomine, LII, WWWVL and others

  40. Using general search engines • Combine subject terms with one or more of these possibilities: • directory • crawler • search engine • database • webring or web ring • link collection • blog

  41. Using general search engines • Google (11/4/02) “toxic chemicals database” > 45 “punk rock search engine” > 77 “science fiction webring” > 97 (web rings are cooperative subject metasites, maintained by experts or aficionados) • Remember, when using a search engine you must match words on the page.

  42. Using specialized (subject-focused) search engines • AKA • Limited-area engines • Targeted search engines • Expert search services • Vertical Portals • Vortals

  43. Using specialized (subject-focused) search engines • Non-html textual files • http://searchpdf.adobe.com/ • Google • Non-textual files • Image, MP3 search engines • Media search at Google, et. al. • Software • Blogs • Blogdex http://blogdex.media.mit.edu/

  44. Web logs or blogs • Online personal journals • Postings are often centered around a particular topic or issue and may contain links to recent relevant information • Frequently updated • Differ from newsgroups in that they are generally by one author

  45. Web logs or blogs • How do you search them? • Blogdex http://blogdex.media.mit.edu • Open Directory http://dmoz.org Computers / Internet / On the Web / Weblogs • Are they part of the Deep Web? • Yes and No