Mining the

Deep Web Mining the Michael Hunter Reference Librarian Hobart and William Smith Colleges For Western New York Library Resources Council Member Libraries’ Staff Sponsored by the Western New York Library Resources Council

For today . . . • From Web to Deep Web • Search Services: Genres and Differences • The Topography of the Internet • Mining the Deep Web: Techniques and Tips • Hands-on Session • Evaluating Deep Web Resources • Using Proprietary Software

Web to Deep Web • 1991 – Gopher • Menu-based text only • You had to KNOW the sites • 1992 – Veronica • Menus of menus • Difficult to access

Web to Deep Web • 1991 - Hyper-Text Markup Language • Linkage capability leads you to related information elsewhere • “Classic” Web Site • Relatively stable content of static, separate documents or files • Typically no larger than 1,000 documents navigated via static directory structures

Web to Deep Web • 1994 – Lycos launched • First crawler-based search engine with database of 54,000 html documents (CMU) • Growth of html documents unprecedented and unanticipated • 2000 (April) “The Web is doubling in size every 8 months” (FAST)

Web to Deep Web • 1996 – Three phenomena pivotal for the development of the Deep Web: • HTML-based database technology introduced • Bluestone’s Sapphire/Web, Oracle • Commercialization of the Web • Growth of home PC-users and e-commerce • Web Servers adapted to embrace “dynamic” serving of data • Microsoft’s ASP, Unix PHP and others

Web to Deep Web • 1998 – Deep Web comes of Age Larger sites redesigned with a database orientation rather than static directory structure • U.S Bureau of the Census • Securities and Exchange Commission • Patent and Trademark Office

Search Services:Genres and Differences • Exclusively crawler-created • Search engines • Meta search engines • Human created and/or influenced • Directories • Specialized search engines • Subject metasites • Deep Web gateway sites

WS WS WS WS WS WS WS WS WS CR CR CR CR WS DATABASE CR CR CR - Crawler WS - Web Server

User 1 User 2 User 3 User 4 User 5 User 6 User 7 Search Engine DATABASE

Search Services:Exclusively Crawler Created • Database compiled through automated, link-dependent crawling and site submission • Unable to access • Dynamically-created pages • Proprietary, non-html filetypes • Multimedia • Software • Password-protected sites • Sites prohibiting crawlers (robots.txt exclusion)

Dynamically-created Web pages • Created at the moment of the query using the most recent version of the database. • Database-driven • Require interaction • Amazon.com • What titles are available? At what price? • Are there recent reviews? What about shipping? • Used widely in e-commerce, news, statistical and other time-sensitive sites.

Dynamically-created Web pages • Why can’t crawlers download them? Technically they can interact, within limits of programming capability Very costly and time-consuming for general search services

Dynamically-created Web pages • How can a crawler detect a dynamically-created page? • From any of the following in the URL ? , % , $ , = , ASP , PHP , CFM and others

proquest.umi.com/pqdweb?Did=000000209668731&Fmt=1&Deli=1&Mtd=1&Idx=5&Sid=1&RQT=309proquest.umi.com/pqdweb?Did=000000209668731&Fmt=1&Deli=1&Mtd=1&Idx=5&Sid=1&RQT=309

Proprietary Filetypes • PDF • Spreadsheets • Word-processed documents • Google does it! Why can’t you?

Adobe Portable Document Format (pdf) Adobe PostScript (ps) Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wk Lotus WordPro (lwp) MacWrite (mw) Microsoft Excel (xls) Microsoft PowerPoint (ppt) Microsoft Word (doc) Microsoft Works (wks, wps, wdb) Microsoft Write (wri) Rich Text Format (rtf) Text (ans, txt) Google’s Deep Web Components: Non-html filetypes (1.75%)SEARCH SYNTAX “california power shortage” filetype:pdf

Google Non-html FiletypesWarning! • FOR NON-HTML FILES • Clicking on a title in the results list opens the application as well, involving risk of a virus or worm that may be attached to the file • INSTEAD, click the “View as HTML” option; no applications will be opened and no risk of virus or worm • NOTE: Titles for non-html files are frequently not descriptive of content

“homeland security” filetype:ppt

Search ServicesHuman created or influenced • Directories – general and specialized • Specialized search engines • Subject metasites or gateways • Deep Web gateways

Search ServicesHuman created or influenced • Content of sites is examined and categorized or crawling is human-focused and refined • CAN include sites with dynamically created pages • CAN be limited to database-driven sites (Deep Web) • CAN include non-html files NOTE: Some specialized search engines may include little human influence eg. Search.edu

The Topography of the Internetor The Layers of the Web • Mapping the web is challenging • Unregulated in nature • Influences from all over the globe • Fulfills many purposes, from personal to commercial • Changes rapidly and unexpectedly • Divisions and terminology are inherently ambiguous eg. “Deep” vs “Invisible” Web

May I suggest a biological, nautical metaphor, perhaps the ocean? SURFACE WEB SHALLOW WEB OPAQUE WEB DEEP WEB

Surface Web • Static html documents • Crawler-accessible

Shallow Web • Static html documents loaded on servers that use ColdFusion or Lotus Domino or other similar software • A different URL for the same page is created each time it is served. • Crawlers skip these to avoid multiple copies of the same page in their database • Technically human accessible via directories, Deep Web gateways or links from other sites

Opaque Web • Static html documents • Technically crawler accessible • 2 types: • Downloaded and indexed by crawler • Not downloaded or indexed by crawler

Opaque Web • Downloaded and indexed by crawler • Buried in search results you never look at • A casualty of “relevance” ranking • Not downloaded or indexed by crawler due to programmed download limits • Document buried deep in the site • Part of a large document that did not get downloaded (Typical crawl per page is 110 K or less) • Document added since last crawler visit (Even the best revisit on an average of every 2 weeks, depending on amount of change at a site)

Opaque Web • Access to the Opaque Web • Specialized search engines • General and specialized directories • Subject metasites • These services typically index more thoroughly and more often than large, general search engines

Deep WebTwo Categories • Technicallyinaccessible to crawlers • Technicallyaccessible to crawlers

Deep Web • Technicallyinaccessible to crawlers • Dynamically created pages • Databases • Non-textual files • Password protected sites • Sites prohibiting crawlers

Deep Web • Technicallyaccessible to crawlers • Textual files in non-html formats (Google does it!) • Pages excluded from crawler by editorial policy or bias

Mining the Deep WebTechniques and Tips

How large is the Deep Web? • White Paper by Michael K. Bergman published in the Journal of Electronic Publishing in 2000. • http://www.brightplanet.com/deepcontent/ tutorials/DeepWeb/index.asp • Currently a scarcity of unbiased research due to its fluid nature, dynamic content and multiple points of access

How large is the Deep Web?Bergman Study • Over 150,000 databases • Over 95% publicly available • Perhaps 500 times larger than the Surface Web • Growth rate currently greater than the Surface Web

What’s in the Deep Web? • Information likely to be stored in a database • People, address, phone number locators • Patents • Laws • Dictionary definitions • Items for sale or auction • Technical reports • Other specialized data

What’s in the Deep Web? • Information that is new and dynamically changing • News • Job postings • Travel schedules and prices • Financial data • Library catalogs and databases • Topical coverage is extremely varied.

Mining the Deep WebA world different from search engines . . . Hunter’s Maxim for Searching the Deep Web Plan to first locate the category of information you want, then browse. Don’t be too specific in your searches. Cast a wide net. Brush up on your Gopher-type search skills (if you were searching the ‘Net back then). We’ve become accustomed to search engine free-text searching. This is a different world.

Basic Strategies for Mining the Deep Web • Using directories, general and specialized • Using general search engines • Using specialized (subject-focused) search engines • Using subject metasites (link-oriented) • Using Deep Web gateway sites (database-oriented) NOTE: Many sites contain elements of all of the above, in varying degrees and combinations

Using directories • Yahoo! > “web directories” > 840 category matches • Yahoo! > database > 22 categories and 7423 site matches • Google Directory > link collections > 493,000 • Databases may also be found under general subject categories • Also use research directories such as Infomine, LII, WWWVL and others

Using general search engines • Combine subject terms with one or more of these possibilities: • directory • crawler • search engine • database • webring or web ring • link collection • blog

Using general search engines • Google (11/4/02) “toxic chemicals database” > 45 “punk rock search engine” > 77 “science fiction webring” > 97 (web rings are cooperative subject metasites, maintained by experts or aficionados) • Remember, when using a search engine you must match words on the page.

Using specialized (subject-focused) search engines • AKA • Limited-area engines • Targeted search engines • Expert search services • Vertical Portals • Vortals

Using specialized (subject-focused) search engines • Non-html textual files • http://searchpdf.adobe.com/ • Google • Non-textual files • Image, MP3 search engines • Media search at Google, et. al. • Software • Blogs • Blogdex http://blogdex.media.mit.edu/

Web logs or blogs • Online personal journals • Postings are often centered around a particular topic or issue and may contain links to recent relevant information • Frequently updated • Differ from newsgroups in that they are generally by one author

Web logs or blogs • How do you search them? • Blogdex http://blogdex.media.mit.edu • Open Directory http://dmoz.org Computers / Internet / On the Web / Weblogs • Are they part of the Deep Web? • Yes and No

Mining the

Mining the

Presentation Transcript

Mining the Genome

The Mining Cycle

Mining the Medical Literature

Mining The Boom

Mining the data warehouse

Mining the Data

The Mining Frontier

The Mining Booms

The dangers of mining

Mining

Mining

The Mining Booms

The challenges of mining

The Process Mining Toolkit

Mining

Mining

Mining the Biomedical Literature

Mining Equipments - Shakti Mining

The Mining Cycle

The Art of Mining

The Mining Cycle