1 / 27

The Invisible Web

The Invisible Web. Chris Sherman Editor, SearchDay SearchEngineWatch.com Information Online 2003. Overview. How Search Engines Work What is the Invisible Web? Tactics for Searching the Invisible Web Future Trends. The Parts of a Search Engine. Three main parts of every search engine:

annalise
Download Presentation

The Invisible Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Invisible Web Chris Sherman Editor, SearchDay SearchEngineWatch.com Information Online 2003 Sydney, Australia January 23, 2003

  2. Sydney, Australia January 23, 2003

  3. Overview • How Search Engines Work • What is the Invisible Web? • Tactics for Searching the Invisible Web • Future Trends Sydney, Australia January 23, 2003

  4. The Parts of a Search Engine • Three main parts of every search engine: • The Crawler (aka spider) • The Indexer • The Search Engine Database Sydney, Australia January 23, 2003

  5. How Search Engines Work The Web Crawler URL1 URL2 Indexer URL3 URL4 Your Browser Eggs - 90% Eggo - 81% Ego- 40% Huh? - 10% All About Eggs by S. I. Am Search Engine Database Eggs? Eggs. Sydney, Australia January 23, 2003

  6. How Crawlers Work • Crawlers are like hyper-caffeineated browsers • Seeded with a set of URLs • Download Web pages, then: • Extract all links on every page for further crawling • Hand the page off to the indexer Sydney, Australia January 23, 2003

  7. The Bow Tie Model • 30% in the core • 24% origination pages • 24% termination pages • 22% disconnected pages -- these are effectively invisible to search engines Source: IBM Sydney, Australia January 23, 2003

  8. What is the Invisible Web? • “Stuff” that search engine crawlers (spiders) can not -- or will not-- add to their databases • 2 to 50 times larger than the visible Web • Resources often much higher quality than the visible Web Sydney, Australia January 23, 2003

  9. What is the Invisible Web? • Certain file formats (PDF, Flash, Office files, streaming media) • Why? They aren’t HTML text • Most real-time data (stock quotes, weather, airline flight info) • Why? Ephemeral & storage intensive Sydney, Australia January 23, 2003

  10. What is the Invisible Web? • Dynamically generated pages (cgi, javascript, asp, or most pages with “?” in URL) • Why? Spider traps • Web accessible databases • Why? Spiders can’t type Sydney, Australia January 23, 2003

  11. The Opaque Web • Visible pages “hidden” behind dynamic navigation codes • Mostly graphic, non-text pages • “Disconnected” pages Sydney, Australia January 23, 2003

  12. The URL Test Sydney, Australia January 23, 2003

  13. The URL Test Sydney, Australia January 23, 2003

  14. The URL Test Sydney, Australia January 23, 2003

  15. The URL Test Sydney, Australia January 23, 2003

  16. The URL Test Sydney, Australia January 23, 2003

  17. The URL Test Sydney, Australia January 23, 2003

  18. Invisible Web Searching:Core Tactics • The first step in determining the best approach for searching the Invisible Web is to have a clear idea of what you’re seeking. • Limit your search to appropriate tools for the particular type of information you’re looking for. Sydney, Australia January 23, 2003

  19. Use Invisible Web Pathfinders • Intelliseek • http://www.invisibleweb.com • Invisible-web.net • http://www.invisible-web.net/ • Librarians’ Index to the Internet • http://www.lii.org Sydney, Australia January 23, 2003

  20. Finding Non-HTML File Formats • Google & AlltheWeb: use the filetype operator • filetype:pdf • filetype:doc • Use specialized engines • searchpdf.adobe.com • Research Index Sydney, Australia January 23, 2003

  21. Finding Real Time Information • Underground Weather • Google News Search • Yahoo Finance • J-Track Spacecraft Tracker Sydney, Australia January 23, 2003

  22. Finding Images • Google/FAST/AltaVista Image Search • Google Catalogs • Visoo • Webseek @ Columbia Sydney, Australia January 23, 2003

  23. Finding Streaming MediaFiles • Speechbot • Singingfish • MSN Music • British Pathe • WindowsMedia .com v.9 player Sydney, Australia January 23, 2003

  24. Future Trends: The Invisible Web Revealed • More “difficult” content indexed • Flash, dynamic content • “Data centric” search engines • ResearchIndex • Agent-brokered database search • Form crawlers Sydney, Australia January 23, 2003

  25. Conclusion • Searching the Invisible Web isn’t hard. It just takes a different mindset. • It’s crucial to develop your own, personal collection. • Expect the unexpected: the boundary between visible and invisible is changing as we speak. Sydney, Australia January 23, 2003

  26. More Info CyberAge Books 0-910965-51-X http://www.invisible-web.net Sydney, Australia January 23, 2003

  27. More Ranting • SearchDay Newsletter • http://searchenginewatch.com/searchday/ • Searchwise • http://www.searchwise.net csherman@searchwise.net Sydney, Australia January 23, 2003

More Related