190 likes | 538 Views
The Invisible Web. INLS 200-001, Week 5, Session 9 Instructor: Sanghee Oh. Today. Logistics Group Presentation BSIS major/minor orientation (Stephanie at SILS) Topics Visible vs. Invisible Web Activities Invisible Web Search Exercise. The Invisible Web.
E N D
The Invisible Web INLS 200-001, Week 5, Session 9 Instructor: Sanghee Oh
Today • Logistics • Group Presentation • BSIS major/minor orientation (Stephanie at SILS) • Topics • Visible vs. Invisible Web • Activities • Invisible Web Search Exercise
What makes sites/pages “invisible”? • Different search engine algorithms order results differently. • No search engine indexes everything on the web. • Some sites deliberately exclude bots. • Sites change & move frequently, so every search engine db is out of date. • Some file formats can’t be read by spiders. • Some sites and parts of sites require a login. • Some sites and parts of sites are created dynamically.
4 searches conducted in 10 search engines. 50% of sites were found by only 1 engine. Another 30% of sites were found by only 2 engines. Search Engine Comparison
Search Engine Comparison • Different search engines have different-sized indexes • Database Total Size Estimates • Sometimes different search engines are actually using the same index • Search Engine Relationship Chart
Some sites deliberately exclude bots. http://sils.unc.edu/robots.txt • Robots Exclusion Protocol • There are a lot of robots out there. • They generate a lot of traffic • Can be excluded from: • Individual pages using a META tag • Sections of a site using a file named robots.txt User-agent: Googlebot Disallow: /~shoh/ User-agent: Wild Ferret Web Hopper Disallow: /~shoh
File formats • What search engines can read: • HTML, SGML, XML • Text • Word − RDF • Excel − PDF • Powerpoint − Postscript • What search engines can’t read: Everything else.
Dynamically-created Pages • The page does not exist until created as a result of a specific query. • The content on the page is pulled from a database. • For example: Amazon.com • Bots can only follow links, they can’t conduct searches. • Some indication in the filename: • ?, .jsp, .asp, cgi, .php
Why does the Invisible Web Exist? • Most of it is made up of the contents of thousands of specialized searchable databases made available via the web. • Rarely are such pages stored anywhere: it is easier and cheaper to dynamically generate the answer page for each query than to store all the possible pages containing all the possible answers to all the possible queries people could make to the database. • In-depth information available • Up-to-date information available
Examples from our everyday life • Google Search • Internet Movie Database • U.S. Census Bureau • CNN.com • Mapquest
Finding Information in the Invisible Web • Find someone who has already collected pointers to databases and provided a list of resources • e.g. librarians, domain experts, etc. • Search for topic and “database” • e.g., research databases • Know your subject matter and start from a well-known piece of literature • e.g., citation search
Sites you should know about • The Invisible Web Directory: invisible-web.net • UNC library’s database page: www.lib.unc.edu • Google Scholar: scholar.google.com • Scirus: www.scirus.com • NCLive: www.nclive.org • Internet Public Library: www.ipl.org • Librarians’ Index to the Internet: lii.org • refdesk.com • FirstGov: www.firstgov.gov
Exercise • The purpose of this exercise is to practice your online search engine skills. For this exercise we will explore some documents produced from the tobacco litigation settlement. (1) tobaccodocuments.org (2) legacy.library.ucsf.edu • Both index the same original content, but were constructed independently
Next • Email me the names of the databases that your group choose and the preferred presentation date by Sep 22. • Readings: • UNC Libraries Catalog Search Tips • WorldCat: Read Introduction • Antelman, K., Lynema, E. & Pace, A. K. (2006). Toward a 21st Century Library Catalog. Information Technology and Libraries, 25(3), 128-139. Available at: http://www.lib.ncsu.edu/staff/kaantelm/antelman_lynema_pace.pdf