280 likes | 395 Views
Web Characterization. Week 9 LBSC 690 Information Technology. Outline. What is the Web? What’s on the Web? What is the nature of the Web? Preserving the Web. Defining the Web. HTTP, HTML, or URL? Static, dynamic or streaming? Public, protected, or internal?.
E N D
Web Characterization Week 9 LBSC 690 Information Technology
Outline • What is the Web? • What’s on the Web? • What is the nature of the Web? • Preserving the Web
Defining the Web • HTTP, HTML, or URL? • Static, dynamic or streaming? • Public, protected, or internal?
Economics of the Web in 1995 • Affordable storage • 300,000 words/$ • Adequate backbone capacity • 25,000 simultaneous transfers • Adequate “last mile” bandwidth • 1 second/screen • Display capability • 10% of US population • Effective search capabilities • Lycos (now google), Yahoo
Nature of the Web • Over one billion pages by 1999 • Growing at 25% per month! • Google indexed about 3 billion pages in 2003 • Unstable • Changing at 1% per week • Redundant • 30-40% (near) duplicates • e.g., unix man page tree
Source: Michael Lesk, How Much Information is there in the World?
What’s a Web “Site”? • OCLC counts any server at port 80 • Misses many servers at other ports • Some servers host unrelated content • Geocities • Some content requires specialized servers • rtsp
World Trade in 2001 Source: World Trade Organization
Global Internet User Population 2000 2005 English English Chinese Source: Global Reach
Widely Spoken Languages Source: http://www.g11n.com/faq.html
Source: James Crawford, http://ourworld.compuserve.com/homepages/JWCRAWFORD/can-pop.htm
Web Page Languages Source: Jack Xu, Excite@Home, 1999
European Web Size: Exponential Growth Source: Extrapolated from Grefenstette and Nioche, RIAO 2000
European Web Content Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997
Live Streams Almost 2000 Internet-accessible Radio and Television Stations source: www.real.com, Feb 2000
Streaming Media • SingingFish indexes 35 million streams • 60% of queries are for music • Then movies • Then sports • Then news
Web Crawl Challenges • Temporary server interruptions • Discovering “islands” and “peninsulas” • Duplicate and near-duplicate content • Dynamic content • Link rot • Server and network loads • Have I seen this page before?
Duplicate Detection • Structural • Identical directory structure (e.g., mirrors, aliases) • Syntactic • Identical bytes • Identical markup (HTML, XML, …) • Semantic • Identical content • Similar content (e.g., with a different banner ad) • Related content (e.g., translated)
Robots Exclusion Protocol • Based on voluntary compliance by crawlers • Exclusion by site • Create a robots.txt file at the server’s top level • Indicate which directories not to crawl • Exclusion by document (in HTML head) • Not implemented by all crawlers <meta name="robots“ content="noindex,nofollow">
The Deep Web • Dynamic pages, generated from databases • Not easily discovered using crawling • Perhaps 400-500 times larger than surface Web • Fastest growing source of new information
Name Type URL Web Size (GBs) National Climatic Data Center (NOAA) Public http://www.ncdc.noaa.gov/ol/satellite/satelliteresources.html 366,000 NASA EOSDIS Public http://harp.gsfc.nasa.gov/~imswww/pub/imswelcome/plain.html 219,600 National Oceanographic (combined with Geophysical) Data Center (NOAA) Public/Fee http://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/ 32,940 Alexa Public (partial) http://www.alexa.com/ 15,860 Right-to-Know Network (RTK Net) Public http://www.rtk.net/ 14,640 MP3.com Public http://www.mp3.com/ Deep Web • 60 Deep Sites Exceed Surface Web by 40 Times
Hands on: The Wayback Machine • Internet Archive • Stored Alexa.com Web crawls since 1997 • http://archive.org • Check out Maryland’s Web site in 1997 • Check out the history of your favorite site
Discussion Point • Can we save everything? • Should we? • Do people have a right to remove things?