Navigating the Web: A Guide to Searching & Crawling Strategies

Searching the Web

The Web • Why is it important: • “Free” ubiquitous information resource • Broad coverage of topics and perspectives • Becoming dominant information collection • Growth and jobs • Web access methods Search (e.g. Google) Directories (e.g. Yahoo!) Other …

Web Characteristics • Distributed data • 80 million web sites (hostnames responding) in April 2006 • 40 million active web sites (don’t redirect, …) • High volatility • Servers come and go … • Large volume • One study found 11.5 billion pages in January 2005 (at that time Google indexed 8 billion pages) • “Dark Web” – content not indexed, not crawlableestimated to be 4,200 to 7,500 terabytes in 2000 (when there were ~ 2.5 billion indexable pages)

Web Characteristics • Unstructured data • Lots of duplicated content (30% estimate) • Semantic duplication much higher • Quality of data • No required editorial process • Many typos and misspellings (impacts IR) • Heterogeneous data • Different media • Different languages • These characteristics are not going to change

Web Content Types Source: How much information 2003

Search Engine Architecture Lots and lots of computers Interface Query Engine Users Index Crawler Indexer Web

Search Engine Architecture Evaluation Chapter 3 Interface Query Engine Users Chapter 10 Chapters 2, 4, & 5 Index Crawler Indexer Chapter 8 Chapters 6 & 7 Web

Search Engine Architecture Interface Query Engine Users Chapter 10 Chapters 2, 4, & 5 Index Crawler Indexer Chapter 8 Chapters 6 & 7 Web

Hubs and Authorities • Hubs • Have lots of links to other pages • Authorities • Have lots of links that point to them • Can use feedback to rank hubs and authorities • Better hubs have links to good authorities • Better authorities have links from good hubs

Crawling the Web • Creating a Web Crawler (Web Spider) • Simplest Technique • Start with a set of URLs • Extract URLs pointed to in original set • Continue using either breadth-first or depth-first • Works for one crawler but hard to coordinate for many crawlers • Partition Web by domain name, ip address, or other technique • Each crawler has its own set but shares a to-do list

Crawling the Web • Need to recrawl • Indexed content is always out of date • Sites come and go and sometimes disappear for periods of time to reappear • Order of URLs traversed makes a difference • Breadth first matches hierarchic organization of content • Depth first gets to deeper content faster • Proceeding to “better” pages first can also help (e.g. good hubs and good authorities)

Server and Author Control of Crawling • Avoid crawling sites that do not want to be crawled • Legal issue • Robot exclusion protocol (Server level control) • file that indicates which portion of web site should not be visited by crawler • http://.../robots.txt • Robot META tag (Author level control) • used to indicate if a file (page) should be indexed or analyzed for links • few crawlers implement this • <meta name=“robots” content=“noindex, nofollow”) • http://www.robotstxt.org/wc/exclusion.html

Example robots.txt Files • TAMU Library User-agent: * Disallow: /portal/site/chinaarchive/template.REGISTER/ Disallow: /portal/site/Library/template.REGISTER/ … • Google User-agent: * Allow: /searchhistory/ Disallow: /search Disallow: /groups Disallow: /images Disallow: /catalogs Disallow: /catalogues … • New York Times User-agent: * Disallow: /pages/college/ … Allow: /pages/ Allow: /2003/ … User-agent: Mediapartners-Google* Disallow: • CSDL User-agent: * Disallow: /FLORA/arch_priv/ Disallow: /FLORA/private/

Crawling Goals • Crawling technique may depend on goal • Types of crawling goals: • Create large broad index • Creating a focused topic or domain-specific index • Target topic-relevant sites • Index preset terms • Creating a subset of content to model characteristics of (part of) the Web • Need to survey appropriately • Cannot use simple depth-first or breadth-first • Create up-to-date index • Use estimated change frequencies

Crawling Challenges • Identifying and keeping track of links • Which to visit • Which have been visited • Issues • Relative vs. absolute link descriptions • Alternate server names • Dynamically generated pages • Server-side scripting • Links buried in scripts

Crawling Architecture • Crawler components Worker threads – attempt to retrieve data for URL DNS resolver – resolves domain names into IP addresses Protocol modules – downloads content in appropriate protocol Link extractor – finds and normalizes URLs URL filter – determines which URLs to add to to-do list URL to-do agent – keeps list of URLs to visit

Crawling Issues • Avoid overloading servers • Brute force approach can become a denial of service attack • Weak politeness guarantee: only one thread allowed to contact a server • Stronger politeness guarantee: maintain queues for each server that put URLs into the to-do list based on priority and load factors • Broken links, time outs • How many times to try? • How long to wait? • How to recognize crawler traps? (server-side programs that generate “infinite” links)

Web Tasks • Precision is the key • Goal: first 10-100 results should satisfy user • Requires ranking that matches user’s need • Recall is not important • Completeness of index is not important • Comprehensive crawling is not important

Browsing • Web directories • Human-organized taxonomies of Web sites • Small portion (< than 1%) of Web pages • Remember that recall (completeness) is not important • Directories point to logical web sites rather than pages • Directory search returns both categories and sites • People generally browse rather than search once they identify categories of interest

Metasearch • Search a number of search engines • Advantages • Do not build their own crawler and index • Cover more of the Web than any of their component search engines • Difficulties • Need to translate query to each engine query language • Need to merge results into a meaningful ranking

Metasearch II • Merging Results • Voting scheme based on component search engines • No model of component ranking schemes needed • Model-based merging • Need understanding of relative ranking, potentially by query type • Why they are not used for the Web • Bias towards coverage (e.g. recall), which is not important for most Web queries • Merging results is largely ad-hoc, so search engines tend to do better • Big application: the Dark Web

Using Structure in Search • Languages to search content and structure • Query languages over labeled graphs • PHIQL: Used in Microplis and PHIDIAS hypertext systems • Web-oriented: W3QL, WebSQL, WebLog, WQL

Using Structure in Search • Other use of structure in search • Relevant pages have neighbors that also tend to be relevant • Search approaches that collect (and filter) neighbors to returned pages

Web Query Characteristics • Few terms and operators • Average 2.35 terms per query • 25% of queries have a single term • Average 0.41 operators per query • Queries get repeated • Average 3.97 instances of each query • This is very uneven (e.g. “Britney Spears” vs. “Frank Shipman”) • Query sessions are short • Average 2.02 queries per session • Average of 1.39 pages of results examined • Data from 1998 study • How different today?

Navigating the Web: A Guide to Searching & Crawling Strategies

Navigating the Web: A Guide to Searching & Crawling Strategies

Presentation Transcript

Searching the Web CS3352 Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching the web

Searching The Web

Searching the Web

Searching the Web

Searching the Web

Searching the web

Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching the Web

Searching The Web