1 / 24

Searching the Web

Searching the Web. The Web. Why is it important: “Free” ubiquitous information resource Broad coverage of topics and perspectives Becoming dominant information collection Growth and jobs Web access methods Search (e.g. Google) Directories (e.g. Yahoo!) Other …. Web Characteristics.

Download Presentation

Searching the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching the Web

  2. The Web • Why is it important: • “Free” ubiquitous information resource • Broad coverage of topics and perspectives • Becoming dominant information collection • Growth and jobs • Web access methods Search (e.g. Google) Directories (e.g. Yahoo!) Other …

  3. Web Characteristics • Distributed data • 80 million web sites (hostnames responding) in April 2006 • 40 million active web sites (don’t redirect, …) • High volatility • Servers come and go … • Large volume • One study found 11.5 billion pages in January 2005 (at that time Google indexed 8 billion pages) • “Dark Web” – content not indexed, not crawlableestimated to be 4,200 to 7,500 terabytes in 2000 (when there were ~ 2.5 billion indexable pages)

  4. Web Characteristics • Unstructured data • Lots of duplicated content (30% estimate) • Semantic duplication much higher • Quality of data • No required editorial process • Many typos and misspellings (impacts IR) • Heterogeneous data • Different media • Different languages • These characteristics are not going to change

  5. Web Content Types Source: How much information 2003

  6. Search Engine Architecture Lots and lots of computers Interface Query Engine Users Index Crawler Indexer Web

  7. Search Engine Architecture Evaluation Chapter 3 Interface Query Engine Users Chapter 10 Chapters 2, 4, & 5 Index Crawler Indexer Chapter 8 Chapters 6 & 7 Web

  8. Search Engine Architecture Interface Query Engine Users Chapter 10 Chapters 2, 4, & 5 Index Crawler Indexer Chapter 8 Chapters 6 & 7 Web

  9. Hubs and Authorities • Hubs • Have lots of links to other pages • Authorities • Have lots of links that point to them • Can use feedback to rank hubs and authorities • Better hubs have links to good authorities • Better authorities have links from good hubs

  10. Crawling the Web • Creating a Web Crawler (Web Spider) • Simplest Technique • Start with a set of URLs • Extract URLs pointed to in original set • Continue using either breadth-first or depth-first • Works for one crawler but hard to coordinate for many crawlers • Partition Web by domain name, ip address, or other technique • Each crawler has its own set but shares a to-do list

  11. Crawling the Web • Need to recrawl • Indexed content is always out of date • Sites come and go and sometimes disappear for periods of time to reappear • Order of URLs traversed makes a difference • Breadth first matches hierarchic organization of content • Depth first gets to deeper content faster • Proceeding to “better” pages first can also help (e.g. good hubs and good authorities)

  12. Server and Author Control of Crawling • Avoid crawling sites that do not want to be crawled • Legal issue • Robot exclusion protocol (Server level control) • file that indicates which portion of web site should not be visited by crawler • http://.../robots.txt • Robot META tag (Author level control) • used to indicate if a file (page) should be indexed or analyzed for links • few crawlers implement this • <meta name=“robots” content=“noindex, nofollow”) • http://www.robotstxt.org/wc/exclusion.html

  13. Example robots.txt Files • TAMU Library User-agent: * Disallow: /portal/site/chinaarchive/template.REGISTER/ Disallow: /portal/site/Library/template.REGISTER/ … • Google User-agent: * Allow: /searchhistory/ Disallow: /search Disallow: /groups Disallow: /images Disallow: /catalogs Disallow: /catalogues … • New York Times User-agent: * Disallow: /pages/college/ … Allow: /pages/ Allow: /2003/ … User-agent: Mediapartners-Google* Disallow: • CSDL User-agent: * Disallow: /FLORA/arch_priv/ Disallow: /FLORA/private/

  14. Crawling Goals • Crawling technique may depend on goal • Types of crawling goals: • Create large broad index • Creating a focused topic or domain-specific index • Target topic-relevant sites • Index preset terms • Creating a subset of content to model characteristics of (part of) the Web • Need to survey appropriately • Cannot use simple depth-first or breadth-first • Create up-to-date index • Use estimated change frequencies

  15. Crawling Challenges • Identifying and keeping track of links • Which to visit • Which have been visited • Issues • Relative vs. absolute link descriptions • Alternate server names • Dynamically generated pages • Server-side scripting • Links buried in scripts

  16. Crawling Architecture • Crawler components Worker threads – attempt to retrieve data for URL DNS resolver – resolves domain names into IP addresses Protocol modules – downloads content in appropriate protocol Link extractor – finds and normalizes URLs URL filter – determines which URLs to add to to-do list URL to-do agent – keeps list of URLs to visit

  17. Crawling Issues • Avoid overloading servers • Brute force approach can become a denial of service attack • Weak politeness guarantee: only one thread allowed to contact a server • Stronger politeness guarantee: maintain queues for each server that put URLs into the to-do list based on priority and load factors • Broken links, time outs • How many times to try? • How long to wait? • How to recognize crawler traps? (server-side programs that generate “infinite” links)

  18. Web Tasks • Precision is the key • Goal: first 10-100 results should satisfy user • Requires ranking that matches user’s need • Recall is not important • Completeness of index is not important • Comprehensive crawling is not important

  19. Browsing • Web directories • Human-organized taxonomies of Web sites • Small portion (< than 1%) of Web pages • Remember that recall (completeness) is not important • Directories point to logical web sites rather than pages • Directory search returns both categories and sites • People generally browse rather than search once they identify categories of interest

  20. Metasearch • Search a number of search engines • Advantages • Do not build their own crawler and index • Cover more of the Web than any of their component search engines • Difficulties • Need to translate query to each engine query language • Need to merge results into a meaningful ranking

  21. Metasearch II • Merging Results • Voting scheme based on component search engines • No model of component ranking schemes needed • Model-based merging • Need understanding of relative ranking, potentially by query type • Why they are not used for the Web • Bias towards coverage (e.g. recall), which is not important for most Web queries • Merging results is largely ad-hoc, so search engines tend to do better • Big application: the Dark Web

  22. Using Structure in Search • Languages to search content and structure • Query languages over labeled graphs • PHIQL: Used in Microplis and PHIDIAS hypertext systems • Web-oriented: W3QL, WebSQL, WebLog, WQL

  23. Using Structure in Search • Other use of structure in search • Relevant pages have neighbors that also tend to be relevant • Search approaches that collect (and filter) neighbors to returned pages

  24. Web Query Characteristics • Few terms and operators • Average 2.35 terms per query • 25% of queries have a single term • Average 0.41 operators per query • Queries get repeated • Average 3.97 instances of each query • This is very uneven (e.g. “Britney Spears” vs. “Frank Shipman”) • Query sessions are short • Average 2.02 queries per session • Average of 1.39 pages of results examined • Data from 1998 study • How different today?

More Related