Searching the web
1 / 24

Searching the Web - PowerPoint PPT Presentation

  • Uploaded on

Searching the Web. The Web. Why is it important: “Free” ubiquitous information resource Broad coverage of topics and perspectives Becoming dominant information collection Growth and jobs Web access methods Search (e.g. Google) Directories (e.g. Yahoo!) Other …. Web Characteristics.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Searching the Web' - kelsie-oneil

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

The web
The Web

  • Why is it important:

    • “Free” ubiquitous information resource

    • Broad coverage of topics and perspectives

    • Becoming dominant information collection

    • Growth and jobs

  • Web access methods

    Search (e.g. Google)

    Directories (e.g. Yahoo!)

    Other …

Web characteristics
Web Characteristics

  • Distributed data

    • 80 million web sites (hostnames responding) in April 2006

    • 40 million active web sites (don’t redirect, …)

  • High volatility

    • Servers come and go …

  • Large volume

    • One study found 11.5 billion pages in January 2005 (at that time Google indexed 8 billion pages)

    • “Dark Web” – content not indexed, not crawlableestimated to be 4,200 to 7,500 terabytes in 2000 (when there were ~ 2.5 billion indexable pages)

Web characteristics1
Web Characteristics

  • Unstructured data

    • Lots of duplicated content (30% estimate)

    • Semantic duplication much higher

  • Quality of data

    • No required editorial process

    • Many typos and misspellings (impacts IR)

  • Heterogeneous data

    • Different media

    • Different languages

  • These characteristics are not going to change

Web content types
Web Content Types

Source: How much information 2003

Search engine architecture
Search Engine Architecture

Lots and lots of computers


Query Engine






Search engine architecture1
Search Engine Architecture


Chapter 3


Query Engine


Chapter 10

Chapters 2, 4, & 5




Chapter 8

Chapters 6 & 7


Search engine architecture2
Search Engine Architecture


Query Engine


Chapter 10

Chapters 2, 4, & 5




Chapter 8

Chapters 6 & 7


Hubs and authorities
Hubs and Authorities

  • Hubs

    • Have lots of links to other pages

  • Authorities

    • Have lots of links that point to them

  • Can use feedback to rank hubs and authorities

    • Better hubs have links to good authorities

    • Better authorities have links from good hubs

Crawling the web
Crawling the Web

  • Creating a Web Crawler (Web Spider)

  • Simplest Technique

    • Start with a set of URLs

    • Extract URLs pointed to in original set

    • Continue using either breadth-first or depth-first

  • Works for one crawler but hard to coordinate for many crawlers

    • Partition Web by domain name, ip address, or other technique

    • Each crawler has its own set but shares a to-do list

Crawling the web1
Crawling the Web

  • Need to recrawl

    • Indexed content is always out of date

    • Sites come and go and sometimes disappear for periods of time to reappear

  • Order of URLs traversed makes a difference

    • Breadth first matches hierarchic organization of content

    • Depth first gets to deeper content faster

    • Proceeding to “better” pages first can also help (e.g. good hubs and good authorities)

Server and author control of crawling
Server and Author Control of Crawling

  • Avoid crawling sites that do not want to be crawled

    • Legal issue

    • Robot exclusion protocol (Server level control)

      • file that indicates which portion of web site should not be visited by crawler

      • http://.../robots.txt

    • Robot META tag (Author level control)

      • used to indicate if a file (page) should be indexed or analyzed for links

      • few crawlers implement this

      • <meta name=“robots” content=“noindex, nofollow”)


Example robots txt files
Example robots.txt Files

  • TAMU Library

    User-agent: *

    Disallow: /portal/site/chinaarchive/template.REGISTER/

    Disallow: /portal/site/Library/template.REGISTER/

  • Google

    User-agent: *

    Allow: /searchhistory/

    Disallow: /search

    Disallow: /groups

    Disallow: /images

    Disallow: /catalogs

    Disallow: /catalogues

  • New York Times

    User-agent: *

    Disallow: /pages/college/

    Allow: /pages/

    Allow: /2003/

    User-agent: Mediapartners-Google*


  • CSDL

    User-agent: *

    Disallow: /FLORA/arch_priv/

    Disallow: /FLORA/private/

Crawling goals
Crawling Goals

  • Crawling technique may depend on goal

  • Types of crawling goals:

    • Create large broad index

    • Creating a focused topic or domain-specific index

      • Target topic-relevant sites

      • Index preset terms

    • Creating a subset of content to model characteristics of (part of) the Web

      • Need to survey appropriately

      • Cannot use simple depth-first or breadth-first

    • Create up-to-date index

      • Use estimated change frequencies

Crawling challenges
Crawling Challenges

  • Identifying and keeping track of links

    • Which to visit

    • Which have been visited

  • Issues

    • Relative vs. absolute link descriptions

    • Alternate server names

    • Dynamically generated pages

    • Server-side scripting

    • Links buried in scripts

Crawling architecture
Crawling Architecture

  • Crawler components

    Worker threads – attempt to retrieve data for URL

    DNS resolver – resolves domain names into IP addresses

    Protocol modules – downloads content in appropriate protocol

    Link extractor – finds and normalizes URLs

    URL filter – determines which URLs to add to to-do list

    URL to-do agent – keeps list of URLs to visit

Crawling issues
Crawling Issues

  • Avoid overloading servers

    • Brute force approach can become a denial of service attack

    • Weak politeness guarantee: only one thread allowed to contact a server

    • Stronger politeness guarantee: maintain queues for each server that put URLs into the to-do list based on priority and load factors

  • Broken links, time outs

    • How many times to try?

    • How long to wait?

    • How to recognize crawler traps? (server-side programs that generate “infinite” links)

Web tasks
Web Tasks

  • Precision is the key

    • Goal: first 10-100 results should satisfy user

    • Requires ranking that matches user’s need

    • Recall is not important

      • Completeness of index is not important

      • Comprehensive crawling is not important


  • Web directories

    • Human-organized taxonomies of Web sites

    • Small portion (< than 1%) of Web pages

      • Remember that recall (completeness) is not important

      • Directories point to logical web sites rather than pages

    • Directory search returns both categories and sites

    • People generally browse rather than search once they identify categories of interest


  • Search a number of search engines

  • Advantages

    • Do not build their own crawler and index

    • Cover more of the Web than any of their component search engines

  • Difficulties

    • Need to translate query to each engine query language

    • Need to merge results into a meaningful ranking

Metasearch ii
Metasearch II

  • Merging Results

    • Voting scheme based on component search engines

      • No model of component ranking schemes needed

    • Model-based merging

      • Need understanding of relative ranking, potentially by query type

  • Why they are not used for the Web

    • Bias towards coverage (e.g. recall), which is not important for most Web queries

    • Merging results is largely ad-hoc, so search engines tend to do better

  • Big application: the Dark Web

Using structure in search
Using Structure in Search

  • Languages to search content and structure

    • Query languages over labeled graphs

      • PHIQL: Used in Microplis and PHIDIAS hypertext systems

      • Web-oriented: W3QL, WebSQL, WebLog, WQL

Using structure in search1
Using Structure in Search

  • Other use of structure in search

    • Relevant pages have neighbors that also tend to be relevant

    • Search approaches that collect (and filter) neighbors to returned pages

Web query characteristics
Web Query Characteristics

  • Few terms and operators

    • Average 2.35 terms per query

      • 25% of queries have a single term

    • Average 0.41 operators per query

  • Queries get repeated

    • Average 3.97 instances of each query

    • This is very uneven (e.g. “Britney Spears” vs. “Frank Shipman”)

  • Query sessions are short

    • Average 2.02 queries per session

    • Average of 1.39 pages of results examined

  • Data from 1998 study

    • How different today?