Searching the web
Download
1 / 24

Searching the Web - PowerPoint PPT Presentation


  • 50 Views
  • Uploaded on

Searching the Web. The Web. Why is it important: “Free” ubiquitous information resource Broad coverage of topics and perspectives Becoming dominant information collection Growth and jobs Web access methods Search (e.g. Google) Directories (e.g. Yahoo!) Other …. Web Characteristics.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Searching the Web' - kelsie-oneil


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

The web
The Web

  • Why is it important:

    • “Free” ubiquitous information resource

    • Broad coverage of topics and perspectives

    • Becoming dominant information collection

    • Growth and jobs

  • Web access methods

    Search (e.g. Google)

    Directories (e.g. Yahoo!)

    Other …


Web characteristics
Web Characteristics

  • Distributed data

    • 80 million web sites (hostnames responding) in April 2006

    • 40 million active web sites (don’t redirect, …)

  • High volatility

    • Servers come and go …

  • Large volume

    • One study found 11.5 billion pages in January 2005 (at that time Google indexed 8 billion pages)

    • “Dark Web” – content not indexed, not crawlableestimated to be 4,200 to 7,500 terabytes in 2000 (when there were ~ 2.5 billion indexable pages)


Web characteristics1
Web Characteristics

  • Unstructured data

    • Lots of duplicated content (30% estimate)

    • Semantic duplication much higher

  • Quality of data

    • No required editorial process

    • Many typos and misspellings (impacts IR)

  • Heterogeneous data

    • Different media

    • Different languages

  • These characteristics are not going to change


Web content types
Web Content Types

Source: How much information 2003


Search engine architecture
Search Engine Architecture

Lots and lots of computers

Interface

Query Engine

Users

Index

Crawler

Indexer

Web


Search engine architecture1
Search Engine Architecture

Evaluation

Chapter 3

Interface

Query Engine

Users

Chapter 10

Chapters 2, 4, & 5

Index

Crawler

Indexer

Chapter 8

Chapters 6 & 7

Web


Search engine architecture2
Search Engine Architecture

Interface

Query Engine

Users

Chapter 10

Chapters 2, 4, & 5

Index

Crawler

Indexer

Chapter 8

Chapters 6 & 7

Web


Hubs and authorities
Hubs and Authorities

  • Hubs

    • Have lots of links to other pages

  • Authorities

    • Have lots of links that point to them

  • Can use feedback to rank hubs and authorities

    • Better hubs have links to good authorities

    • Better authorities have links from good hubs


Crawling the web
Crawling the Web

  • Creating a Web Crawler (Web Spider)

  • Simplest Technique

    • Start with a set of URLs

    • Extract URLs pointed to in original set

    • Continue using either breadth-first or depth-first

  • Works for one crawler but hard to coordinate for many crawlers

    • Partition Web by domain name, ip address, or other technique

    • Each crawler has its own set but shares a to-do list


Crawling the web1
Crawling the Web

  • Need to recrawl

    • Indexed content is always out of date

    • Sites come and go and sometimes disappear for periods of time to reappear

  • Order of URLs traversed makes a difference

    • Breadth first matches hierarchic organization of content

    • Depth first gets to deeper content faster

    • Proceeding to “better” pages first can also help (e.g. good hubs and good authorities)


Server and author control of crawling
Server and Author Control of Crawling

  • Avoid crawling sites that do not want to be crawled

    • Legal issue

    • Robot exclusion protocol (Server level control)

      • file that indicates which portion of web site should not be visited by crawler

      • http://.../robots.txt

    • Robot META tag (Author level control)

      • used to indicate if a file (page) should be indexed or analyzed for links

      • few crawlers implement this

      • <meta name=“robots” content=“noindex, nofollow”)

      • http://www.robotstxt.org/wc/exclusion.html


Example robots txt files
Example robots.txt Files

  • TAMU Library

    User-agent: *

    Disallow: /portal/site/chinaarchive/template.REGISTER/

    Disallow: /portal/site/Library/template.REGISTER/

  • Google

    User-agent: *

    Allow: /searchhistory/

    Disallow: /search

    Disallow: /groups

    Disallow: /images

    Disallow: /catalogs

    Disallow: /catalogues

  • New York Times

    User-agent: *

    Disallow: /pages/college/

    Allow: /pages/

    Allow: /2003/

    User-agent: Mediapartners-Google*

    Disallow:

  • CSDL

    User-agent: *

    Disallow: /FLORA/arch_priv/

    Disallow: /FLORA/private/


Crawling goals
Crawling Goals

  • Crawling technique may depend on goal

  • Types of crawling goals:

    • Create large broad index

    • Creating a focused topic or domain-specific index

      • Target topic-relevant sites

      • Index preset terms

    • Creating a subset of content to model characteristics of (part of) the Web

      • Need to survey appropriately

      • Cannot use simple depth-first or breadth-first

    • Create up-to-date index

      • Use estimated change frequencies


Crawling challenges
Crawling Challenges

  • Identifying and keeping track of links

    • Which to visit

    • Which have been visited

  • Issues

    • Relative vs. absolute link descriptions

    • Alternate server names

    • Dynamically generated pages

    • Server-side scripting

    • Links buried in scripts


Crawling architecture
Crawling Architecture

  • Crawler components

    Worker threads – attempt to retrieve data for URL

    DNS resolver – resolves domain names into IP addresses

    Protocol modules – downloads content in appropriate protocol

    Link extractor – finds and normalizes URLs

    URL filter – determines which URLs to add to to-do list

    URL to-do agent – keeps list of URLs to visit


Crawling issues
Crawling Issues

  • Avoid overloading servers

    • Brute force approach can become a denial of service attack

    • Weak politeness guarantee: only one thread allowed to contact a server

    • Stronger politeness guarantee: maintain queues for each server that put URLs into the to-do list based on priority and load factors

  • Broken links, time outs

    • How many times to try?

    • How long to wait?

    • How to recognize crawler traps? (server-side programs that generate “infinite” links)


Web tasks
Web Tasks

  • Precision is the key

    • Goal: first 10-100 results should satisfy user

    • Requires ranking that matches user’s need

    • Recall is not important

      • Completeness of index is not important

      • Comprehensive crawling is not important


Browsing
Browsing

  • Web directories

    • Human-organized taxonomies of Web sites

    • Small portion (< than 1%) of Web pages

      • Remember that recall (completeness) is not important

      • Directories point to logical web sites rather than pages

    • Directory search returns both categories and sites

    • People generally browse rather than search once they identify categories of interest


Metasearch
Metasearch

  • Search a number of search engines

  • Advantages

    • Do not build their own crawler and index

    • Cover more of the Web than any of their component search engines

  • Difficulties

    • Need to translate query to each engine query language

    • Need to merge results into a meaningful ranking


Metasearch ii
Metasearch II

  • Merging Results

    • Voting scheme based on component search engines

      • No model of component ranking schemes needed

    • Model-based merging

      • Need understanding of relative ranking, potentially by query type

  • Why they are not used for the Web

    • Bias towards coverage (e.g. recall), which is not important for most Web queries

    • Merging results is largely ad-hoc, so search engines tend to do better

  • Big application: the Dark Web


Using structure in search
Using Structure in Search

  • Languages to search content and structure

    • Query languages over labeled graphs

      • PHIQL: Used in Microplis and PHIDIAS hypertext systems

      • Web-oriented: W3QL, WebSQL, WebLog, WQL


Using structure in search1
Using Structure in Search

  • Other use of structure in search

    • Relevant pages have neighbors that also tend to be relevant

    • Search approaches that collect (and filter) neighbors to returned pages


Web query characteristics
Web Query Characteristics

  • Few terms and operators

    • Average 2.35 terms per query

      • 25% of queries have a single term

    • Average 0.41 operators per query

  • Queries get repeated

    • Average 3.97 instances of each query

    • This is very uneven (e.g. “Britney Spears” vs. “Frank Shipman”)

  • Query sessions are short

    • Average 2.02 queries per session

    • Average of 1.39 pages of results examined

  • Data from 1998 study

    • How different today?


ad