web search engines n.
Skip this Video
Loading SlideShow in 5 Seconds..
Web search engines PowerPoint Presentation
Download Presentation
Web search engines

Loading in 2 Seconds...

play fullscreen
1 / 58

Web search engines - PowerPoint PPT Presentation

  • Uploaded on

Web search engines. Paolo Ferragina Dipartimento di Informatica Università di Pisa. The Web: Size: more than tens of billions of pages Language and encodings: hundreds… Distributed authorship: SPAM, format-less,… Dynamic: in one year 35% survive, 20% untouched The User:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Web search engines' - dima

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
web search engines

Web search engines

Paolo Ferragina

Dipartimento di Informatica

Università di Pisa

two main difficulties
The Web:

Size: more than tens of billions of pages

Language and encodings:hundreds…

Distributed authorship: SPAM, format-less,…

Dynamic: in one year 35% survive, 20% untouched

The User:

Query composition: short (2.5 terms avg) and imprecise

Query results: 85% users look at just one result-page

Several needs: Informational, Navigational, Transactional

Two main difficulties

Extracting “significant data” is difficult !!

Matching “user needs” is difficult !!

evolution of search engines
Evolution of Search Engines
  • First generation-- use only on-page, web-text data
    • Word frequency and language
  • Second generation-- use off-page, web-graph data
    • Link (or connectivity) analysis
    • Anchor-text (How people refer to a page)
  • Third generation-- answer “the need behind the query”
    • Focus on “user need”, rather than on query
    • Integrate multiple data-sources
    • Click-through data

1995-1997 AltaVista, Excite, Lycos, etc

1998: Google

Google, Yahoo,

MSN, ASK,………

Fourth generation  Information Supply

[Andrei Broder, VP emerging search tech, Yahoo! Research]





III° generation

II° generation

IV° generation

quality of a search engine

Quality of a search engine

Paolo Ferragina

Dipartimento di Informatica

Università di Pisa

Reading 8

is it good
Is it good ?
  • How fast does it index
    • Number of documents/hour
    • (Average document size)
  • How fast does it search
    • Latency as a function of index size
  • Expressiveness of the query language
measures for a search engine
Measures for a search engine
  • All of the preceding criteria are measurable
  • The key measure: user happiness

…useless answers won’t make a user happy

happiness elusive to measure
Happiness: elusive to measure
  • Commonest approach is given by the relevance of search results
    • How do we measure it ?
  • Requires 3 elements:
    • A benchmark document collection
    • A benchmark suite of queries
    • A binary assessment of either Relevant or Irrelevant for each query-doc pair
evaluating an ir system
Evaluating an IR system
  • Standard benchmarks
    • TREC: National Institute of Standards and Testing (NIST) has run large IR testbed for many years
    • Other doc collections: marked by human experts, for each query and for each doc, Relevant or Irrelevant
  • On the Web everything is more complicated since we cannot mark the entire corpus !!
precision vs recall
Precision vs. Recall
  • Precision: % docs retrieved that are relevant [issue “junk” found]
  • Recall: % docs relevant that are retrieved [issue “info” found]




how to compute them
How to compute them
  • Precision: fraction of retrieved docs that are relevant
  • Recall: fraction of relevant docs that are retrieved
  • Precision P = tp/(tp + fp)
  • Recall R = tp/(tp + fn)
some considerations
Some considerations
  • Can get high recall (but low precision) by retrieving all docs for all queries!
  • Recall is a non-decreasing function of the number of docs retrieved
  • Precision usually decreases
precision recall curve
We measures Precision at various levels of Recall

Note: it is an AVERAGE over many queries

Precision-Recall curve







a common picture
A common picture







f measure
F measure
  • Combined measure (weighted harmonic mean):
  • People usually use balanced F1 measure
    • i.e., with  = ½ thus 1/F = ½ (1/P + 1/R)
  • Use this if you need to optimize a single measure that balances precision and recall.
the web graph properties

The web-graph: properties

Paolo Ferragina

Dipartimento di Informatica

Università di Pisa

Reading 19.1 and 19.2

the web s characteristics
The Web’s Characteristics
  • Size
    • 1 trillion of pages is available (Google 7/08)
      • 50 billion static pages
    • 5-40K per page => terabytes & terabytes
    • Size grows every day!!
  • Change
    • 8% new pages, 25% new links change weekly
    • Life time of about 10 days
some definitions
Some definitions
  • Weakly connected components (WCC)
    • Set of nodes such that from any node can go to any node via an undirected path.
  • Strongly connected components (SCC)
    • Set of nodes such that from any node can go to any node via a directed path.



find the core
Find the CORE
  • Iterate the following process:
    • Pick a random vertex v
    • Compute all nodes reached from v: O(v)
    • Compute all nodes that reach v: I(v)
    • Compute SCC(v):= I(v) ∩ O(v)
    • Check whether it is the largest SCC

If the CORE is about ¼ of the vertices, after 20 iterations, Pb to not find the core < 1% (given that the graph is available).

compute sccs
Compute SCCs
  • Classical Algorithm:
    • DFS(G)
    • Transpose G in GT
    • DFS(GT) following vertices in decreasing order of the time their visit ended at step 1.
    • Every tree is a SCC.

DFS is hard to compute on disk: no locality


Classical Approach


foreach vertex v do



foreach vertex v do

if (color[v]==WHITE)






d[u] time  time +1

foreach v in succ[u] do

if (color[v]=WHITE) then

p[v] u



color[u] BLACK

f[u]  time  time + 1

semi external dfs
Semi-External DFS
  • Bit array of nodes (visited or not)
  • Array of successors
  • Stack of the DFS-recursion

Key observation: If bit-array fits in internal memory than a DFS takes |V| + |E|/B disk accesses.

what about million billion nodes
What about million/billion nodes?


Key observation: A forest is a DFS forest if and only if there are no FORWARD CROSS edges among the non-tree edges

Algorithm ? Construct incrementally a tentative DFS forest which minimizes the # of those edges (overall), in passes...

a semi external dfs
A Semi-External DFS
  • Bit array of nodes (visited or not)
  • Array of successors (stack of the DFS-recursion)

Key assumption: We assume that 2n edges, and the auxiliary data structures, fit in memory.

Rearrange nodes in adj-lists, the ones with large subtrees go to the front

observing web graph
Observing Web Graph
  • We do not know which percentage of it we know
  • The only way to discover the graph structure of the web is via large scale crawls
  • Warning: the picture might be distorted by
    • Size limitation of the crawl
    • Crawling rules
    • Perturbations of the "natural" process of birth and death of nodes and links
why is it interesting
Why is it interesting?
  • Largest artifact ever conceived by the human
  • Exploit its structure of the Web for
      • Crawl strategies
      • Search
      • Spam detection
      • Discovering communities on the web
      • Classification/organization
  • Predict the evolution of the Web
      • Sociological understanding
many other large graphs
Many other large graphs…
  • Physical network graph
    • V = Routers
    • E = communication links
  • The “cosine” graph (undirected, weighted)
    • V = static web pages
    • E = semantic distance between pages
  • Query-Log graph (bipartite, weighted)
    • V = queries and URL
    • E = (q,u) u is a result for q, and has been clicked by some user who issued q
  • Social graph (undirected, unweighted)
    • V = users
    • E = (x,y) if x knows y (facebook, address book, email,..)
the size of the web

The size of the web

Paolo Ferragina

Dipartimento di Informatica

Università di Pisa

Reading 19.5

what is the size of the web
What is the size of the web ?
  • Issues
    • The web is really infinite
      • Dynamic content, e.g., calendar
    • Static web contains syntactic duplication, mostly due to mirroring (~30%)
    • Some servers are seldom connected
  • Who cares?
    • Media, and consequently the user
    • Engine design
what can we attempt to measure
The relative sizes of search engines

Document extension: e.g. engines index pages not yet crawled, by indexing anchor-text.

Document restriction: All engines restrict what is indexed (first n words, only relevant words, etc.)

The coverage of a search engine relative to another particular crawling process.

What can we attempt to measure?
relative size from overlap given two engines a and b

Sec. 19.5


Relative Size from OverlapGiven two engines A and B

Sample URLs randomly from A

Check if contained in B and vice versa

AÇ B= (1/2) * Size A

AÇ B= (1/6) * Size B

(1/2)*Size A = (1/6)*Size B

\ Size A / Size B =

(1/6)/(1/2) = 1/3

Each test involves: (i) Sampling URL (ii) Checking URL

sampling urls
Sampling URLs
  • Ideal strategy: Generate a random URL and check for containment in each index.
  • Problem: Random URLs are hard to find!
  • Approach 1: Generate a random URL surely contained in a given search engine
  • Approach 2: Random walks or random IP addresses
1 random url in se via random queries
#1: Random URL in SE via random queries
  • Generate random query:
    • Lexicon:400,000+ words from a web crawl
    • Conjunctive Queries: w1 and w2

e.g., vocalists AND rsi

  • Get 100 result URLs from engine A
  • Choose a random URL as the candidate to check for presence in search engine B (next slide)
  • This induces a probability weight W(p) for each page.
  • Conjecture: W(SEA) / W(SEB) ~ |SEA| / |SEB|
url checking
URL checking
  • Download D at address URL.
    • Get list of words.
    • Use 8 low frequency words as AND query to B
    • Check if D is present in result set.
  • Problems:
    • Near duplicates
    • Engine time-outs
    • Is 8-word query good enough?
advantages disadvantages
Advantages & disadvantages
  • Statistically sound under the induced weight.
  • Biases induced by random query
    • Query bias: Favors content-rich pages in the language(s) of the lexicon
    • Ranking bias [Solution: Use conjunctive queries & fetch all]
    • Query restriction bias:engine might not deal properly with 8 words conjunctive query
    • Malicious bias: Sabotage by engine
    • Operational Problems: Time-outs, failures, engine inconsistencies, index modification.
2 random ip addresses
#2: Random IP addresses
  • Find a web server at the given IP address
    • If there’s one
  • Collect all pages from server
  • From this, choose a page at random
advantages disadvantages1
Advantages & disadvantages
  • Advantages
    • Clean statistics
    • Independent of crawling strategies
  • Disadvantages
    • Many hosts might share one IP, or not accept requests
    • No guarantee all pages are linked to root page, and thus can be collected.
    • Power law for # pages/hosts generates bias towards sites with few pages.
  • No sampling solution is perfect.
  • Lots of new ideas ...

....but the problem is getting harder

  • Quantitative studies are fascinating and a good research problem
the web graph storage

The web-graph: storage

Paolo Ferragina

Dipartimento di Informatica

Università di Pisa

Reading 20.4


Directed graph G = (V,E)

    • V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)

Three key properties:

  • Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
the in degree distribution
The In-degree distribution

Altavista crawl, 1999

WebBase Crawl 2001

Indegree follows power law distribution

This is true also for: out-degree, size components,...


Directed graph G = (V,E)

    • V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:

  • Skewed distribution: Pb that a node has x links is 1/xa, a ≈ 2.1
  • Locality: usually most of the hyperlinks point to other URLs on the same host (about 80%).
  • Similarity: pages close in lexicographic order tend to share many outgoing lists
a picture of the web graph
A Picture of the Web Graph



21 millions of pages, 150millions of links

url sorting



the library webgraph
The library WebGraph

Uncompressed adjacency list

Successor list S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1}

For negative entries:

Adjacency list with compressed gaps



copy lists close nodes sharemany successors

Reference chains

possibly limited

Copy-lists: close nodes sharemany successors

Uncompressed adjacency list

Adjacency list with copy lists


Each bit of the copy-list informs whether the corresponding successor of y is also a successor of the reference x;

The reference index is chosen in [0,W] that gives the best compression.


copy blocks rle copy list
Copy-blocks = RLE(Copy-list)

Adjacency list with copy lists.

Adjacency list with copy blocks

(RLE on bit sequences)


The first copy block is 0 if the copy list starts with 0;

Each RLE-length is decremented by one for all blocks

The last block is omitted (we know the length from Outd);


extra nodes compressing intervals

This is a Java and C++ lib

(≈3 bits/edge)


Extra-nodes:Compressing Intervals

Adjacency list with copy blocks.


in extra-nodes

0 = (15-15)*2 (positive)

2 = (23-19)-2 (jump >= 2)

600 = (316-16)*2

12 = (22-16)*2 (positive)

3018 = 3041-22-1 (difference)

Intervals: encoded by left extreme and length

Int. length: decremented by Lmin = 2

Residuals: differences between residuals,

or the source (the first)