random sampling from a search engine s index l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Random Sampling from a Search Engine‘s Index PowerPoint Presentation
Download Presentation
Random Sampling from a Search Engine‘s Index

Loading in 2 Seconds...

play fullscreen
1 / 37

Random Sampling from a Search Engine‘s Index - PowerPoint PPT Presentation


  • 141 Views
  • Uploaded on

Random Sampling from a Search Engine‘s Index. Ziv Bar-Yossef and Maxim Gurevich Department of Electrical Engineering Technion Presentation at group meeting, Oct., 24 Allen, Zhenjiang Lin. Outline. Introduction Search Engine Samplers Motivation The Bharat-Broder Sampler (WWW’98)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Random Sampling from a Search Engine‘s Index' - dorit


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
random sampling from a search engine s index

Random Sampling from a Search Engine‘s Index

Ziv Bar-Yossef and Maxim Gurevich

Department of Electrical Engineering Technion

Presentation at group meeting, Oct., 24

Allen, Zhenjiang Lin

outline
Outline
  • Introduction
    • Search Engine Samplers
    • Motivation
  • The Bharat-Broder Sampler (WWW’98)
  • Infrastructure of Proposed Methods
    • Search Engines as Hypergraphs
    • Monte Carlo Simulation Methods – Rejection Sampling
  • The Pool-based Sampler
  • The Random Walk Sampler
  • Experimental Results
  • Conclusions
search engine samplers
Search Engine Samplers

Search Engine

Web

Public

Interface

D

Index

Top k results

Queries

Indexed Documents

Random document x  D

Sampler

motivation
Motivation
  • Useful tool for search engine evaluation:
    • Freshness
      • Fraction of up-to-date pages in the index
    • Topical bias
      • Identification of overrepresented/underrepresented topics
    • Spam
      • Fraction of spam pages in the index
    • Security
      • Fraction of pages in index infected by viruses/worms/trojans
    • Relative Size
      • Number of documents indexed compared with other search engines
size wars
Size Wars

August 2005

: We index 20 billion documents.

September 2005

: We index 8 billion documents, but our indexis 3 times larger than our competition’s.

So, who’s right?

why does size matter anyway
Why Does Size Matter, Anyway?
  • Comprehensiveness
    • A good crawler covers the most documents possible
  • Narrow-topic queries
    • E.g., get homepage of John Doe
  • Prestige
    • A marketing advantage
measuring size using random samples bharatbroder98 cheneyperry05 gullisignorni05
Measuring size using random samples[BharatBroder98, CheneyPerry05, GulliSignorni05]
  • Sample pages uniformly at random from the search engine’s index
  • Two alternatives
    • Absolute size estimation
      • Sample until collision
      • Collision expected after k ~ N½ random samples (birthday paradox)
      • Return k2
    • Relative size estimation
      • Check how many samples from search engine A are present in search engine B and vice versa
related work
Related Work
  • Random Sampling from a Search Engine’s Index[BharatBroder98, CheneyPerry05, GulliSignorni05]
  • Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00]
  • Queries from user query logs [LawrenceGiles98, DobraFeinberg04]
  • Random sampling from the whole web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01]
the bharat broder sampler preprocessing step
The Bharat-Broder Sampler: Preprocessing Step

Lexicon

Large corpus

L

t1, freq(t1,C)

C

t2, freq(t2,C)

the bharat broder sampler
The Bharat-Broder Sampler

Search Engine

Top k results

t1 AND t2

Two random terms t1, t2

L

Random document from top k results

BB Sampler

  • Only if:
  • all queries return the same number of results ≤ k
  • all documents are of the same length
  • Then, samples are uniform.
the bharat broder sampler drawbacks
The Bharat-Broder Sampler:Drawbacks
  • Documents have varying lengths
    • Bias towards long documents
  • Some queries have more than k matches
    • Bias towards documents with high static rank
two novel samplers
Two novel samplers
  • A pool-based sampler
    • Guaranteed to produce near-uniform samples
    • Needs an lexicon / query pool
  • A random walk sampler
    • After sufficiently many steps, guaranteed to produce near-uniform samples
    • Does not need an explicit lexicon / pool at all!

Focus of this talk

search engines as hypergraphs
Search Engines as Hypergraphs

“news”

“google”

  • results(q)= { documents returned on query q }
  • queries(x)= { queries that return x as a result }
  • P = query pool = a set of queries
  • Query pool hypergraph:
    • Vertices: Indexed documents
    • Hyperedges: { result(q) | q  P }

www.cnn.com

news.google.com

www.google.com

news.bbc.co.uk

www.foxnews.com

www.mapquest.com

maps.google.com

en.wikipedia.org/wiki/BBC

www.bbc.co.uk

maps.yahoot.com

“maps”

“bbc”

query cardinalities and document degrees
Query Cardinalities and Document Degrees

“news”

“google”

  • Query cardinality: card(q) = |results(q)|
  • Document degree: deg(x) = |queries(x)|
  • Examples:
    • card(“news”) = 4, card(“bbc”) = 3
    • deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2

www.cnn.com

news.google.com

www.google.com

news.bbc.co.uk

www.foxnews.com

www.mapquest.com

maps.google.com

en.wikipedia.org/wiki/BBC

www.bbc.co.uk

maps.yahoot.com

“maps”

“bbc”

sampling documents uniformly
Sampling documents uniformly
  • Sampling documents from D uniformly Hard
  • Sampling documents from D non-uniformly: Easier
  • Will show later: can sample documents proportionally to their degrees:
sampling documents by degree
Sampling documents by degree

“news”

“google”

  • p(news.bbc.co.uk) = 2/13
  • p(www.cnn.com) = 1/13

www.cnn.com

news.google.com

www.google.com

news.bbc.co.uk

www.foxnews.com

www.mapquest.com

maps.google.com

en.wikipedia.org/wiki/BBC

www.bbc.co.uk

maps.yahoot.com

“maps”

“bbc”

monte carlo simulation
Monte Carlo Simulation
  • We need: Samples from the uniform distribution
  • We have: Samples from the degree distribution
  • Can we somehow use the samples from the degree distribution to generate samples from the uniform distribution?
  • Yes!

Monte Carlo Simulation Methods

Rejection Sampling

Importance Sampling

Metropolis-Hastings

Maximum-Degree

rejection sampling algorithm
Rejection Sampling Algorithm

Sampling values from an arbitrary probability distribution f(x) by using an instrumental distribution g(x)

The algorithm (due to John von Neumann) is as follows:

  • Sample x from g(x) and u from U(0,1)
  • Check whether or not u < f(x) / Mg(x).
    • If this holds, accept x as a realization of f(x);
    • if not, reject the value of x and repeat the sampling step.

M > 1 is an appropriate bound on f(x) / g(x).

Prove:

pRS(x) = g(x) . f(x) / Mg(x) = f(x) / M.

f(x) / Mg(x) ≤ 1 <=> M ≥ f(x) / g(x), ∨x∈D.

rejection sampling an example
Rejection Sampling: An Example
  • Sampling u.a.r from Square: g(x) Easy
  • Sampling u.a.r from Disc: f(x) Hard
  • Since f(x)=F, g(x)=G, set M = F/G;
  • Generate a candidate point x from unit square, g(x);
  • If x is in unit disc, f(x) = F≠ 0, thus f(x)/Mg(x)=1, accept x;
  • If x is in square/disc, f(x) = 0,thus f(x)/Mg(x)=0, reject x;
  • Therefore, x is sampled u.a.r from the unit disc.
monte carlo simulation20
Monte Carlo Simulation
  • : Target distribution
    • In our case:  = uniform on D
  • p: Trial distribution
    • In our case: p = degree distribution
  • Bias weight of p(x) relative to (x):
    • In our case:

-Sampler

Sample from 

Samples from p

(x1,w(x)), (x2,w(x)),…

Monte Carlo Simulator

p-Sampler

x

bias weights
Bias Weights
  • Unnormalized forms of  and p:
  • : (unknown) normalization constants
  • Examples:
    •  = uniform:
    • p = degree distribution:
  • Bias weight:
rejection sampling von neumann
Rejection Sampling [von Neumann]
  • C: envelope constant
    • C ≥ w(x) for all x
  • The algorithm:
    • accept := false
    • while (not accept)
      • generate a sample x from p
      • toss a coin whose heads probability is
      • if coin comes up heads,

accept := true

    • return x
  • In our case: C = 1 and acceptance prob = 1/deg(x)
pool based sampler
Pool-Based Sampler
  • Degree distribution: p(x) = deg(x) / x’deg(x’)

Search Engine

results(q1), results(q2),…

q1,q2,…

Pool-Based Sampler

(x1,1/deg(x1)),(x2,1/deg(x2)),…

Degree distribution sampler

Rejection Sampling

x

Documents sampled from degree distribution with corresponding weights

Uniform sample

sampling documents by degree24
Sampling documents by degree

“google”

“news”

www.cnn.com

news.google.com

  • Select a random query q
  • Select a random x  results(q)
  • Documents with high degree are more likely to be sampled
  • If we sample q uniformly  “oversample” documents that belong to narrow queries-the weights of queries are different.
  • We need to sample q proportionally to its cardinality

www.google.com

news.bbc.co.uk

www.foxnews.com

www.mapquest.com

maps.google.com

en.wikipedia.org/wiki/BBC

www.bbc.co.uk

maps.yahoot.com

“maps”

“bbc”

sampling documents by degree 2
Sampling documents by degree (2)

“google”

“news”

www.cnn.com

news.google.com

  • Select a query q proportionally to its cardinality
  • Select a random x  results(q)
  • Analysis:

www.google.com

news.bbc.co.uk

www.foxnews.com

www.mapquest.com

maps.google.com

en.wikipedia.org/wiki/BBC

www.bbc.co.uk

maps.yahoot.com

“maps”

“bbc”

degree distribution sampler
Degree Distribution Sampler

Search Engine

Query sampled from cardinality distribution

Document sampled from degree distribution

results(q)

q

Degree Distribution Sampler

Cardinality Distribution Sampler

Sample x uniformly from results(q)

x

sampling queries by cardinality
Sampling queries by cardinality
  • Sampling queries from pool uniformly: Easy
  • Sampling queries from pool by cardinality: Hard
    • Requires knowing cardinalities of all queries in the search engine
  • Use Monte Carlo methods to simulate biased sampling via uniform sampling:
    • Target distribution: the cardinality distribution
    • Trial distribution: uniform distribution on the query pool
sampling queries by cardinality28
Sampling queries by cardinality
  • Bias weight of cardinality distribution relative to the uniform distribution:
    • Can be computed using a single search engine query
  • Use rejection sampling:
    • Envelope constant for rejection sampling:
    • Queries are sampled uniformly from the pool
    • Each query q is accepted with probability
complete pool based sampler
Complete Pool-Based Sampler

Search Engine

Uniform Query Sampler

(q,card(q)),…

Rejection Sampling

Uniform query sample

Query sampled from cardinality distribution

(q,results(q)),…

(x,1/deg(x)),…

Degree Distribution Sampler

Rejection Sampling

x

Uniform document sample

Documents sampled from degree distribution with corresponding weights

dealing with overflowing queries
Dealing with Overflowing Queries
  • Problem: Some queries may overflow (card(q) > k)
    • Bias towards highly ranked documents
  • Solutions:
    • Select a pool P in which overflowing queries are rare (e.g., phrase queries)
    • Skip overflowing queries
    • Adapt rejection sampling to deal with approximate weights

Theorem:Samples of PB sampler are at most -away from uniform. ( = overflow probability of P)

creating the query pool
Creating the query pool

Query Pool

Large corpus

P

q1

C

q2

  • Example: P = all 3-word phrases that occur in C
    • If “to be or not to be” occurs in C, P contains:
      • “to be or”, “be or not”, “or not to”, “not to be”
  • Choose P that “covers” most documents in D
a random walk sampler
A random walk sampler
  • Define a graph G over the indexed documents
    • (x,y)  E iff queries(x) ∩ queries(y) ≠ 
  • Run a random walk on G
    • Limit distribution = degree distribution
    • Use MCMC methods to make limit distribution uniform.
      • Metropolis-Hastings
      • Maximum-Degree
  • Does not need a preprocessing step
  • Less efficient than the pool-based sampler
relative sizes of google msn and yahoo
Relative Sizes of Google, MSN and Yahoo!

Google = 1

Yahoo! = 1.28

MSN Search = 0.73

conclusions
Conclusions
  • Two new search engine samplers
    • Pool-based sampler
    • Random walk sampler
  • Samplers are guaranteed to produce near-uniform samples, under plausible assumptions.
  • Samplers show no or little bias in experiments.