- By
**dorit** - Follow User

- 141 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Random Sampling from a Search Engine‘s Index' - dorit

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Random Sampling from a Search Engine‘s Index

Ziv Bar-Yossef and Maxim Gurevich

Department of Electrical Engineering Technion

Presentation at group meeting, Oct., 24

Allen, Zhenjiang Lin

Outline

- Introduction
- Search Engine Samplers
- Motivation
- The Bharat-Broder Sampler (WWW’98)
- Infrastructure of Proposed Methods
- Search Engines as Hypergraphs
- Monte Carlo Simulation Methods – Rejection Sampling
- The Pool-based Sampler
- The Random Walk Sampler
- Experimental Results
- Conclusions

Search Engine Samplers

Search Engine

Web

Public

Interface

D

Index

Top k results

Queries

Indexed Documents

Random document x D

Sampler

Motivation

- Useful tool for search engine evaluation:
- Freshness
- Fraction of up-to-date pages in the index
- Topical bias
- Identification of overrepresented/underrepresented topics
- Spam
- Fraction of spam pages in the index
- Security
- Fraction of pages in index infected by viruses/worms/trojans
- Relative Size
- Number of documents indexed compared with other search engines

Size Wars

August 2005

: We index 20 billion documents.

September 2005

: We index 8 billion documents, but our indexis 3 times larger than our competition’s.

So, who’s right?

Why Does Size Matter, Anyway?

- Comprehensiveness
- A good crawler covers the most documents possible
- Narrow-topic queries
- E.g., get homepage of John Doe
- Prestige
- A marketing advantage

Measuring size using random samples[BharatBroder98, CheneyPerry05, GulliSignorni05]

- Sample pages uniformly at random from the search engine’s index
- Two alternatives
- Absolute size estimation
- Sample until collision
- Collision expected after k ~ N½ random samples (birthday paradox)
- Return k2
- Relative size estimation
- Check how many samples from search engine A are present in search engine B and vice versa

Related Work

- Random Sampling from a Search Engine’s Index[BharatBroder98, CheneyPerry05, GulliSignorni05]
- Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00]
- Queries from user query logs [LawrenceGiles98, DobraFeinberg04]
- Random sampling from the whole web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01]

The Bharat-Broder Sampler

Search Engine

Top k results

t1 AND t2

Two random terms t1, t2

L

Random document from top k results

BB Sampler

- Only if:
- all queries return the same number of results ≤ k
- all documents are of the same length
- Then, samples are uniform.

The Bharat-Broder Sampler:Drawbacks

- Documents have varying lengths
- Bias towards long documents
- Some queries have more than k matches
- Bias towards documents with high static rank

Two novel samplers

- A pool-based sampler
- Guaranteed to produce near-uniform samples
- Needs an lexicon / query pool
- A random walk sampler
- After sufficiently many steps, guaranteed to produce near-uniform samples
- Does not need an explicit lexicon / pool at all!

Focus of this talk

Search Engines as Hypergraphs

“news”

“google”

- results(q)= { documents returned on query q }
- queries(x)= { queries that return x as a result }
- P = query pool = a set of queries
- Query pool hypergraph:
- Vertices: Indexed documents
- Hyperedges: { result(q) | q P }

www.cnn.com

news.google.com

www.google.com

news.bbc.co.uk

www.foxnews.com

www.mapquest.com

maps.google.com

en.wikipedia.org/wiki/BBC

www.bbc.co.uk

maps.yahoot.com

“maps”

“bbc”

Query Cardinalities and Document Degrees

“news”

“google”

- Query cardinality: card(q) = |results(q)|
- Document degree: deg(x) = |queries(x)|
- Examples:
- card(“news”) = 4, card(“bbc”) = 3
- deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2

www.cnn.com

news.google.com

www.google.com

news.bbc.co.uk

www.foxnews.com

www.mapquest.com

maps.google.com

en.wikipedia.org/wiki/BBC

www.bbc.co.uk

maps.yahoot.com

“maps”

“bbc”

Sampling documents uniformly

- Sampling documents from D uniformly Hard
- Sampling documents from D non-uniformly: Easier
- Will show later: can sample documents proportionally to their degrees:

Sampling documents by degree

“news”

“google”

- p(news.bbc.co.uk) = 2/13
- p(www.cnn.com) = 1/13

www.cnn.com

news.google.com

www.google.com

news.bbc.co.uk

www.foxnews.com

www.mapquest.com

maps.google.com

en.wikipedia.org/wiki/BBC

www.bbc.co.uk

maps.yahoot.com

“maps”

“bbc”

Monte Carlo Simulation

- We need: Samples from the uniform distribution
- We have: Samples from the degree distribution
- Can we somehow use the samples from the degree distribution to generate samples from the uniform distribution?
- Yes!

Monte Carlo Simulation Methods

Rejection Sampling

Importance Sampling

Metropolis-Hastings

Maximum-Degree

Rejection Sampling Algorithm

Sampling values from an arbitrary probability distribution f(x) by using an instrumental distribution g(x)

The algorithm (due to John von Neumann) is as follows:

- Sample x from g(x) and u from U(0,1)
- Check whether or not u < f(x) / Mg(x).
- If this holds, accept x as a realization of f(x);
- if not, reject the value of x and repeat the sampling step.

M > 1 is an appropriate bound on f(x) / g(x).

Prove:

pRS(x) = g(x) . f(x) / Mg(x) = f(x) / M.

f(x) / Mg(x) ≤ 1 <=> M ≥ f(x) / g(x), ∨x∈D.

Rejection Sampling: An Example

- Sampling u.a.r from Square: g(x) Easy
- Sampling u.a.r from Disc: f(x) Hard
- Since f(x)=F, g(x)=G, set M = F/G;
- Generate a candidate point x from unit square, g(x);
- If x is in unit disc, f(x) = F≠ 0, thus f(x)/Mg(x)=1, accept x;
- If x is in square/disc, f(x) = 0,thus f(x)/Mg(x)=0, reject x;
- Therefore, x is sampled u.a.r from the unit disc.

Monte Carlo Simulation

- : Target distribution
- In our case: = uniform on D
- p: Trial distribution
- In our case: p = degree distribution
- Bias weight of p(x) relative to (x):
- In our case:

-Sampler

Sample from

Samples from p

(x1,w(x)), (x2,w(x)),…

Monte Carlo Simulator

p-Sampler

x

Bias Weights

- Unnormalized forms of and p:
- : (unknown) normalization constants
- Examples:
- = uniform:
- p = degree distribution:
- Bias weight:

Rejection Sampling [von Neumann]

- C: envelope constant
- C ≥ w(x) for all x
- The algorithm:
- accept := false
- while (not accept)
- generate a sample x from p
- toss a coin whose heads probability is
- if coin comes up heads,

accept := true

- return x
- In our case: C = 1 and acceptance prob = 1/deg(x)

Pool-Based Sampler

- Degree distribution: p(x) = deg(x) / x’deg(x’)

Search Engine

results(q1), results(q2),…

q1,q2,…

Pool-Based Sampler

(x1,1/deg(x1)),(x2,1/deg(x2)),…

Degree distribution sampler

Rejection Sampling

x

Documents sampled from degree distribution with corresponding weights

Uniform sample

Sampling documents by degree

“google”

“news”

www.cnn.com

news.google.com

- Select a random query q
- Select a random x results(q)
- Documents with high degree are more likely to be sampled
- If we sample q uniformly “oversample” documents that belong to narrow queries-the weights of queries are different.
- We need to sample q proportionally to its cardinality

www.google.com

news.bbc.co.uk

www.foxnews.com

www.mapquest.com

maps.google.com

en.wikipedia.org/wiki/BBC

www.bbc.co.uk

maps.yahoot.com

“maps”

“bbc”

Sampling documents by degree (2)

“google”

“news”

www.cnn.com

news.google.com

- Select a query q proportionally to its cardinality
- Select a random x results(q)
- Analysis:

www.google.com

news.bbc.co.uk

www.foxnews.com

www.mapquest.com

maps.google.com

en.wikipedia.org/wiki/BBC

www.bbc.co.uk

maps.yahoot.com

“maps”

“bbc”

Degree Distribution Sampler

Search Engine

Query sampled from cardinality distribution

Document sampled from degree distribution

results(q)

q

Degree Distribution Sampler

Cardinality Distribution Sampler

Sample x uniformly from results(q)

x

Sampling queries by cardinality

- Sampling queries from pool uniformly: Easy
- Sampling queries from pool by cardinality: Hard
- Requires knowing cardinalities of all queries in the search engine
- Use Monte Carlo methods to simulate biased sampling via uniform sampling:
- Target distribution: the cardinality distribution
- Trial distribution: uniform distribution on the query pool

Sampling queries by cardinality

- Bias weight of cardinality distribution relative to the uniform distribution:
- Can be computed using a single search engine query
- Use rejection sampling:
- Envelope constant for rejection sampling:
- Queries are sampled uniformly from the pool
- Each query q is accepted with probability

Complete Pool-Based Sampler

Search Engine

Uniform Query Sampler

(q,card(q)),…

Rejection Sampling

Uniform query sample

Query sampled from cardinality distribution

(q,results(q)),…

(x,1/deg(x)),…

Degree Distribution Sampler

Rejection Sampling

x

Uniform document sample

Documents sampled from degree distribution with corresponding weights

Dealing with Overflowing Queries

- Problem: Some queries may overflow (card(q) > k)
- Bias towards highly ranked documents
- Solutions:
- Select a pool P in which overflowing queries are rare (e.g., phrase queries)
- Skip overflowing queries
- Adapt rejection sampling to deal with approximate weights

Theorem:Samples of PB sampler are at most -away from uniform. ( = overflow probability of P)

Creating the query pool

Query Pool

Large corpus

P

q1

C

q2

…

…

- Example: P = all 3-word phrases that occur in C
- If “to be or not to be” occurs in C, P contains:
- “to be or”, “be or not”, “or not to”, “not to be”
- Choose P that “covers” most documents in D

A random walk sampler

- Define a graph G over the indexed documents
- (x,y) E iff queries(x) ∩ queries(y) ≠
- Run a random walk on G
- Limit distribution = degree distribution
- Use MCMC methods to make limit distribution uniform.
- Metropolis-Hastings
- Maximum-Degree
- Does not need a preprocessing step
- Less efficient than the pool-based sampler

Conclusions

- Two new search engine samplers
- Pool-based sampler
- Random walk sampler
- Samplers are guaranteed to produce near-uniform samples, under plausible assumptions.
- Samplers show no or little bias in experiments.

Download Presentation

Connecting to Server..