- 80 Views
- Uploaded on
- Presentation posted in: General

Web Algorithmics

Web Algorithmics

Web Search Engines

Retrieve docs that are “relevant” for the user query

- Doc: file word or pdf, web page, email, blog, e-book,...
- Query: paradigm “bag of words”
Relevant ?!?

The Web:

Language and encodings:hundreds…

Distributed authorship: SPAM, format-less,…

Dynamic: in one year 35% survive, 20% untouched

The User:

Query composition: short (2.5 terms avg) and imprecise

Query results: 85% users look at just one result-page

Several needs: Informational, Navigational, Transactional

Extracting “significant data” is difficult !!

Matching “user needs” is difficult !!

- First generation-- use only on-page, web-text data
- Word frequency and language

- Second generation-- use off-page, web-graph data
- Link (or connectivity) analysis
- Anchor-text (How people refer to a page)

- Third generation-- answer “the need behind the query”
- Focus on “user need”, rather than on query
- Integrate multiple data-sources
- Click-through data

1995-1997 AltaVista, Excite, Lycos, etc

1998: Google

Google, Yahoo,

MSN, ASK,………

Fourth generation Information Supply

[Andrei Broder, VP emerging search tech, Yahoo! Research]

+$

-$

Web Algorithmics

The structure of a Search Engine

?

Page archive

Crawler

Web

Page

Analizer

Indexer

Query

resolver

Ranker

Control

text

auxiliary

Structure

Query

- Consider Wikipedia En:
- Collection size ≈ 10 Gbytes
- # docs ≈4 * 106
- #terms in total > 1 billion(avg term len = 6 chars)
- #terms distinct = several millions

- Which kind of data structure do we build to support word-based searches ?

#docs ≈ 4M

#terms > 1M

Space ≈ 4Tb !

1 if play contains word, 0 otherwise

2

4

6

10

32

1

2

3

5

8

13

21

34

- A term like Calpurnia may use log2 N bits per occurrence
- A term like the should take about 1 bit per occurrence

Brutus

the

Calpurnia

13

16

Currently they get 13% original text

- Sort the docIDs
- Store gaps between consecutive docIDs:
- Brutus: 33, 47, 154, 159, 202 …
33, 14, 107, 5, 43 …

Two advantages:

- Brutus: 33, 47, 154, 159, 202 …
- Space: store smaller integers (clustering?)
- Speed: query requires just a scan

Length-1

- v > 0 and Length = log2 v +1
e.g., v=9 represented as <000,1001>.

- g-code for v takes 2 log2 v +1 bits
(ie. factor of 2 from optimal)

- Optimal for Pr(v) = 1/2v2, and i.i.d integers

[q times 0s] 1

Log k bits

- It is a parametric code: depends on k
- Quotient q=(v-1)/k, and the rest is r= v – k * q – 1
- Useful when integers concentrated around k
- How do we choose k ?
- Usually k 0.69 * mean(v) [Bernoulli model]
- Optimal for Pr(v) = p (1-p)v-1, where mean(v)=1/p, and i.i.d ints

Use b (e.g. 2) bits to encode 128 numbers or create exceptions

3

42

2

3

3

1

1

…

3

3

23

1

2

11

10

11

11

01

01

…

11

11

01

10

42

23

a block of 128 numbers

Translate data: [base, base + 2b-1] [0,2b-1]

Encode exceptions: ESC or pointers

Choose b to encode 90% values, or trade-off:

b waste more bits, b more exceptions

D = 1 1 1 2 2 2 2 4 3 1 1 1

M = 1 2 3 5 7 9 11 15 18 19 20 21

lo=9+1=10, hi=21, num=6

lo=1, hi=9-1=8, num=5

- Recursive coding preorder traversal of a balanced binary tree
- At every step we know (initially, they are encoded):
- num= |M| = 12, Lidx=1, low= 1, Ridx=12, hi= 21

- Take the middle element: h= (Lidx+Ridx)/2=6 M[6]=9,
- left_size= h – Lidx = 5, right_size= Ridx-h = 6
- low+left_size =1+5 = 6 ≤ M[h] ≤ hi– right_size=(21 – 6)= 15
- We can encode 9 in log2 (15-6+1) = 4 bits

2

4

6

13

32

1

2

3

5

8

13

21

34

- Retrieve all pages matching the query

Brutus

the

Caesar

4

13

17

2

4

6

13

32

1

2

3

5

8

13

21

34

Best order for query processing ?

- Shorter lists first…

Brutus

The

Calpurnia

4

13

17

Query: BrutusANDCalpurniaANDThe

- Expand the posting lists with word positions
- to:
- 2:1,17,74,222,551;4:8,16,190,429,433;7:13,23,191; ...

- be:
- 1:17,19; 4:17,191,291,430,434;5:14,19,101; ...

- to:
- Larger space occupancy, 5÷8% on Web

2

4

6

13

32

1

2

3

5

8

13

21

34

- Retrieve all pages matching the query
- Order pages according to various scores:
- Term position & freq (body, title, anchor,…)
- Link popularity
- User clicks or preferences

Brutus

the

Caesar

4

13

17

?

Page archive

Crawler

Web

Page

Analizer

Indexer

Query

resolver

Ranker

Control

text

auxiliary

Structure

Query

Web Algorithmics

Text-based Ranking

(1° generation)

æ

ö

n

=

ç

÷

idf

log

where nt = #docs containing term t

n = #docs in the indexed collection

è

ø

t

n

t

=

tf

Frequency of term t in doc d = #occt / |d|

t,d

Vector Space model

Postulate: Documents that are “close together”

in the vector space talk about the same things.

Euclidean distance sensible to vector length !!

Easy to Spam

t3

cos(a)= v w / ||v|| * ||w||

d2

d3

Sophisticated algos

to find top-k docs

for a query Q

d1

a

t1

d5

t2

d4

The user query is a very short doc

- Preprocess: Assign to each term, its m best documents
- Search:
- If |Q| = q terms, merge their preferred lists ( mq answers).
- Compute COS between Q and these docs, and choose the top k.

Need to pick m>k to work well empirically.

Now SE use tf-idf PLUS PageRank (PLUS other weights)

Web Algorithmics

Link-based Ranking

(2° generation)

- First generation: using link counts as simple measures of popularity.
- Undirected popularity:
- Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5).

- Directed popularity:
- Score of a page = number of its in-links (es. 3).

- Undirected popularity:

Easy to SPAM

- Each link has its own importance!!
- PageRank is
- independent of the query
- many interpretations…

Any node

d

1-d

Neighbors

Principal

eigenvector

Fixed value

B(i) : set of pages linking to i.

#out(i): number of outgoing links from i.

Any node

d

“In the steady state” each page has a long-term visit rate

- use this as the page’s score.

1-d

Neighbors

- Graph (intuitive interpretation)
- Co-citation

- Matrix (easy for computation)
- Eigenvector computation or a linear system solution

- Markov Chain (useful to prove convergence)
- a sort of Usage Simulation

- Preprocessing:
- Given graph of links, build matrix L
- Compute its principal eigenvector r
- r[i] is the pagerank of page i
We are interested in the relative order

- Query processing:
- Retrieve pages containing query terms
- Rank them by their Pagerank
The final order is query-independent