Web algorithmics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 41

Web Algorithmics PowerPoint PPT Presentation


  • 79 Views
  • Uploaded on
  • Presentation posted in: General

Web Algorithmics. Web Search Engines. Goal of a Search Engine. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query : paradigm “bag of words” Relevant ?!?. The Web: Language and encodings: hundreds…

Download Presentation

Web Algorithmics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Web algorithmics

Web Algorithmics

Web Search Engines


Goal of a search engine

Goal of a Search Engine

Retrieve docs that are “relevant” for the user query

  • Doc: file word or pdf, web page, email, blog, e-book,...

  • Query: paradigm “bag of words”

    Relevant ?!?


Two main difficulties

The Web:

Language and encodings:hundreds…

Distributed authorship: SPAM, format-less,…

Dynamic: in one year 35% survive, 20% untouched

The User:

Query composition: short (2.5 terms avg) and imprecise

Query results: 85% users look at just one result-page

Several needs: Informational, Navigational, Transactional

Two main difficulties

Extracting “significant data” is difficult !!

Matching “user needs” is difficult !!


Evolution of search engines

Evolution of Search Engines

  • First generation-- use only on-page, web-text data

    • Word frequency and language

  • Second generation-- use off-page, web-graph data

    • Link (or connectivity) analysis

    • Anchor-text (How people refer to a page)

  • Third generation-- answer “the need behind the query”

    • Focus on “user need”, rather than on query

    • Integrate multiple data-sources

    • Click-through data

1995-1997 AltaVista, Excite, Lycos, etc

1998: Google

Google, Yahoo,

MSN, ASK,………

Fourth generation  Information Supply

[Andrei Broder, VP emerging search tech, Yahoo! Research]


Web algorithmics

+$

-$


This is a search engine

This is a search engine!!!


Wolfram alpha

Wolfram Alpha


Clusty

Clusty


Yahoo correlator

Yahoo! Correlator


Web algorithmics1

Web Algorithmics

The structure of a Search Engine


The structure

?

Page archive

Crawler

Web

Page

Analizer

Indexer

Query

resolver

Ranker

Control

text

auxiliary

Structure

The structure

Query


Generating the snippets

Generating the snippets !


The big fight find the best ranking

The big fight: find the best ranking...


Ranking google vs google cn

Ranking: Google vs Google.cn


Problem indexing

Problem: Indexing

  • Consider Wikipedia En:

    • Collection size ≈ 10 Gbytes

    • # docs ≈4 * 106

    • #terms in total > 1 billion(avg term len = 6 chars)

    • #terms distinct = several millions

  • Which kind of data structure do we build to support word-based searches ?


Db based solution term doc matrix

DB-based solution: Term-Doc matrix

#docs ≈ 4M

#terms > 1M

Space ≈ 4Tb !

1 if play contains word, 0 otherwise


Current solution inverted index

2

4

6

10

32

1

2

3

5

8

13

21

34

Current solution: Inverted index

  • A term like Calpurnia may use log2 N bits per occurrence

  • A term like the should take about 1 bit per occurrence

Brutus

the

Calpurnia

13

16

Currently they get 13% original text


Gap coding for postings

Gap-coding for postings

  • Sort the docIDs

  • Store gaps between consecutive docIDs:

    • Brutus: 33, 47, 154, 159, 202 …

      33, 14, 107, 5, 43 …

      Two advantages:

  • Space: store smaller integers (clustering?)

  • Speed: query requires just a scan


G code for integer encoding

g-code for integer encoding

Length-1

  • v > 0 and Length = log2 v +1

    e.g., v=9 represented as <000,1001>.

  • g-code for v takes 2 log2 v +1 bits

    (ie. factor of 2 from optimal)

  • Optimal for Pr(v) = 1/2v2, and i.i.d integers


Rice code simplification of golomb code

Rice code (simplification of Golomb code)

[q times 0s] 1

Log k bits

  • It is a parametric code: depends on k

  • Quotient q=(v-1)/k, and the rest is r= v – k * q – 1

  • Useful when integers concentrated around k

  • How do we choose k ?

    • Usually k  0.69 * mean(v) [Bernoulli model]

    • Optimal for Pr(v) = p (1-p)v-1, where mean(v)=1/p, and i.i.d ints


Pfordelta coding

PForDelta coding

Use b (e.g. 2) bits to encode 128 numbers or create exceptions

3

42

2

3

3

1

1

3

3

23

1

2

11

10

11

11

01

01

11

11

01

10

42

23

a block of 128 numbers

Translate data: [base, base + 2b-1]  [0,2b-1]

Encode exceptions: ESC or pointers

Choose b to encode 90% values, or trade-off:

b waste more bits, b more exceptions


Interpolative coding

Interpolative coding

D = 1 1 1 2 2 2 2 4 3 1 1 1

M = 1 2 3 5 7 9 11 15 18 19 20 21

lo=9+1=10, hi=21, num=6

lo=1, hi=9-1=8, num=5

  • Recursive coding  preorder traversal of a balanced binary tree

  • At every step we know (initially, they are encoded):

    • num= |M| = 12, Lidx=1, low= 1, Ridx=12, hi= 21

  • Take the middle element: h= (Lidx+Ridx)/2=6  M[6]=9,

  • left_size= h – Lidx = 5, right_size= Ridx-h = 6

  • low+left_size =1+5 = 6 ≤ M[h] ≤ hi– right_size=(21 – 6)= 15

  • We can encode 9 in log2 (15-6+1) = 4 bits


Query processing

2

4

6

13

32

1

2

3

5

8

13

21

34

Query processing

  • Retrieve all pages matching the query

Brutus

the

Caesar

4

13

17


Some optimization

2

4

6

13

32

1

2

3

5

8

13

21

34

Some optimization

Best order for query processing ?

  • Shorter lists first…

Brutus

The

Calpurnia

4

13

17

Query: BrutusANDCalpurniaANDThe


Phrase queries

Phrase queries

  • Expand the posting lists with word positions

    • to:

      • 2:1,17,74,222,551;4:8,16,190,429,433;7:13,23,191; ...

    • be:

      • 1:17,19; 4:17,191,291,430,434;5:14,19,101; ...

  • Larger space occupancy,  5÷8% on Web


Query processing1

2

4

6

13

32

1

2

3

5

8

13

21

34

Query processing

  • Retrieve all pages matching the query

  • Order pages according to various scores:

    • Term position & freq (body, title, anchor,…)

    • Link popularity

    • User clicks or preferences

Brutus

the

Caesar

4

13

17


The structure1

?

Page archive

Crawler

Web

Page

Analizer

Indexer

Query

resolver

Ranker

Control

text

auxiliary

Structure

The structure

Query


Web algorithmics2

Web Algorithmics

Text-based Ranking

(1° generation)


A famous weight tf idf

æ

ö

n

=

ç

÷

idf

log

where nt = #docs containing term t

n = #docs in the indexed collection

è

ø

t

n

t

A famous “weight”: tf-idf

=

tf

Frequency of term t in doc d = #occt / |d|

t,d

Vector Space model


A graphical example

Postulate: Documents that are “close together”

in the vector space talk about the same things.

Euclidean distance sensible to vector length !!

Easy to Spam

A graphical example

t3

cos(a)= v w / ||v|| * ||w||

d2

d3

Sophisticated algos

to find top-k docs

for a query Q

d1

a

t1

d5

t2

d4

The user query is a very short doc


Approximate top k results

Approximate top-k results

  • Preprocess: Assign to each term, its m best documents

  • Search:

    • If |Q| = q terms, merge their preferred lists ( mq answers).

    • Compute COS between Q and these docs, and choose the top k.

Need to pick m>k to work well empirically.

Now SE use tf-idf PLUS PageRank (PLUS other weights)


Web algorithmics3

Web Algorithmics

Link-based Ranking

(2° generation)


Query independent ordering

Query-independent ordering

  • First generation: using link counts as simple measures of popularity.

    • Undirected popularity:

      • Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5).

    • Directed popularity:

      • Score of a page = number of its in-links (es. 3).

Easy to SPAM


Second generation pagerank

Second generation: PageRank

  • Each link has its own importance!!

  • PageRank is

    • independent of the query

    • many interpretations…


Basic intuition

Any node

d

1-d

Neighbors

Basic Intuition…


Google s pagerank

Principal

eigenvector

Google’s Pagerank

Fixed value

B(i) : set of pages linking to i.

#out(i): number of outgoing links from i.


Three different interpretations

Any node

d

“In the steady state” each page has a long-term visit rate

- use this as the page’s score.

1-d

Neighbors

Three different interpretations

  • Graph (intuitive interpretation)

    • Co-citation

  • Matrix (easy for computation)

    • Eigenvector computation or a linear system solution

  • Markov Chain (useful to prove convergence)

    • a sort of Usage Simulation


Pagerank use in search engines

Pagerank: use in Search Engines

  • Preprocessing:

    • Given graph of links, build matrix L

    • Compute its principal eigenvector r

    • r[i] is the pagerank of page i

      We are interested in the relative order

  • Query processing:

    • Retrieve pages containing query terms

    • Rank them by their Pagerank

      The final order is query-independent


  • Login