web algorithmics
Download
Skip this Video
Download Presentation
Web Algorithmics

Loading in 2 Seconds...

play fullscreen
1 / 41

Web Algorithmics - PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on

Web Algorithmics. Web Search Engines. Goal of a Search Engine. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query : paradigm “bag of words” Relevant ?!?. The Web: Language and encodings: hundreds…

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Web Algorithmics' - sema


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
web algorithmics

Web Algorithmics

Web Search Engines

goal of a search engine
Goal of a Search Engine

Retrieve docs that are “relevant” for the user query

  • Doc: file word or pdf, web page, email, blog, e-book,...
  • Query: paradigm “bag of words”

Relevant ?!?

two main difficulties
The Web:

Language and encodings:hundreds…

Distributed authorship: SPAM, format-less,…

Dynamic: in one year 35% survive, 20% untouched

The User:

Query composition: short (2.5 terms avg) and imprecise

Query results: 85% users look at just one result-page

Several needs: Informational, Navigational, Transactional

Two main difficulties

Extracting “significant data” is difficult !!

Matching “user needs” is difficult !!

evolution of search engines
Evolution of Search Engines
  • First generation-- use only on-page, web-text data
    • Word frequency and language
  • Second generation-- use off-page, web-graph data
    • Link (or connectivity) analysis
    • Anchor-text (How people refer to a page)
  • Third generation-- answer “the need behind the query”
    • Focus on “user need”, rather than on query
    • Integrate multiple data-sources
    • Click-through data

1995-1997 AltaVista, Excite, Lycos, etc

1998: Google

Google, Yahoo,

MSN, ASK,………

Fourth generation  Information Supply

[Andrei Broder, VP emerging search tech, Yahoo! Research]

slide6
+$

-$

web algorithmics1

Web Algorithmics

The structure of a Search Engine

the structure
?

Page archive

Crawler

Web

Page

Analizer

Indexer

Query

resolver

Ranker

Control

text

auxiliary

Structure

The structure

Query

problem indexing
Problem: Indexing
  • Consider Wikipedia En:
    • Collection size ≈ 10 Gbytes
    • # docs ≈4 * 106
    • #terms in total > 1 billion(avg term len = 6 chars)
    • #terms distinct = several millions
  • Which kind of data structure do we build to support word-based searches ?
db based solution term doc matrix
DB-based solution: Term-Doc matrix

#docs ≈ 4M

#terms > 1M

Space ≈ 4Tb !

1 if play contains word, 0 otherwise

current solution inverted index
2

4

6

10

32

1

2

3

5

8

13

21

34

Current solution: Inverted index
  • A term like Calpurnia may use log2 N bits per occurrence
  • A term like the should take about 1 bit per occurrence

Brutus

the

Calpurnia

13

16

Currently they get 13% original text

gap coding for postings
Gap-coding for postings
  • Sort the docIDs
  • Store gaps between consecutive docIDs:
    • Brutus: 33, 47, 154, 159, 202 …

33, 14, 107, 5, 43 …

Two advantages:

  • Space: store smaller integers (clustering?)
  • Speed: query requires just a scan
g code for integer encoding
g-code for integer encoding

Length-1

  • v > 0 and Length = log2 v +1

e.g., v=9 represented as <000,1001>.

  • g-code for v takes 2 log2 v +1 bits

(ie. factor of 2 from optimal)

  • Optimal for Pr(v) = 1/2v2, and i.i.d integers
rice code simplification of golomb code
Rice code (simplification of Golomb code)

[q times 0s] 1

Log k bits

  • It is a parametric code: depends on k
  • Quotient q=(v-1)/k, and the rest is r= v – k * q – 1
  • Useful when integers concentrated around k
  • How do we choose k ?
    • Usually k  0.69 * mean(v) [Bernoulli model]
    • Optimal for Pr(v) = p (1-p)v-1, where mean(v)=1/p, and i.i.d ints
pfordelta coding
PForDelta coding

Use b (e.g. 2) bits to encode 128 numbers or create exceptions

3

42

2

3

3

1

1

3

3

23

1

2

11

10

11

11

01

01

11

11

01

10

42

23

a block of 128 numbers

Translate data: [base, base + 2b-1]  [0,2b-1]

Encode exceptions: ESC or pointers

Choose b to encode 90% values, or trade-off:

b waste more bits, b more exceptions

interpolative coding
Interpolative coding

D = 1 1 1 2 2 2 2 4 3 1 1 1

M = 1 2 3 5 7 9 11 15 18 19 20 21

lo=9+1=10, hi=21, num=6

lo=1, hi=9-1=8, num=5

  • Recursive coding  preorder traversal of a balanced binary tree
  • At every step we know (initially, they are encoded):
    • num= |M| = 12, Lidx=1, low= 1, Ridx=12, hi= 21
  • Take the middle element: h= (Lidx+Ridx)/2=6  M[6]=9,
  • left_size= h – Lidx = 5, right_size= Ridx-h = 6
  • low+left_size =1+5 = 6 ≤ M[h] ≤ hi– right_size=(21 – 6)= 15
  • We can encode 9 in log2 (15-6+1) = 4 bits
query processing
2

4

6

13

32

1

2

3

5

8

13

21

34

Query processing
  • Retrieve all pages matching the query

Brutus

the

Caesar

4

13

17

some optimization
2

4

6

13

32

1

2

3

5

8

13

21

34

Some optimization

Best order for query processing ?

  • Shorter lists first…

Brutus

The

Calpurnia

4

13

17

Query: BrutusANDCalpurniaANDThe

phrase queries
Phrase queries
  • Expand the posting lists with word positions
    • to:
      • 2:1,17,74,222,551;4:8,16,190,429,433;7:13,23,191; ...
    • be:
      • 1:17,19; 4:17,191,291,430,434;5:14,19,101; ...
  • Larger space occupancy,  5÷8% on Web
query processing1
2

4

6

13

32

1

2

3

5

8

13

21

34

Query processing
  • Retrieve all pages matching the query
  • Order pages according to various scores:
    • Term position & freq (body, title, anchor,…)
    • Link popularity
    • User clicks or preferences

Brutus

the

Caesar

4

13

17

the structure1
?

Page archive

Crawler

Web

Page

Analizer

Indexer

Query

resolver

Ranker

Control

text

auxiliary

Structure

The structure

Query

web algorithmics2

Web Algorithmics

Text-based Ranking

(1° generation)

a famous weight tf idf
æ

ö

n

=

ç

÷

idf

log

where nt = #docs containing term t

n = #docs in the indexed collection

è

ø

t

n

t

A famous “weight”: tf-idf

=

tf

Frequency of term t in doc d = #occt / |d|

t,d

Vector Space model

a graphical example
Postulate: Documents that are “close together”

in the vector space talk about the same things.

Euclidean distance sensible to vector length !!

Easy to Spam

A graphical example

t3

cos(a)= v w / ||v|| * ||w||

d2

d3

Sophisticated algos

to find top-k docs

for a query Q

d1

a

t1

d5

t2

d4

The user query is a very short doc

approximate top k results
Approximate top-k results
  • Preprocess: Assign to each term, its m best documents
  • Search:
    • If |Q| = q terms, merge their preferred lists ( mq answers).
    • Compute COS between Q and these docs, and choose the top k.

Need to pick m>k to work well empirically.

Now SE use tf-idf PLUS PageRank (PLUS other weights)

web algorithmics3

Web Algorithmics

Link-based Ranking

(2° generation)

query independent ordering
Query-independent ordering
  • First generation: using link counts as simple measures of popularity.
    • Undirected popularity:
      • Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5).
    • Directed popularity:
      • Score of a page = number of its in-links (es. 3).

Easy to SPAM

second generation pagerank
Second generation: PageRank
  • Each link has its own importance!!
  • PageRank is
    • independent of the query
    • many interpretations…
google s pagerank
Principal

eigenvector

Google’s Pagerank

Fixed value

B(i) : set of pages linking to i.

#out(i): number of outgoing links from i.

three different interpretations
Any node

d

“In the steady state” each page has a long-term visit rate

- use this as the page’s score.

1-d

Neighbors

Three different interpretations
  • Graph (intuitive interpretation)
    • Co-citation
  • Matrix (easy for computation)
    • Eigenvector computation or a linear system solution
  • Markov Chain (useful to prove convergence)
    • a sort of Usage Simulation
pagerank use in search engines
Pagerank: use in Search Engines
  • Preprocessing:
    • Given graph of links, build matrix L
    • Compute its principal eigenvector r
    • r[i] is the pagerank of page i

We are interested in the relative order

  • Query processing:
    • Retrieve pages containing query terms
    • Rank them by their Pagerank

The final order is query-independent

ad