Algorithms for large data sets
1 / 45

Algorithms for Large Data Sets - PowerPoint PPT Presentation

  • Updated On :

Algorithms for Large Data Sets. Ziv Bar-Yossef. Lecture 5 April 23, 2006. Ranking Algorithms. PageRank [Page, Brin, Motwani, Winograd 1998]. Motivating principles Rank of p should be proportional to the rank of the pages that point to p

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Algorithms for Large Data Sets' - schuyler

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Algorithms for large data sets l.jpg

Algorithms for Large Data Sets

Ziv Bar-Yossef

Lecture 5

April 23, 2006

Pagerank page brin motwani winograd 1998 l.jpg
PageRank [Page, Brin, Motwani, Winograd 1998]

  • Motivating principles

    • Rank of p should be proportional to the rank of the pages that point to p

      • Recommendations from Bill Gates & Steve Jobs vs. from Moishale and Ahuva

    • Rank of p should depend on the number of pages “co-cited” with p

      • Compare: Bill Gates recommends only me vs. Bill Gates recommends everyone on earth

Pagerank attempt 1 l.jpg
PageRank, Attempt #1

  • Additional Conditions:

    • r is non-negative: r ≥ 0

    • r is normalized: ||r||1 = 1

  • B = normalized adjacency matrix:

  • Then:

    • r is a non-negative normalized left eigenvector of B with eigenvalue 1

Pagerank attempt 15 l.jpg
PageRank, Attempt #1

  • Solution exists only if B has eigenvalue 1

  • Problem: B may not have 1 as an eigenvalue

    • Because some of its rows are 0.

    • Example:

Pagerank attempt 2 l.jpg
PageRank, Attempt #2

  •  = normalization constant

    • r is a non-negative normalized left eigenvector of B with eigenvalue 1/

Pagerank attempt 27 l.jpg
PageRank, Attempt #2

  • Any nonzero eigenvalue  of B may give a solution

    • l = 1/

    • r = any non-negative normalized left eigenvector of B with eigenvalue 

  • Which solution to pick?

    • Pick a “principal eigenvector” (i.e., corresponding to maximal )

  • How to find a solution?

    • Power iterations

Pagerank attempt 28 l.jpg
PageRank, Attempt #2

  • Problem #1: Maximal eigenvalue may have multiplicity > 1

    • Several possible solutions

    • Happens, for example, when graph is disconnected

  • Problem #2: Rank accumulates at sinks.

    • Only sinks or nodes, from which a sink cannot be reached, can have nonzero rank mass.

Pagerank final definition l.jpg
PageRank, Final Definition

  • e = “rank source” vector

    • Standard setting: e(p) = /n for all p ( < 1)

  • 1 = the all 1’s vector

  • Then:

    • r is a non-negative normalized left eigenvector of (B + 1eT) with eigenvalue 1/

Pagerank final definition10 l.jpg
PageRank, Final Definition

  • Any nonzero eigenvalue of (B + 1eT) may give a solution

  • Pick r to be a principal left eigenvector of (B + 1eT)

  • Will show:

    • Principal eigenvalue has multiplicity 1, for any graph

    • There exists a non-negative left eigenvector

  • Hence, PageRank always exists and is uniquely defined

  • Due to rank source vector, rank no longer accumulates at sinks

An alternative view of pagerank the random surfer model l.jpg
An Alternative View of PageRank:The Random Surfer Model

  • When visiting a page p, a “random surfer”:

    • With probability 1 - d, selects a random outlink p  q and goes to visit q. (“focused browsing”)

    • With probability d, jumps to a random web page q. (“loss of interest”)

    • If p has no outlinks, assume it has a self loop.

  • P: probability transition matrix:

Pagerank random surfer model l.jpg
PageRank & Random Surfer Model

Therefore, r is a principal left eigenvector of (B + 1eT) if and only if it is a principal left eigenvector of P.



Pagerank markov chains l.jpg
PageRank & Markov Chains

  • PageRank vector is normalized principal left eigenvector of (B + 1eT).

  • Hence, PageRank vector is also a principal left eigenvector of P

  • Conclusion: PageRank is the unique stationary distribution of the random surfer Markov Chain.

  • PageRank(p) = r(p) = probability of random surfer visiting page p at the limit.

  • Note: “Random jump” guarantees Markov Chain is ergodic.

Hits hubs and authorities kleinberg 1997 l.jpg
HITS: Hubs and Authorities [Kleinberg, 1997]

  • HITS: Hyperlink Induced Topic Search

  • Main principle: every page p is associated with two scores:

    • Authority score: how “authoritative” a page is about the query’s topic

      • Ex: query: “IR”; authorities: scientific IR papers

      • Ex: query: “automobile manufacturers”; authorities: Mazda, Toyota, and GM web sites

    • Hub score: how good the page is as a “resource list” about the query’s topic

      • Ex: query: “IR”; hubs: surveys and books about IR

      • Ex: query: “automobile manufacturers”; hubs: KBB, car link lists

Mutual reinforcement l.jpg
Mutual Reinforcement

HITS principles:

  • p is a good authority, if it is linked by many good hubs.

  • p is a good hub, if it points to many good authorities.

Hits algebraic form l.jpg
HITS: Algebraic Form

  • a: authority vector

  • h: hub vector

  • A: adjacency matrix

  • Then:

  • Therefore:

  • a is principal eigenvector of ATA

  • h is principal eigenvector of AAT

Co citation and bibilographic coupling l.jpg
Co-Citation and Bibilographic Coupling

  • ATA: co-citation matrix

    • ATAp,q = # of pages that link both to p and to q.

    • Thus: authority scores propagate through co-citation.

  • AAT: bibliographic coupling matrix

    • AATp,q = # of pages that both p and q link to.

    • Thus: hub scores propagate through bibliographic coupling.





Principal eigenvector computation l.jpg
Principal Eigenvector Computation

  • E: n × n matrix

  • |1| > |2| ≥ |3| … ≥ |n| : eigenvalues of E

    • Suppose 1 > 0

  • v1,…,vn: corresponding eigenvectors

  • Eigenvectors are form an orthornormal basis

  • Input:

    • The matrix E

    • A unit vector u, which is not orthogonal to v1

  • Goal: compute 1 and v1

Why does it work l.jpg
Why Does It Work?

  • Theorem: As t  , w  c · v1 (c is a constant)

  • Convergence rate: Proportional to (2/1)t

  • The larger the “spectral gap” 2 - 1, the faster the convergence.

Spectral methods in information retrieval l.jpg
Spectral Methods in Information Retrieval

Outline l.jpg

  • Motivation: synonymy and polysemy

  • Latent Semantic Indexing (LSI)

  • Singular Value Decomposition (SVD)

  • LSI via SVD

  • Why LSI works?

  • HITS and SVD

Synonymy and polysemy l.jpg
Synonymy and Polysemy

  • Synonymy: multiple terms with (almost) the same meaning

    • Ex: cars, autos, vehicles

    • Harms recall

  • Polysemy: a term with multiple meanings

    • Ex: java (programming language, coffee, island)

    • Harms precision

Traditional solutions l.jpg
Traditional Solutions

  • Query expansion

    • Synonymy: OR on all synonyms

      • Manual/automatic use of thesauri

      • Too few synonyms: recall still low

      • Too many synonyms: harms precision

    • Polysemy: AND on term and additional specializing terms

      • Ex: +java +”programming language”

      • Too broad terms: precision still low

      • Too narrow terms: harms recall

Syntactic space l.jpg
Syntactic Space


  • D: document collection, |D| = n

  • T: term space, |T| = m

  • At,d: “weight” of t in d (e.g., TFIDF)

  • ATA: pairwise document similarities

  • AAT: pairwise term similarities





Syntactic indexing l.jpg
Syntactic Indexing

  • Index keys: terms

  • Limitations

    • Synonymy

      • (Near)-identical rows

    • Polysemy

    • Space inefficiency

      • Matrix usually is not full rank

  • Gap between syntax and semantics:

    Information need is semantic but index and query are syntactic.

Semantic space l.jpg
Semantic Space


  • C: concept space, |C| = r

  • Bc,d: “weight” of c in d

  • Change of basis

  • Compare to wavelet and Fourier transforms





Latent semantic indexing lsi deerwester et al 1990 l.jpg
Latent Semantic Indexing (LSI)[Deerwester et al. 1990]

  • Index keys: concepts

  • Documents & query: mixtures of concepts

  • Given a query, finds the most similar documents

  • Bridges the syntax-semantics gap

  • Space-efficient

    • Concepts are orthogonal

    • Matrix is full rank

  • Questions

    • What is the concept space?

    • What is the transformation from the syntax space to the semantic space?

    • How to filter “noise concepts”?

Singular values l.jpg
Singular Values

  • A: m×n real matrix

  • Definition:  ≥ 0 is a singular value of A if there exist a pair of vectors u,v s.t.

    Av = u and ATu = v

    u and v are called singular vectors.

  • Ex:  = ||A||2 = max||x||2 = 1 ||Ax||2.

    • Corresponding singular vectors: x that maximizes ||Ax||2 and y = Ax / ||A||2.

  • Note: ATAv = 2v and AATu = 2u

    • 2 is eigenvalue of ATA and AAT

    • u eigenvector of ATA

    • v eigenvector of AAT

Singular value decomposition svd l.jpg
Singular Value Decomposition (SVD)

  • Theorem: For every m×n real matrix A, there exists a singular value decomposition:

    A = U  VT

    • 1 ≥ … ≥ r > 0 (r = rank(A)): singular values of A

    •  = Diag(1,…,r)

    • U: column-orthonormal m×r matrix (UT U = I)

    • V: column-orthonormal n×r matrix (VT V = I)







Singular values vs eigenvalues l.jpg
Singular Values vs. Eigenvalues

A = U  VT

  • 1,…,r: singular values of A

    • 12,…,r2: non-zero eigenvalues of ATA and AAT

  • u1,…,ur: columns of U

    • Orthonormal basis for span(columns of A)

    • Left singular vectors of A

    • Eigenvectors of ATA

  • v1,…,vr: columns of V

    • Orthonormal basis for span(rows of A)

    • Right singular vectors

    • Eigenvectors of AAT

Lsi as svd l.jpg

  • A = U  VT UTA =  VT

  • u1,…,ur : concept basis

  • B =  VT : LSI matrix

  • Ad: d-th column of A

  • Bd: d-th column of B

  • Bd = UTAd

  • Bd[c] = ucT Ad

Noisy concepts l.jpg
Noisy Concepts

B = UTA =  VT

  • Bd[c] = c vd[c]

  • If c is small, then Bd[c] small for all d

  • k = largest i s.t. i is “large”

  • For all c = k+1,…,r, and for all d, c is a low-weight concept in d

  • Main idea: filter out all concepts c = k+1,…,r

    • Space efficient: # of index terms = k (vs. r or m)

    • Better retrieval: noisy concepts are filtered out across the board

Low rank svd l.jpg
Low-rank SVD

B = UTA =  VT

  • Uk = (u1,…,uk)

  • Vk = (v1,…,vk)

  • k = upper-left k×k sub-matrix of 

  • Ak = Ukk VkT

  • Bk = Sk VkT

  • rank(Ak) = rank(Bk) = k

Low dimensional embedding l.jpg
Low Dimensional Embedding

  • Forbenius norm:

  • Fact:

  • Therefore, if is small, then for “most” d,d’, .

  • Ak preserves pairwise similarities among documents  at least as good as A for retrieval.

Computing svd l.jpg
Computing SVD

  • Compute singular values of A, by computing eigenvalues of ATA

  • Compute U,V by computing eigenvectors of ATA and AAT

  • Running time not too good: O(m2 n + m n2)

    • Not practical for huge corpora

  • Sub-linear time algorithms for estimating Ak[Frieze,Kannan,Vempala 1998]

Hits and svd l.jpg

  • A: adjacency matrix of a web (sub-)graph G

  • a: authority vector

  • h: hub vector

  • a is principal eigenvector of ATA

  • h is principal eigenvector of AAT

  • Therefore: a and h give A1: the rank-1 SVD of A

  • Generalization: using Ak, we can get k authority and hub vectors, corresponding to other topics in G.

Why is lsi better papadimitriou et al 1998 azar et al 2001 l.jpg
Why is LSI Better?[Papadimitriou et al. 1998] [Azar et al. 2001]

  • LSI summary

    • Documents are embedded in low dimensional space (m  k)

    • Pairwise similarities are preserved

    • More space-efficient

  • But why is retrieval better?

    • Synonymy

    • Polysemy

Generative model l.jpg
Generative Model

  • A corpus modelM = (T,C,W,D)

    • T: Term space, |T| = m

    • C: Concept space, |C| = k

      • Concept: distribution over terms

    • W: Topic space

      • Topic: distribution over concepts

    • D: Document distribution

      • Distribution over W × N

  • A document d is generated as follows:

    • Sample a topic w and a length n according to D

    • Repeat n times:

      • Sample a concept c from C according to w

      • Sample a term t from T according to c

Simplifying assumptions l.jpg
Simplifying Assumptions

  • Every document has a single topic (W = C)

  • For every two concepts c,c’, ||c – c’|| ≥ 1 - 

  • The probability of every term under a concept c is at most some constant .

Lsi works l.jpg
LSI Works

  • A: m×n term-document matrix, representing n documents generated according to the model

  • Theorem[Papadimitriou et al. 1998]

    With high probability, for every two documents d,d’,

    • If topic(d) = topic(d’), then

    • If topic(d)  topic(d’), then

Proof l.jpg

  • For simplicity, assume  = 0

  • Want to show:

    • If topic(d) = topic(d’), Adk || Ad’k

    • If topic(d)  topic(d’), Adk Ad’k

  • Dc: documents whose topic is the concept c

  • Tc: terms in supp(c)

    • Since ||c – c’|| = 1, Tc ∩ Tc’ = Ø

  • A has non-zeroes only in blocks: B1,…,Bk, where

    Bc: sub-matrix of A with rows in Tc and columns in Dc

  • ATA is a block diagonal matrix with blocks BT1B1,…, BTkBk

  • (i,j)-th entry of BTcBc: term similarity between i-th and j-th documents whose topic is the concept c

  • BTcBc: adjacency matrix of a bipartite (multi-)graph Gc on Dc

Proof cont l.jpg
Proof (cont.)

  • Gc is a “random” graph

  • First and second eigenvalues of BTcBc are well separated

  • For all c,c’, second eigenvalue of BTcBc is smaller than first eigenvalue of BTc’Bc’

  • Top k eigenvalues of ATA are the principal eigenvalues of BTcBc for c = 1,…,k

  • Let u1,…,uk be corresponding eigenvectors

  • For every document d on topic c, Ad is orthogonal to all u1,…,uk, except for uc.

  • Akd is a scalar multiple of uc.

Extensions azar et al 2001 l.jpg
Extensions[Azar et al. 2001]

  • A more general generative model

  • Explain also improved treatment of polysemy