1 / 45

# Algorithms for Large Data Sets - PowerPoint PPT Presentation

Algorithms for Large Data Sets. Ziv Bar-Yossef. Lecture 5 April 23, 2006. http://www.ee.technion.ac.il/courses/049011. Ranking Algorithms. PageRank [Page, Brin, Motwani, Winograd 1998]. Motivating principles Rank of p should be proportional to the rank of the pages that point to p

Related searches for Algorithms for Large Data Sets

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Algorithms for Large Data Sets' - schuyler

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Algorithms for Large Data Sets

Ziv Bar-Yossef

Lecture 5

April 23, 2006

http://www.ee.technion.ac.il/courses/049011

PageRank [Page, Brin, Motwani, Winograd 1998]

• Motivating principles

• Rank of p should be proportional to the rank of the pages that point to p

• Recommendations from Bill Gates & Steve Jobs vs. from Moishale and Ahuva

• Rank of p should depend on the number of pages “co-cited” with p

• Compare: Bill Gates recommends only me vs. Bill Gates recommends everyone on earth

• r is non-negative: r ≥ 0

• r is normalized: ||r||1 = 1

• B = normalized adjacency matrix:

• Then:

• r is a non-negative normalized left eigenvector of B with eigenvalue 1

• Solution exists only if B has eigenvalue 1

• Problem: B may not have 1 as an eigenvalue

• Because some of its rows are 0.

• Example:

•  = normalization constant

• r is a non-negative normalized left eigenvector of B with eigenvalue 1/

• Any nonzero eigenvalue  of B may give a solution

• l = 1/

• r = any non-negative normalized left eigenvector of B with eigenvalue 

• Which solution to pick?

• Pick a “principal eigenvector” (i.e., corresponding to maximal )

• How to find a solution?

• Power iterations

• Problem #1: Maximal eigenvalue may have multiplicity > 1

• Several possible solutions

• Happens, for example, when graph is disconnected

• Problem #2: Rank accumulates at sinks.

• Only sinks or nodes, from which a sink cannot be reached, can have nonzero rank mass.

• e = “rank source” vector

• Standard setting: e(p) = /n for all p ( < 1)

• 1 = the all 1’s vector

• Then:

• r is a non-negative normalized left eigenvector of (B + 1eT) with eigenvalue 1/

• Any nonzero eigenvalue of (B + 1eT) may give a solution

• Pick r to be a principal left eigenvector of (B + 1eT)

• Will show:

• Principal eigenvalue has multiplicity 1, for any graph

• There exists a non-negative left eigenvector

• Hence, PageRank always exists and is uniquely defined

• Due to rank source vector, rank no longer accumulates at sinks

An Alternative View of PageRank:The Random Surfer Model

• When visiting a page p, a “random surfer”:

• With probability 1 - d, selects a random outlink p  q and goes to visit q. (“focused browsing”)

• With probability d, jumps to a random web page q. (“loss of interest”)

• If p has no outlinks, assume it has a self loop.

• P: probability transition matrix:

Therefore, r is a principal left eigenvector of (B + 1eT) if and only if it is a principal left eigenvector of P.

Suppose:

Then:

• PageRank vector is normalized principal left eigenvector of (B + 1eT).

• Hence, PageRank vector is also a principal left eigenvector of P

• Conclusion: PageRank is the unique stationary distribution of the random surfer Markov Chain.

• PageRank(p) = r(p) = probability of random surfer visiting page p at the limit.

• Note: “Random jump” guarantees Markov Chain is ergodic.

HITS: Hubs and Authorities [Kleinberg, 1997]

• HITS: Hyperlink Induced Topic Search

• Main principle: every page p is associated with two scores:

• Authority score: how “authoritative” a page is about the query’s topic

• Ex: query: “IR”; authorities: scientific IR papers

• Ex: query: “automobile manufacturers”; authorities: Mazda, Toyota, and GM web sites

• Hub score: how good the page is as a “resource list” about the query’s topic

• Ex: query: “IR”; hubs: surveys and books about IR

• Ex: query: “automobile manufacturers”; hubs: KBB, car link lists

HITS principles:

• p is a good authority, if it is linked by many good hubs.

• p is a good hub, if it points to many good authorities.

• a: authority vector

• h: hub vector

• Then:

• Therefore:

• a is principal eigenvector of ATA

• h is principal eigenvector of AAT

Co-Citation and Bibilographic Coupling

• ATA: co-citation matrix

• ATAp,q = # of pages that link both to p and to q.

• Thus: authority scores propagate through co-citation.

• AAT: bibliographic coupling matrix

• AATp,q = # of pages that both p and q link to.

• Thus: hub scores propagate through bibliographic coupling.

p

q

p

q

• E: n × n matrix

• |1| > |2| ≥ |3| … ≥ |n| : eigenvalues of E

• Suppose 1 > 0

• v1,…,vn: corresponding eigenvectors

• Eigenvectors are form an orthornormal basis

• Input:

• The matrix E

• A unit vector u, which is not orthogonal to v1

• Goal: compute 1 and v1

• Theorem: As t  , w  c · v1 (c is a constant)

• Convergence rate: Proportional to (2/1)t

• The larger the “spectral gap” 2 - 1, the faster the convergence.

Spectral Methods in Information Retrieval

• Motivation: synonymy and polysemy

• Latent Semantic Indexing (LSI)

• Singular Value Decomposition (SVD)

• LSI via SVD

• Why LSI works?

• HITS and SVD

• Synonymy: multiple terms with (almost) the same meaning

• Ex: cars, autos, vehicles

• Harms recall

• Polysemy: a term with multiple meanings

• Ex: java (programming language, coffee, island)

• Harms precision

• Query expansion

• Synonymy: OR on all synonyms

• Manual/automatic use of thesauri

• Too few synonyms: recall still low

• Too many synonyms: harms precision

• Polysemy: AND on term and additional specializing terms

• Ex: +java +”programming language”

• Too broad terms: precision still low

• Too narrow terms: harms recall

documents

• D: document collection, |D| = n

• T: term space, |T| = m

• At,d: “weight” of t in d (e.g., TFIDF)

• ATA: pairwise document similarities

• AAT: pairwise term similarities

A

terms

m

n

• Index keys: terms

• Limitations

• Synonymy

• (Near)-identical rows

• Polysemy

• Space inefficiency

• Matrix usually is not full rank

• Gap between syntax and semantics:

Information need is semantic but index and query are syntactic.

documents

• C: concept space, |C| = r

• Bc,d: “weight” of c in d

• Change of basis

• Compare to wavelet and Fourier transforms

B

r

concepts

n

Latent Semantic Indexing (LSI)[Deerwester et al. 1990]

• Index keys: concepts

• Documents & query: mixtures of concepts

• Given a query, finds the most similar documents

• Bridges the syntax-semantics gap

• Space-efficient

• Concepts are orthogonal

• Matrix is full rank

• Questions

• What is the concept space?

• What is the transformation from the syntax space to the semantic space?

• How to filter “noise concepts”?

• A: m×n real matrix

• Definition:  ≥ 0 is a singular value of A if there exist a pair of vectors u,v s.t.

Av = u and ATu = v

u and v are called singular vectors.

• Ex:  = ||A||2 = max||x||2 = 1 ||Ax||2.

• Corresponding singular vectors: x that maximizes ||Ax||2 and y = Ax / ||A||2.

• Note: ATAv = 2v and AATu = 2u

• 2 is eigenvalue of ATA and AAT

• u eigenvector of ATA

• v eigenvector of AAT

• Theorem: For every m×n real matrix A, there exists a singular value decomposition:

A = U  VT

• 1 ≥ … ≥ r > 0 (r = rank(A)): singular values of A

•  = Diag(1,…,r)

• U: column-orthonormal m×r matrix (UT U = I)

• V: column-orthonormal n×r matrix (VT V = I)

U

A

VT

×

×

=

A = U  VT

• 1,…,r: singular values of A

• 12,…,r2: non-zero eigenvalues of ATA and AAT

• u1,…,ur: columns of U

• Orthonormal basis for span(columns of A)

• Left singular vectors of A

• Eigenvectors of ATA

• v1,…,vr: columns of V

• Orthonormal basis for span(rows of A)

• Right singular vectors

• Eigenvectors of AAT

• A = U  VT UTA =  VT

• u1,…,ur : concept basis

• B =  VT : LSI matrix

• Ad: d-th column of A

• Bd: d-th column of B

B = UTA =  VT

• Bd[c] = c vd[c]

• If c is small, then Bd[c] small for all d

• k = largest i s.t. i is “large”

• For all c = k+1,…,r, and for all d, c is a low-weight concept in d

• Main idea: filter out all concepts c = k+1,…,r

• Space efficient: # of index terms = k (vs. r or m)

• Better retrieval: noisy concepts are filtered out across the board

B = UTA =  VT

• Uk = (u1,…,uk)

• Vk = (v1,…,vk)

• k = upper-left k×k sub-matrix of 

• Ak = Ukk VkT

• Bk = Sk VkT

• rank(Ak) = rank(Bk) = k

• Forbenius norm:

• Fact:

• Therefore, if is small, then for “most” d,d’, .

• Ak preserves pairwise similarities among documents  at least as good as A for retrieval.

• Compute singular values of A, by computing eigenvalues of ATA

• Compute U,V by computing eigenvectors of ATA and AAT

• Running time not too good: O(m2 n + m n2)

• Not practical for huge corpora

• Sub-linear time algorithms for estimating Ak[Frieze,Kannan,Vempala 1998]

• A: adjacency matrix of a web (sub-)graph G

• a: authority vector

• h: hub vector

• a is principal eigenvector of ATA

• h is principal eigenvector of AAT

• Therefore: a and h give A1: the rank-1 SVD of A

• Generalization: using Ak, we can get k authority and hub vectors, corresponding to other topics in G.

Why is LSI Better?[Papadimitriou et al. 1998] [Azar et al. 2001]

• LSI summary

• Documents are embedded in low dimensional space (m  k)

• Pairwise similarities are preserved

• More space-efficient

• But why is retrieval better?

• Synonymy

• Polysemy

• A corpus modelM = (T,C,W,D)

• T: Term space, |T| = m

• C: Concept space, |C| = k

• Concept: distribution over terms

• W: Topic space

• Topic: distribution over concepts

• D: Document distribution

• Distribution over W × N

• A document d is generated as follows:

• Sample a topic w and a length n according to D

• Repeat n times:

• Sample a concept c from C according to w

• Sample a term t from T according to c

• Every document has a single topic (W = C)

• For every two concepts c,c’, ||c – c’|| ≥ 1 - 

• The probability of every term under a concept c is at most some constant .

• A: m×n term-document matrix, representing n documents generated according to the model

With high probability, for every two documents d,d’,

• If topic(d) = topic(d’), then

• If topic(d)  topic(d’), then

• For simplicity, assume  = 0

• Want to show:

• Dc: documents whose topic is the concept c

• Tc: terms in supp(c)

• Since ||c – c’|| = 1, Tc ∩ Tc’ = Ø

• A has non-zeroes only in blocks: B1,…,Bk, where

Bc: sub-matrix of A with rows in Tc and columns in Dc

• ATA is a block diagonal matrix with blocks BT1B1,…, BTkBk

• (i,j)-th entry of BTcBc: term similarity between i-th and j-th documents whose topic is the concept c

• BTcBc: adjacency matrix of a bipartite (multi-)graph Gc on Dc

• Gc is a “random” graph

• First and second eigenvalues of BTcBc are well separated

• For all c,c’, second eigenvalue of BTcBc is smaller than first eigenvalue of BTc’Bc’

• Top k eigenvalues of ATA are the principal eigenvalues of BTcBc for c = 1,…,k

• Let u1,…,uk be corresponding eigenvectors

• For every document d on topic c, Ad is orthogonal to all u1,…,uk, except for uc.

• Akd is a scalar multiple of uc.

Extensions[Azar et al. 2001]

• A more general generative model

• Explain also improved treatment of polysemy