- 71 Views
- Uploaded on
- Presentation posted in: General

Algorithms for Large Data Sets

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Algorithms for Large Data Sets

Ziv Bar-Yossef

Lecture 5

April 23, 2006

http://www.ee.technion.ac.il/courses/049011

- Motivating principles
- Rank of p should be proportional to the rank of the pages that point to p
- Recommendations from Bill Gates & Steve Jobs vs. from Moishale and Ahuva

- Rank of p should depend on the number of pages “co-cited” with p
- Compare: Bill Gates recommends only me vs. Bill Gates recommends everyone on earth

- Rank of p should be proportional to the rank of the pages that point to p

- Additional Conditions:
- r is non-negative:r ≥ 0
- r is normalized:||r||1 = 1

- B = normalized adjacency matrix:

- Then:
- r is a non-negative normalized left eigenvector of B with eigenvalue 1

- Solution exists only if B has eigenvalue 1
- Problem: B may not have 1 as an eigenvalue
- Because some of its rows are 0.
- Example:

- = normalization constant
- r is a non-negative normalized left eigenvector of B with eigenvalue 1/

- Any nonzero eigenvalue of B may give a solution
- l = 1/
- r = any non-negative normalized left eigenvector of B with eigenvalue

- Which solution to pick?
- Pick a “principal eigenvector” (i.e., corresponding to maximal )

- How to find a solution?
- Power iterations

- Problem #1: Maximal eigenvalue may have multiplicity > 1
- Several possible solutions
- Happens, for example, when graph is disconnected

- Problem #2: Rank accumulates at sinks.
- Only sinks or nodes, from which a sink cannot be reached, can have nonzero rank mass.

- e = “rank source” vector
- Standard setting: e(p) = /n for all p ( < 1)

- 1 = the all 1’s vector

- Then:
- r is a non-negative normalized left eigenvector of (B + 1eT) with eigenvalue 1/

- Any nonzero eigenvalue of (B + 1eT) may give a solution
- Pick r to be a principal left eigenvector of (B + 1eT)
- Will show:
- Principal eigenvalue has multiplicity 1, for any graph
- There exists a non-negative left eigenvector

- Hence, PageRank always exists and is uniquely defined
- Due to rank source vector, rank no longer accumulates at sinks

- When visiting a page p, a “random surfer”:
- With probability 1 - d, selects a random outlink p q and goes to visit q. (“focused browsing”)
- With probability d, jumps to a random web page q. (“loss of interest”)
- If p has no outlinks, assume it has a self loop.

- P: probability transition matrix:

Therefore, r is a principal left eigenvector of (B + 1eT) if and only if it is a principal left eigenvector of P.

Suppose:

Then:

- PageRank vector is normalized principal left eigenvector of (B + 1eT).
- Hence, PageRank vector is also a principal left eigenvector of P
- Conclusion: PageRank is the unique stationary distribution of the random surfer Markov Chain.
- PageRank(p) = r(p) = probability of random surfer visiting page p at the limit.
- Note: “Random jump” guarantees Markov Chain is ergodic.

- HITS: Hyperlink Induced Topic Search
- Main principle: every page p is associated with two scores:
- Authority score: how “authoritative” a page is about the query’s topic
- Ex: query: “IR”; authorities: scientific IR papers
- Ex: query: “automobile manufacturers”; authorities: Mazda, Toyota, and GM web sites

- Hub score: how good the page is as a “resource list” about the query’s topic
- Ex: query: “IR”; hubs: surveys and books about IR
- Ex: query: “automobile manufacturers”; hubs: KBB, car link lists

- Authority score: how “authoritative” a page is about the query’s topic

HITS principles:

- p is a good authority, if it is linked by many good hubs.
- p is a good hub, if it points to many good authorities.

- a: authority vector
- h: hub vector
- A: adjacency matrix

- Then:

- Therefore:

- a is principal eigenvector of ATA
- h is principal eigenvector of AAT

- ATA: co-citation matrix
- ATAp,q = # of pages that link both to p and to q.
- Thus: authority scores propagate through co-citation.

- AAT: bibliographic coupling matrix
- AATp,q = # of pages that both p and q link to.
- Thus: hub scores propagate through bibliographic coupling.

p

q

p

q

- E: n × n matrix
- |1| > |2| ≥ |3| … ≥ |n| : eigenvalues of E
- Suppose 1 > 0

- v1,…,vn: corresponding eigenvectors
- Eigenvectors are form an orthornormal basis
- Input:
- The matrix E
- A unit vector u, which is not orthogonal to v1

- Goal: compute 1 and v1

- Theorem: As t , w c · v1 (c is a constant)
- Convergence rate: Proportional to (2/1)t
- The larger the “spectral gap” 2 - 1, the faster the convergence.

- Motivation: synonymy and polysemy
- Latent Semantic Indexing (LSI)
- Singular Value Decomposition (SVD)
- LSI via SVD
- Why LSI works?
- HITS and SVD

- Synonymy: multiple terms with (almost) the same meaning
- Ex: cars, autos, vehicles
- Harms recall

- Polysemy: a term with multiple meanings
- Ex: java (programming language, coffee, island)
- Harms precision

- Query expansion
- Synonymy: OR on all synonyms
- Manual/automatic use of thesauri
- Too few synonyms: recall still low
- Too many synonyms: harms precision

- Polysemy: AND on term and additional specializing terms
- Ex: +java +”programming language”
- Too broad terms: precision still low
- Too narrow terms: harms recall

- Synonymy: OR on all synonyms

documents

- D: document collection, |D| = n
- T: term space, |T| = m
- At,d: “weight” of t in d (e.g., TFIDF)
- ATA: pairwise document similarities
- AAT: pairwise term similarities

A

terms

m

n

- Index keys: terms
- Limitations
- Synonymy
- (Near)-identical rows

- Polysemy
- Space inefficiency
- Matrix usually is not full rank

- Synonymy
- Gap between syntax and semantics:
Information need is semantic but index and query are syntactic.

documents

- C: concept space, |C| = r
- Bc,d: “weight” of c in d
- Change of basis
- Compare to wavelet and Fourier transforms

B

r

concepts

n

- Index keys: concepts
- Documents & query: mixtures of concepts
- Given a query, finds the most similar documents
- Bridges the syntax-semantics gap
- Space-efficient
- Concepts are orthogonal
- Matrix is full rank

- Questions
- What is the concept space?
- What is the transformation from the syntax space to the semantic space?
- How to filter “noise concepts”?

- A: m×n real matrix
- Definition: ≥ 0 is a singular value of A if there exist a pair of vectors u,v s.t.
Av = u and ATu = v

u and v are called singular vectors.

- Ex: = ||A||2 = max||x||2 = 1 ||Ax||2.
- Corresponding singular vectors: x that maximizes ||Ax||2 and y = Ax / ||A||2.

- Note: ATAv = 2v and AATu = 2u
- 2 is eigenvalue of ATA and AAT
- u eigenvector of ATA
- v eigenvector of AAT

- Theorem: For every m×n real matrix A, there exists a singular value decomposition:
A = U VT

- 1 ≥ … ≥ r > 0 (r = rank(A)): singular values of A
- = Diag(1,…,r)
- U: column-orthonormal m×r matrix (UT U = I)
- V: column-orthonormal n×r matrix (VT V = I)

U

A

VT

×

×

=

A = U VT

- 1,…,r: singular values of A
- 12,…,r2: non-zero eigenvalues of ATA and AAT

- u1,…,ur: columns of U
- Orthonormal basis for span(columns of A)
- Left singular vectors of A
- Eigenvectors of ATA

- v1,…,vr: columns of V
- Orthonormal basis for span(rows of A)
- Right singular vectors
- Eigenvectors of AAT

- A = U VT UTA = VT
- u1,…,ur : concept basis
- B = VT : LSI matrix
- Ad: d-th column of A
- Bd: d-th column of B
- Bd = UTAd
- Bd[c] = ucT Ad

B = UTA = VT

- Bd[c] = c vd[c]
- If c is small, then Bd[c] small for all d
- k = largest i s.t. i is “large”
- For all c = k+1,…,r, and for all d, c is a low-weight concept in d
- Main idea: filter out all concepts c = k+1,…,r
- Space efficient: # of index terms = k (vs. r or m)
- Better retrieval: noisy concepts are filtered out across the board

B = UTA = VT

- Uk = (u1,…,uk)
- Vk = (v1,…,vk)
- k = upper-left k×k sub-matrix of
- Ak = Ukk VkT
- Bk = Sk VkT
- rank(Ak) = rank(Bk) = k

- Forbenius norm:
- Fact:
- Therefore, if is small, then for “most” d,d’, .
- Ak preserves pairwise similarities among documents at least as good as A for retrieval.

- Compute singular values of A, by computing eigenvalues of ATA
- Compute U,V by computing eigenvectors of ATA and AAT
- Running time not too good: O(m2 n + m n2)
- Not practical for huge corpora

- Sub-linear time algorithms for estimating Ak[Frieze,Kannan,Vempala 1998]

- A: adjacency matrix of a web (sub-)graph G
- a: authority vector
- h: hub vector
- a is principal eigenvector of ATA
- h is principal eigenvector of AAT
- Therefore: a and h give A1: the rank-1 SVD of A
- Generalization: using Ak, we can get k authority and hub vectors, corresponding to other topics in G.

- LSI summary
- Documents are embedded in low dimensional space (m k)
- Pairwise similarities are preserved
- More space-efficient

- But why is retrieval better?
- Synonymy
- Polysemy

- A corpus modelM = (T,C,W,D)
- T: Term space, |T| = m
- C: Concept space, |C| = k
- Concept: distribution over terms

- W: Topic space
- Topic: distribution over concepts

- D: Document distribution
- Distribution over W × N

- A document d is generated as follows:
- Sample a topic w and a length n according to D
- Repeat n times:
- Sample a concept c from C according to w
- Sample a term t from T according to c

- Every document has a single topic (W = C)
- For every two concepts c,c’, ||c – c’|| ≥ 1 -
- The probability of every term under a concept c is at most some constant .

- A: m×n term-document matrix, representing n documents generated according to the model
- Theorem[Papadimitriou et al. 1998]
With high probability, for every two documents d,d’,

- If topic(d) = topic(d’), then
- If topic(d) topic(d’), then

- For simplicity, assume = 0
- Want to show:
- If topic(d) = topic(d’), Adk || Ad’k
- If topic(d) topic(d’), Adk Ad’k

- Dc: documents whose topic is the concept c
- Tc: terms in supp(c)
- Since ||c – c’|| = 1, Tc ∩ Tc’ = Ø

- A has non-zeroes only in blocks: B1,…,Bk, where
Bc: sub-matrix of A with rows in Tc and columns in Dc

- ATA is a block diagonal matrix with blocks BT1B1,…, BTkBk
- (i,j)-th entry of BTcBc: term similarity between i-th and j-th documents whose topic is the concept c
- BTcBc: adjacency matrix of a bipartite (multi-)graph Gc on Dc

- Gc is a “random” graph
- First and second eigenvalues of BTcBc are well separated
- For all c,c’, second eigenvalue of BTcBc is smaller than first eigenvalue of BTc’Bc’
- Top k eigenvalues of ATA are the principal eigenvalues of BTcBc for c = 1,…,k
- Let u1,…,uk be corresponding eigenvectors
- For every document d on topic c, Ad is orthogonal to all u1,…,uk, except for uc.
- Akd is a scalar multiple of uc.

- A more general generative model
- Explain also improved treatment of polysemy

End of Lecture 5