Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Latent Semantic Indexing(mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18

Speeding up cosine computation • What if we could take our vectors and “pack” them into fewer dimensions (say 50,000100) while preserving distances? • Now, O(nm) • Then, O(km+kn) where k << n,m • Two methods: • “Latent semantic indexing” • Random projection

A sketch • LSI is data-dependent • Create a k-dim subspace by eliminating redundant axes • Pull together “related” axes – hopefully • car and automobile • Random projection is data-independent • Choose a k-dim subspace that guarantees good stretching properties with high probability between pair of points. What about polysemy ?

Notions from linear algebra • Matrix A, vector v • Matrix transpose (At) • Matrix product • Rank • Eigenvalues l and eigenvector v: Av = lv

Overview of LSI • Pre-process docs using a technique from linear algebra called Singular Value Decomposition • Create a new (smaller) vector space • Queries handled (faster) in this new space

Singular-Value Decomposition • Recall mn matrix of terms  docs, A. • A has rank r  m,n • Define term-term correlation matrix T=AAt • T is a square, symmetric mm matrix • Let P be mrmatrix of eigenvectors of T • Define doc-doc correlation matrix D=AtA • D is a square, symmetric nn matrix • Let R be nrmatrix of eigenvectors of D

A’s decomposition • Do exist matrices P(for T, mr) and R(for D, nr) formed by orthonormal columns (unit dot-product) • It turns out that A = PSRt • WhereS is a diagonal matrix with the eigenvalues of T=AAt in decreasing order. mn mr rn = rr Rt S P A

document k k 0 k 0 0 useless due to 0-col/0-row of Sk Dimensionality reduction • For some k << r, zeroout all but the k biggest eigenvalues in S[choice of k is crucial] • Denote by Sk this new version of S, having rank k • Typically k is about 100, while r (A’s rank) is > 10,000 = r Sk Rt S P A Ak k x n r x n m x r m x k

Guarantee • Akis a pretty good approximation to A: • Relative distances are (approximately) preserved • Of all mn matrices of rank k, Ak is the best approximation to A wrt the following measures: • minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = sk+1 • minB, rank(B)=k ||A-B||F2 = ||A-Ak||F2 = sk+12+ sk+22+...+ sr2 • Frobenius norm ||A||F2 = s12+ s22+...+ sr2

R,P are formed by orthonormal eigenvectors of the matrices D,T Reduction • Xk = Sk Rtis the doc-matrix k x n, hence reduced to k dim • Take the doc-correlation matrix: • It is D=AtA=(P SRt)t(P SRt) = (SRt)t (SRt) • Approx S with Sk, thus get AtA Xkt Xk (both are n x n matr.) • We use Xkto define how to project A and Q: • Xk= SkRt , substitute Rt = S-1Pt A, so getPkt A. • In fact, SkS-1Pt = Pkt which is a k x m matrix • This means that to reduce a doc/query vector is enough to multiply it by Pkt • Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn)

Which are the concepts ? • c-th concept = c-th row of Pkt (which is k x m) • Denote it by Pkt [c], whose size is m = #terms • Pkt [c][i] = strength of association between c-th concept and i-th term • Projected document: d’j= Pkt dj • d’j[c] = strenght of concept c in dj • Projected query: q’= Pkt q • q’[c] = strenght of concept c in q

Random Projections Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only !

An interesting math result Lemma (Johnson-Linderstrauss, ‘82) Let P be a set of n distinct points in m-dimensions. Given e > 0, there exists a function f : P  IRk such that for every pair of points u,v in P it holds: (1 - e) ||u - v||2 ≤ ||f(u) – f(v)||2 ≤(1 + e) ||u-v||2 Where k = O(e-2 log m) f() is called JL-embedding Setting v=0 we also get a bound on f(u)’s stretching!!!

What about the cosine-distance ? f(u)’s, f(v)’s stretching substituting formula above

E[ri,j] = 0 Var[ri,j] = 1 How to compute a JL-embedding? If we set R = ri,j to be a random mx k matrix, where the components are independent random variables with one of the following distributions

Finally... • Random projections hide large constants • k  (1/e)2 * log m, so it may be large… • it is simple and fast to compute • LSI is intuitive and may scale to any k • optimal under various metrics • but costly to compute

Document duplication(exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

Sec. 19.6 Duplicate documents • The web is full of duplicated content • Few exact duplicate detection • Many cases of nearduplicates • E.g., Last modified date the only difference between two copies of a page

Natural Approaches • Fingerprinting: • only works for exact matches, slow • Checksum – no worst-case collision probability guarantees • MD5 – cryptographically-secure string hashes • Edit-distance • metric for approximate string-matching • expensive – even for one pair of strings • impossible – for 1032 web documents • Random Sampling • sample substrings (phrases, sentences, etc) • hope: similar documents  similar samples • But – even samples of same document will differ

Exact-Duplicate Detection • Obvious techniques • Checksum – no worst-case collision probability guarantees • MD5 – cryptographically-secure string hashes • relatively slow • Karp-Rabin’s Scheme • Rolling hash: split doc in many pieces • Algebraic technique – arithmetic on primes • Efficient and other nice properties…

Near-Duplicate Detection • Problem • Given a large collection of documents • Identifythe near-duplicate documents • Web search engines • Proliferation of near-duplicate documents • Legitimate – mirrors, local copies, updates, … • Malicious – spam, spider-traps, dynamic URLs, … • Mistaken – spider errors • 30% of web-pages are near-duplicates [1997]

Desiderata • Storage: only small sketchesof each document. • Computation:the fastest possible • Stream Processing: • once sketch computed, source is unavailable • Error Guarantees • problem scale  small biases have large impact • need formal guarantees – heuristics will not do

Basic Idea[Broder 1997] • Shingling • dissect document into q-grams(shingles) • represent documents by shingle-sets • reduce problem to set intersection [ Jaccard ] • They are near-duplicates if large shingle-sets intersect enough

SA SB Similarity of Documents Doc A Doc B • Jaccard measure – similarity of SA, SB • Claim: A & B are near-duplicates if sim(SA,SB) is high

Basic Idea[Broder 1997] • Shingling • dissect document into q-grams(shingles) • represent documents by shingle-sets • reduce problem to set intersection [ Jaccard ] • They are near-duplicates if large shingle-sets intersect enough • We need to cope with“Set Intersection” • fingerprints of shingles (for space/time efficiency) • min-hash to estimate intersections sizes (for further time and space efficiency)

Multiset of Shingles Multiset of Fingerprints Doc shingling fingerprint Documents  Sets of 64-bit fingerprints • Fingerprints: • Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits) • Fingerprint space [0, …, U-1] • In practice, use 64-bit fingerprints, i.e., U=264 • Prob[collision] ≈ (8q)/264 << 1 This reduces space for storing the multi-sets And the time to intersect them, but...

Sec. 19.6 Speeding-up: Sketch of a document • Intersecting shingle-sets is too costly • Create a “sketch vector” (of size ~200) for each document, for its shingle-set • Documents that share ≥t(say 80%) corresponding vector elements are near duplicates

Sketching by Min-Hashing • Consider • SA, SB P • Pick a random permutation π of P (such as ax+b mod |P|) • Define = π -1( min{π(SA)} ) , b = π -1( min{π(SB)} ) • minimal element under permutation π • Lemma:

Strengthening it… • Similarity sketch sk(A) = k minimal elements under π(SA) • Kis fixed or is a fixed ratio of SA,SB ? • We might also take K permutations and the min of each • Similarity Sketches sk(A): • Succinct representation of fingerprint sets SA • Allows efficient estimation of sim(SA,SB) • Basic idea is to use min-hash of fingerprints • Note: we can reduce the variance by using a larger k

Sec. 19.6 Document 1 Computing Sketch[i] for Doc1 264 Start with 64-bit f(shingles) Permute on the number line with pi Pick the min value 264 264 264

Sec. 19.6 Document 1 Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 2 264 264 264 264 264 264 A B 264 264 Are these equal? Test for 200 random permutations:p1, p2,… p200

Sec. 19.6 Document 2 Document 1 264 264 264 264 264 264 264 264 However… A B A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection) Claim: This happens with probability Size_of_intersection / Size_of_union

Sum up… • Brute-force: compare sk(A) vs. sk(B) for all the pairs of documents A and B. • Locality sensitive hashing (LSH) • Compute sk(A) for each document A • Use LSH of all sketches, briefly: • Take h elements of sk(A) as ID (may induce false positives) • Create t IDs (to reduce the false negatives) • If one ID matches with another one (wrt same h-selection), then the corresponding docs are probably near-duplicates; hence compare.

Search Engines “Semantic” searches ?

Mainly term-based: polysemy and synonymy issues Term Vector Vector Space model t3 v 2.2 5.1 9.1 1.0 0.1 w a t2 t1 Similarity(v,w) ≈ cos(α) Classical approach “Diego Maradona won against Mexico” Dictionary of terms against Diego Maradona Mexico won

A new approach: Massive graphs of entities and relations May 2012 38

A typical issue: polysemy the paparazzi photographed the star the astronomer photographed the star

Another issue: synonymy He is using Microsoft’s browser He is a fan of Internet Explorer

http://tagme.di.unipi.it

TAGME “Diego Maradona won against Mexico” ρ-score ρ-score • Diego A. Maradona • Diego Maradona jr. • Maradona Stadium • Maradona Film • … • Mexico nation • Mexico state • Mexico football team • Mexico baseball team • … • Korean won • Win-loss record • Only won • ... No Annotation DISAMBIGUATION by a voting scheme PRUNING 2 simple features PARSING

Barack Obama Iran Lockheed Martin RQ-170 Sentinel President of the United States Ultimatum Mahmoud Ahmadinejad Why is it more powerful ? obama asks iran for RQ-170 sentinel drone back us president issues Ahmadinejad ultimatum

Barack Obama Mahmoud Ahmadinejad President of the United States RQ-170 drone Iran Ultimatum Text as a sub-graph of topics 44

Graph analysis allows to find similarities between texts and entities even if they do not match syntactically (so at concept level) Barack Obama Mahmoud Ahmadinejad Any relatedness measure over a graph, e.g. [Milne & Witten, 2008] President of the United States RQ-170 drone Iran Ultimatum Text as a sub-graph of topics

Search Results Clustering TOPICS • Jaguar Cars • Panthera Onca • Mac OS X • Atari Jaguar • Jacksonville Jags • Fender Jaguar • …

Paper at ACM WSDM 2012 Paper at IEEE Software 2012 Pls design your killer app...http://acube.di.unipi.it/tagme Releasing open-source… Paper at ECIR 2012

Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Presentation Transcript

Latent Semantic Indexing: A probabilistic Analysis

Detecting Cyberbullying using Latent Semantic Indexing(LSI)

Latent Semantic Indexing

Latent Semantic Indexing

Latent Semantic Kernels

LATENT SEMANTIC INDEXING

Lecture 14: Latent Semantic Indexing +

Latent Semantic Indexing SI650: Information Retrieva l

LATENT SEMANTIC INDEXING

Paper: Indexing by Latent Semantic Analysis

Latent Semantic Indexing

Latent Semantic Indexing and Beyond

Indexing by Latent Semantic Analysis

Latent Semantic Indexing

Latent Semantic Mapping (LSA)

Latent Semantic Indexing and Beyond

Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Latent Semantic Indexing

Latent Semantic Indexing

Lecture 15: Latent Semantic Indexing

Latent Semantic Indexing: A probabilistic Analysis