1 / 33

Locality Sensitive Hashing

Locality Sensitive Hashing. Basics and applications. A well-known problem. Given a large collection of documents Identify the near-duplicate documents Web search engines Proliferation of near-duplicate documents Legitimate – mirrors, local copies, updates, …

nancyi
Download Presentation

Locality Sensitive Hashing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Locality Sensitive Hashing Basics and applications

  2. A well-known problem • Given a large collection of documents • Identifythe near-duplicate documents • Web search engines • Proliferation of near-duplicate documents • Legitimate – mirrors, local copies, updates, … • Malicious – spam, spider-traps, dynamic URLs, … • 30% of web-pages are near-duplicates [1997]

  3. Natural Approaches • Fingerprinting: • only works for exact matches • Karp Rabin (rolling hash) – collision probability guarantees • MD5 – cryptographically-secure string hashes • Edit-distance • metric for approximate string-matching • expensive – even for one pair of documents • impossible – for billion web documents • Random Sampling • sample substrings (phrases, sentences, etc) • hope: similar documents  similar samples • But – even samples of same document will differ

  4. Basic Idea: Shingling[Broder 1997] • dissect document into q-grams(shingles) T = I leave and study in Pisa, …. Ifwe set q=3 the 3-grams are: <I leave and><leave and study><and study in><study in Pisa>… • represent documents by sets of hash[shingle]  The problem reduces to set intersection among int

  5. SA SB Basic Idea: Shingling[Broder 1997] Set intersection Jaccard similarity Doc A Doc B • Claim: A & B are near-duplicates if sim(SA,SB) is high

  6. Sec. 19.6 Sketching of a document From each shingle-set we build a “sketch vector” (~200 size) Postulate: Documents that share ≥tcomponents of their sketch-vectors are claimed to be near duplicates

  7. Sketching by Min-Hashing • Consider • SA, SB P = {0,…,p-1} • Pick a random permutation π of the whole set P (such as ax+b mod p) • Pick the minimal element of SA : = min{π(SA)} • Pick the minimal element of SB : b = min{π(SB)} • Lemma:

  8. Strengthening it… • Similarity sketch sk(A) • d minimal elements under π(SA) • Or take d permutations and the min of each • Note: we can reduce the variance by using a larger d • Typically d is few hundred mins (~200)

  9. Sec. 19.6 Document 1 Computing Sketch[i] for Doc1 264 Start with 64-bit f(shingles) Permute with pi Pick the min value 264 264 264

  10. Sec. 19.6 Document 1 Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 2 264 264 264 264 264 264 A B 264 264 Are these equal? Claim: This happens with probability Size_of_intersection / Size_of_union Use 200 random permutations (minimum), and thus create one 200-dim vector per document and evaluate the fraction of shared components

  11. It’s even more difficult… So wehavesqueezedfewKbs of data (web page) intofewhundredbytes Butyoustillneed abrute-force comparison (quadratic time) to compute all nearly-duplicate documents • Thisistoomuchevenif it isexecuted in RAM

  12. Locality Sensitive Hashing The case of the Hamming distance How to compute fast the fraction of different compo. in d-dim vectors How to compute fast the hamming distance between d-dim vectors Fraction different components = HammingDist/d

  13. A warm-up • Consider the case of binary (sketch) vectors, thus living in the hypercube {0,1}d • Hamming distance D(p,q)= # coord on which p and q differ • Define hash function h by choosing a set I of k random coordinates h(p) = p|I = projection of p on I Example: if p=01011 (d=5), a pick I={1,4} (with k=2), then h(p)=01 Note similarity with the Bloom Filter

  14. 1 2 …. d A key property • Pr[to pick an equal component]= (d - D(p,q))/d • We can vary the probability by changing k equal Pr k=1 Pr k=2 What about False Negatives ? distance distance

  15. Reiterate • RepeatL times the k-projections hi • Declare a «match» ifatleastonehimatches Example:d=5, k=2, p = 01011 and q = 00101 • I1 = {2,4}, we have h1(p) = 11 and h1(q)=00 • I2 = {1,4}, we have h2(p) = 01 and h2(q)=00 • I3 = {1,5}, we have h3(p) = 01 and h3(q)=01 • We set g( ) = < h1( ), h2( ), h3( )> p and q match !!

  16. Measuring the error prob. The g() consists of L independenthasheshi Pr[g(p) matches g(q)] =1 - Pr[hi(p) ≠ hi(q), i=1, …, L] Pr s (1/L)^(1/k)

  17. Find groups of similar items • SOL 1: Bucketsprovide the candidate similaritems «Merge» similar sets ifthey share items p Points in a bucket are possibly similar objects h1(p) h2(p) hL(p) T1 T2 TL

  18. Find groups of similar items • SOL 1: Bucketsprovide the candidate similaritems • SOL 2: • Sortitems by the hi(), and pickassimilar candidate the equalones • Repeat L times, for allhi() • «Merge» candidate sets ifthey share items. Checkcandidates !!! Whataboutclustering ?

  19. LSH versus K-means • What about optimality ? K-means is locally optimal [recently, some researchers showed how to introduce some guarantee] • What about the Sim-cost ? K-means compares items in Q(d) time and space [notice that d may be bi/millions] • What about the cost per iteration and their number? Typically K-means requires few iterations, each costs K  n time: I  K  n  d • What about K ? In principle have to iterate K=1,…, n LSH needssort(n) time hence, on disk, fewpasses over the data and with guaranteederrorbounds

  20. Also on-line query Given a query q, check the buckets of hj(q) for j=1,…, L q h1(q) h2(q) hL(q) T1 T2 TL

  21. Locality Sensitive Hashingand its applications More problems, indeed

  22. Another classic problem The problem: Given U users, the goal is to find groups of similar users (or, smilar to user Q) Features = Personal data, preferences, purchases, navigational behavior, followers/ing or +1,… • A feature is typically a numerical value: binary or real • Hammingdistance: #differentcomponents

  23. More than Hamming distance • Example: q P* • q is the query point. • P* is its Nearest Neighbor

  24. Approximation helps r q p*

  25. A slightly different problem Approximate Nearest Neighbor • Given an error parameter >0 • For query q and nearest-neighbor p’, return p such that Justification • Mapping objects to metric space is heuristic anyway • Get tremendous performance improvement

  26. A workable approach • Given an error parameter >0, distance threshold t>0 • (t,)-Approximate NN Query • If no point p with D(q,p)<t, return FAILURE • Else, return anyp’ with D(q,p’)< (1+)t • Application:Approximate Nearest Neighbor • Assume maximum distance is T • Run in parallel for • Time/space – O(log1+  T) overhead

  27. Locality Sensitive Hashingand its applications The analysis

  28. LSH Analysis • For a fixed threshold r, we distinguish between • Near D(p,q) < r • FarD(p,q) > (1+ε)r • A locality-sensitive hash h should guarantee • Near points are hashed together with Pr[h(a)=h(b)] ≥ P1 • Far points may be mapped together but Pr[h(a)=h(c)] ≤ P2 where, of course, wehavethatP1>P2 b a c

  29. What about hamming distance? Family hi(.) = p|c1,…,ck,where the ci are chosen randomly • If D(a,b) ≤ r, then Pr[hi(a)= hi(b)] = (1 – D(a,b)/d)k ≥( 1 – r/d )k = ( p1 )k = P1 • If D(a,c) > (1+e)r, then Pr[hi(a)= hi(c)] = (1 – D(a,c)/d)k <( 1 – r(1+e)/d )k = ( p2 )k = P2 where, of course, wehavethatp1>p2 (as P1 > P2)

  30. LSH Analysis The LSH-algorithm with the L mappings hi() correctly solves the (r,e)-NN problem on a point query q if the following hold: • The total number of points FAR from q and belonging to the bucket hi(q) is a constant. • If p* NEAR to q, then hi(p*)= hi(q) for some I (p* is in a visited bucket) Theorem.Take k=log_{1/p2} n and L=n(ln p1/ ln p2), then the two properties above hold with probability at least 0.298. Repeating the process Q(1/d) times, we ensure a probability of success of at least 1-d. • Space ≈ nL = n1+r, where r = (ln p1 / ln p2) < 1 • Query time ≈ L buckets accessed, they are nr

  31. Proof p* is a point near to q: D(q,p*) < r FAR(q) = set of points ps.t. D(q,p) > (1+e) r BUCKETi(q) = set of points ps.t.hi(p)= hi(q) Let us define the following events: E1 =Num of far points in the visited buckets ≤ 3 L E2 = p* occurs in some visited bucket, i.e. js.t.hj(q) = hj(p*)

  32. Bad collisions more than 3L Let p is a point in FAR(q): Pr[fixed j, hj(p) = hj(q)] < P2 = ( p2 )k Given that k = log1/p2 n Pr[fixed j, a far point p satisfies hj(p) = hj(q)] = 1/n By Markov inequality Pr(X > 3E[x]) ≤ 1/3, follows:

  33. Good collision: p* occurs For any hj the probability that Pr[hj(p*) = hj(q)] ≥ P1 = ( p1 )k = ( p1 )^{ log1/p2 n } = Given that L=n(ln p1/ ln p2), this is actually 1/L. So we have that Pr[not E2] = Pr[not finding p* in q’s buckets] = = (1 - Pr[hj(p*) = hj(q)])L = (1 – 1/L)L ≤ 1/e Finally Pr[E1 and E2] ≥ 1 – Pr[not E1 OR not E2] ≥ 1 – (Pr[not E1] + Pr [not E2]) ≥ 1 - (1/3) - (1/e) = 0.298

More Related