1 / 37

Scalable Techniques for Clustering the Web

Scalable Techniques for Clustering the Web. Taher H. Haveliwala Aristides Gionis Piotr Indyk Stanford University {taherh,gionis,indyk}@cs.stanford.edu. Project Goals. Generate fine-grained clustering of web based on topic Similarity search (“What’s Related?”) Two major issues:

fonda
Download Presentation

Scalable Techniques for Clustering the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Techniques for Clustering the Web Taher H. Haveliwala Aristides Gionis Piotr Indyk Stanford University {taherh,gionis,indyk}@cs.stanford.edu

  2. Project Goals • Generate fine-grained clustering of web based on topic • Similarity search (“What’s Related?”) • Two major issues: • Develop appropriate notion of similarity • Scale up to millions of documents

  3. Prior Work • Offline: detecting replicas • [Broder-Glassman-Manasse-Zweig’97] • [Shivakumar-G. Molina’98] • Online: finding/grouping related pages • [Zamir-Etzioni’98] • [Manjara] • Link based methods • [Dean-Henzinger’99, Clever]

  4. Prior Work: Online, Link • Online: cluster results of search queries • does not work for clustering entire web offline • Link based approaches are limited • What about relatively new pages? • What about less popular pages?

  5. Prior Work: Copy detection • Designed to detect duplicates/near-replicas • Do not scale when notion of similarity is modified to ‘topical’ similarity • Creation of document-document similarity matrix is the core challenge: join bottleneck

  6. Pairwise similarity • Consider relation Docs(id, sentence) • Must compute: SELECT D1.id, D2.id FROM Docs D1, Docs D2 WHERE D1.sentence = D2.sentence GROUP BY D1.id, D2.id HAVING count(*) >  • What if we change ‘sentence’ to ‘word’?

  7. Pairwise similarity • Relation Docs(id, word) • Compute: SELECT D1.id, D2.id FROM Docs D1, Docs D2 WHERE D1.word = D2.word GROUP BY D1.id, D2.id HAVING count(*) >  • For 25M urls, could take months to compute!

  8. Overview • Choose document representation • Choose similarity metric • Compute pairwise document similarities • Generate clusters

  9. Document representation • Bag of words model • Bag for each page p consists of • Title of p • Anchor text of all pages pointing to p (Also include window of words around anchors)

  10. Bag Generation http://www.foobar.com/ http://www.music.com/ ...click here for a great music page... MusicWorld ...click here for great sports page... Enter our site http://www.baz.com/ ...what I had for lunch... ...this music is great...

  11. Bag Generation • Union of ‘anchor windows’ is a concise description of a page. • Note that using anchor windows, we can cluster more documents than we’ve crawled: • In general, a set of N documents refers to cN urls

  12. Standard IR • Remove stopwords (~ 750) • Remove high frequency & low frequency terms • Use stemming • Apply TFIDF scaling

  13. Overview • Choose document representation • Choose similarity metric • Compute pairwise document similarities • Generate clusters

  14. Similarity • Similarity metric for pages U1, U2, that were assigned bags B1, B2, respectively • sim(U1, U2) = |B1 B2| / |B1  B2| • Threshold is set to 20%

  15. Reality Check www.foodchannel.com: www.epicurious.com/a_home/a00_home/home.html .37 www.gourmetworld.com .36 www.foodwine.com .325 www.cuisinenet.com .3125 www.kitchenlink.com .3125 www.yumyum.com .3 www.menusonline.com .3 www.snap.com/directory/category/0,16,-324,00.html .2875 www.ichef.com .2875 www.home-canning.com .275

  16. Overview • Choose document representation • Choose similarity metric • Compute pairwise document similarities • Generate clusters

  17. Pair Generation • Find all pairs of pages (U1, U2) satisfying sim(U1, U2)  20% • Ignore all url pairs with sim < 20% • How do we avoid the join bottleneck?

  18. Locality Sensitive Hashing • Idea: use special kind of hashing • Locality Sensitive Hashing (LSH) provides a solution: • Min-wise hash functions [Broder’98] • LSH [Indyk, Motwani’98], [Cohen et al’2000] • Properties: • Similar urls are hashed together w.h.p • Dissimilar urls are not hashed together

  19. Locality Sensitive Hashing music.com opera.com sing.com sports.com golf.com

  20. Hashing • Two steps • Min-hash (MH): a way to consistently sample words from bags • Locality sensitive hashing (LSH): similar pages get hashed to the same bucket while dissimilar ones do not

  21. Step 1: Min-hash • Step 1: Generate m min-hash signatures for each url (m = 80) • For i = 1...m • Generate a random order hi on words • mhi(u) = argmin {hi(w) | w  Bu} • Pr(mhi(u) = mhi(v)) = sim(u, v)

  22. Step 1: Min-hash Round 1: ordering = [cat, dog, mouse, banana] Set A: {mouse, dog} MH-signature = dog Set B: {cat, mouse} MH-signature = cat

  23. Step 1: Min-hash Round 2: ordering = [banana, mouse, cat, dog] Set A: {mouse, dog} MH-signature = mouse Set B: {cat, mouse} MH-signature = mouse

  24. Step 2: LSH • Step 2: Generate l LSH signatures for each url, using k of the min-hash values (l = 125, k = 3) • For i = 1...l • Randomly select k min-hash indices and concatenate them to form i’th LSH signature

  25. Step 2: LSH • Generate candidate pair if u and v have an LSH signature in common in any round • Pr(lsh(u) = lsh(v)) = Pr(mh(u) = mh(v))k

  26. Step 2: LSH Set A: {mouse, dog, horse, ant} MH1 = horse MH2 = mouse MH3 = ant MH4 = dog LSH134 = horse-ant-dog LSH234 = mouse-ant-dog Set B: {cat, ice, shoe, mouse} MH1 = cat MH2 = mouse MH3 = ice MH4 = shoe LSH134 = cat-ice-shoe LSH234 = mouse-ice-shoe

  27. Step 2: LSH • Bottom line - probability of collision: • 10% similarity  0.1% • 1% similarity  0.0001%

  28. Step 2: LSH Round 1 sports.com golf.com party.com music.com opera.com . . . . . . sing.com sport- team- win music- sound- play sing- music- ear

  29. Step 2: LSH Round 2 sports.com golf.com music.com sing.com . . . . . . opera.com game- team- score audio- music- note theater- luciano- sing

  30. Sort & Filter • Using all buckets from all LSH rounds, generate candidate pairs • Sort candidate pairs on first field • Filter candidate pairs: keep pair (u, v), only if u and v agree on 20% of MH-signatures • Ready for “What’s Related?” queries...

  31. Overview • Choose document representation • Choose similarity metric • Compute pairwise document similarities • Generate clusters

  32. Clustering • The set of document pairs represents the document-document similarity matrix with 20% similarity threshold • Clustering algorithms • S-Link: connected components • C-Link: maximal cliques • Center: approximation to C-Link

  33. Center • Scan through pairs (they are sorted on first component) • For each run [(u, v1), ... , (u, vn)] • if u is not marked • cluster = u + unmarked neighbors of u • mark u and all neighbors of u

  34. Center

  35. Results 20 Million urls on Pentium-II 450

  36. Sample Cluster feynman.princeton.edu/~sondhi/205main.html hep.physics.wisc.edu/wsmith/p202/p202syl.html hepweb.rl.ac.uk/ppUK/PhysFAQ/relativity.html pdg.lbl.gov/mc_particle_id_contents.html physics.ucsc.edu/courses/10.html town.hall.org/places/SciTech/qmachine www.as.ua.edu/physics/hetheory.html www.landfield.com/faqs/by-newsgroup/sci/sci.physics.relativity.html www.pa.msu.edu/courses/1999spring/PHY492/desc_PHY492.html www.phy.duke.edu/Courses/271/Synopsis.html . . . (total of 27 urls) . . .

  37. Ongoing/Future Work • Tune anchor-window length • Develop system to measure quality • What is ground truth? • How do you judge clustering of millions of pages?

More Related