Scalable techniques for clustering the web
Download
1 / 37

Scalable Techniques for Clustering the Web - PowerPoint PPT Presentation


  • 132 Views
  • Updated On :

Scalable Techniques for Clustering the Web. Taher H. Haveliwala Aristides Gionis Piotr Indyk Stanford University {taherh,gionis,indyk}@cs.stanford.edu. Project Goals. Generate fine-grained clustering of web based on topic Similarity search (“What’s Related?”) Two major issues:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Scalable Techniques for Clustering the Web' - fonda


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Scalable techniques for clustering the web l.jpg

Scalable Techniques for Clustering the Web

Taher H. Haveliwala

Aristides Gionis

Piotr Indyk

Stanford University

{taherh,gionis,indyk}@cs.stanford.edu


Project goals l.jpg
Project Goals

  • Generate fine-grained clustering of web based on topic

  • Similarity search (“What’s Related?”)

  • Two major issues:

    • Develop appropriate notion of similarity

    • Scale up to millions of documents


Prior work l.jpg
Prior Work

  • Offline: detecting replicas

    • [Broder-Glassman-Manasse-Zweig’97]

    • [Shivakumar-G. Molina’98]

  • Online: finding/grouping related pages

    • [Zamir-Etzioni’98]

    • [Manjara]

  • Link based methods

    • [Dean-Henzinger’99, Clever]


Prior work online link l.jpg
Prior Work: Online, Link

  • Online: cluster results of search queries

    • does not work for clustering entire web offline

  • Link based approaches are limited

    • What about relatively new pages?

    • What about less popular pages?


Prior work copy detection l.jpg
Prior Work: Copy detection

  • Designed to detect duplicates/near-replicas

  • Do not scale when notion of similarity is modified to ‘topical’ similarity

  • Creation of document-document similarity matrix is the core challenge: join bottleneck


Pairwise similarity l.jpg
Pairwise similarity

  • Consider relation Docs(id, sentence)

  • Must compute:

    SELECT D1.id, D2.id

    FROM Docs D1, Docs D2

    WHERE D1.sentence = D2.sentence

    GROUP BY D1.id, D2.id

    HAVING count(*) > 

  • What if we change ‘sentence’ to ‘word’?


Pairwise similarity7 l.jpg
Pairwise similarity

  • Relation Docs(id, word)

  • Compute:

    SELECT D1.id, D2.id

    FROM Docs D1, Docs D2

    WHERE D1.word = D2.word

    GROUP BY D1.id, D2.id

    HAVING count(*) > 

  • For 25M urls, could take months to compute!


Overview l.jpg
Overview

  • Choose document representation

  • Choose similarity metric

  • Compute pairwise document similarities

  • Generate clusters


Document representation l.jpg
Document representation

  • Bag of words model

  • Bag for each page p consists of

    • Title of p

    • Anchor text of all pages pointing to p (Also include window of words around anchors)


Bag generation l.jpg
Bag Generation

http://www.foobar.com/

http://www.music.com/

...click here for a great music page...

MusicWorld

...click here for great sports page...

Enter our site

http://www.baz.com/

...what I had for lunch...

...this music is great...


Bag generation11 l.jpg
Bag Generation

  • Union of ‘anchor windows’ is a concise description of a page.

  • Note that using anchor windows, we can cluster more documents than we’ve crawled:

    • In general, a set of N documents refers to cN urls


Standard ir l.jpg
Standard IR

  • Remove stopwords (~ 750)

  • Remove high frequency & low frequency terms

  • Use stemming

  • Apply TFIDF scaling


Overview13 l.jpg
Overview

  • Choose document representation

  • Choose similarity metric

  • Compute pairwise document similarities

  • Generate clusters


Similarity l.jpg
Similarity

  • Similarity metric for pages U1, U2, that were assigned bags B1, B2, respectively

    • sim(U1, U2) = |B1 B2| / |B1  B2|

  • Threshold is set to 20%


Reality check l.jpg
Reality Check

www.foodchannel.com:

www.epicurious.com/a_home/a00_home/home.html .37

www.gourmetworld.com .36

www.foodwine.com .325

www.cuisinenet.com .3125

www.kitchenlink.com .3125

www.yumyum.com .3

www.menusonline.com .3

www.snap.com/directory/category/0,16,-324,00.html .2875

www.ichef.com .2875

www.home-canning.com .275


Overview16 l.jpg
Overview

  • Choose document representation

  • Choose similarity metric

  • Compute pairwise document similarities

  • Generate clusters


Pair generation l.jpg
Pair Generation

  • Find all pairs of pages (U1, U2) satisfying sim(U1, U2)  20%

  • Ignore all url pairs with sim < 20%

  • How do we avoid the join bottleneck?


Locality sensitive hashing l.jpg
Locality Sensitive Hashing

  • Idea: use special kind of hashing

  • Locality Sensitive Hashing (LSH) provides a solution:

    • Min-wise hash functions [Broder’98]

    • LSH [Indyk, Motwani’98], [Cohen et al’2000]

  • Properties:

    • Similar urls are hashed together w.h.p

    • Dissimilar urls are not hashed together


Locality sensitive hashing19 l.jpg
Locality Sensitive Hashing

music.com

opera.com

sing.com

sports.com

golf.com


Hashing l.jpg
Hashing

  • Two steps

    • Min-hash (MH): a way to consistently sample words from bags

    • Locality sensitive hashing (LSH): similar pages get hashed to the same bucket while dissimilar ones do not


Step 1 min hash l.jpg
Step 1: Min-hash

  • Step 1: Generate m min-hash signatures for each url (m = 80)

    • For i = 1...m

      • Generate a random order hi on words

      • mhi(u) = argmin {hi(w) | w  Bu}

  • Pr(mhi(u) = mhi(v)) = sim(u, v)


Step 1 min hash22 l.jpg
Step 1: Min-hash

Round 1:

ordering = [cat, dog, mouse, banana]

Set A:

{mouse, dog}

MH-signature = dog

Set B:

{cat, mouse}

MH-signature = cat


Step 1 min hash23 l.jpg
Step 1: Min-hash

Round 2:

ordering = [banana, mouse, cat, dog]

Set A:

{mouse, dog}

MH-signature = mouse

Set B:

{cat, mouse}

MH-signature = mouse


Step 2 lsh l.jpg
Step 2: LSH

  • Step 2: Generate l LSH signatures for each url, using k of the min-hash values (l = 125, k = 3)

    • For i = 1...l

      • Randomly select k min-hash indices and concatenate them to form i’th LSH signature


Step 2 lsh25 l.jpg
Step 2: LSH

  • Generate candidate pair if u and v have an LSH signature in common in any round

  • Pr(lsh(u) = lsh(v)) = Pr(mh(u) = mh(v))k


Step 2 lsh26 l.jpg
Step 2: LSH

Set A:

{mouse, dog, horse, ant}

MH1 = horse

MH2 = mouse

MH3 = ant

MH4 = dog

LSH134 = horse-ant-dog

LSH234 = mouse-ant-dog

Set B:

{cat, ice, shoe, mouse}

MH1 = cat

MH2 = mouse

MH3 = ice

MH4 = shoe

LSH134 = cat-ice-shoe

LSH234 = mouse-ice-shoe


Step 2 lsh27 l.jpg
Step 2: LSH

  • Bottom line - probability of collision:

    • 10% similarity  0.1%

    • 1% similarity  0.0001%


Step 2 lsh28 l.jpg
Step 2: LSH

Round 1

sports.com

golf.com

party.com

music.com

opera.com

. . .

. . .

sing.com

sport-

team-

win

music-

sound-

play

sing-

music-

ear


Step 2 lsh29 l.jpg
Step 2: LSH

Round 2

sports.com

golf.com

music.com

sing.com

. . .

. . .

opera.com

game-

team-

score

audio-

music-

note

theater-

luciano-

sing


Sort filter l.jpg
Sort & Filter

  • Using all buckets from all LSH rounds, generate candidate pairs

  • Sort candidate pairs on first field

  • Filter candidate pairs: keep pair (u, v), only if u and v agree on 20% of MH-signatures

  • Ready for “What’s Related?” queries...


Overview31 l.jpg
Overview

  • Choose document representation

  • Choose similarity metric

  • Compute pairwise document similarities

  • Generate clusters


Clustering l.jpg
Clustering

  • The set of document pairs represents the document-document similarity matrix with 20% similarity threshold

  • Clustering algorithms

    • S-Link: connected components

    • C-Link: maximal cliques

    • Center: approximation to C-Link


Center l.jpg
Center

  • Scan through pairs (they are sorted on first component)

  • For each run [(u, v1), ... , (u, vn)]

    • if u is not marked

      • cluster = u + unmarked neighbors of u

    • mark u and all neighbors of u



Results l.jpg
Results

20 Million urls on Pentium-II 450


Sample cluster l.jpg
Sample Cluster

feynman.princeton.edu/~sondhi/205main.html

hep.physics.wisc.edu/wsmith/p202/p202syl.html

hepweb.rl.ac.uk/ppUK/PhysFAQ/relativity.html

pdg.lbl.gov/mc_particle_id_contents.html

physics.ucsc.edu/courses/10.html

town.hall.org/places/SciTech/qmachine

www.as.ua.edu/physics/hetheory.html

www.landfield.com/faqs/by-newsgroup/sci/sci.physics.relativity.html

www.pa.msu.edu/courses/1999spring/PHY492/desc_PHY492.html

www.phy.duke.edu/Courses/271/Synopsis.html

. . . (total of 27 urls) . . .


Ongoing future work l.jpg
Ongoing/Future Work

  • Tune anchor-window length

  • Develop system to measure quality

    • What is ground truth?

    • How do you judge clustering of millions of pages?