scalable techniques for clustering the web
Download
Skip this Video
Download Presentation
Scalable Techniques for Clustering the Web

Loading in 2 Seconds...

play fullscreen
1 / 37

Scalable Techniques for Clustering the Web - PowerPoint PPT Presentation


  • 132 Views
  • Uploaded on

Scalable Techniques for Clustering the Web. Taher H. Haveliwala Aristides Gionis Piotr Indyk Stanford University {taherh,gionis,indyk}@cs.stanford.edu. Project Goals. Generate fine-grained clustering of web based on topic Similarity search (“What’s Related?”) Two major issues:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Scalable Techniques for Clustering the Web' - fonda


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
scalable techniques for clustering the web

Scalable Techniques for Clustering the Web

Taher H. Haveliwala

Aristides Gionis

Piotr Indyk

Stanford University

{taherh,gionis,indyk}@cs.stanford.edu

project goals
Project Goals
  • Generate fine-grained clustering of web based on topic
  • Similarity search (“What’s Related?”)
  • Two major issues:
    • Develop appropriate notion of similarity
    • Scale up to millions of documents
prior work
Prior Work
  • Offline: detecting replicas
    • [Broder-Glassman-Manasse-Zweig’97]
    • [Shivakumar-G. Molina’98]
  • Online: finding/grouping related pages
    • [Zamir-Etzioni’98]
    • [Manjara]
  • Link based methods
    • [Dean-Henzinger’99, Clever]
prior work online link
Prior Work: Online, Link
  • Online: cluster results of search queries
    • does not work for clustering entire web offline
  • Link based approaches are limited
    • What about relatively new pages?
    • What about less popular pages?
prior work copy detection
Prior Work: Copy detection
  • Designed to detect duplicates/near-replicas
  • Do not scale when notion of similarity is modified to ‘topical’ similarity
  • Creation of document-document similarity matrix is the core challenge: join bottleneck
pairwise similarity
Pairwise similarity
  • Consider relation Docs(id, sentence)
  • Must compute:

SELECT D1.id, D2.id

FROM Docs D1, Docs D2

WHERE D1.sentence = D2.sentence

GROUP BY D1.id, D2.id

HAVING count(*) > 

  • What if we change ‘sentence’ to ‘word’?
pairwise similarity7
Pairwise similarity
  • Relation Docs(id, word)
  • Compute:

SELECT D1.id, D2.id

FROM Docs D1, Docs D2

WHERE D1.word = D2.word

GROUP BY D1.id, D2.id

HAVING count(*) > 

  • For 25M urls, could take months to compute!
overview
Overview
  • Choose document representation
  • Choose similarity metric
  • Compute pairwise document similarities
  • Generate clusters
document representation
Document representation
  • Bag of words model
  • Bag for each page p consists of
    • Title of p
    • Anchor text of all pages pointing to p (Also include window of words around anchors)
bag generation
Bag Generation

http://www.foobar.com/

http://www.music.com/

...click here for a great music page...

MusicWorld

...click here for great sports page...

Enter our site

http://www.baz.com/

...what I had for lunch...

...this music is great...

bag generation11
Bag Generation
  • Union of ‘anchor windows’ is a concise description of a page.
  • Note that using anchor windows, we can cluster more documents than we’ve crawled:
    • In general, a set of N documents refers to cN urls
standard ir
Standard IR
  • Remove stopwords (~ 750)
  • Remove high frequency & low frequency terms
  • Use stemming
  • Apply TFIDF scaling
overview13
Overview
  • Choose document representation
  • Choose similarity metric
  • Compute pairwise document similarities
  • Generate clusters
similarity
Similarity
  • Similarity metric for pages U1, U2, that were assigned bags B1, B2, respectively
    • sim(U1, U2) = |B1 B2| / |B1  B2|
  • Threshold is set to 20%
reality check
Reality Check

www.foodchannel.com:

www.epicurious.com/a_home/a00_home/home.html .37

www.gourmetworld.com .36

www.foodwine.com .325

www.cuisinenet.com .3125

www.kitchenlink.com .3125

www.yumyum.com .3

www.menusonline.com .3

www.snap.com/directory/category/0,16,-324,00.html .2875

www.ichef.com .2875

www.home-canning.com .275

overview16
Overview
  • Choose document representation
  • Choose similarity metric
  • Compute pairwise document similarities
  • Generate clusters
pair generation
Pair Generation
  • Find all pairs of pages (U1, U2) satisfying sim(U1, U2)  20%
  • Ignore all url pairs with sim < 20%
  • How do we avoid the join bottleneck?
locality sensitive hashing
Locality Sensitive Hashing
  • Idea: use special kind of hashing
  • Locality Sensitive Hashing (LSH) provides a solution:
    • Min-wise hash functions [Broder’98]
    • LSH [Indyk, Motwani’98], [Cohen et al’2000]
  • Properties:
    • Similar urls are hashed together w.h.p
    • Dissimilar urls are not hashed together
locality sensitive hashing19
Locality Sensitive Hashing

music.com

opera.com

sing.com

sports.com

golf.com

hashing
Hashing
  • Two steps
    • Min-hash (MH): a way to consistently sample words from bags
    • Locality sensitive hashing (LSH): similar pages get hashed to the same bucket while dissimilar ones do not
step 1 min hash
Step 1: Min-hash
  • Step 1: Generate m min-hash signatures for each url (m = 80)
    • For i = 1...m
      • Generate a random order hi on words
      • mhi(u) = argmin {hi(w) | w  Bu}
  • Pr(mhi(u) = mhi(v)) = sim(u, v)
step 1 min hash22
Step 1: Min-hash

Round 1:

ordering = [cat, dog, mouse, banana]

Set A:

{mouse, dog}

MH-signature = dog

Set B:

{cat, mouse}

MH-signature = cat

step 1 min hash23
Step 1: Min-hash

Round 2:

ordering = [banana, mouse, cat, dog]

Set A:

{mouse, dog}

MH-signature = mouse

Set B:

{cat, mouse}

MH-signature = mouse

step 2 lsh
Step 2: LSH
  • Step 2: Generate l LSH signatures for each url, using k of the min-hash values (l = 125, k = 3)
    • For i = 1...l
      • Randomly select k min-hash indices and concatenate them to form i’th LSH signature
step 2 lsh25
Step 2: LSH
  • Generate candidate pair if u and v have an LSH signature in common in any round
  • Pr(lsh(u) = lsh(v)) = Pr(mh(u) = mh(v))k
step 2 lsh26
Step 2: LSH

Set A:

{mouse, dog, horse, ant}

MH1 = horse

MH2 = mouse

MH3 = ant

MH4 = dog

LSH134 = horse-ant-dog

LSH234 = mouse-ant-dog

Set B:

{cat, ice, shoe, mouse}

MH1 = cat

MH2 = mouse

MH3 = ice

MH4 = shoe

LSH134 = cat-ice-shoe

LSH234 = mouse-ice-shoe

step 2 lsh27
Step 2: LSH
  • Bottom line - probability of collision:
    • 10% similarity  0.1%
    • 1% similarity  0.0001%
step 2 lsh28
Step 2: LSH

Round 1

sports.com

golf.com

party.com

music.com

opera.com

. . .

. . .

sing.com

sport-

team-

win

music-

sound-

play

sing-

music-

ear

step 2 lsh29
Step 2: LSH

Round 2

sports.com

golf.com

music.com

sing.com

. . .

. . .

opera.com

game-

team-

score

audio-

music-

note

theater-

luciano-

sing

sort filter
Sort & Filter
  • Using all buckets from all LSH rounds, generate candidate pairs
  • Sort candidate pairs on first field
  • Filter candidate pairs: keep pair (u, v), only if u and v agree on 20% of MH-signatures
  • Ready for “What’s Related?” queries...
overview31
Overview
  • Choose document representation
  • Choose similarity metric
  • Compute pairwise document similarities
  • Generate clusters
clustering
Clustering
  • The set of document pairs represents the document-document similarity matrix with 20% similarity threshold
  • Clustering algorithms
    • S-Link: connected components
    • C-Link: maximal cliques
    • Center: approximation to C-Link
center
Center
  • Scan through pairs (they are sorted on first component)
  • For each run [(u, v1), ... , (u, vn)]
    • if u is not marked
      • cluster = u + unmarked neighbors of u
    • mark u and all neighbors of u
results
Results

20 Million urls on Pentium-II 450

sample cluster
Sample Cluster

feynman.princeton.edu/~sondhi/205main.html

hep.physics.wisc.edu/wsmith/p202/p202syl.html

hepweb.rl.ac.uk/ppUK/PhysFAQ/relativity.html

pdg.lbl.gov/mc_particle_id_contents.html

physics.ucsc.edu/courses/10.html

town.hall.org/places/SciTech/qmachine

www.as.ua.edu/physics/hetheory.html

www.landfield.com/faqs/by-newsgroup/sci/sci.physics.relativity.html

www.pa.msu.edu/courses/1999spring/PHY492/desc_PHY492.html

www.phy.duke.edu/Courses/271/Synopsis.html

. . . (total of 27 urls) . . .

ongoing future work
Ongoing/Future Work
  • Tune anchor-window length
  • Develop system to measure quality
    • What is ground truth?
    • How do you judge clustering of millions of pages?
ad