clustering applications at yahoo l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Clustering Applications at Yahoo! PowerPoint Presentation
Download Presentation
Clustering Applications at Yahoo!

Loading in 2 Seconds...

play fullscreen
1 / 32

Clustering Applications at Yahoo! - PowerPoint PPT Presentation


  • 355 Views
  • Uploaded on

Clustering Applications at Yahoo!. Deepayan Chakrabarti (deepay@yahoo-inc.com). Outline. Clustering, in itself, is often not the primary problem Exploratory analysis is rarely needed Methods are often tied to the “final” goal. Data Problem Method Issues. Data Problem Method Issues.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Clustering Applications at Yahoo!' - Gabriel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
clustering applications at yahoo

Clustering Applications at Yahoo!

Deepayan Chakrabarti (deepay@yahoo-inc.com)

outline
Outline
  • Clustering, in itself, is often not the primary problem
    • Exploratory analysis is rarely needed
  • Methods are often tied to the “final” goal

Data

Problem

Method

Issues

Data

Problem

Method

Issues

Data

Problem

Method

Issues

outline3
Outline
  • Advertiser-keyword graph(+ social networks)
    • Graph Partitioning
  • CTR predictions for ads on webpages
    • Co-clustering
  • Query refinement and suggestions
    • Local search methods
  • Conclusions
outline4
Outline
  • Advertiser-keyword graph(+ social networks)
    • Graph Partitioning
  • CTR predictions for ads on webpages
    • Co-clustering
  • Query refinement and suggestions
    • Local search methods
  • Conclusions
graph partitioning applications
Graph Partitioning (Applications)
  • Find clusters of advertisers and keywords
    • Keyword suggestions
    • Running experiments on some “natural” clusters
  • Similar:
    • Y! Answers
    • Flickr

bids

Bidded Keyword

Advertiser

~10M nodes

graph partitioning applications6
Graph Partitioning (Applications)
  • Find clusters of IM users
    • Targeted advertising
    • Exploratory analysis
  • Clusters of the Web Graph
    • Distributed pagerank computation

Who-messages-whom IM graph

~100M nodes

graph partitioning methods
Graph Partitioning (Methods)
  • Basic “Global” spectral partitioning [Ng+/01]
    • Find 2nd eigenvector of the graph Laplacian
    • This embeds all nodes on the real line
    • Split the line in two, to get two clusters
      • Can approximate the optimal conductance cut
    • For more clusters:
      • Use k eigenvectors (for known k), or
      • Split in two, and recurse on each cluster
graph partitioning methods8
Graph Partitioning (Methods)
  • However, this has problems [Lang/05, Leskovec+/08]:
    • Min. conductance or quotient cuts lead to “small chunks”

Better balance  worse cuts

Large-sized low-quotient cuts are actually just unions of whiskers

“Whiskers”

graph partitioning methods9
Graph Partitioning (Methods)
  • However, this has problems:
    • Min. conductance or quotient cuts lead to “small chunks”
      • Balance is not very strongly encouraged
    • Recursive partitioning takes too long
      • Each eigenvector computation yields only a small cluster being broken off
  • Two alternatives:
    • More balanced cuts  recursion is faster
    • Unbalanced cuts, but much faster computation
graph partitioning methods10

Perfect balance on real line: NP-Hard

Perfect balance on hypersphere: SDP formulation [Lang/05]

Graph Partitioning (Methods)
  • Achieving better balance
    • Combine algorithms (e.g., [Anderson+/08])
      • METIS (more balanced cuts), followed by
      • Flow-based improvement (conductance)
    • Stronger balance constraints

Many nodes

One node

Spectral embedding

graph partitioning methods11
Graph Partitioning (Methods)
  • Faster Computation via “local” graph partitioning [Spielman+/04, Anderson+/06]
    • Pick seeds randomly
    • Build local clusters around seed
    • Bite off cluster, and repeat
    • Time for each iteration is proportional to the size of the local cluster (scalability)
    • Better for large graphs, or when not all clusters are needed
graph partitioning issues
Graph Partitioning (Issues)
  • Many complex networks are very different from planar/mesh-like networks
    • Good small cuts
    • Good large cuts are hard to find, and may not even exist
      • Hard to have a good hierarchical partitioning
  • Is conductance really the best objective?
    • METIS+flow finds lower conductance cuts, but
    • Local spectral methods finds more “tightly-knit” cuts [Leskovec+/08]
    • Is there a good compromise?
graph partitioning issues13
Graph Partitioning (Issues)
  • Speed and Scalability
    • Most implementations have trouble with large graphs, or graphs with some extremely high-degree nodes
    • Map-reduce style partitioning algorithms?
outline14
Outline
  • Advertiser-keyword graph(+ social networks)
    • Graph Partitioning
  • CTR predictions for ads on webpages
    • Co-clustering
  • Query refinement and suggestions
    • Local search methods
  • Conclusions
co clustering applications
Co-clustering (Applications)

Ads

  • The Content Match problem
    • Predict click-thru rate (CTR)
    • Extreme sparsity
      • Few views
      • Even fewer clicks

Webpages

clicks

= CTR

views

co clustering methods
Co-clustering (Methods)

Ad clusters

Cell CTR

= row effect + column effect + block effect

  • Approximate cell CTR using block CTR [Dhillon+/03]
  • Minimize divergence between the original matrix and its reconstruction
  • Combats sparsity

Webpage clusters

co clustering issues
Co-clustering (Issues)
  • Picking the right number of clusters
    • Use MDL [Chakrabarti+/2004]
  • Handling new ads/pages
    • Combine co-clustering with feature-based prediction models [Agarwal+/2007]
  • Handling extra information
    • E.g., if each page and ad can be categorized into a taxonomy [Chakrabarti+/2007]
    • Can the taxonomy be automatically modified?
co clustering issues18
Co-clustering (Issues)
  • Iterative process (hard for map-reduce)
    • Factor-3 approximation by just clustering webpages and ads separately [Dasgupta+/2008]
outline19
Outline
  • Advertiser-keyword graph(+ social networks)
    • Graph Partitioning
  • CTR predictions for ads on webpages
    • Co-clustering
  • Query refinement and suggestions
    • Local search methods
  • Conclusions
query refinement
Query Refinement
  • User inputs ambiguous query (“madonna”)
  • Search engine asks:“Did you mean: songs, videos, pictures?”

Refinement = cluster of terms

lyrics “song title” album …

query refinement21
Query Refinement
  • Suppose we could relate queries with keywords

Related Keywords

Queries

Madonna

Beatles

songs

lyrics

albums

pictures

photos

Honda

Ford

query refinement problem
Query Refinement (Problem)
  • Different from plain bipartite graph partitioning
    • Don’t confuse users!
      • only 3 or fewer clusters for each query
      • only a few easily-distinguishable clusters overall
  • Clustering quality now also depends on
    • the query workload
    • the algo that picks the “top-3” clusters for any query
query refinement method
Query Refinement (Method)
  • Can optimally pick top-k clusters for any query [Wang+/2009]
    • for a wide range of matching functions
    • only if clusters are disjoint
  • Iteratively improve clustering via local search
    • Move a keyword to a new cluster
    • Update top-k clusters for all queries in workload
    • Repeat
query refinement issues
Query Refinement (Issues)
  • Iterative  slow
    • Each iteration has to go over the entire historical query logs
    • Optimality guarantees?
    • Modeling issues:
      • How do we present a cluster to the user?
      • Cluster naming?
      • How does a user interact with a cluster?
outline25
Outline
  • Advertiser-keyword graph(+ social networks)
    • Graph Partitioning
  • CTR predictions for ads on webpages
    • Co-clustering
  • Query refinement and suggestions
    • Local search methods
  • Conclusions
conclusions
Conclusions
  • Standalone clustering applications are rare!
    • Constraints on clustering
      • What use will it serve (as in query refinement)?
      • What is the desired “natural” cluster size?
    • Clustering for predictions (e.g., CTR)
      • Missing values
      • Extreme sparsity
    • Combining clustering with explore-exploit
conclusions27
Conclusions
  • Even for standalone clustering:
    • Scalability concerns
      • 10s to 100s of millions of nodes
      • Skewed degree distributions
      • Algorithms amenable to map-reduce
    • The right “balance” relaxation
      • Tradeoffs between cluster “compactness”, conductance, and partition sizes
      • Is it even reasonable to require balance?
references
References
  • On Spectral Clustering: Analysis and an Algorithm, by Ng, Jordan, & Weiss, in NIPS 2002
  • Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems, by Spielman & Teng, STOC 2004
  • Fixing two weaknesses of the Spectral Method, by Lang, NIPS 2005
  • Local Graph Partitioning using PageRank Vectors, by Anderson, Chung, & Lang, SODA 2008
  • An algorithm for improving graph partitions, by Anderson & Lang, SODA 2008
  • Finding dense and isolated submarkets in a sponsored search spending graph, by Lang & Anderson, CIKM 2007
  • Clustering of bipartite advertiser-keyword graph, by Carrasco, Fain, Lang, & Zhukov, IEEE Computer Society, 2003
  • Statistical Properties of Community Structure in Large Social and Information Networks, by Leskovec, Lang, Dasgupta, & Mahoney, WWW 2008
  • Approximate Algorithms for Co-Clustering, by Anagnostopoulos, Dasgupta, & Kumar, PODS 2008
  • Information-theoretic co-clustering, by Dhillon, Mallela, & Modha, in KDD 2003
  • Fully Automatic Cross-Associations, by Chakrabarti, Papadimitriou, & Faloutsos, in KDD 2004
  • Predictive discrete latent factor models for large scale dyadic data, by Merugu, & Agarwal, in KDD 2007
  • Estimating Rates of Rare Events at Multiple Resolutions, by Agarwal, Broder, Chakrabarti, Diklic, Josifovski, & Sayyadian, in KDD 2007
  • Mining Broad Latent Query Aspects from Search Sessions, by Wang, Chakrabarti, & Punera, in KDD 2009
graph partitioning applications29
Graph Partitioning (Applications)
  • Find clusters of users posting in similar groups
    • Enhance group activity
    • Are their special groups? Can we build special tools for them?
  • Similar:
    • Y! Answers
    • Flickr

Yahoo! Groups

Users

~100M users, ~10M groups

graph partitioning issues30
Graph Partitioning (Issues)
  • Many complex networks are very different from planar/mesh-like networks
    • Good small cuts
    • Good large cuts are hard to find, and may not even exist
      • Hard to have a good hierarchical partitioning
  • Good large cuts
    • If they exist, most algorithms find them
    • If not: METIS+flow finds lower conductance cuts, but local spectral methods finds more “coherent” cuts
    • Is there a good compromise?
query refinement31
Query Refinement
  • Given an ambiguous query, present user with up to 5 refinements
    • madonna  songs, videos, pictures
  • Each refinement represents a cluster of terms
    • songs = (lyrics, “song title”, album, …)
query refinement32
Query Refinement
  • How can we relate “madonna” and “songs”?

Search“madonna songs”

Search “madonna”

click

no click

Beatles

Honda

Ford

lyrics

albums

pictures

photos

songs

Madonna

Queries

Reformulations