1 / 74

Large Graph Mining: Power Tools and a Practitioner’s guide

Large Graph Mining: Power Tools and a Practitioner’s guide. Task 3: Recommendations & proximity Faloutsos, Miller & Tsourakakis CMU. Outline. Introduction – Motivation Task 1: Node importance Task 2: Community detection Task 3: Recommendations Task 4: Connection sub-graphs

mili
Download Presentation

Large Graph Mining: Power Tools and a Practitioner’s guide

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Graph Mining:Power Tools and a Practitioner’s guide Task 3: Recommendations & proximity Faloutsos, Miller & Tsourakakis CMU Faloutsos, Miller, Tsourakakis

  2. Outline Introduction – Motivation Task 1: Node importance Task 2: Community detection Task 3: Recommendations Task 4: Connection sub-graphs Task 5: Mining graphs over time Task 6: Virus/influence propagation Task 7: Spectral graph theory Task 8: Tera/peta graph mining: hadoop Observations – patterns of real graphs Conclusions Faloutsos, Miller, Tsourakakis

  3. Acknowledgement: Most of the foils in ‘Task 3’ are by Hanghang TONG www.cs.cmu.edu/~htong Faloutsos, Miller, Tsourakakis

  4. Detailed outline • Problem dfn and motivation • Solution: Random walk with restarts • Efficient computation • Case study: image auto-captioning • Extensions: bi-partite graphs; tracking • Conclusions Faloutsos, Miller, Tsourakakis

  5. Motivation: Link Prediction ? A B i i i i Should we introduce Mr. A to Mr. B? Faloutsos, Miller, Tsourakakis

  6. Motivation - recommendations Products / movies customers ‘smith’ ?? Terminator 2 Faloutsos, Miller, Tsourakakis

  7. Answer: proximity • ‘yes’, if ‘A’ and ‘B’ are ‘close’ • ‘yes’, if ‘smith’ and ‘terminator 2’ are ‘close’ QUESTIONS in this part: • How to measure ‘closeness’/proximity? • How to do it quickly? • What else can we do, given proximity scores? Faloutsos, Miller, Tsourakakis

  8. How close is ‘A’ to ‘B’? a.k.a Relevance, Closeness, ‘Similarity’… Faloutsos, Miller, Tsourakakis

  9. Why is it useful? Recommendation And many more Image captioning [Pan+] Conn. / CenterPiece subgraphs [Faloutsos+], [Tong+], [Koren+] and Link prediction [Liben-Nowell+], [Tong+] Ranking [Haveliwala], [Chakrabarti+] Email Management [Minkov+] Neighborhood Formulation [Sun+] Pattern matching [Tong+] Collaborative Filtering [Fouss+] … Faloutsos, Miller, Tsourakakis

  10. Region Automatic Image Captioning Image Test Image Keyword Sea Sun Sky Wave Cat Forest Tiger Grass Q: How to assign keywords to the test image? A: Proximity! [Pan+ 2004] Faloutsos, Miller, Tsourakakis

  11. Center-Piece Subgraph(CePS) Input Output CePS guy CePS Original Graph Q: How to find hub for the black nodes? A: Proximity! [Tong+ KDD 2006] Faloutsos, Miller, Tsourakakis

  12. Detailed outline • Problem dfn and motivation • Solution: Random walk with restarts • Efficient computation • Case study: image auto-captioning • Extensions: bi-partite graphs; tracking • Conclusions Faloutsos, Miller, Tsourakakis

  13. How close is ‘A’ to ‘B’? • Should be close, if they have • many, • short • ‘heavy’ paths Faloutsos, Miller, Tsourakakis

  14. Why not shortest path? Some ``bad’’ proximities A: ‘pizza delivery guy’ problem Faloutsos, Miller, Tsourakakis

  15. Why not max. netflow? Some ``bad’’ proximities A: No penalty for long paths Faloutsos, Miller, Tsourakakis

  16. What is a ``good’’ Proximity? … • Multiple Connections • Quality of connection • Direct & In-directed Conns • Length, Degree, Weight… Faloutsos, Miller, Tsourakakis

  17. Random walk with restart 10 9 12 2 8 1 11 3 4 6 5 7 [Haveliwala’02] Faloutsos, Miller, Tsourakakis

  18. Random walk with restart 0.03 0.04 10 9 0.10 12 2 0.08 0.02 0.13 8 1 0.13 11 3 0.04 4 0.05 6 5 0.13 7 0.05 Nearby nodes, higher scores Ranking vector More red, more relevant Faloutsos, Miller, Tsourakakis

  19. Why RWR is a good score? j i : adjacency matrix. c: damping factor all paths from i to j with length 1 all paths from i to j with length 2 all paths from i to j with length 3 Faloutsos, Miller, Tsourakakis

  20. Detailed outline • Problem dfn and motivation • Solution: Random walk with restarts • variants • Efficient computation • Case study: image auto-captioning • Extensions: bi-partite graphs; tracking • Conclusions Faloutsos, Miller, Tsourakakis

  21. Variant: escape probability Define Random Walk (RW) on the graph Esc_Prob(CMUParis) Prob (starting at CMU, reaches Paris before returning toCMU) the remaining graph CMU Paris Faloutsos, Miller, Tsourakakis Esc_Prob = Pr (smile before cry)

  22. Other Variants Other measure by RWs Community Time/Hitting Time [Fouss+] SimRank [Jeh+] Equivalence of Random Walks Electric Networks: EC [Doyle+]; SAEC[Faloutsos+]; CFEC[Koren+] Spring Systems Katz [Katz], [Huang+], [Scholkopf+] Matrix-Forest-based Alg [Chobotarev+] details Faloutsos, Miller, Tsourakakis

  23. Other Variants Other measure by RWs Community Time/Hitting Time [Fouss+] SimRank [Jeh+] Equivalence of Random Walks Electric Networks: EC [Doyle+]; SAEC[Faloutsos+]; CFEC[Koren+] Spring Systems Katz [Katz], [Huang+], [Scholkopf+] Matrix-Forest-based Alg [Chobotarev+] details All are “related to” or “similar to” random walk with restart! Faloutsos, Miller, Tsourakakis

  24. Map of proximity measurements details Regularized Un-constrained Quad Opt. Norma lize RWR Katz 4 ssp decides 1 esc_prob Esc_Prob + Sink Hitting Time/ Commute Time relax X out-degree Harmonic Func. Constrained Quad Opt. Effective Conductance “voltage = position” String System Faloutsos, Miller, Tsourakakis Physical Models Mathematic Tools

  25. Notice: Asymmetry (even in undirected graphs) C B C-> A : high A-> C: low A D E Faloutsos, Miller, Tsourakakis

  26. Summary of Proximity Definitions Goal: Summarize multiple relationships Solutions Basic: Random Walk with Restarts [Haweliwala’02] [Pan+ 2004][Sun+ 2006][Tong+ 2006] Properties: Asymmetry [Koren+ 2006][Tong+ 2007] [Tong+ 2008] Variants: Esc_Prob and many others. [Faloutsos+ 2004] [Koren+ 2006][Tong+ 2007] Faloutsos, Miller, Tsourakakis

  27. Detailed outline • Problem dfn and motivation • Solution: Random walk with restarts • Efficient computation • Case study: image auto-captioning • Extensions: bi-partite graphs; tracking • Conclusions Faloutsos, Miller, Tsourakakis

  28. Preliminary: Sherman–Morrison Lemma details If: x = + = +3 + Faloutsos, Miller, Tsourakakis

  29. Preliminary: Sherman–Morrison Lemma details If: x = + Then: Faloutsos, Miller, Tsourakakis

  30. Sherman – Morrison Lemma – intuition: • Given a small perturbation on a matrix A -> A’ • We can quickly update its inverse Faloutsos, Miller, Tsourakakis

  31. SM: The block-form details A B C D Or… And many other variants… Also known as Woodbury Identity Faloutsos, Miller, Tsourakakis

  32. SM Lemma: Applications details • RLS (Recursive least squares) • and almost any algorithm in time series! • Kalman filtering • Incremental matrix decomposition • … • … and all the fast solutions we will introduce! Faloutsos, Miller, Tsourakakis

  33. Reminder: PageRank • With probability 1-c, fly-out to a random node • Then, we have p = c Bp + (1-c)/n 1 => p = (1-c)/n [I - c B] -1 1 Faloutsos, Miller, Tsourakakis

  34. p = c Bp + (1-c)/n 1 The only difference Restart p Starting vector Adjacency matrix Ranking vector Faloutsos, Miller, Tsourakakis

  35. Computing RWR 10 9 12 2 8 1 11 3 4 6 5 7 p = c Bp + (1-c)/n 1 Restart p Starting vector Adjacency matrix Ranking vector 1 n x 1 n x n n x 1 Faloutsos, Miller, Tsourakakis

  36. Q: Given query i, how to solve it? Query ? ? Starting vector Ranking vector Ranking vector Adjacency matrix Faloutsos, Miller, Tsourakakis

  37. OntheFly: 10 9 12 2 8 1 11 3 0.04 0.03 10 9 0.10 12 4 0.13 0.08 2 0.02 8 1 11 0.13 3 6 0.04 5 4 0.05 6 5 0.13 7 7 0.05 No pre-computation/ light storage Slow on-line response O(mE) Faloutsos, Miller, Tsourakakis

  38. PreCompute 0.04 0.03 10 9 0.10 12 0.13 0.08 2 0.02 8 1 11 0.13 3 0.04 4 0.05 6 5 0.13 7 0.05 10 9 12 2 8 1 11 R: 3 4 6 5 7 c x Q Q Faloutsos, Miller, Tsourakakis

  39. PreCompute: 10 9 12 2 8 1 11 3 0.04 0.03 10 9 0.10 12 4 0.13 0.08 2 0.02 8 1 11 0.13 3 6 0.04 5 4 0.05 6 5 0.13 7 7 0.05 Fast on-line response Heavy pre-computation/storage cost O(n ) 3 Faloutsos, Miller, Tsourakakis O(n ) 2

  40. Q: How to Balance? On-line Off-line Faloutsos, Miller, Tsourakakis

  41. How to balance? Idea (‘B-Lin’) • Break into communities • Pre-compute all, within a community • Adjust (with S.M.) for ‘bridge edges’ H. Tong, C. Faloutsos, & J.Y. Pan. Fast Random Walk with Restart and Its Applications. ICDM, 613-622, 2006. Faloutsos, Miller, Tsourakakis

  42. B_Lin: Basic Idea[Tong+ ICDM 2006] 10 10 9 9 12 12 2 2 8 8 1 1 11 11 3 3 4 4 10 10 9 9 6 6 5 5 12 12 2 2 8 8 1 1 11 11 7 7 3 3 4 4 6 6 5 5 7 7 0.04 10 10 0.03 9 9 10 9 12 12 0.10 2 2 12 0.13 0.08 2 8 8 1 1 0.02 11 11 8 3 3 1 11 0.13 3 0.04 4 4 4 6 6 5 5 0.05 6 5 0.13 7 7 7 0.05 Find Community Combine Fix the remaining Faloutsos, Miller, Tsourakakis

  43. B_Lin: details details ~ + ~ ~ ~ W 1: within community W Cross-community Faloutsos, Miller, Tsourakakis

  44. B_Lin: details details ~ ~ W W1 -1 Easy to be inverted LRA difference -1 ~ ~ I – c I – c –cUSV SM Lemma! Faloutsos, Miller, Tsourakakis

  45. B_Lin: summary • Pre-Computational Stage • Q: • A: A few small, instead of ONE BIG, matrices inversions • On-Line Stage • Q: Efficiently recover one column of Q • A: A few, instead of MANY, matrix-vector multiplications Efficiently compute and store Q Faloutsos, Miller, Tsourakakis

  46. Query Time vs. Pre-Compute Time Log Query Time • Quality: 90%+ • On-line: • Up to 150x speedup • Pre-computation: • Two orders saving Log Pre-compute Time Faloutsos, Miller, Tsourakakis

  47. Detailed outline • Problem dfn and motivation • Solution: Random walk with restarts • Efficient computation • Case study: image auto-captioning • Extensions: bi-partite graphs; tracking • Conclusions Faloutsos, Miller, Tsourakakis

  48. gCaP: Automatic Image Caption Q { } Cat Forest Grass Tiger {?, ?, ?,} … { } Sea Sun Sky Wave A: Proximity! [Pan+ KDD2004] Faloutsos, Miller, Tsourakakis

  49. Sea Sun Sky Wave Cat Forest Tiger Grass Region Image Test Image Keyword Faloutsos, Miller, Tsourakakis

  50. Region Image Test Image {Grass, Forest, Cat, Tiger} Sea Sun Sky Wave Cat Forest Tiger Grass Keyword Faloutsos, Miller, Tsourakakis

More Related