1 / 62

Patterns amongst Competing Task Frequencies: S u p e r – L i n e a r i t i e s , & t h e A l m o n d -

Patterns amongst Competing Task Frequencies: S u p e r – L i n e a r i t i e s , & t h e A l m o n d -D G m o d e l. Danai Koutra B.Aditya Prakash Vasileios Koutras Christos Faloutsos. PAKDD, 15-17 April 2013, Gold Coast, Australia. Questions we answer (1). Patterns :

rossa
Download Presentation

Patterns amongst Competing Task Frequencies: S u p e r – L i n e a r i t i e s , & t h e A l m o n d -

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Patterns amongst Competing Task Frequencies: S u p er – L ine a rities , &the A lmond-D G model Danai Koutra B.AdityaPrakash VasileiosKoutras Christos Faloutsos PAKDD, 15-17 April 2013, Gold Coast, Australia

  2. Questions we answer (1) • Patterns: If Bob executes task xfor nx times, how many times does he execute task y? • Modeling: Which 2-d distribution fits 2-d clouds of points? # of # of Danai Koutra (CMU)

  3. Questions we answer (2) • Patterns: If Bob executes task xfor nx times, how many times does he execute task y? • Modeling: Which 2-d distribution fits 2-d clouds of points? ‘Smith’ (100 calls, 700 sms) # of # of Danai Koutra (CMU)

  4. Let’s peek...… at our contributions • Patterns: • power laws between competing tasks • log-logistic distributions for many tasks • Modeling: Almond-DG distribution for 2-d real datasets • Practical Use: spot outliers; what-if scenarios ln(tweets) ln(comments) Danai Koutra (CMU)

  5. Let’s peek...… at our contributions • Patterns: • power laws between competing tasks • log-logistic distributions for many tasks • Modeling: Almond-DG distribution for 2-d real datasets • Practical Use: spot outliers; what-if scenarios Danai Koutra (CMU)

  6. Let’s peek...… at our contributions • Patterns: • power laws between competing tasks • log-logistic distributions for many tasks • Modeling: Almond-DG distribution for 2-d real datasets • Practical Use: spot outliers; what-if scenarios Danai Koutra (CMU)

  7. Roadmap • Data • Observed Patterns • Related Work • Proposed Distribution • Goodness of Fit • Conclusions Danai Koutra (CMU)

  8. Data 1: TencentWeibo • micro-blogging website in China • 2.2 million users • Tasks extracted • Tweets • Retweets • Comments • Mentions • Followees Danai Koutra (CMU)

  9. Data 2: Phonecall Dataset • phone-call records • 3.1 million users • Tasks extracted: • Calls • Messages • Voice friends • SMS friends • Total minutes of phonecalls Danai Koutra (CMU)

  10. Roadmap • Data • Observed Patterns • Super Linear Relative Frequency • Log-logistic Marginals • Proposed Distribution • Goodness of Fit • Conclusions Danai Koutra (CMU)

  11. Pattern 1 - SuRF: Super Linear Relative Frequency (1) Intuition: 2x tweets, 16x retweets ln(tweets) ‘Smith’ (1100 retweets, 7 tweets) 0.23 ln(retweets) Danai Koutra (CMU)

  12. Pattern 1 - SuRF: Super Linear Relative Frequency (1) Intuition: 2x tweets, 16x retweets ln(tweets) ‘Smith’ (1100 retweets, 7 tweets) 0.23 • Logarithmic Binning Fit [Akoglu’10] • 15 log buckets • E[Y|X=x] per bucket • linear regression on conditional means ln(retweets) Danai Koutra (CMU)

  13. Pattern 1 – SuRF (2) Intuition: 2x tweets, 4x comments ln(tweets) 0.304 ln(comments) Danai Koutra (CMU)

  14. Pattern 1 – SuRF (3) Intuition: 2x tweets, 4x mentions ln(tweets) 0.33 ln(mentions) Danai Koutra (CMU)

  15. Pattern 1 – SuRF (4) Intuition: 2x followees, 16x retweets ln(followees) 0.25 ln(retweets) Danai Koutra (CMU)

  16. Pattern 1 – SuRF (5) Intuition: super-linearity; more calls, even more minutes ln(total_mins) 1.18 ln(calls_no) Danai Koutra (CMU)

  17. Pattern 1 – SuRF (6a) Intuition: 2x friends, 3x phonecalls ln(voice_friends) 0.79 ln(calls_no) Danai Koutra (CMU)

  18. Pattern 1 – SuRF (6b) ln(voice_friends) Telemarketers? 0.79 ln(calls_no) Danai Koutra (CMU)

  19. Pattern 1 – SuRF (7) ln(sms_friends) Intuition: 2x friends, 5x sms 0.21 ln(sms_no) Danai Koutra (CMU)

  20. Contributions revisited (1) • Patterns: • power laws between competing tasks • log-logistic distributions for many tasks • Modeling: Almond-DG distribution for 2-d real datasets • Practical Use: spot outliers; what-if scenarios. ln(tweets) ln(comments) Danai Koutra (CMU)

  21. Roadmap • Data • Observed Patterns • Super Linear Relative Frequency • Log-logistic Marginals • Proposed Distribution • Goodness of Fit • Conclusions Danai Koutra (CMU)

  22. Pattern 2: log-logistic marginals (1) Marginal PDF NOT power law ln(frequency) ln(retweets) Danai Koutra (CMU)

  23. Pattern 2: log-logistic marginals (2) Marginal PDF NOT power law ln(frequency) ln(comments) Danai Koutra (CMU)

  24. Pattern 2: log-logistic marginals (3) Marginal PDF power law ln(frequency) ln(mentions) Danai Koutra (CMU)

  25. Pattern 2: log-logistic marginals (3) Marginal PDF power law ln(frequency) How to capture both??? ln(mentions) Danai Koutra (CMU)

  26. Contributions revisited (2) • Patterns: We observe • power law relationships between competing tasks • log-logistic distributions for many tasks • Modeling: We propose the Almond-DG distribution for fitting 2-d real world datasets • Practical Use: spot outliers;what-if scenarios. Danai Koutra (CMU)

  27. Roadmap • Data • Observed Patterns • Proposed Distribution • Problem Definition • Almond-DG • Background: copulas • Goodness of Fit • Conclusions Danai Koutra (CMU)

  28. Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y),that captures (a) the marginals (b) the dependency # of # of # of # of Danai Koutra (CMU)

  29. Solutions in the Literature? • Multivariate Logistic [Malik & Abraham, 1973] • Multivariate Pareto Distribution [Mardia, 1962] • Triple Power Law [Akoglu et al., 2012] bivariate distribution for modeling reciprocity in phonecall networks Danai Koutra (CMU)

  30. Solutions in the Literature? • Multivariate Logistic [Malik & Abraham, 1973] • Multivariate Pareto Distribution [Mardia, 1962] • Triple Power Law [Akoglu et al., 2012] bivariate distribution for modeling reciprocity in phonecall networks BUT none of them captures the marginals AND dependency / correlation!!! Danai Koutra (CMU)

  31. Roadmap • Related Work • Data • Observed Patterns • Proposed Distribution • Problem Definition • Almond-DG • Background: copulas • Goodness of Fit • Conclusions Danai Koutra (CMU)

  32. Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y),that captures (a) the marginals (b) the dependency # of # of # of # of Danai Koutra (CMU)

  33. STEP 1: How to model the marginal distributions? Marginal PDF • A: Log-logistic! • Q: Why? • A: Because it • mimics Pareto • captures the top concavity • matches reality ln(frequency) ln(retweets) Danai Koutra (CMU)

  34. Reminder:Log-logistic (1) BACKGROUND • CDF: F(x; α, β) = 1/[1 + (x/α)−β], x, α, β ≥ 0 • Intuition: The longer you survive the disease, the even longer you survive. • memoryless • 2 parameters: scale (α) and shape (β) ✗ a=1 β= Danai Koutra (CMU)

  35. Reminder:Log-logistic (2a) BACKGROUND • In log-log scales, it looks like hyperbola PDF β = shape parameter a = scale param = median Danai Koutra (CMU)

  36. Reminder:Log-logistic (2b) BACKGROUND • In log-log scales, looks like hyperbola By truncating the top concavity, we get a power law. PDF β = shape parameter a = scale param = median Danai Koutra (CMU)

  37. Parameter Estimation:Log-logistic (3) BACKGROUND • linear log-odd plots real Theory -ln(odds) α = 2.07 β = 1.27 Prob(X<=x)Prob(X>x) ln(mentions) Danai Koutra (CMU)

  38. Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y),that captures (a) the marginals (b) the dependency ✔ # of ✔ ✔ # of # of # of Danai Koutra (CMU)

  39. STEP 2a: How to model the dependency? • A: weborrow an idea from survival models, financial risk management, decision analysis • COPULAS! Danai Koutra (CMU)

  40. Copulas in a nutshell BACKGROUND • Modeling dependence between r.v.’s (e.g., X = # of , Y = # of ) Danai Koutra (CMU)

  41. Copulas in a nutshell BACKGROUND • Model dependence between r.v.’s (e.g., X = # of , Y = # of ) • Create multivariate distribution s.t.: • the marginals are preserved • the correlation (+, -, none) is captured # of # of Danai Koutra (CMU)

  42. STEP 2b: Which copula? • A: among the many copulas • Gaussian • Clayton • Frank Archimedean family • Joe - explicit formula • Independence - 1 parameter • Gumbel • …. Danai Koutra (CMU)

  43. Applications ofGumbel’s copula BACKGROUND Modeling of: • the dependence between loss and lawyer’s fees in order to calculate reinsurance premiums • the rainfall frequency as a joint distribution of volume, peak, duration etc. • … Danai Koutra (CMU)

  44. Gumbel’s copula:Example 1 BACKGROUND • Uniform marginals • No dependence # of # of Danai Koutra (CMU)

  45. Gumbel’s copula:Example 2 BACKGROUND • Skewed marginals • No correlation # of # of Danai Koutra (CMU)

  46. Gumbel’s copula:Example 3 BACKGROUND • Skewed marginals • ρ = 0.7 # of # of Danai Koutra (CMU)

  47. Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y),that captures (a) the marginals (b) the dependency # of ✔ ✔ # of # of # of Danai Koutra (CMU)

  48. Proposed Continuous Distribution: Almond where θ = ( 1– ρ )-1captures the dependence ρ= Spearman’s coefficient ρ=0 ρ=0.4 ρ=0.7 ρ=0 ρ=0.2 ρ=0.7 αx=αy=1 βx=βy=1αx= 6.5 αy=2.1 βx=1.6 βy=1.27 Danai Koutra (CMU)

  49. Proposed Discrete Distribution: Almond-DG - DG 1. We discretize the values of Almond (floor(X), floor(Y)) 2. and truncate them i.e., keep the pairs with X>=1 and Y>=1. Discrete #’s … Danai Koutra (CMU)

  50. Contributions revisited (3) • Patterns: We observe • power laws between competing tasks • log-logistic distributions for many tasks • Modeling: Almond-DG distribution for 2-d real datasets • Practical Use: spot outliers; what-if scenarios. Danai Koutra (CMU)

More Related