1 / 39

Detecting Novel Associations in Large Data Sets

A pragmatic discussion of. Detecting Novel Associations in Large Data Sets. by David N. Reshef , Yakir Reshef , Hilary Finucane , Sharon Grossman, Gilean McVean , Peter Turnbaugh , Eric Lander, Michael Mitzenmacher , and Pardis Sabeti. Sean Patrick Murphy sayhitosean@gmail.com.

herb
Download Presentation

Detecting Novel Associations in Large Data Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A pragmatic discussion of Detecting Novel Associations in Large Data Sets by David N. Reshef, YakirReshef, Hilary Finucane, Sharon Grossman, GileanMcVean, Peter Turnbaugh, Eric Lander, Michael Mitzenmacher, and PardisSabeti Sean Patrick Murphy sayhitosean@gmail.com

  2. Getting Started • Blog overview - http://theoreticalecology.wordpress.com/2011/12/16/the-maximal-information-coefficient/ • MINE code (Java-based with python and R wrappers) http://www.exploredata.net/Downloads/MINE-Application • MINE homepage - http://www.exploredata.net/ • Science article and supplemental information - http://www.sciencemag.org/content/334/6062/1518.abstract • http://andrewgelman.com/2011/12/mr-pearson-meet-mr-mandelbrot-detecting-novel-associations-in-large-data-sets/

  3. So who actually read the paper?

  4. Outline • Motivation • Explanation • Application

  5. Motivation The Problem • 10,000+ variables • Hundreds, thousands, millions of observations • Your boss wants you to find all possible relationships between all different variable pairs … • Where do you start?

  6. Motivation Scatter Plots?

  7. Motivation 50 Variables  1225 different scatter plots to examine!

  8. Motivation Other Options? • Correlation Matrix • Factor Analysis/Principal Component Analysis • Audience recommendations?

  9. Motivation Possible Problems • A large number of possible relationships • Each has a different statistical test • Need to have a hypothesis about the relationship that might be present in the data

  10. Motivation Desired Properties • Generality – the correlation coefficient should be sensitive to a wide range of possible dependencies, including superpositions of functions. • Equitability – the score of the coefficient should be influenced by noise, but not by the form of the dependency between variables

  11. Explanation Enter the Maximal Information Coefficient (MIC)

  12. Explanation Algorithm Intuition

  13. Explanation We have a dataset D x y

  14. Explanation

  15. Explanation Definition of mutual information (for discrete random variables)

  16. Explanation MI = 0.5 MI = 0.6 MI = 0.7 Maximum mutual information

  17. Explanation Characteristic Matrix We have to normalize by min {log x, log y} to enable comparison across grids.

  18. Explanation 2x3 MI = 0.65 MI = 0.56 MI = 0.71

  19. Explanation Characteristic Matrix

  20. Explanation Characteristic Matrix

  21. Explanation This highest value is the Maximal Information Coefficient (MIC) Every entry of the characteristic matrix is between 0 and 1, inclusive MIC(X,Y) = MIC(Y,X) – symmetric MIC is invariant under order preserving transformations of the axis This surface is just a 3D representation of the characteristic matrix.

  22. Explanation How Big is the Characteristic Matrix? • Technically, infinite in size • This is unwieldy • So we set bounds on xy < B(n) = n0.6n = number of data points • This is an empirically set value

  23. Explanation How Do We Compute the Maximum Information for a Particular xy Grid? • Heuristic-based, dynamic programming • Pseudo-code in supplemental materials • Only approximate solution, seems to work • Authors acknowledge better algorithm should be found • At the moment, mostly irrelevant as the authors have released a Java implementation of the algorithm

  24. Application Useful Properties of the MIC Statistic With probability approaching 1 as sample size grows • MIC assigns scores that tend to 1 for all never-constant noiseless functional relationships • MIC assigns scores that tend to 1 for a larger class of noiseless relationships (including superpositions of noiseless functional relationships) • MIC assigns scores that tend to 0 to statistically independent variables

  25. Application MIC

  26. Application

  27. Application So what does the MIC mean? • Uncorrected p-value tables are available to download for various sample sizes of data • Null hypothesis is variables are statistically independent • http://www.exploredata.net/Downloads/P-Value-Tables

  28. Application MINE = Maximal Information-based Nonparametric Exploration Hopefully this part is self explanatory now Nonparametric vs parametric could be a session unto itself. Here, we do not rely on assumptions that the data in question are drawn from a specific probability distribution (such as the normal distribution). MINE statistics leverage the extra information captured by the characteristic matrix to offer more insight into the relationships between variables.

  29. Application Maximum Asymmetry Score (MAS<= MIC) – measures deviations from monotonicity Minimum Cell Number (MCN) - measures the complexity of an association in terms of the number of cells required Maximum Edge Value (MEV <= MIC) – measures closeness to being a function (vertical line test )

  30. Application MAS – monotonicity MEV – vertical line test MCN – complexity

  31. Application Usage this takes too long … change it first R: MINE(“MLB2008.csv”,”one.pair”,var1.id=2,var2.id=12) Java: java -jar MINE.jar MLB2008.csv -onePair 2 12 Seeks relationships between salary and home runs, 338 pairs http://www.exploredata.net/Usage-instructions

  32. Application Notes • Does not work on textual data (must be numeric) • Long execution times • Outputs MIC and other mentioned MINE statistics, not the Characteristic Matrix • Output is .csv, a row per variable pair

  33. Application Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License You are free to: • to copy, distribute and transmit the work With the following conditions: • Attribution — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). • Noncommercial — You may not use this work for commercial purposes. • No Derivative Works — You may not alter, transform, or build upon this work.

  34. Application Now What? Data Triage Pipeline Complex Data Set MIC Ranked list of variable relationships to examine in more depth with the tool(s) of your choice

  35. Lingering Questions • Can this be extended to higher-dimensional relationships? • Just how approximate is the current MIC algorithm? • Who wants to develop an open source implementation? • What other MINE statistics are waiting for discovery? • Execution time – the algorithm is embarrassingly parallel – easily HADOOPified • Many tests reported by the paper only introduced vertical noise into the data? • There is also some question as to its power vs Pearson and Dcor (http://www-stat.stanford.edu/~tibs/reshef/comment.pdf)

  36. Comment by N. Simon and R. Tibshiranhttp://www-stat.stanford.edu/~tibs/reshef/script.R Power Power Power Power Noise Level Noise Level

  37. Backup Slides

More Related