Designing M-estimators for expression analysis: PLIER

1. 10/11/04 Affymetrix Designing M-estimators for expression analysis: PLIER Earl Hubbell Principal Statistician Affymetrix

2. 10/11/04 Affymetrix Outline Drive our intuition (basic data) Formalize the intuition Check functionality Look at results Bonus tricks and stunts

3. 10/11/04 Affymetrix Wafers, Chips, and Features We talk a lot about "wafers", "chips", and "features". This is a graphical representation. We've mirrored the technology in the semiconductor industry, photolithography, with life sciences. We do manufacture on wafers. These wafers are 5-inch-by-5-inch pieces of glass. In our whole-genome products we get 49 individual chips out of each wafer, and on each one of those chips there are over 1,300,000 unique features, and each one of those features has millions of identical DNA probes on them. We talk a lot about "wafers", "chips", and "features". This is a graphical representation. We've mirrored the technology in the semiconductor industry, photolithography, with life sciences. We do manufacture on wafers. These wafers are 5-inch-by-5-inch pieces of glass. In our whole-genome products we get 49 individual chips out of each wafer, and on each one of those chips there are over 1,300,000 unique features, and each one of those features has millions of identical DNA probes on them.

4. 10/11/04 Affymetrix Expression Probes

5. 10/11/04 Affymetrix Components of Stray Signal

6. 10/11/04 Affymetrix Components of Bound Target Signal and Noise

7. 10/11/04 Affymetrix Hybridization is mostly linear, with some stray signal & saturation

8. 10/11/04 Affymetrix One probe (pair): PM-MM* reduces bias

9. 10/11/04 Affymetrix Probes not very informative about concentration near background!

10. 10/11/04 Affymetrix Probes have systematic differences

11. 10/11/04 Affymetrix �Affinity� compensates for first-order probe differences

12. 10/11/04 Affymetrix �Likelihood� summarizes knowledge of expression

13. 10/11/04 Affymetrix A pause before jumping into equations �� statistics, whatever their mathematical sophistication and elegance, cannot make bad variables into good ones.� H.T. Reynolds, �Analysis of Nominal Data�

14. 10/11/04 Affymetrix Fun with Statistics Money: What should I estimate? M-estimators: Statistics by Optimization Model: Linking Intensity to Concentration Mismatches: Faking Subtraction Mayhem: Does it work? More: Tricks!

15. 10/11/04 Affymetrix Estimator Goals Handle zero/near-zero concentrations Handle �arithmetic� noise at low end Minimum bias (avoid sample trouble) [can always variance stabilize later] Resist outliers Avoid lots of parameters!

16. 10/11/04 Affymetrix How to estimate? (-5+373+473)/3 = 280.3 (�Mean�) 280.3 is the value minimizing (x+5)^2+(x-373)^2+(x-473)^2 median(-5,373,473) = 373 373 is the value minimizing |x+5|+|x-373|+|x-473|

17. 10/11/04 Affymetrix M-estimator Optimizes some function of the data sum(f(y,xi)) for y y is then an estimate of some interesting property of the data (we hope) Looks like �Maximum Likelihood� estimates (but can tune for utility)

18. 10/11/04 Affymetrix Designing the M-estimator PLIER M-estimator minimizes some function of the data and the estimator(s) Our case: sum( f(PM,MM, a,c,z) ) Choose f to model �reasonable� error Choose tail of f to handle outliers PLIER: �Probe Logarithmic Intensity ERror�

19. 10/11/04 Affymetrix Assumptions (approximations?) Concentration never negative! c>=0 linear link between true signal & concentration: T~a*c Background (not constant) adds to signal: I ~ T+B Background same for PM and MM Multiplicative intensity error log(I) ~ normal(log(T+B),s^2)

20. 10/11/04 Affymetrix Assumption: Multiplicative Error Widely agreed that replicate observations of probes (PM,MM) are approximately log-normal I.e. PM varies by 10% of PM Does not imply that derived quantities (PM-MM or PM-B) are also log-normal! I.e. PM-MM varies by ~7% of (PM+MM) not by 10% of (PM-MM)!

21. 10/11/04 Affymetrix No obvious need for arithmetic noise for raw intensities

22. 10/11/04 Affymetrix Simplified model [PM-MM] PM= a*c+MM MM = a2*c+B If B can vary wildly (experiment-experiment, probe-probe) , left with PM-MM = a*c Incorporating multiplicative error e1*PM-e2*MM = a*c

23. 10/11/04 Affymetrix Key concept: good fits have small multiplicative errors Trying to minimize log(e1)^2+log(e2)^2 The actual minimum is a complicated function, so we(I) don�t want to solve for it And we don�t have to - M-estimators can be chosen for computational convenience Therefore, let log(e1)^2=log(e2)^2

24. 10/11/04 Affymetrix How good is the fit?: 2 possible log(e1)=log(e2) (�log transform�)- no solution for MM>PM, always worse fit than log(e1)=-log(e2) (�PLIER�) => e = [a*c+sqrt((a*c)^2+4*PM*MM)]/2*PM log(e) exists for any PM,MM>0, any a,c effective error model changes from �arithmetic� near zero to �multiplicative� far from zero

25. 10/11/04 Affymetrix PM �-� MM Goodness of fit

26. 10/11/04 Affymetrix Define center of f Residual r=log(e) Under log-normal assumption, fit for least r^2 But we should fix the tails (where outliers show up, and the approximation breaks down)

27. 10/11/04 Affymetrix Robustness Want to �discount� outliers compared to sum-of-squares Off-the-shelf: Geman-McClure transformation f(r,z) = r^2/(1+r^2/z) Looks like least-squares for r small bounds influence of residual to at most z

28. 10/11/04 Affymetrix Transformation f(r) and its Influence Function

29. 10/11/04 Affymetrix

30. 10/11/04 Affymetrix PLIER: �on a t-shirt� y= a*c e = [y+sqrt(y^2+4*PM*MM)]/2*PM r = log(e) f(r,z) = r^2/(1+r^2/z) argmin(sum(f(r,z))) over all a,c >=0 yields PLIER estimate of affinity and concentration

31. 10/11/04 Affymetrix Optimizing (finding minima) Many ways to find best fit Easiest to explain is cyclic coordinate ascent aka �polishing� the data Can start anywhere (but best to start with a good guess)

32. 10/11/04 Affymetrix Finding affinity/concentration [Don�t I need to know one to start?]

33. 10/11/04 Affymetrix Observed PM/MM values

34. 10/11/04 Affymetrix Compare observed to predicted (find where to improve predictions)

35. 10/11/04 Affymetrix �Polishing� the table Guess initial values (a = 1.0, c=0) Find best concentrations (with current affinities) Find best affinities (for current concentrations) Repeat until minimized (or bored) [remember: values non-negative!]

36. 10/11/04 Affymetrix How does it work on real data? Gold-standard data generated by spiking in known transcript Example is one of the transcripts (6th) Look at residuals to find outliers

37. 10/11/04 Affymetrix Latin Square Experimental Design

38. 10/11/04 Affymetrix Model fit:A=1.0, C=0.0

39. 10/11/04 Affymetrix Residuals: Fit Concentration

40. 10/11/04 Affymetrix Residuals: Fit Probes

41. 10/11/04 Affymetrix Fit concentration and affinities - data fits except for outliers [clearly revealed]

42. 10/11/04 Affymetrix What are the outliers?

43. 10/11/04 Affymetrix Final results (value)

44. 10/11/04 Affymetrix Know everything (approx)

45. 10/11/04 Affymetrix Trick: P/A calls by fit

46. 10/11/04 Affymetrix Trick: models are good for residuals!

47. 10/11/04 Affymetrix Optimization (harder to illustrate) Current implementation uses descent optimization (Newton-like) Start with a good initial guess (median polish) Improve by descent Try jumps to escape local minima

48. 10/11/04 Affymetrix Evaluating performance MvA plots (unbiased/biased) Receiver Operating Characteristic (ROC) Area Under Curve (AUC) (global/stratified) Benchmark results

49. 10/11/04 Affymetrix MvA plots Scatterplot turned 45% Plotting A vs B M = log(A)-log(B) A = (log(A)+log(B))/2 �average� Allows easy visualization of changes

50. 10/11/04 Affymetrix MVA (bias added for stabilization)

51. 10/11/04 Affymetrix Receiver Operating Characteristic ROC curves measure separation of distributions for two states �Changed� or �unchanged� between pair(s) of experiments Depends on the variation of the signal within an experiment, and the separation between the two states Note that just measuring variation or just measuring separation can be misleading! One popular method of defining �changed� is a fold-change threshold ROC curves can be summarized by �area under curve�

52. 10/11/04 Affymetrix Overall performance good (ROC)

53. 10/11/04 Affymetrix Specific performance regimes are of interest Low, medium, high concentrations Relatively small fold-changes (2-fold, 4-fold) Thresholds defined by fold-change Thresholds defined by change relative to variation (�t-like statistic�)

54. 10/11/04 Affymetrix

55. 10/11/04 Affymetrix Output characteristics of some standard methods MAS 5.0 � Not variance stabilized(*), some bias, runs on single chips PLIER � Not variance stabilized(*), minimal bias, reduced variance, runs on multiple chips RMA � Variance stable, noticeable bias, low variance, runs on multiple chips (*)[Can always apply stabilizing transformation]

56. 10/11/04 Affymetrix

57. 10/11/04 Affymetrix

58. 10/11/04 Affymetrix

59. 10/11/04 Affymetrix

60. 10/11/04 Affymetrix

61. 10/11/04 Affymetrix Works fine on U133 too

62. 10/11/04 Affymetrix Bonus: M-estimator tricks Handle PM-only (PM+MM, PM-B) just fine by replacing error model in f Play Bayesian games (affinity penalties, concentration penalties)

63. 10/11/04 Affymetrix M-estimator: PM only PM-B = a*c [background estimate �perfect�] e*PM-B = a*c e= (a*c+B)/PM proceed in the same framework using e Note that B can be zero for (a*c>0)

64. 10/11/04 Affymetrix PM-only: global background biased

65. 10/11/04 Affymetrix Can play �Bayesian� games Probe affinities likely to be �log-normal� distributed Add a penalty term to avoid overweighting any single probe Good when insufficient data sum(log(e)^2) + (penalty)*sum(log(a)^2) [Can do the same for concentration]

66. 10/11/04 Affymetrix Bayesian prior on probes

67. 10/11/04 Affymetrix PLIER M-estimators form a very flexible framework for analysis Can handle PM-B, PM-MM, PM-only approaches in same framework Handles zero/near-zero concentration & affinities in model directly Seems to produce good results

68. 10/11/04 Affymetrix PLIER: obtaining an implementation PLIER algorithm SDK is now available under a GPL open source license. � The code is available as C++ without windows dependencies.� Documentation is included at the site.� All of us at Affymetrix hope that releasing PLIER in this manner promotes all of the values that the Bioconductor community embraces. http://www.affymetrix.com/support/developer/index.affx

69. 10/11/04 Affymetrix Thanks David Kulp Sejal Shah Simon Cawley David Finkelstein Mike Lelivelt Teresa Webster Rui Mei Suzanne Dee Stefan Bekiranov Xiaojun Di Alex Cheung Steve Lincoln Many, many others!

Designing M-estimators for expression analysis: PLIER

Designing M-estimators for expression analysis: PLIER

Presentation Transcript

Expression analysis 2

Designing with Expression Studio

Expression Analysis Platforms

Microarrays for Gene Expression Analysis

Designing with Expression Studio

TASK Analysis: Designing for Understanding

Observers/Estimators

Differential Expression Analysis

Matching Estimators

Gene Expression Analysis

Statistical Analysis for Expression Experiments

Expression Analysis Platforms

Global expression analysis

Gene expression analysis

Gene Expression Analysis

Gene Expression Analysis

Cluster Analysis for Gene Expression Data

Nonparametric Divergence Estimators for Independent Subspace Analysis

Sketching for M-Estimators: A Unified Approach to Robust Regression

Expression Data Analysis

Gene Expression Analysis