230 likes | 237 Views
Recent Developments. 15min. Previous Topics. https://exp-platform.com/2018StrataABTutorial/ Machine learning in Causal Inference Variance Reduction and Regression Adjustment Beyond Average Treatment Effect, a.k.a. Effect Heterogeneity. Topics. Bayes Factor Bounds
E N D
Recent Developments 15min
Previous Topics https://exp-platform.com/2018StrataABTutorial/ • Machine learning in Causal Inference • Variance Reduction and Regression Adjustment • Beyond Average Treatment Effect, a.k.a. Effect Heterogeneity
Topics • Bayes Factor Bounds • Treatment Effect Estimation and Empirical Bayes: learn prior from historical data If time permits • Continuous Monitoring and Adaptive Experimentation
Bound Bayes Factor Z-score (or t-stat when sample size is large) Don’t know , but needs to compute Simple bounds can be derived under different assumptions regarding --- the distribution of treatment effect under the alternative
Opportunistic : Likelihood ratio bound The that favor most is to let • Z has the largest likelihood when (MLE) If we want to be symmetric around 0, e.g. two-sided test, the likelihood ratio bound will put with 50% Z and 50% -Z • Likelihood bound is the simple, and most aggressive in favor of
Power bound and UMBPT Power bound: assume (or what ever number) when using (or another number) UMPBT (Uniformly most powerful Bayesian Test): put at 1.96 (for with equal probability
Local bound • More realistically, if sets to be 0, should be more likely to be close to 0 than be away from 0 • Local : is such that the distribution of –log(p-value) under has decreasing hazard rate • For two-sided test, think of it as unimodal symmetric distribution with mode at 0 and density decays at both tails with certain condition • Originally derived assuming p-value under follows Beta() • Bayes Factor bound =
No knowledge about the genuine true prior should not prevent you from using Bayes Factor and Hypothesis classification
Treatment Effect Estimation Regression/Posterior Mean: Benefit: • Robust against multiple testing and post-inference selection/Winner’s curse (Why?) • Data need to include observations for other metrics, and other replications of the same experiment • For any given metric X, Data only need to include all metrics with correlated effects • Usually only small number of metrics have highly correlated effects, condition on each metric alone is good enough • Reduced variance: usually smaller than
Simulation study: Prior ~ t(df = 3), trained with Laplace prior Posterior mean/MLE Note the decreasing coefficient Coverage of 95% “confidence interval” Left: posterior mean and variance Right: MLE and its sampling variance Reduced variance
Simulation Study: Estimation curve and true effects Different traffic sizes True Effect Fatter than normal tail, less shrinkage at tails Observed Delta
Prior Learning • When you have a set of historical data • Parametric prior model • Normal, Laplace, t, etc. • Two group model: with p to be 0, and 1-p to be from another distribution • Maximize Marginal likelihood: Classic empirical Bayes, EM algorithm • Fit unknown parameter first. And apply it to posterior mean and posterior variance. • Some parameter, such as degree of freedom of t, is very hard to fit. • Regression based: treatment the treatment estimation as the goal, directly fit the parameters to maximize prediction accuracy • Issue: unknown ground truth of the effect. • Solution 1: Experiment splitting ( a paper of this conference). • Solution 2: Estimate the testing error using SURE (stein’s unbiased risk estimator) • Both approaches do not even require parametric model for prior, you only need to model the posterior mean directly.
Treatment Effect Prediction Error for pval<10% MLE Simulation, t with df=3 MSE for selected pval<0.1 (about 19%) Boosting tree: predict effect using observed delta and sample size. Expect sample size has a nonlinear effect here. Weighted by sample size Weighted linear regression using observed delta alone Assume Normal prior with mean 0 Laplace(double exponential prior) Weighted linear regression but with one more sensitive metric
Continuous Monitoring and Adaptive Experimentation • Continuous Monitoring and Early stopping • Interim traffic reallocation: Multi-armed bandit • Contextual multi-armed bandit and continuous exploration and exploitation in high dimension configuration space • Bayesian global optimization in high dimension configuration space
Should you consider these? • Do you have a clear and simple goal and constraints? • Do you want to watch out for unexpected movements/results? • Is the signal/reward instantly observable or relatively short term? Do you randomize by page-view or want to track users over a time period? Do you worry about switching users’ experience? • Do you expect the treatment effect to be relatively stable over time? Do you want to learn intraweek pattern of the effect? • Are your goal purely exploration and aim at identifying the winning variant/arm? Do you care about the opportunity cost during the experiment period, e.g. regret? • Do you have a large number of variants or high dimensional configuration space to optimize for?
Continuous Monitoring • Bayes Factor and Posterior based methods supports continuous monitoring and early stopping • The interpretation of Bayes Factor and posterior probability holds and requires no adjustment! (Genuine or empirical Bayes prior or Bayes Factor bound) • mSPRT (mixture sequential probability ratio test) used by Optimizely: • Look at Bayes Factor assuming effect has a normal prior. The unknown prior variance need to be specified manually or fit from historical data(based on Gold/Silver/Bronze tiers of customers) • Test procedure has a one to one mapping to Posterior based methods. The KDD 2017 version stops when BF > • The procedure has a frequentist flavor and does not try to control FDR. Instead it tries to control Type-I error and needs extra steps for FDR control
Global Optimization/Best Arm Identification • Opportunity cost is not part of the concern • Pure-exploration multi-armed bandit: no topology and relation between the arms, assumes the rewards are independent of each other • Typically arms are configured in a continuous/ordered space, reward surface tend to be smooth, rewards at two configurations are more correlated if they are closer • Bayesian global optimization = multi-armed bandit + Gaussian Process • Closed formula • Traffic reallocation based on heuristics such as knowledge gradient or expected improvement • Some of the structure in the GP can be shared across tasks, using offline data
References: Bayes Factor and Estimation • “Redefine statistical significance”, Benjamin et.al. 2017, Nature Human Behavior • “Calibration of ρ Values for Testing Precise Null Hypotheses”, Sellke et.al. 2001, The American Statistician • “Objective Bayesian two sample hypothesis testing for online controlled experiments”, Deng 2015, WWW • “Three Recommendations for Improving the Use of p-Values”. Benjamin and Berger, 2019, The American Statistician • “Uniformly most powerful Bayesian tests”, Johnson, 2013, Annals of Statistics • “Onp-Values and Bayes Factors”, Held and Ott, 2018, Annual Review of Statistics and Its Application • “Improving Treatment Effect Estimators Through Experiment Splitting”, Coey and Cunningham, 2019, WebConf
Reference: Continuous Monitoring • “Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing”, Deng et.al. 2016, IEEE DSAA • “A Sequential Test for Selecting the Better Variant”, Ju, et. al., 2019, WSDM • “Peeking at A/B Tests: Why It Matters, and What to Do About It”, Johari, et.al., 2017 KDD • “Optional Stopping with Bayes Factors: a categorization and extension of folklore results, with an application to invariant situations”, Hendriksen et. al., 2018, preprint
References: Adaptive Experimentation • “Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems”, Bubeck and Cesa-Bianchi, 2012, now • “Constrained Bayesian Optimization with Noisy Experiments”, Letham, et. al. 2018, Bayesian Analysis • “Bayesian optimization for policy search via mixed online-offline experimentation”, Letham and Bakshy, JMLR
Questions? http://exp-platform.com