1 / 33

Conditional and Reference Class Linear Regression

Conditional and Reference Class Linear Regression. Brendan Juba Washington University in St. Louis. Based on ITCS’17 paper and joint works with: Hainline , Le, and Woodruff; Calderon, Li, Li, and Ruan . AISTATS’19 arXiv:1806.02326. Outline. Introduction and motivation

deal
Download Presentation

Conditional and Reference Class Linear Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Conditional and Reference Class Linear Regression Brendan Juba Washington University in St. Louis Based on ITCS’17 paper and joint works with:Hainline, Le, and Woodruff; Calderon, Li, Li, and Ruan.AISTATS’19 arXiv:1806.02326

  2. Outline • Introduction and motivation • Overview of sparse regression via list-learning • General regression via list-learning • Recap and challenges for future work

  3. How can we determine which data is useful and relevant for making data-driven inferences?

  4. Conditional Linear Regression • Given: data drawn from a joint distribution over x∈{0,1}n, y∈Rd, z∈R • Find: • A k-DNF c(x)Recall: OR of “terms” of size ≤ kterms: ANDs of Boolean “literals” • Parameters w∈Rd • Such that on the conditionc(x), the linear rule<w,y>predictsz. z w y c(x) = •∨•

  5. Reference Class Regression • Given: data drawn from a joint distribution over x∈{0,1}n, y∈Rd, z∈Rpoint of interest x* • Find: • A k-DNF c(x) • Parameters w∈Rd • Such that on the conditionc(x), the linear rule<w,y>predictszandc(x*)=1. z w x* y c(x) = •∨•∨•

  6. Motivation • Rosenfeld et al. 2015: some sub-populations have strong risk factors for cancer that are insignificant in the full population • “Intersectionality” in social sciences: subpopulations may behave differently • Good experimentsisolate a set of conditions in which a desired effect has a simple model • It’s useful to find these “segments” or “conditions” in which simplemodelsfit • And, we don’t expect to be able to model all cases simply…

  7. Results: algorithms for conditional linear regression We can solve this problem for k-DNF conditions on nBoolean attributes, regression on d real attributes when • w is sparse: (‖w‖0 ≤ s for constant s) for all lp norms, • loss εÕ(εnk/2) for general k-DNF • loss εÕ(εT log log n) for T-term k-DNF • For general coefficients w with σ-subgaussian residuals, maxterms t of c‖Cov(y|t) - Cov(y|c)‖op sufficiently smalll2 loss εÕ(εT (log log n + log σ)) for T-term k-DNF Technique: “list regression” (BBV’08, J’17, CSV’17)

  8. Why only k-DNF? • Theorem(informal): algorithms to find w andc satisfying E[(<w,y>-z)2|c(x)] ≤ α(n)εwhen a conjunction c* exists enable PAC-learning of DNF (in the usual, “distribution-free” model). • Sketch: encode DNF’s “labels” for x as follows • Label 1 ⟹ z ≡ 0 (easy to predict with w = 0) • Label 0 ⟹ z has high variance (high prediction error) • Terms of the DNF are conjunctionsc* with easy z • c selecting x with easy to predict z gives weak learner for DNF

  9. Why only approximate? • Same construction: algorithms for conditional linear regression solve agnostic learning task for c(x) on Boolean x • State-of-the-art for k-DNFs suggests: require poly(n)(n = # Boolean) blow-up of loss, α(n) • ZMJ‘17: achieve Õ(nk/2) blow-up for k-DNF for the corresponding agnostic learning task • JLM ‘18: achieve Õ(T log log n)blow-up for T-term k-DNF • Formal evidence of hardness? Open question!

  10. Outline • Introduction and motivation • Overview of sparse regression via list-learning • General regression via list-learning • Recap and challenges for future work

  11. Main technique– “list regression” • Given examples we can find a polynomial-size list of candidate linear predictorsincluding anapproximately optimal linear rulew’ on the unknown subset S = {(x(j),y(j),z(j)): c*(x(j))=1, j=1,…,m} for the unknown conditionc*(x). • We learn a conditionc for each w, take a pair (w’,c’) for which c’ satisfies μ-fraction of data

  12. Sparse max-norm “list regression” (J.’17) • Fix a set of s coordinates i1,…,is. • For the unknown subset S, the optimal linear rule using i1,…,is is the solution to the following linear program minimize ε subject to -ε ≤ wi1xi1(j) + … + wisxis(j) - z(j) ≤ εfor j∈S. • s+1 dimensions – optimum attained at basic feasible solution given by s+1 tight constraints

  13. Sparse max-norm “list regression” (J.‘17) • The optimal linear rule using i1,…,is is given by the solution to the systemwi1xi1(jr) + … + wisxis(jr) - z(jr) = σrεfor r = 1,…,s+1for some j1,…,js+1∈S, σ1,…,σs+1∈{-1,+1}. • Enumerate (i1,…,is,j1,…,js+1,σ1,…,σs+1) in[d]s×[m]s+1×{-1,+1}s+1, solve for each w • Includes all (j1,…,js+1)∈Ss+1 (regardless of S!) • List has size dsms+12s+1 = poly(d,m) for constant s.

  14. Summary – algorithm for max-norm conditional linear regression (J.’17) For each (i1,…,is,j1,…,js+1,σ1,…,σs+1) in[d]s×[m]s+1×{-1,+1}s+1, • Solve wi1xi1(jr) + … + wisxis(jr) - z(jr) = σrεfor r = 1,…,s+1 • Ifε > ε*(given), continue to next iteration. • Initialize c to k-DNF over all terms of size k • For j=1,…,m, if |<w,y(j)> - z(j)| > ε • For each term T in c, if T(x(j))=1 • Remove T from c • If #{j:c(x(j))=1} > μ’m, (μ’ initially 0) • Put μ’ = #{j:c(x(j))=1}/m, w’=w, c’=c Learn condition c using “labels” provided by w Choose condition c that includes the most data

  15. Extension to lp-norm list-regressionJoint work with Hainline, Le, Woodruff (AISTATS’19) • Consider the matrix S = [y(j),z(j)]j=1,…,m:c*(x(j))=1 • lp-norm of S(w,-1) approximates lp-loss of w on c • Since w*∈Rd is s-sparse, there exists a small “sketch” matrix S’ such that ‖Sv‖p≈‖S’v‖pfor all vectors v on these s coordinates • (Cohen-Peng ’15): moreover, rows of S’ can be O(1) rescaled rows of S • New algorithm: search approximate weights, minimize lp-loss to find candidates for w

  16. lp-norm conditional linear regressionJoint work with Hainline, Le, Woodruff (AISTATS’19) • Using the polynomial-size list containing an approximation to w*, we still need to extract a conditionc such that E[|<w,y>-z|P|c(x)] ≤ α(n)ε • Use |<w,y(i)>-z(i)|P as weight/label for ith point • Easy Õ(T log log n) approximation for T-term k-DNF: • only consider terms t with E[|<w,y>-z|P|t(x)]Pr[t(x)] ≤ εPr[c*(x)] • Greedy algorithm for partial set cover: (terms = sets of points) covering (1-γ)Pr[c*(x)]-fraction • Obtains T log m-size cover – small k-DNF c’ • Haussler’88: to estimate T-term k-DNFs, only require m =O(Tk log n) points ⇒ Õ(εT log log n) loss on c’ • Reference class: add any surviving term satisfied by x*

  17. lp-norm conditional linear regressionJoint work with Hainline, Le, Woodruff (AISTATS’19) • Using the polynomial-size list containing an approximation to w*, we still need to extract a conditionc such that E[|<w,y>-z|P|c(x)] ≤ α(n)ε • Use |<w,y(i)>-z(i)|P as weight/label for ith point • ZMJ’17/Peleg’07: More sophisticated algorithm achieving Õ(nk/2) approximation for general k-DNF(plug in directly to obtain conditional linear regression) • J.,Li’19: can obtain same guarantee for reference class

  18. l2-norm conditional linear regression vs. selective linear regression LIBSVM benchmarks. Red: conditional regression; Black: selective regression Boston (m=506, d=13) Bodyfat (m=252, d=14) Space_GA (m=3107, d=6) Cpusmall (m=8192, d=12)

  19. Outline • Introduction and motivation • Overview of sparse regression via list-learning • General regression via list-learningJoint work with Calderon, Li, Li, and Ruan (arXiv). • Recap and challenges for future work

  20. CSV’17-style List Learning • Basic algorithm: Soft relaxation of fitting unknown S ↕︎ (alternating) Outlier detection and reduction • Improve accuracy by clustering output w and “recentering” • We reformulate CSV’17 in terms of terms(rather than individual points)

  21. Relaxation of fitting unknown c • For fixed weights (“inlier” indicators) u(1),…,u(T) ∈[0,1] • Each termt has its own parameters w(t) • Solve: minw,Y ∑tu(t)|t|lt(w(t)) + λtr(Y) (Y: enclosing ellipsoid; λ: carefully chosen constant) w(1) Enclosing ellipsoid Y w(3) w(2) ⋱ w(outlier) w(outlier)

  22. Outlier detection and reduction • Fix parameters w(1),…,w(T) ∈Rd • Give each termt its own formula indicators ct’(t) • Must find “coalition” c(t) of ≥μ’-fraction (|c(t)|=∑t’∈c(t)|t’|, μ’ = 1/m∑t∈c|t|)such that w(t) ≈ ŵ(t) = 1/|c(t)|∑t’∈c(t)ct’(t)|t’|w(t’) • Reduce inlier indicator u(t) by 1-|lt(w(t)) - lt(ŵ(t))|/maxt’|lt’(w(t’)) – lt’(ŵ(t’))|-factor w(1) • w(3) ŵ(1) • w(2) • • ⋱ Intuition: points fit by parameters in small ellipsoid have a good coalition for which objective value changes little (for smooth loss l). Outliers cannot find a good coalition and are damped/removed. ŵ(outlier) • w(outlier) •

  23. Clustering and recentering Full algorithm: (initially, one data cluster) Basic algorithm on cluster centers ↕︎ (alternating) Cluster outputs Next iteration: run basic algorithm with parameters of radius R/2 centered on each cluster w(1) w(3) w(2) ⋱ w(j) w(k) w(l)

  24. Overview of analysis (1/3)Loss bounds from basic algorithm • Follows same essential outline as CSV’17 • Guarantee for basic algorithm:we find ŵsuch that given ‖w*‖≤ R,E[l(ŵ)|c*]-E[l(w*)|c*]≤ O(R maxw,t∈c*‖[∇E[lt(w)|t]-∇E[l(w)|c*]‖/√μ) • Where ‖[∇E[lt(w)|t]-∇E[l(w)|c*]‖ ⟶‖(w-w*)(Cov(y|t) - Cov(y|c*))‖ • The bound is O(R2maxt∈c*‖Cov(y|t) - Cov(y|c*)‖/√μ) for all R ≥ 1/poly(γμm/σT)(errors σ-subgaussian on c*)

  25. Overview of analysis (2/3) From loss to accuracy via convexity • We can find: ŵsuch that given ‖w*‖≤ R,E[l(ŵ)|c*]-E[l(w*)|c*]≤O(R2‖Cov(y|t) - Cov(y|c*)‖/√μ) • Implies that for all significant t in c*, ‖w(t)-w*‖2≤O(R2 T maxt∈c*‖Cov(y|t) - Cov(y|c*)‖/κ√μ) where κis the convexity parameter of l(w) • Iteratively reduce R with clustering and recentering…

  26. Overview of analysis (3/3) Reduce R by clustering & recentering • In each iteration, for all significant t in c*, ‖w(t)-w*‖2≤O(R2 T maxt∈c*‖Cov(y|t) - Cov(y|c*)‖/κ√μ) where κis the convexity parameter of l(w) • Padded decompositions (FRT’04): can find a list of clusteringssuch that one cluster in each contains w(t) for all significant tin c* with high probability • Moreover, if κ≥ Ω(Tlog(1/μ)maxt∈c*‖Cov(y|t)-Cov(y|c*)‖/√μ) then we obtain a new cluster center ŵ in our list such that ‖ŵ-w*‖≤ R/2with high probability • So: we can iterate, reducing R→1/poly(γμm/σT) • m large ⟹ ‖ŵ(t)-w*‖⟶0

  27. Finishing up: obtaining a k-DNF from ŵ • Cast as weighted partial set cover instance (terms = sets of points) using ∑i:t(x(i))=1li(w) as weight of termt and the ratio objective (cost/size) • ZMJ’17: with ratio objective, still obtain O(log μm) approximation • Recall: we chose ŵto optimize ∑t∈c∑i:t(x(i))=1li(w) (=cost of cover c in set cover) – adds only T-factor • Recall: we only consider terms satisfied with probability ≥ γμ/T– use at most T/γterms • Haussler’88, again: only need ∼O(Tk/γlog n) points

  28. Summary: guarantee for general conditional linear regression Theorem. Suppose that D is a distribution over x∈{0,1}n, y∈Rd, and z∈Rsuch that for some T-termk-DNF c*and w*∈Rd, <w*,y>-z is σ-subgaussian on D|c*withE[(<w*,y>-z)2|c*(x)] ≤ ε, Pr[c*(x)] ≥ μ, and E[(<w,y>-z)2|c*(x)] is κ-strongly convex in w with κ≥ Ω(Tlog(1/μ)maxt∈c*‖Cov(y|t)-Cov(y|c*)‖/√μ).There is a polynomial-time algorithm that uses examples from D to find wand c s.t.w.p. 1-δ,E[(<w,y>-z)2|c(x)] ≤ Õ(εT (log log n + log σ)) and Pr[c(x)] ≥ (1-γ)μ.

  29. Comparison to sparse conditional linear regression on benchmarks Space_GA (m=3107, d=6) Cpusmall (m=8192, d=12) Small benchmarks (Bodyfat, m=252, d=14; Boston, m=506, d=13): does not converge https://github.com/wumming/lud

  30. Outline • Introduction and motivation • Overview of sparse regression via list-learning • General regression via list-learning • Recap and challenges for future work

  31. Summary: new algorithms for conditional linear regression We can solve conditional linear regression for k-DNF conditions on nBoolean attributes, regression on d real attributes when • w is sparse: (‖w‖0 ≤ s for constant s) for all lp norms, • loss εÕ(εnk/2) for general k-DNF • loss εÕ(εT log log n) for T-term k-DNF • For general coefficients w with σ-subgaussian residuals, maxterms t of c‖Cov(y|t) - Cov(y|c)‖op sufficiently smalll2 loss εÕ(εT (log log n + log σ)) for T-term k-DNF Technique: “list regression” (BBV’08, J’17, CSV’17)

  32. Open problems • Remove covariance requirement! • Improve large-formula error bounds to O(nk/2ε) for general (dense) regression • Algorithms without semidefinite programming? Without padded decompositions? • Note: algorithms for sparse regression alreadysatisfy 1—3. • Formal evidence for hardness of polynomial-factor approximations for agnostic learning? • Conditional supervised learning for other hypothesis classes?

  33. References • Hainline, Juba, Le, Woodruff. Conditional sparse lp-norm regression with optimal probability. In AISTATS, PMLR 89:1042-1050, 2019. • Calderon, Juba, Li, Li, Ruan. Conditional Linear Regression. arXiv:1806.02326 [cs.LG], 2018. • Juba. Conditional sparse linear regression. In ITCS, 2017. • Rosenfeld, Graham, Hamoudi, Butawan, Eneh, Kahn, Miah, Niranjan, Lovat. MIAT: A novel attribute selection approach to better predict upper gastrointestinal cancer. In DSAA, pp.1—7, 2015. • Balcan, Blum, Vempala. A discriminative framework for clustering via similarity functions. In STOC pp.671—680, 2008. • Charikar, Steinhardt, Valiant. Learning from untrusted data. In STOC, pp.47—60, 2017. • Fakcharoenphol, Rao, Talwar. A tight bound on approximating arbitrary metrics by tree metrics. JCSS 69(3):485-497, 2004. • Zhang, Mathew, Juba. An improved algorithm for learning to perform exception-tolerant abduction. In AAAI, pp.1257—1265, 2017. • Juba, Li, Miller. Learning abduction under partial observability. In AAAI, pp.1888—1896, 2018. • Cohen, Peng. lp row sampling by Lewis weights. In STOC, pp.47—60, 2017. • Haussler. Quantifying inductive bias: AI learning algorithms and Valiant’s learning framework. Artificial Intelligence, 36:177—221, 1988. • Peleg. Approximation algorithms for the label covermax and red-blue set cover problems. J. Discrete Algorithms, 5:55—64, 2007.

More Related