Conditional and Reference Class Linear Regression

Conditional and Reference Class Linear Regression Brendan Juba Washington University in St. Louis Based on ITCS’17 paper and joint works with:Hainline, Le, and Woodruff; Calderon, Li, Li, and Ruan.AISTATS’19 arXiv:1806.02326

Outline • Introduction and motivation • Overview of sparse regression via list-learning • General regression via list-learning • Recap and challenges for future work

How can we determine which data is useful and relevant for making data-driven inferences?

Conditional Linear Regression • Given: data drawn from a joint distribution over x∈{0,1}n, y∈Rd, z∈R • Find: • A k-DNF c(x)Recall: OR of “terms” of size ≤ kterms: ANDs of Boolean “literals” • Parameters w∈Rd • Such that on the conditionc(x), the linear rule<w,y>predictsz. z w y c(x) = •∨•

Reference Class Regression • Given: data drawn from a joint distribution over x∈{0,1}n, y∈Rd, z∈Rpoint of interest x* • Find: • A k-DNF c(x) • Parameters w∈Rd • Such that on the conditionc(x), the linear rule<w,y>predictszandc(x*)=1. z w x* y c(x) = •∨•∨•

Motivation • Rosenfeld et al. 2015: some sub-populations have strong risk factors for cancer that are insignificant in the full population • “Intersectionality” in social sciences: subpopulations may behave differently • Good experimentsisolate a set of conditions in which a desired effect has a simple model • It’s useful to find these “segments” or “conditions” in which simplemodelsfit • And, we don’t expect to be able to model all cases simply…

Results: algorithms for conditional linear regression We can solve this problem for k-DNF conditions on nBoolean attributes, regression on d real attributes when • w is sparse: (‖w‖0 ≤ s for constant s) for all lp norms, • loss εÕ(εnk/2) for general k-DNF • loss εÕ(εT log log n) for T-term k-DNF • For general coefficients w with σ-subgaussian residuals, maxterms t of c‖Cov(y|t) - Cov(y|c)‖op sufficiently smalll2 loss εÕ(εT (log log n + log σ)) for T-term k-DNF Technique: “list regression” (BBV’08, J’17, CSV’17)

Why only k-DNF? • Theorem(informal): algorithms to find w andc satisfying E[(<w,y>-z)2|c(x)] ≤ α(n)εwhen a conjunction c* exists enable PAC-learning of DNF (in the usual, “distribution-free” model). • Sketch: encode DNF’s “labels” for x as follows • Label 1 ⟹ z ≡ 0 (easy to predict with w = 0) • Label 0 ⟹ z has high variance (high prediction error) • Terms of the DNF are conjunctionsc* with easy z • c selecting x with easy to predict z gives weak learner for DNF

Why only approximate? • Same construction: algorithms for conditional linear regression solve agnostic learning task for c(x) on Boolean x • State-of-the-art for k-DNFs suggests: require poly(n)(n = # Boolean) blow-up of loss, α(n) • ZMJ‘17: achieve Õ(nk/2) blow-up for k-DNF for the corresponding agnostic learning task • JLM ‘18: achieve Õ(T log log n)blow-up for T-term k-DNF • Formal evidence of hardness? Open question!

Main technique– “list regression” • Given examples we can find a polynomial-size list of candidate linear predictorsincluding anapproximately optimal linear rulew’ on the unknown subset S = {(x(j),y(j),z(j)): c*(x(j))=1, j=1,…,m} for the unknown conditionc*(x). • We learn a conditionc for each w, take a pair (w’,c’) for which c’ satisfies μ-fraction of data

Sparse max-norm “list regression” (J.’17) • Fix a set of s coordinates i1,…,is. • For the unknown subset S, the optimal linear rule using i1,…,is is the solution to the following linear program minimize ε subject to -ε ≤ wi1xi1(j) + … + wisxis(j) - z(j) ≤ εfor j∈S. • s+1 dimensions – optimum attained at basic feasible solution given by s+1 tight constraints

Sparse max-norm “list regression” (J.‘17) • The optimal linear rule using i1,…,is is given by the solution to the systemwi1xi1(jr) + … + wisxis(jr) - z(jr) = σrεfor r = 1,…,s+1for some j1,…,js+1∈S, σ1,…,σs+1∈{-1,+1}. • Enumerate (i1,…,is,j1,…,js+1,σ1,…,σs+1) in[d]s×[m]s+1×{-1,+1}s+1, solve for each w • Includes all (j1,…,js+1)∈Ss+1 (regardless of S!) • List has size dsms+12s+1 = poly(d,m) for constant s.

Summary – algorithm for max-norm conditional linear regression (J.’17) For each (i1,…,is,j1,…,js+1,σ1,…,σs+1) in[d]s×[m]s+1×{-1,+1}s+1, • Solve wi1xi1(jr) + … + wisxis(jr) - z(jr) = σrεfor r = 1,…,s+1 • Ifε > ε*(given), continue to next iteration. • Initialize c to k-DNF over all terms of size k • For j=1,…,m, if |<w,y(j)> - z(j)| > ε • For each term T in c, if T(x(j))=1 • Remove T from c • If #{j:c(x(j))=1} > μ’m, (μ’ initially 0) • Put μ’ = #{j:c(x(j))=1}/m, w’=w, c’=c Learn condition c using “labels” provided by w Choose condition c that includes the most data

Extension to lp-norm list-regressionJoint work with Hainline, Le, Woodruff (AISTATS’19) • Consider the matrix S = [y(j),z(j)]j=1,…,m:c*(x(j))=1 • lp-norm of S(w,-1) approximates lp-loss of w on c • Since w*∈Rd is s-sparse, there exists a small “sketch” matrix S’ such that ‖Sv‖p≈‖S’v‖pfor all vectors v on these s coordinates • (Cohen-Peng ’15): moreover, rows of S’ can be O(1) rescaled rows of S • New algorithm: search approximate weights, minimize lp-loss to find candidates for w

lp-norm conditional linear regressionJoint work with Hainline, Le, Woodruff (AISTATS’19) • Using the polynomial-size list containing an approximation to w*, we still need to extract a conditionc such that E[|<w,y>-z|P|c(x)] ≤ α(n)ε • Use |<w,y(i)>-z(i)|P as weight/label for ith point • Easy Õ(T log log n) approximation for T-term k-DNF: • only consider terms t with E[|<w,y>-z|P|t(x)]Pr[t(x)] ≤ εPr[c*(x)] • Greedy algorithm for partial set cover: (terms = sets of points) covering (1-γ)Pr[c*(x)]-fraction • Obtains T log m-size cover – small k-DNF c’ • Haussler’88: to estimate T-term k-DNFs, only require m =O(Tk log n) points ⇒ Õ(εT log log n) loss on c’ • Reference class: add any surviving term satisfied by x*

lp-norm conditional linear regressionJoint work with Hainline, Le, Woodruff (AISTATS’19) • Using the polynomial-size list containing an approximation to w*, we still need to extract a conditionc such that E[|<w,y>-z|P|c(x)] ≤ α(n)ε • Use |<w,y(i)>-z(i)|P as weight/label for ith point • ZMJ’17/Peleg’07: More sophisticated algorithm achieving Õ(nk/2) approximation for general k-DNF(plug in directly to obtain conditional linear regression) • J.,Li’19: can obtain same guarantee for reference class

l2-norm conditional linear regression vs. selective linear regression LIBSVM benchmarks. Red: conditional regression; Black: selective regression Boston (m=506, d=13) Bodyfat (m=252, d=14) Space_GA (m=3107, d=6) Cpusmall (m=8192, d=12)

Outline • Introduction and motivation • Overview of sparse regression via list-learning • General regression via list-learningJoint work with Calderon, Li, Li, and Ruan (arXiv). • Recap and challenges for future work

CSV’17-style List Learning • Basic algorithm: Soft relaxation of fitting unknown S ↕︎ (alternating) Outlier detection and reduction • Improve accuracy by clustering output w and “recentering” • We reformulate CSV’17 in terms of terms(rather than individual points)

Relaxation of fitting unknown c • For fixed weights (“inlier” indicators) u(1),…,u(T) ∈[0,1] • Each termt has its own parameters w(t) • Solve: minw,Y ∑tu(t)|t|lt(w(t)) + λtr(Y) (Y: enclosing ellipsoid; λ: carefully chosen constant) w(1) Enclosing ellipsoid Y w(3) w(2) ⋱ w(outlier) w(outlier)

Clustering and recentering Full algorithm: (initially, one data cluster) Basic algorithm on cluster centers ↕︎ (alternating) Cluster outputs Next iteration: run basic algorithm with parameters of radius R/2 centered on each cluster w(1) w(3) w(2) ⋱ w(j) w(k) w(l)

Overview of analysis (2/3) From loss to accuracy via convexity • We can find: ŵsuch that given ‖w*‖≤ R,E[l(ŵ)|c*]-E[l(w*)|c*]≤O(R2‖Cov(y|t) - Cov(y|c*)‖/√μ) • Implies that for all significant t in c*, ‖w(t)-w*‖2≤O(R2 T maxt∈c*‖Cov(y|t) - Cov(y|c*)‖/κ√μ) where κis the convexity parameter of l(w) • Iteratively reduce R with clustering and recentering…

Overview of analysis (3/3) Reduce R by clustering & recentering • In each iteration, for all significant t in c*, ‖w(t)-w*‖2≤O(R2 T maxt∈c*‖Cov(y|t) - Cov(y|c*)‖/κ√μ) where κis the convexity parameter of l(w) • Padded decompositions (FRT’04): can find a list of clusteringssuch that one cluster in each contains w(t) for all significant tin c* with high probability • Moreover, if κ≥ Ω(Tlog(1/μ)maxt∈c*‖Cov(y|t)-Cov(y|c*)‖/√μ) then we obtain a new cluster center ŵ in our list such that ‖ŵ-w*‖≤ R/2with high probability • So: we can iterate, reducing R→1/poly(γμm/σT) • m large ⟹ ‖ŵ(t)-w*‖⟶0

Finishing up: obtaining a k-DNF from ŵ • Cast as weighted partial set cover instance (terms = sets of points) using ∑i:t(x(i))=1li(w) as weight of termt and the ratio objective (cost/size) • ZMJ’17: with ratio objective, still obtain O(log μm) approximation • Recall: we chose ŵto optimize ∑t∈c∑i:t(x(i))=1li(w) (=cost of cover c in set cover) – adds only T-factor • Recall: we only consider terms satisfied with probability ≥ γμ/T– use at most T/γterms • Haussler’88, again: only need ∼O(Tk/γlog n) points

Summary: guarantee for general conditional linear regression Theorem. Suppose that D is a distribution over x∈{0,1}n, y∈Rd, and z∈Rsuch that for some T-termk-DNF c*and w*∈Rd, <w*,y>-z is σ-subgaussian on D|c*withE[(<w*,y>-z)2|c*(x)] ≤ ε, Pr[c*(x)] ≥ μ, and E[(<w,y>-z)2|c*(x)] is κ-strongly convex in w with κ≥ Ω(Tlog(1/μ)maxt∈c*‖Cov(y|t)-Cov(y|c*)‖/√μ).There is a polynomial-time algorithm that uses examples from D to find wand c s.t.w.p. 1-δ,E[(<w,y>-z)2|c(x)] ≤ Õ(εT (log log n + log σ)) and Pr[c(x)] ≥ (1-γ)μ.

Comparison to sparse conditional linear regression on benchmarks Space_GA (m=3107, d=6) Cpusmall (m=8192, d=12) Small benchmarks (Bodyfat, m=252, d=14; Boston, m=506, d=13): does not converge https://github.com/wumming/lud

Summary: new algorithms for conditional linear regression We can solve conditional linear regression for k-DNF conditions on nBoolean attributes, regression on d real attributes when • w is sparse: (‖w‖0 ≤ s for constant s) for all lp norms, • loss εÕ(εnk/2) for general k-DNF • loss εÕ(εT log log n) for T-term k-DNF • For general coefficients w with σ-subgaussian residuals, maxterms t of c‖Cov(y|t) - Cov(y|c)‖op sufficiently smalll2 loss εÕ(εT (log log n + log σ)) for T-term k-DNF Technique: “list regression” (BBV’08, J’17, CSV’17)

Open problems • Remove covariance requirement! • Improve large-formula error bounds to O(nk/2ε) for general (dense) regression • Algorithms without semidefinite programming? Without padded decompositions? • Note: algorithms for sparse regression alreadysatisfy 1—3. • Formal evidence for hardness of polynomial-factor approximations for agnostic learning? • Conditional supervised learning for other hypothesis classes?

References • Hainline, Juba, Le, Woodruff. Conditional sparse lp-norm regression with optimal probability. In AISTATS, PMLR 89:1042-1050, 2019. • Calderon, Juba, Li, Li, Ruan. Conditional Linear Regression. arXiv:1806.02326 [cs.LG], 2018. • Juba. Conditional sparse linear regression. In ITCS, 2017. • Rosenfeld, Graham, Hamoudi, Butawan, Eneh, Kahn, Miah, Niranjan, Lovat. MIAT: A novel attribute selection approach to better predict upper gastrointestinal cancer. In DSAA, pp.1—7, 2015. • Balcan, Blum, Vempala. A discriminative framework for clustering via similarity functions. In STOC pp.671—680, 2008. • Charikar, Steinhardt, Valiant. Learning from untrusted data. In STOC, pp.47—60, 2017. • Fakcharoenphol, Rao, Talwar. A tight bound on approximating arbitrary metrics by tree metrics. JCSS 69(3):485-497, 2004. • Zhang, Mathew, Juba. An improved algorithm for learning to perform exception-tolerant abduction. In AAAI, pp.1257—1265, 2017. • Juba, Li, Miller. Learning abduction under partial observability. In AAAI, pp.1888—1896, 2018. • Cohen, Peng. lp row sampling by Lewis weights. In STOC, pp.47—60, 2017. • Haussler. Quantifying inductive bias: AI learning algorithms and Valiant’s learning framework. Artificial Intelligence, 36:177—221, 1988. • Peleg. Approximation algorithms for the label covermax and red-blue set cover problems. J. Discrete Algorithms, 5:55—64, 2007.

Conditional and Reference Class Linear Regression

Conditional and Reference Class Linear Regression

Presentation Transcript

Linear regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Conditional Logistic Regression

Linear Regression

Regression Linear Regression

Linear Regression

LINEAR REGRESSION

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear Regression

Linear regression

Linear Regression