Constructing hard functions from learning algorithms

Constructing hard functions from learning algorithms Columbia University Igor Carboni Oliveira Joint work with Adam Klivans (UT Austin) and PraveshKothari (UT Austin)

Introduction and Motivation Hard problems in algorithm design (upper bounds): Learn a class of functions C (Example: learn ACC circuits). Check satisfiability of circuits in C (Solve ACC-SAT ). Hard problem in complexity theory (lower bounds): Prove circuit lower bounds. Example: find a function in EXP not in P/poly. There exists explicit function f (i.e., f in P, EXP, NP, etc.) not contained in some non-uniform class (AC0, ACC, TC0, etc). Algorithm for hard computational problem

Theorem [Karp-Lipton 1980]:If there exists an efficient algorithm for 3-SAT then It is actually possible to *prove* circuit lower bounds by designing non-trivial algorithms. Theorem [Kabanets-Impagliazzo 2004]If we can derandomize Polynomial Identity Testing, then (i) NEXP ⊈ P/poly; or(ii) Permanent is not computed by polynomial-size arithmetic circuits. Theorem [Williams 2011]Let C be a class of circuits (ACC, TC0, NC1, etc).If there exists an algorithm for C-SAT that runs in deterministic time , where n is the number of inputs of the circuit and m is the size of the circuit, then NEXP ⊈ C. Corollary: NEXP ⊈ ACC (new circuit lower bound!)

Interesting fact: Circuit lower bounds often precede the development of new algorithms. Example 1: [Håstad 1987] Switching Lemma (PARITY AC0). Used by [Linial-Mansour-Nisan 1993] to prove that AC0 can be learned in quasi-polynomial time. Example 2: [Paturi-Pudlak-Zane 1999] Satisfiability Coding Lemma. Implies tight lower bounds on depth-3 circuits and improved upper bounds for the k-SAT problem.

Intuition Circuit lower bound proofs often reveal new structural properties of a circuit class, allowing functions in the class to be learned/tested in some non-trivial way. [Fortnow-Klivans 2006] Formalizes this intuition; important conceptual result: Prove that some circuit lower bound is actually necessary for learning.

Outline: This talk: Improved circuit lower bounds from learning algorithms Part 1) Lower bounds from PAC learning algorithms. Part 2) Lower bounds from exact learning algorithms. Part 3) Lower bounds from learning algorithms using statistical queries. Remark: All results have meaningful parameterized versions, where we allow the learning algorithm to run in time T(n, size(c), …). For simplicity, we will usually discuss lower bounds from efficient learning algorithms.

Review:PAC Model under uniform distribution with membership queries Definition. Let C be any class of boolean functions. An algorithm A PAC learns C if for every c ∈ C and for every ε > 0, given n, ε, size(c) as input and membership query access to c, algorithm A outputs with high probability (over its internal randomness) a hypothesis h such that Pr[c(x) ≠ h(x)] < ε. We measure the running time of A as a function T = T(n, 1/ε, size(c)). Think of C as your favorite circuit class: AC0, ACC, TC0, NC1, etc. Membership Query oracle MQc: Given any x ∈ {0,1}n, returns c(x). Observations: 1) , for each input size n. 2) We associate to each c ∈ C its size size(c) under some efficient representation.3) We assume the hypothesis h can be evaluated in time T.

Previous Results (PAC Model) Theorem [Fortnow-Klivans 2006]. Let C be any concept class. Assume that C is PAC learnablewith membership queries under the uniform distribution in polynomial-time. ThenBPEXP ⊈ Cpoly. Fornow-Klivans: If we can learn C = TC0 efficiently, then we solve an important open problem in complexity theory! BPEXP = Exponential time version of BPP = Languages decidable by randomized algorithms running in exponential time. Open Problem: Prove that BPEXP is not in P/Poly, or at least that it is not in TC0. TC0 = “small” constant depth circuits with Majority gates.

Obtain stronger circuit lower bound (PSPACE instead of BPEXP), unless something very unlikely is true (randomness is super ultra powerful) New Results (PAC Model) Theorem 1. Let C ⊆ P/polybe any concept class. Assume that C is PAC learnable with membership queries under the uniform distribution in polynomial-time. Then either: • PSPACE ⊈ Cpoly; or • PSPACE ⊆ BPP. (this result implies the original theorem proven in [Fortnow-Klivans 2006])

Original proof [Fortnow-Klivans]:Uses Karp-Lipton collapse for EXP + Toda’s Thm + Valiant’s Thm (complexity of Permanent) + algorithmic construction based on [Impagliazzo-Wigderson 2001].While we use the same high-level idea, we simplify the original proof and obtain stronger consequences. THM1: If C is PAC-learnable in poly time then either: PSPACE not in Cpoly ; or PSPACE is in BPP; Proof Sketch (of Theorem 1). Assume that C is efficiently learnable and PSPACE is in Cpoly. We need to prove that PSPACE is contained in BPP.Enough to show that some PSPACE-complete language L is in BPP. Plan: L is in Cpoly, and we can efficiently learn concepts in Cpoly. Simulate learning algorithm to obtain a BPP algorithm for L.

THM1: If C is PAC-learnable in poly time then either: PSPACE not in Cpoly ; or PSPACE is in BPP; Problem 1: (PAC)Learner provides hypothesis that is correct on 99% of the inputs. BPP Algorithm: must be correct on every input (with high probability). Solution: There is a PSPACE-complete language L that is self-correctible. In other words, if we can compute L on most inputs, then we can compute it correctly on every input with high probability [Beaver-Feigenbaum 1990].

Problem 2: Learning algorithm asks membership queries about the unknown concept. In our case cn: {0,1}n -> {0,1} is L (on inputs of size n). How can we answer membership queries if this is exactly what we are trying to do (compute L)? THM1: If C is PAC-learnable in poly time then either: PSPACE not in Cpoly ; or PSPACE is in BPP; Idea: The canonical PSPACE-complete QBF is downward self-reducible. In other words, can compute QBF(x) in polynomial-time if we know how to compute QBF on smaller instances. Good, but to implement this approach needs a single complete language for PSPACE that is bothself-correctible and downward self-reducible! Fortunately, Theorem [Trevisan-Vadhan 2007]. There exists a language L* such that: L* is PSPACE-complete. L* is self-correctible. L* is downward self-reducible.

Can compute L*(x), x of size n by unfolding this process • (n stages). • Each stage runs in polynomial time (learner, self-correction, downward-self reducibility). • Can control the error with union bound. THM: If C is PAC-learnable in poly time then either: PSPACE not in Cpoly ; or PSPACE is in BPP; In summary:1) Have efficient learning algorithm for Cpoly. By assumption, L* is in Cpoly. Need to prove L* is in BPP. 2) If we can compute L* on instances of size less than n, then we can simulate the learning algorithm using concept cn = L*⋂ {0,1}n . Can answer MQs using downward self-reducibility of L*. 3) Learning algorithm outputs (w.h.p.) a hypothesis h that is correct on most inputs (since it learns Cpoly, and L* is computed by some concept c in Cpoly). 4) We use self-correction of L* to obtain a procedure that computes L* on every input of size n with high probability. 5) Thus we obtain an algorithm that computes L* with high probability on every input of size n (provided that we have an algorithm that computes L* on smaller inputs). PSPACE = BPP!

Theorem 1. Let C ⊆ P/polybe any concept class. Assume that C is PAC learnable with membership queries under the uniform distribution in polynomial-time. Then either: • PSPACE ⊈ Cpoly; or • PSPACE ⊆ BPP. Natural Question: Why do we have this “or” in the conclusion? Observe that part (2) does not depend on C, which is not very appealing. Application of (relativized) Thm1:New proof of Karp-Lipton collapse for PSPACE. Conjecture 1. Let C ⊆ P/polybe any concept class. Assume that C is PAC learnable with membership queries under the uniform distribution in • polynomial-time.Then PSPACE ⊈ Cpoly. We prove that if we can remove this extra condition, then we obtain an *unconditional* proof that BPP ≠ PSPACE.

Review: Exact Learning from Membership and Equivalence queries Definition. Let C be any class of boolean functions. An algorithm A exact learns C if for every c ∈ C, when given access to a membership query oracle MQc and an equivalence query oracle EQc, algorithm A outputs a final hypothesis h such that h(x) = c(x) for every x in {0,1}n. We measure the running time of A as a function T = T(n, size(c)). Membership Query oracle MQc: Given any x ∈ {0,1}n, returns c(x). Equivalence Query oracle EQc: Given (the representation) of a function g:{0,1}n -> {0,1}, outputs “yes” if g ≡ c, or an input w such that g(w) ≠ c(w) otherwise. *Important*: The learning algorithm is deterministic.

Previous Results (Exact Learning) Theorem [Fortnow-Klivans 2006]. Let C ⊆ P/poly be any concept class. If there is an efficient exact learning algorithm for Cusing membership and equivalence queries, then EXPNP⊈ Cpoly. No need for NP oracle! Techniques: Karp-Lipton collapses, Permanent, Toda’s Theorem, ideas from [Kabanets-Impagliazzo 2004], time hierarchy theorems. Theorem [Hitchcock-Harkins 2011]. Let C ⊆ P/poly be any concept class. Under the same assumption (efficient learnability), it follows thatEXP ⊈ Cpoly. Techniques: betting games (a generalization of resource-bounded measure).

New Results (Exact Learning) Theorem 2. Let C be any concept class. Suppose there is an exact learning algorithm for C that makes < 2nmembership and equivalence queries, and runs in time T=T(n, s(n)) when learning concepts of size s(n). Then DTIME(T2) ⊈ Cs(n). (proof uses a simple diagonalization argument!) Williams’ result: NEXP ⊈ ACC Corollary. Let C be any concept class. If there is an efficientexact learning algorithm for C using membership and equivalence queries, then DTIME(t(n)) ⊈ Cpoly, for any super-polynomial function t(n). Example: If we can exact learn ACC circuits in time , where n is the number of inputs and m the number of gates, then EXP ⊈ ACC. Corollary. If SIZE(n) is exact learnable using MQ and EQ in polynomial time, then P = BPP.

How to obtain better lower bounds? Previous approach (PAC result): Run learner, answer queries according to some hard function… Fortnow-Klivans’ proof for Exact Learning: Similar approach, needs an NP oracle to provide answers to EQc oracle… Our approach:Simulate learning algorithm for T steps. Don’t care much about the answers provided to the oracle queries, and construct a function based on these answers.

MQ, EQ What do we have? Algorithm learns any f in C in T steps.What do we want? Construct a hard function not in C. win-win situation:- If our answers are *not* consistent with any f in C, we have a hard function for free.- Otherwise, our answers during the simulation are consistent with some concept f in C. Learner exact learns f (in T steps). Find a new input z not queried by learner and negate f(z), obtaining a function not in C. Our approach:Simulate learning algorithm for T steps. Don’t care much about the answers provided to the oracle queries, and construct a function based on these answers.

THM 2:∃ algorithm that learns any f ∈ C in time T (using MQf and EQf), then DTIME(poly(T)) ⊈ C. Proof sketch: Suppose algorithm A learns any f ∈ C in time T.We construct algorithm B running in time poly(T) that computes a function g ∉ C. Algorithm B Input: x in {0,1}n.(B initially ignores its input x) Set S := ∅, g: S -> {0,1} always (where S ⊆ {0,1}n)Simulate learning algorithm A for T steps.If A invokes MQ(w) on some string w: answer g(w) if w ∈ S, answer 1 otherwise; set S := S ∪ {w}; g(w)=1.If A invokes EQ(θ) for some function θ: find first string w ∉ S; set S := S ∪ {w}; g(w) = 1 – θ(w). answer “no” with counter-example w.Finish simulation.[Diagonalization] If learner outputs hypothesis h, find first z ∉ S. Set S := S ∪ {z} and g(z) = 1 – h(z).Outputs g(x) if x ∈ S, otherwise outputs 1.

Remember win-win situation:- either there is no f in C consistent with our answers (encoded by g); Done! - g = f (over S), f in C. Each query increases S by at most 1. Learner asks < 2n queries. New point z (diagonalization) always exists. Learner cannot learn g, so g is not in C. Done! Algorithm B Input: x in {0,1}n.(B initially ignores its input x) Set S := ∅, g: S -> {0,1} always (where S ⊆ {0,1}n)Simulate learning algorithm A with input n for T steps.If A invokes MQ(w) on some string w: answer g(w) if w ∈ S, answer 1 otherwise; set S := S ∪ {w}; g(w)=1.If A invokes EQ(θ) for some function θ: find first string w ∉ S; set S := S ∪ {w}; g(w) = 1 – θ(w). answer “no” with counter-example w.Finish simulation.[Diagonalization] If learner outputs h, find first z ∉ S. Set S := S ∪ {z} and g(z) = 1 – h(z).Output g(x) if x ∈ S, otherwise output 1. Fact 1. Algorithm B runs in time poly(T). Fact 2. Function g does not depend on the actual input x (it does depend on n = |x|) Fact 3. No function f ∈ C is consistent with g over S⊆ {0,1}n (therefore g∉ C, and Algorithm B computes a hard function). QED.

Review:Learning from Correlational Statistical Queries (CSQ Learning) [Bshouty-Feldman 2002] For convenience, we consider Boolean functions mapping {-1,1}n -> {-1,1}. Given functions f,g: {-1,1}n -> {-1,1}, we let <f , g> := 1/2n∑f(x) g(x). Definition [CSTAT Oracle]. Let C be a concept class. For any c in C, a correlational statistical oracle for c of tolerance τ > 0, CSTAT(c, τ), takes as input a bounded function ψ : {-1,1}n -> [-1,1] and returns v ∈ [-1,1] such that |v - <c,ψ>| < τ. Definition [CSQ Learning Algorithm]. A (deterministic) algorithm A learns a class C of Boolean functions in time T = T(n,1/ε,1/τ) and query complexity Q = Q(n,1/ε,1/τ) if for any concept c in C, and any ε > τ > 0, algorithm A makes Q queries to CSTAT(c, τ) and uses at most T steps to return a hypothesis h such that: Pr[h(x) ≠ c(x)] < ε(under uniform).

Some remarks about the CSQ Model • - Any statistical query can be simulated by two target independent queries and two correlational queries. CSQ Learning is a restriction of the well-known Statistical Query (SQ) Learning Model. • A weak average-case hardness for an explicit function (parity) can be obtained by a simple argument based on SQ-dimension [BFJKMR 1994]. More specifically, it is possible to show that if a class C is (weak) learnable in the CSQ model in polynomial time under uniform then it has polynomial SQ-dimension, and it follows that there exists a parity function PAR on log n inputs such that, for any concept c in C: • Pr[c(x) ≠ PAR(x)] > 1/nlog n. Our result: shows how to find explicit function that is (1/poly)-far from C.

New Results [(Correlational) Statistical Learning] Theorem 3. Let 1/poly(n) < τ < ε < ½. Suppose there is algorithm A that runs in time T = T(n,1/ε,1/τ) and learns any c in C under uniform in the CSQ model to accuracy 1- ε using at most Q = Q(n,1/ε,1/τ) queries to CSTAT(c, τ). Then there exists a function f ∈ DTIME(poly(Q,T,1/τ)) such that:Pr[f(x) ≠ c(x)] = Ω(τ) ,for any concept c in C. Remark:Theorem holds for any ε < ½, thus even weak-learner yields (strong) hard-on-average function.

THM 3:∃ algorithm that CSQ-learns any c ∈ C in poly time can construct explicit f that is Ω(τ) -far from C. Proof Sketch. Recall that all functions map {-1,1}n -> {-1,1}. Construct hard function f in two stages:Part 1) Use learning algorithm to obtain function family G = {g1, g2, … ,gQ+1} of Q+1 functions such that: For every c in C, there exists some g in G such that |< c, g >| ≥ τ.“every function in C is correlated with some function in G”Part 2) Construct in polynomial-time a function f such that: For every g in G, we have |< f, g >| ≤ τ/4.“f is not correlated with any function in G” Standard construct. Combinatorial Discrepancy (1) + (2) f is Ω(τ) -far from every c in C!

Proof of Part 1. Use learning algorithm to obtain function family G = {g1, g2, … ,gQ+1} of Q+1 functions such that:For every c in C, there exists g in G such that |< c, g >| ≥ τ.“every function in C is correlated with some function in G” THM 3:∃ algorithm that CSQ-learns any c ∈ C in poly time can construct explicit f that is Ω(τ) -far from C. We follow [Feldman 2008].Recall CSTAT(c, τ): given function ψ from learner,returns v ∈ [-1,1] such that |v - <c,ψ>| < τ. Key observation. if we run learner and returns 0 in a call to CSTAT( . , τ): Answer is wrong if and only if |< . ,ψ>| ≥ τ. • Simulate learner for at most T steps or Q queries, always answering 0 to query. • Let G = set of Q functions used in queries + hypothesis h (in case it exists).win-win situation: For any c, either some answer is wrong (done!), or all answers are good.In this case h and c are ε-close by learning assumption, therefore h and c are correlated!

THM 3:∃ algorithm that CSQ-learns any c ∈ C in poly time can construct explicit f that is Ω(τ) -far from C. Sketch of Part 2. Construct in polynomial-time a function f such that:For every g in G, we have |< f, g >| ≤ τ/4.“f is not correlated with any function in G” Technique [Chattopadhyay-Klivans-Kothari 2012].  Connection between discrepancy minimization and average-case hardness. • Def. • Let G be a class of functions mapping a finite set S to {-1,1}. • Let χ: S -> {-1,1} be a coloring of S. • The discrepancy of χ with respect to g in G is χ(g) = |∑w:g(w)=1χ(w) . g(w) | • The discrepancy of χ with respect to class G is disc[χ ,G] = maxgχ(g) S w:g(w)=1

Want *deterministic* f such that for every g in G, have |<f,g>|≤ τ/4. One last problem: |S| is exponential.Solution: Break {-1,1}n into blocks of size O( (1/τ2) log |G’|), given input x to algorithm computing hard function, find its corresponding block B ⊆ {-1,1}n… S Key observation: If a coloring χ minimizes discrepancy of functions g and –g (a set and its complement), then|<χ,g>S| must be small.Formally: χ(g) < τ|S| and χ(–g) < τ|S| implies |<χ,g>S| < 2τ. w:g(w)=1 A random coloring has low-discrepancy! Enough to obtain a low-discrepancy coloringf = χ of S={-1,1} n with respect to G’ = G ∪ –G. Theorem (Deterministic construction of low-discrepancy coloring [Sivakumar 2002]) There exists a deterministic algorithm running in time poly(|G’|,|S|) that produces a coloring χ such that disc[χ ,G’] < .

Summary • We obtain improved circuit lower bounds from the existence of learning algorithms under many well-studied models (using different approaches). • In particular, if (exact) learning is easy then BPP = P. • We give further evidence that developing non-trivial learning algorithms for more expressive concept classes will require new techniques.

Some directions for future work 1) Is it possible to obtain new circuit lower bonds by constructing non-trivial learning algorithms? 2) Can we obtain improved circuit lower bounds from randomized learning algorithms? Efficient PAC implies BPSUBEXP ⊈ C? 3) Can we show that circuit lower bounds imply non-trivial {learning, satisfiability alg.}? Circuit Lower Bounds Non-trivial {derandomization, learning, satisfiability alg.} But also: Circuit lower bounds Non-trivial Derandomization Can we give evidence that William’s approach is in some sense necessary for obtaining new circuit lower bounds?

Constructing hard functions from learning algorithms

Constructing hard functions from learning algorithms

Presentation Transcript

WISER: Teaching Constructing Learning Experiences

3.6 Mathematical Models: Constructing Functions

Reinforcement Learning : Learning Algorithms

Exact Algorithms for Hard Problems

On Constructing Parallel Pseudorandom Generators from One-Way Functions

Mathematical Models Constructing functions

Learning Submodular Functions

Constructing a list of functions and attributes

Collaborative learning: Constructing Stakeholder Debates

4.2-2 Constructing Polynomial Functions

Building Functions from Functions

Constructing worm-like algorithms from Schwinger-Dyson equations

Algorithms for hard problems Parameterized complexity – definitions, sample algorithms

Online Learning Algorithms

Machine Learning Algorithms

Efficient Pseudorandom Generators from Exponentially Hard One-Way Functions

Concept Learning Algorithms

Constructing Basic Formulas and using functions

Fill-Area Algorithms and Functions

Efficient Pseudorandom Generators from Exponentially Hard One-Way Functions