1 / 47

Harmonic Analysis in Learning Theory

Harmonic Analysis in Learning Theory. Jeff Jackson Duquesne University. Themes. Harmonic analysis is central to learning theoretic results in wide variety of models Results generally strongest known for learning with respect to uniform distribution

Download Presentation

Harmonic Analysis in Learning Theory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Harmonic Analysis in Learning Theory Jeff Jackson Duquesne University

  2. Themes • Harmonic analysis is central to learning theoretic results in wide variety of models • Results generally strongest known for learning with respect to uniform distribution • Work on learning problems has led to some new harmonic results • Spectral properties of Boolean function classes • Algorithms for approximating Boolean functions

  3. Uniform Learning Model Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target functionf : {0,1}n {0,1} Uniform Random Examples< x, f(x) > Example OracleEX(f) Learning AlgorithmA Accuracy ε > 0

  4. Circuit Classes • Constant-depth AND/OR circuits (AC0 without the polynomial-size restriction; call this CDC) • DNF: depth-2 circuit with OR at root Ù } Ú Ú Ú d levels Ù Ù Ù . . . . . . . . . . . . . . . v1 v2 v3 vn Negations allowed

  5. Decision Trees v3 v2 v1 0 1 v4 0 1 0

  6. Decision Trees v3 x3 = 0 v2 v1 0 1 v4 0 1 0 x = 11001

  7. Decision Trees v3 v2 v1 x1 = 1 0 1 v4 0 1 0 x = 11001

  8. Decision Trees v3 v2 v1 0 1 v4 0 1 0 x = 11001 f(x) = 1

  9. Function Size • Each function representation has a natural size measure: • CDC, DNF: # of gates • DT: # of leaves • Size sF (f) of f with respect to class F is size of smallest representation of f within F • For all Boolean f, sCDC(f) ≤ sDNF(f) ≤ sDT(f)

  10. Efficient Uniform Learning Model Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Time poly(n,sF ,1/ε) Target functionf : {0,1}n {0,1} Uniform Random Examples< x, f(x) > Example OracleEX(f) Learning AlgorithmA Accuracy ε > 0

  11. Harmonic-Based Uniform Learning • [LMN]: constant-depth circuits are quasi-efficiently (n polylog(s/ε)-time) uniform learnable • [BT]: monotone Boolean functions are uniform learnable in time roughly 2√n logn • Monotone: For all x, i: f(x|xi=0) ≤ f(x|xi=1) • Also exponential in 1/ε (so assumes ε constant) • But independent of any size measure

  12. Notation • Assume f: {0,1}n  {-1,1} • For all a in {0,1}n, χa(x) ≡ (-1) a · x • For all a in {0,1}n, Fourier coefficient f(a) of f at a is: • Sometimes write, e.g., f({1}) for f(10…0) ^ ^ ^

  13. Fourier Properties of Classes • [LMN]: f is a constant-depth circuit of depth d andS = { a : |a| < logd(s/ε) } ( |a| ≡ # of 1’s in a ) • [BT]:f is a monotone Boolean function andS = { a : |a| < √n / ε) }

  14. Spectral Properties

  15. Proof Techniques • [LMN]: Hastad’s Switching Lemma + harmonic analysis • [BT]: Based on [KKL] • Define AS(f) ≡ n · Prx,i[f(x|xi=0) ≠ f(x|xi=1)] • If S = {a : |a| < AS(f)/ε} then ΣaÏS f2(a) < ε • For monotone f, harmonic analysis + Cauchy-Schwartz shows AS(f) ≤ √n • Note: This is tight for MAJ ^

  16. Function Approximation • For all Boolean f, • For S Í {0,1}n, define • [LMN]:

  17. “The” Fourier Learning Algorithm • Given: ε (and perhaps s, d) • Determine k such that for S = {a : |a| < k}, ΣaÏS f2(a) < ε • Draw sufficiently large sample of examples <x,f(x)> to closely estimate f(a) for all aÎS • Chernoff bounds: ~nk/ε sample size sufficient • Output h ≡ sign(ΣaÎS f(a) χa) • Run time ~ n2k/ε ^ ^ ~

  18. Halfspaces • [KOS]: Halfspaces are efficiently uniform learnable (given ε is constant) • Halfspace: $wÎRn+1 s.t. f(x) = sign(w · (xº1)) • If S = {a : |a| < (21/ε)2 } then åaÏS f2(a) < ε • Apply LMN algorithm • Similar result applies for arbitrary function applied to constant number of halfspaces • Intersection of halfspaces key learning pblm ^

  19. Halfspace Techniques • [O] (cf. [BKS], [BJTa]): • Noise sensitivity of f at γ is probability that corrupting each bit of x with probability γ changes f(x) • NSγ (f) = ½(1-åa(1-2 γ)|a|f2(a)) • [KOS]: • If S = {a : |a| < 1/ γ} then åaÏS f2(a) < 3 NSγ (f) • If f is halfspace then NSε < 9√ ε ^ ^

  20. Monotone DT • [OS]: Monotone functions are efficiently learnable given: • ε is constant • sDT(f) is used as the size measure • Techniques: • Harmonic analysis: for monotone f, AS(f) ≤ √log sDT(f) • [BT]: If S = {a : |a| < AS(f)/ε} then ΣaÏS f2(a) < ε • Friedgut: $ |T| ≤ 2AS(f)/ε s.t. ΣAËT f2(A) < ε ^ ^

  21. Weak Approximators • KKL also show that if f is monotone,there is an i such that -f({i}) ≥ log2n/n • Therefore Pr[f(x) = -χ{i}(x)] ≥ ½ + log2n/2n • In general, h s.t. Pr[f = h] ≥ ½ + 1/poly(n,s) is called a weak approximator to f • If A outputs a weak approximator for every f in F , then F is weakly learnable ^

  22. Uniform Learning Model Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target functionf : {0,1}n {0,1} Uniform Random Examples< x, f(x) > Example OracleEX(f) Learning AlgorithmA Accuracy ε > 0

  23. Weak Uniform Learning Model Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ½ -1/p(n,s) Target functionf : {0,1}n {0,1} Uniform Random Examples< x, f(x) > Example OracleEX(f) Learning AlgorithmA

  24. Efficient Weak Learning Algorithm for Monotone Boolean Functions • Draw set of ~n2 examples <x,f(x)> • For i = 1 to n • Estimatef({i}) • Outputh ≡ argmaxf({i})(-χ{i}) ^ ^

  25. Weak Approximation for MAJ of Constant-Depth Circuits • Note that adding a single MAJ to a CDC destroys the LMN spectral property • [JKS]: MAJ of CDC’s is quasi-efficiently quasi-weak uniformlearnable • If f is a MAJ of CDC’s of depth d, and if the number of gates in f is s, then there is a set A Í {0,1}n such that • |A| < logd s ≡ k • Pr[f(x) = χA(x)] ≥ ½ +1/4snk

  26. Weak Learning Algorithm • Compute k = logds • Draw ~snk examples <x,f(x)> • Repeat for |A| < k • Estimate f(A) • Until find A s.t. f(A) > 1/2snk • Outputh ≡ χA • Run time ~npolylog(s) ^ ^

  27. Weak ApproximatorProof Techniques • “Discriminator Lemma” (HMPST) • Implies one of the CDC’s is a weak approximator to f • LMN spectral characterization of CDC • Harmonic analysis • Beigel result used to extend weak learning to CDC with polylog MAJ gates

  28. Boosting • In many (not all) cases, uniform weak learning algorithms can be converted to uniform (strong) learning algorithms using a boosting technique ([S], [F], …) • Need to learn weakly with respect to near-uniform distributions • For near-uniform distribution D, find weak hj s.t. Prx~D[hj = f] > ½ + 1/poly(n,s) • Final h typically MAJ of weak approximators

  29. Strong Learning for MAJ of Constant-Depth Circuits • [JKS]: MAJ of CDC is quasi-efficiently uniform learnable • Show that for near-uniform distributions, some parity function is a weak approximator • Beigel result again extends to CDC with poly-log MAJ gates • [KP] + boosting: there are distributions for which no parity is a weak approximator

  30. Uniform Learning from a Membership Oracle Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target functionf : {0,1}n {0,1} Membership OracleMEM(f) Learning AlgorithmA x f(x) Accuracy ε > 0

  31. Uniform Membership Learning of Decision Trees • [KM] • L1(f) ≡ åa |f(a)| ≤ sDT(f) • If S = {a : |f(a)| ≥ ε/L1(f)} then ΣaÏS f2(a) < ε • [GL]: Algorithm (memberhip oracle) for finding {a : |f(a)| ≥ θ} in time ~n/θ6 • So can efficiently uniform membership learn DT • Output h same form as LMN:h ≡ sign(ΣaÎS f(a) χa) ^ ^ ^ ^ ^ ^ ~

  32. Uniform Membership Learning of DNF • [J] • "(distributions D)$ χa s.t. Prx~D[f(x) = χa(x)] ≥ ½ + 1/6sDNF • Modified [GL] can efficiently locate such χa given oracle for near-uniform D • Boosters can provide such an oracle when uniform learning • Boosting provides strong learning • [BJTb] (see also [KS]): • Modified Levin algo finds χa in time ~ns2

  33. Uniform Learning from a Classification Noise Oracle Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target functionf : {0,1}n {0,1} Classification Noise OracleEXη(f) Learning AlgorithmA Uniform random x Pr[<x, f(x)>]=1-η Pr[<x, -f(x)>]=η Accuracy ε > 0 Error rate η > 0

  34. Uniform Learning from a Statistical Query Oracle Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target functionf : {0,1}n {0,1} Statistical Query OracleSQ(f) Learning AlgorithmA ( q(), τ ) EU[q(x, f(x))] ± τ Accuracy ε > 0

  35. SQ and Classification Noise Learning • [K] • If F is uniform SQ learnable in time poly(n, sF ,1/ε, 1/τ) then F is uniform CN learnable in time poly(n, sF ,1/ε, 1/τ, 1/(1-2η)) • Empirically, almost always true that if F is efficiently uniform learnable then F is efficiently uniform SQ learnable(i.e., 1/τ poly in other parameters) • Exception: F = PARn ≡ {χa : aÎ {0,1}n, |a| ≤ n}

  36. Uniform SQ Hardness for PAR • [BFJKMR] • Harmonic analysis shows that for any q, χa:EU[q(x,χa(x))] = q(0n+1) + q(aº 1) • Thus adversarial SQ response to (q,τ) is q(0n+1) whenever |q(aº 1)| < τ • Parseval: |q(bº 1)| < τ for all but 1/τ2 Fourier coefficients • So ‘bad’ query eliminates only poly coefficients • Even PARlog n not efficiently SQ learnable ^ ^ ^ ^ ^

  37. Uniform Learning from an Attribute Noise Oracle Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target functionf : {0,1}n {0,1} Attribute Noise OracleEXDN(f) Learning AlgorithmA Uniform random x <xÅr, f(x)>, r~DN Accuracy ε > 0 Noise model DN

  38. Uniform Learning with Independent Attribute Noise • [BJTa]: • LMN algorithm produces estimates of f(a) · Er~DN[χa(r)] • Example application • Assume noise process DN is a product distribution: • DN(x) = ∏i (pi(xi) + (1-pi)(1-xi)) • Assume pi < 1/polylog n, 1/ε at most quasi-poly(n) (mild restrictions) • Then modified LMN uniform learns attribute noisy AC0 in quasi-poly time ^

  39. Agnostic Learning Model Arbitrary Boolean Function Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] minimized Target functionf : {0,1}n {0,1} Uniform Random Examples< x, f(x) > Example OracleEX(f) Learning AlgorithmA

  40. Near-Agnostic Learning via LMN • [KKM]: • Let f be an arbitrary Boolean function • Fix any set S Í {1..n} and fix ε • Let g be any function s.t. • ΣaÏS g2(a) < ε and • Pr[f ≠ g] is minimized (call this η) • Then for h learned by LMN by estimating coefficients of f over S: • Pr[f ≠ h] < 4η + ε ^

  41. Average Case Uniform Learning Model Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε D -randomf : {0,1}n {0,1} Uniform Random Examples< x, f(x) > Example OracleEX(f) Learning AlgorithmA Accuracy ε > 0

  42. Average Case Learning of DT • [JSa]: • D : uniform over complete, non-redundantlog-depth DT’s • DT efficiently uniform learnable on average • Output is a DT (proper learning)

  43. Average Case Learning of DT • Technique • [KM]: All Fourier coefficients of DT with min depth d are rational with denominator 2d • In average-case tree, coefficient f({i}) for at least one variable vi has odd numerator • So log(denominator) is min depth of tree • Try all variables at root and find depth of child trees, choosing root with shallowest children • Recurse on child trees to choose their roots ^

  44. Average Case Learning of DNF • [JSb]: • D : s terms, each term uniform from terms of length log s • Monotone DNF with <n2 terms and DNF with <n1.5 terms properly and efficiently uniform learnable on average • Harmonic property • In average-case DNF, sign of f({i,j}) (usually) indicates whether vi and vj are in a common term or not ^

  45. Summary • Most uniform-learning results depend on harmonic analysis • Learning theory provides motivation for new harmonic observations • Even very “weak” harmonic results can be useful in learning-theory algorithms

  46. Some Open Problems • Efficient uniform learning of monotone DNF • Best to date for small sDNF is [S], time ~nslog s (based on [BT], [M], [LMN]) • Non-uniform learning • Relatively easy to extend many results to product distributions, e.g. [FJS] extends [LMN] • Key issue in real-world applicability

  47. Open Problems (cont’d) • Weaker dependence on ε • Several algorithms fully exponential (or worse) in 1/ε • Additional proper learning results • Allows for interpretation of learned hypothesis

More Related