Fourier analysis and boolean function learning
Download
1 / 48

fourier analysis and boolean function learning - PowerPoint PPT Presentation


  • 444 Views
  • Updated On :

Fourier Analysis and Boolean Function Learning. Jeff Jackson Duquesne University www.mathcs.duq.edu/~jackson. Themes. Fourier analysis is central to learning theoretic results in wide variety of models

Related searches for fourier analysis and boolean function learning

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'fourier analysis and boolean function learning' - Antony


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Fourier analysis and boolean function learning

Fourier Analysis and Boolean Function Learning

Jeff Jackson

Duquesne University

www.mathcs.duq.edu/~jackson


Themes
Themes

  • Fourier analysis is central to learning theoretic results in wide variety of models

    • Results generally are the strongest known for learning Boolean function classes with respect to uniform distribution

  • Work on learning problems has led to some new harmonic results

    • Spectral properties of Boolean function classes

    • Algorithms for approximating Boolean functions


Uniform learning model
Uniform Learning Model

Boolean Function Class F(e.g., DNF)

Hypothesis

h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε

Target functionf : {0,1}n {0,1}

Uniform

Random

Examples< x, f(x) >

Example OracleEX(f)

Learning AlgorithmA

Accuracy

ε > 0


Circuit classes
Circuit Classes

  • Constant-depth AND/OR circuits (AC0 without the polynomial-size restriction; call this CDC)

  • DNF: depth-2 circuit with OR at root

Ù

}

Ú

Ú

Ú

d levels

Ù

Ù

Ù

. . .

. . .

. . .

. . .

. . .

v1 v2 v3 vn

Negations allowed


Decision trees
Decision Trees

v3

v2

v1

0

1

v4

0

1

0


Decision trees1
Decision Trees

v3

x3 = 0

v2

v1

0

1

v4

0

1

0

x = 11001


Decision trees2
Decision Trees

v3

v2

v1

x1 = 1

0

1

v4

0

1

0

x = 11001


Decision trees3
Decision Trees

v3

v2

v1

0

1

v4

0

1

0

x = 11001

f(x) = 1


Function size
Function Size

  • Each function representation has a natural size measure:

    • CDC, DNF: # of gates

    • DT: # of leaves

  • Size sF (f) of f with respect to class F is size of smallest representation of f within F

    • For all Boolean f, sCDC(f) ≤ sDNF(f) ≤ sDT(f)


Efficient uniform learning model
Efficient Uniform Learning Model

Boolean Function Class F(e.g., DNF)

Hypothesis

h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε

Time

poly(n,sF ,1/ε)

Target functionf : {0,1}n {0,1}

Uniform

Random

Examples< x, f(x) >

Example OracleEX(f)

Learning AlgorithmA

Accuracy

ε > 0


Harmonic based uniform learning
Harmonic-Based Uniform Learning

  • [LMN]: constant-depth circuits are quasi-efficiently (n polylog(s/ε)-time) uniform learnable

  • [BT]: monotone Boolean functions are uniform learnable in time roughly 2√n logn

    • Monotone: For all x, i: f(x|xi=0) ≤ f(x|xi=1)

    • Also exponential in 1/ε (so assumes ε constant)

    • But independent of any size measure


Notation
Notation

  • Assume f: {0,1}n  {-1,1}

  • For all a in {0,1}n, χa(x) ≡ (-1) a · x

  • For all a in {0,1}n, Fourier coefficient f(a) of f at a is:

  • Sometimes write, e.g., f({1}) for f(10…0)

^

^

^


Fourier properties of classes
Fourier Properties of Classes

  • [LMN]: f is a constant-depth circuit of depth d andS = { a : |a| < logd(s/ε) } ( |a| ≡ # of 1’s in a )

  • [BT]:f is a monotone Boolean function andS = { a : |a| < √n / ε) }



Proof techniques
Proof Techniques

  • [LMN]: Hastad’s Switching Lemma + harmonic analysis

  • [BT]: Based on [KKL]

    • Define AS(f) ≡ n · Prx,i[f(x|xi=0) ≠ f(x|xi=1)]

    • If S = {a : |a| < AS(f)/ε} then ΣaÏS f2(a) < ε

    • For monotone f, harmonic analysis + Cauchy-Schwartz shows AS(f) ≤ √n

    • Note: This is tight for MAJ

^


Function approximation
Function Approximation

  • For all Boolean f,

  • For S Í {0,1}n, define

  • [LMN]:


The fourier learning algorithm
“The” Fourier Learning Algorithm

  • Given: ε (and perhaps s, d, ...)

  • Determine k such that for S = {a : |a| < k}, ΣaÏS f2(a) < ε

  • Draw sufficiently large sample of examples <x,f(x)> to closely estimate f(a) for all aÎS

    • Chernoff bounds: ~nk/ε sample size sufficient

  • Output h ≡ sign(ΣaÎS f(a) χa)

  • Run time ~ n2k/ε

^

^

~


Halfspaces
Halfspaces

  • [KOS]: Halfspaces are efficiently uniform learnable (given ε is constant)

    • Halfspace: $wÎRn+1 s.t. f(x) = sign(w · (xº1))

    • If S = {a : |a| < (21/ε)2 } then åaÏS f2(a) < ε

    • Apply LMN algorithm

  • Similar result applies for arbitrary function applied to constant number of halfspaces

    • Intersection of halfspaces key learning pblm

^


Halfspace techniques
Halfspace Techniques

  • [O] (cf. [BKS], [BJTa]):

    • Noise sensitivity of f at γ is probability that corrupting each bit of x with probability γ changes f(x)

    • NSγ (f) ≡ ½(1-åa(1-2 γ)|a|f2(a))

  • [KOS]:

    • If S = {a : |a| < 1/ γ} then åaÏS f2(a) < 3 NSγ (f)

    • If f is halfspace then NSγ(f) < 9√ γ

^

^


Monotone dt
Monotone DT

  • [OS]: Monotone functions are efficiently learnable given:

    • ε is constant

    • sDT(f) is used as the size measure

  • Techniques:

    • Harmonic analysis: for monotone f, AS(f) ≤ √log sDT(f)

    • [BT]: If S = {a : |a| < AS(f)/ε} then ΣaÏS f2(a) < ε

    • Friedgut: $ |T| ≤ 2AS(f)/ε s.t. ΣAËT f2(A) < ε

^

^


Weak approximators
Weak Approximators

  • KKL also show that if f is monotone,there is an i such that -f({i}) ≥ log2n/n

  • Therefore Pr[f(x) = -χ{i}(x)] ≥ ½ + log2n/2n

  • In general, h s.t. Pr[f = h] ≥ ½ + 1/poly(n,s) is called a weak approximator to f

  • If A outputs a weak approximator for every f in F , then F is weakly learnable

^


Uniform learning model1
Uniform Learning Model

Boolean Function Class F(e.g., DNF)

Hypothesis

h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε

Target functionf : {0,1}n {0,1}

Uniform

Random

Examples< x, f(x) >

Example OracleEX(f)

Learning AlgorithmA

Accuracy

ε > 0


Weak uniform learning model
Weak Uniform Learning Model

Boolean Function Class F(e.g., DNF)

Hypothesis

h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ½ -1/p(n,s)

Target functionf : {0,1}n {0,1}

Uniform

Random

Examples< x, f(x) >

Example OracleEX(f)

Learning AlgorithmA


Efficient weak learning algorithm for monotone boolean functions
Efficient Weak Learning Algorithm for Monotone Boolean Functions

  • Draw set of ~n2 examples <x,f(x)>

  • For i = 1 to n

    • Estimatef({i})

  • Outputh ≡ argmaxf({i})(-χ{i})

^

^


Weak approximation for maj of constant depth circuits
Weak Approximation for FunctionsMAJ of Constant-Depth Circuits

  • Note that adding a single MAJ to a CDC destroys the LMN spectral property

  • [JKS]: MAJ of CDC’s is quasi-efficiently quasi-weak uniformlearnable

    • If f is a MAJ of CDC’s of depth d, and if the number of gates in f is s, then there is a set A Í {0,1}n such that

      • |A| < logd s ≡ k

      • Pr[f(x) = χA(x)] ≥ ½ +1/4snk


Weak learning algorithm
Weak Learning Algorithm Functions

  • Compute k = logds

  • Draw ~snk examples <x,f(x)>

  • Repeat for |A| < k

    • Estimate f(A)

  • Until find A s.t. f(A) > 1/2snk

  • Outputh ≡ χA

  • Run time ~npolylog(s)

^

^


Weak approximator proof techniques
Weak Approximator FunctionsProof Techniques

  • “Discriminator Lemma” (HMPST)

    • Implies one of the CDC’s is a weak approximator to f

  • LMN spectral characterization of CDC

  • Harmonic analysis

  • Beigel result used to extend weak learning to CDC with polylog MAJ gates


Boosting
Boosting Functions

  • In many (not all) cases, uniform weak learning algorithms can be converted to uniform (strong) learning algorithms using a boosting technique ([S], [FS], …)

    • Need to learn weakly with respect to near-uniform distributions

      • For near-uniform distribution D, find weak hj s.t. Prx~D[hj = f] > ½ + 1/poly(n,s)

    • Final h typically MAJ of weak approximators


Strong learning for maj of constant depth circuits
Strong Learning for Functions MAJ of Constant-Depth Circuits

  • [JKS]: MAJ of CDC is quasi-efficiently uniform learnable

    • Show that for near-uniform distributions, some parity function is a weak approximator

    • Beigel result again extends to CDC with poly-log MAJ gates

  • [KP] + boosting: there are distributions for which no parity is a weak approximator


Uniform learning from a membership oracle
Uniform Learning from a Membership Oracle Functions

Boolean Function Class F(e.g., DNF)

Hypothesis

h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε

Target functionf : {0,1}n {0,1}

Membership OracleMEM(f)

Learning AlgorithmA

x

f(x)

Accuracy

ε > 0


Uniform membership learning of decision trees
Uniform Membership Learning of Decision Trees Functions

  • [KM]

    • L1(f) ≡ åa |f(a)| ≤ sDT(f)

    • If S = {a : |f(a)| ≥ ε/L1(f)} then ΣaÏS f2(a) < ε

    • [GL]: Algorithm (memberhip oracle) for finding {a : |f(a)| ≥ θ} in time ~n/θ6

    • So can efficiently uniform membership learn DT

    • Output h same form as LMN:h ≡ sign(ΣaÎS f(a) χa)

^

^

^

^

^

^

~


Uniform membership learning of dnf
Uniform Membership Learning of DNF Functions

  • [J]

    • "(distributions D)$ χa s.t. Prx~D[f(x) = χa(x)] ≥ ½ + 1/6sDNF

    • Modified [GL] can efficiently locate such χa given oracle for near-uniform D

      • Boosters can provide such an oracle when uniform learning

    • Boosting provides strong learning

  • [BJTb], [KS], [F]

    • For near-uniform D, can find χa in time ~ns2


Uniform learning from a random walk oracle
Uniform Learning from a FunctionsRandom Walk Oracle

Boolean Function Class F(e.g., DNF)

Hypothesis

h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε

Target functionf : {0,1}n {0,1}

Random

Walk

Examples< x, f(x) >

Random Walk Oracle

RW(f)

Learning AlgorithmA

Accuracy

ε > 0


Random walk dnf learning
Random Walk DNF Learning Functions

  • [BMOS]

    • Noise sensitivity and related values can be accurately estimated using a random walk oracle

      • NSγ (f) ≡ ½(1-åa(1-2 γ)|a|f2(a))

      • Tb(f) ≡ åa b|a|f2(a)

    • Estimate of Tb(f) is efficient if |b| logarithmic

    • Only need logarithmic |b| to learn DNF [BF]

^

^


Random walk parity learning
Random Walk Parity Learning Functions

  • [JW] (unpub)

    • Effectively, [BMOS] limited to finding “heavy” Fourier coefficents f(a) for logarithmic |a|

    • Using a “breadth-first” variation of KM, can locate any |f(a)| > θ in time O(nlog 1/ θ)

    • “Heavy” coefficient corresponds to a parity function that weakly approximates

^

^


Uniform learning from a classification noise oracle
Uniform Learning from a FunctionsClassification Noise Oracle

Boolean Function Class F(e.g., DNF)

Hypothesis

h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε

Target functionf : {0,1}n {0,1}

Classification Noise OracleEXη(f)

Learning AlgorithmA

Uniform random x

Pr[<x, f(x)>]=1-η

Pr[<x, -f(x)>]=η

Accuracy

ε > 0

Error rate

η > 0


Uniform learning from a statistical query oracle
Uniform Learning from a FunctionsStatistical Query Oracle

Boolean Function Class F(e.g., DNF)

Hypothesis

h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε

Target functionf : {0,1}n {0,1}

Statistical Query OracleSQ(f)

Learning AlgorithmA

( q(), τ )

EU[q(x, f(x))] ± τ

Accuracy

ε > 0


Sq and classification noise learning
SQ and FunctionsClassification Noise Learning

  • [K]

    • If F is uniform SQ learnable in time poly(n, sF ,1/ε, 1/τ) then F is uniform CN learnable in time poly(n, sF ,1/ε, 1/τ, 1/(1-2η))

    • Empirically, almost always true that if F is efficiently uniform learnable then F is efficiently uniform SQ learnable(i.e., 1/τ poly in other parameters)

      • Exception: F = PARn ≡ {χa : aÎ {0,1}n, |a| ≤ n}


Uniform sq hardness for par
Uniform SQ Hardness for PAR Functions

  • [BFJKMR]

    • Harmonic analysis shows that for any q, χa:EU[q(x,χa(x))] = q(0n+1) + q(aº 1)

    • Thus adversarial SQ response to (q,τ) is q(0n+1) whenever |q(aº 1)| < τ

    • Parseval: |q(bº 1)| < τ for all but 1/τ2 Fourier coefficients

    • So ‘bad’ query eliminates only poly coefficients

    • Even PARlog n not efficiently SQ learnable

^

^

^

^

^


Uniform learning from an attribute noise oracle
Uniform Learning from an FunctionsAttribute Noise Oracle

Boolean Function Class F(e.g., DNF)

Hypothesis

h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε

Target functionf : {0,1}n {0,1}

Attribute Noise OracleEXDN(f)

Learning AlgorithmA

Uniform random x

<xÅr, f(x)>, r~DN

Accuracy

ε > 0

Noise model

DN


Uniform learning with independent attribute noise
Uniform Learning with FunctionsIndependent Attribute Noise

  • [BJTa]:

    • LMN algorithm produces estimates of f(a) · Er~DN[χa(r)]

  • Example application

    • Assume noise process DN is a product distribution:

      • DN(x) = ∏i (pixi + (1-pi)(1-xi))

    • Assume pi < 1/polylog n, 1/ε at most quasi-poly(n) (mild restrictions)

    • Then modified LMN uniform learns attributenoisy AC0 in quasi-poly time

^


Agnostic learning model
Agnostic Learning Model Functions

Arbitrary Boolean Function

Hypothesis h

in H s.t. Prx~U [f(x) ≠ h(x) ]

<= optH + ε

Target functionf : {0,1}n {0,1}

Uniform

Random

Examples< x, f(x) >

Example OracleEX(f)

Learning AlgorithmA

Accuracy

ε > 0


Agnostic learning of halfspaces
Agnostic Learning of Halfspaces Functions

  • [KKMS]

    • Agnostic learning algorithm for H the set of halfspaces

    • Algorithm is not Fourier-based (L1 regression)

  • However, a somewhat weaker result can be obtained by simple Fourier analysis


Near agnostic learning via lmn
Near-Agnostic Learning via LMN Functions

  • [KKMS]:

    • Let f be an arbitrary Boolean function

    • Fix any set S Í {1..n} and fix ε

    • Let g be any function s.t.

      • ΣaÏS g2(a) < ε and

      • Pr[f ≠ g] (call this η) is minimized for any such g

    • Then for h learned by LMN by estimating coefficients of f over S:

      • Pr[f ≠ h] < 4η + ε

^


Summary
Summary Functions

  • Most uniform-learning results for Boolean function classes depend on harmonic analysis

  • Learning theory provides motivation for new harmonic observations

  • Even very “weak” harmonic results can be useful in learning-theory algorithms


Some open problems
Some Open Problems Functions

  • Efficient uniform learning of monotone DNF

    • Best to date for small sDNF is [Ser], time ~nslog s (based on [BT], [M], [LMN])

  • Non-uniform learning

    • Relatively easy to extend many results to product distributions, e.g. [FJS] extends [LMN]

    • Key issue in real-world applicability


Open problems cont d
Open Problems (cont’d) Functions

  • Weaker dependence on ε

    • Several algorithms fully exponential (or worse) in 1/ε

  • Additional proper learning results

    • Allows for interpretation of learned hypothesis


References
References Functions

  • Beigel: When Do Extra Majority Gates Help? ...

  • [BFJKMR] Blum, Furst, Jackson, Kearns, Mansour, Rudich. Weakly Learning DNF...

  • [BJTa] Bshouty, Jackson, Tamon. Uniform-Distribution Attribute Noise Learnability.

  • [BJTb] Bshouty, Jackson, Tamon. More Efficient PAC-learning of DNF...

  • [BKS] Benjamini, Kalai, Schramm. Noise Sensitivity of Boolean Functions...

  • [BMOS] Bshouty, Mossel, O’Donnell, Servedio. Learning DNF from Random Walks.

  • [BT] Bshouty, Tamon. On the Fourier Spectrum of Monotone Functions.

  • [F] Feldman. Attribute Efficient and Non-adaptive Learning of Parities...

  • [FJS] Furst, Jackson, Smith. Improved Learning of AC0 Functions.

  • [FS] Freund, Schapire. A Decision-theoretic Generalization of On-line Learning...

  • Friedgut: Boolean Functions with Low Average Sensitivity Depend on Few Coordinates.

  • [HMPST] Hajnal, Maass, Pudlak, Szegedy, Turan. Threshold Circuits of Bounded Depth.

  • [J] Jackson. An Efficient Membership-Query Algorithm for Learning DNF...

  • [JKS] Jackson, Klivans, Servedio. Learnability Beyond AC0.

  • [JW] Jackson, Wimmer. In prep.

  • [KKL] Kahn, Kalai, Linial. The Influence of Variables on Boolean Functions.

  • [KKMS] Kalai, Klivans, Mansour, Servedio. On Agnostic Boosting and Parity Learning.

  • [K] Kearns. Efficient Noise-tolerant learning from Statistical Queries.

  • [KM] Kushilevitz, Mansour. Learning Decision Trees using the Fourier Spectrum.

  • [KOS] Klivans, O’Donnell, Servedio. Learning Intersections and Thresholds of Halfspaces.

  • [KP] Krause, Pudlak. On Computing Boolean Functions by Sparse Real Polynomials.

  • [KS] Klivans, Servedio. Boosting and Hard-core Sets.

  • [LMN] Linial, Mansour, Nisan. Constant-depth Circuits, Fourier Transform, and Learnability.

  • [M] Mansour. An O(nloglog n) Learning Algorithm for DNF...

  • [O] O’Donnell. Hardness Amplification within NP.

  • [OS] O’Donnell, Servedio. Learning Monotone Functions from Random Examples in Polynomial Time.

  • [S] Schapire. The Strength of Weak Learnability.

  • [Ser] Servedio. On Learning Monotone DNF under Product Distributions.


ad