1 / 40

Noise Tolerant Learning

Noise Tolerant Learning. Presented by Aviad Maizels. Based on: “Noise-tolerant learning, the parity problem, and the statistical query model” Avrim Blum, Adam Kalai and Hal Wasserman “A Generalized Birthday problem” David Wagner

dcowan
Download Presentation

Noise Tolerant Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Noise Tolerant Learning Presented by Aviad Maizels Based on: “Noise-tolerant learning, the parity problem, and the statistical query model” \Avrim Blum, Adam Kalai and Hal Wasserman “A Generalized Birthday problem” \ David Wagner “Hard-core predicates for any one way function” \ Goldreich O. and L.A.Levin “Simulated annealing and Boltzmann machines” \ Emile Aarts and Jan Korst

  2. void Agenda() { do { • A few sentences about Codes • The opposite problem • Learning with noise • The k-sum problem • Can we do it faster ?? • Annealing } while (!understandable); }

  3. 1-p 0 0 p p 1-p 1 1 void fast_introduction_to_LECC() { The communication channel may disrupt the original data: Proposed solution: encode messages to give some protection against errors.

  4. void fast_introduction_to_LECC()(Continue – terminology) Linear Codes: • Fixed sized block code • Additive closure Code is tagged using two parameters (n,k): • k – data size • n – encoded word size Source Encoder Channel msg=u1u2…uk codeword=x1x2…xn noise

  5. k n-k data redundancy void fast_introduction_to_LECC()(Continue – terminology) • Systematic code – original data appears directly inside the codeword. • Generating matrix (G) - a matrix s.t. multiplying it with a message will output the encoded word. • Num of rows == space dimension (k) • Every codeword can be represented as a linear combination of G’s rows.

  6. 111 011 111 011 101 001 101 001 010 110 010 110 000 100 000 100 void fast_introduction_to_LECC()(Continue – terminology) • Hamming distance – the number of places two vectors differ in • Denoted by dist(x,y) • Hamming weight – the number of places that differ from zero in a vector • Denoted by wt(x) • Minimum distance of linear code – minimum weight of any non-zero vector

  7. void fast_introduction_to_LECC()(Continue – terminology) • Perfect code (t)- Every vector has hamming distance <=t from a unique codeword Channel Decoder Target received word=x + e msg’ ?? error vector=e1e2…en

  8. void fast_introduction_to_LECC()(Continue – terminology) • Complete Decoding – The acceptance groups around the codewords together contains all the vectors of length n ... }

  9. void the_opposite_problem() { • Decoding linear (n,k) codes in the presence of random noise when k > O(logn) in poly(n)-time. • k = O(logn) is trivial | in !(coding-theory) terms: • Given a finite set of code words (examples) of length n, their labels and a codeword , find\learn the label of , in the presence of random noise, in poly(n) time.

  10. void the_opposite_problem()(Continue – Main idea) Without noise: • Any vector can be written as a linear combination of previously seen examples. • Deducing the vector’s label can be done in the same way. So… All we need is to find a basis to deduce any label of a new example. Qs: Is it the same with the presence of noise ??

  11. void the_opposite_problem()(Continue – Main idea) Well… No. Summing examples actually boosts the noise: Given s examples and a noise rate of η < ½, the sum of s examples has a noise rate of ½ + ½(1-2η)s write basis vectors as a sum of small number of examples and the new sample as a linear combination of the above. }

  12. void learning_with_noise() { • Concept – boolean function over the input space • Concept class – set of concepts • World model: • There is a fixed noise rate η<1/2, • Fixed probability distribution D over the input space • The alg. may ask for labeled example (x,l). • &… an unknown concept c.

  13. k bits 1010111… x c void learning_with_noise() { • Goal: Find an e-approximation of c • a function h s.t. Prx←D[h(x) = c(x)] ≥ 1-e • Parity function: defined by a corresponding vector v{0,1}n. The function is then given by the rule

  14. void learning_with_noise()(Continue – Preliminaries) • Efficiently learnable: Concept class C is E.L. in the presence of random classification noise under distribution D if: •  alg A s.t.  e>0, δ>0, η>0 and  concept cC • A produces an e-approximation of c with probability at least 1- δ when given access to D-random examples. • A must run in time polynomial in n,1/e,1/ δ and in 1/(1/2- η).

  15. void learning_with_noise()(Continue – Goal) • We’ll show that: The length-k parity problem for noise rate η<1/2, can be solved with computation time and total size of examples of 2O(k/logk). Observe the behavior of the noise when we’re adding up examples:

  16. void learning_with_noise()(Continue – Noise behavior) • pi + qi = 1 • Denote: si = pi-qi = 2pi–1 = 1–2qi si[-1,1]  p3 = p1q2+p2q1 ; q3 = p1p2+ q1q2  s3 = p3–q3 = s1s2  p1= appearing frequency of noisy bit. q1= appearing frequency of correct bit. 1010111… 1111011… p2= appearing frequency of noisy bit. q2= appearing frequency of correct bit.

  17. void learning_with_noise()(Continue – Idea) Main idea: Draw much more examples than needed to find basis vectors as a sum of relatively small number of examples. • If η<1/2 the sum of (logn) labels will be polynomially indistinguishable from random • We can repeat the process to boost reliability

  18. 1010111… b bits 1 2 a void learning_with_noise()(Continue – Definitions) A few more definitions: • k = a*b • Vi - subspace of {0,1}ab consisting of vectors whose last i blocks are zeroed • i-sample – set of independent vectors that are uniformly distributed over Vi

  19. void learning_with_noise()(Continue – Main construction) Construction: Given i-sample of size s, we construct (i+1)-sample of size at least s-2b in time O(s) Behold: • i-sample={x1,…,xs}. • Partition the x’s based on the (a-i) block (we’ll get max 2b partitions). • For each non-empty partition, pick a random vector, add it to the other vectors on his partition and then discard the vector. Result: z1,…,zm vectors, m≥s-2b where: • The block (a-i-1) is zeroed out • zj are independent uniformly distributed over Vi+1

  20. void learning_with_noise()(Continue – Algorithm) Algorithm (Finding the 1st bit): • Ask for a2b labeled examples • Apply construction (a-1) times to get (a-1)-sample • There is 1-1/e chance that the vector (1,0,…,0) will be a member of the (a-1)- sample. If it’s not there, we’ll do it again with new labeled examples (expected number of repetitions is constant) Note: we’ve written (1,0,…,0) as a sum of 2(a-1) examples, causing the noise rate to boost to

  21. void learning_with_noise()(Continue – Observations) Observations: • We found the first bit of our new sample using the number of examples and computation time in poly • We can shift all examples to determine the remainder bits • Fixing a=(1/2)logk and b=2k/logk will give the desired for a constant noise rate η. }

  22. void the_k_sum_problem() { The key to improve the above alg is to find a better way to solve a problem similar to “k-sum”. Problem: Given k lists L1,…,Lk of elements, drawn uniformly and independently from {0,1}n, find x1L1,…,xkLk s.t. Note: a solution to the “k-sum” problem exists with good probability if |L1|*|L2|*…*|Lk| >> 2n (Similar to birthday paradox)

  23. void the_k_sum_problem()(Continue – Wagner’s Algorithm - Definitions) Preliminary definitions and observations: • Lowl(x) – the l LS bits of x • L1 xl L2 – contains all pairs from L1 x L2 that agree on the l LS bits. • If lowl(x1x2)=0 and lowl(x3x4)=0 then lowl(x1x2x3x4)=0 and Pr[x1x2x3x4=0]=2l/2n • Join (xl) operation: • Hash join: stores one list and scans through the other • (|L1| + |L2|) steps, O(|L1|+|L2|) storage • Merge join: sorts & scans the two sorted lists • O(max(|L1|,|L2|)log(max(|L1|,|L2|))) time

  24. {(x1,…,x4): x1…x4=0} xl L1 xl L2 L3 xl L4 xl xl L1 L2 L3 L4 void the_k_sum_problem()(Continue – Wagner’s Algorithm – Simple case) The 4 lists case: • Extends lists until they each contains 2l elements • Generate a new list L12 of values x1x2 s.t. lowl(x1x2)=0 and a new list L34 in the same way • Search for matches between L12 and L34

  25. void the_k_sum_problem()(Continue – Wagner’s Algorithm) Observation: • Pr[lowl(xixj)=0]=1/2l when 1ij 4 and xi,xj are chosen uniformly at random • E[|Lij|]=(|Li|*|Lj|)/2l=22l/2l=2l • The expected number of elements common between L12 and L34 that will yield the desired solutions is |L12|*|L34|/2n-l (ln/3 will give us at least 1) Complexity: • O(2n/3) time and space

  26. void the_k_sum_problem()(Continue – Wagner’s Algorithm) Improvisations: • We don’t need low l bits to be zero. We can fix them to any α (i.e. ) • The value 0 in x1… xk=0 can be replaced with a constant c of our choice (by replacing Lk with Lk’=Lkc) • If k>k’ the complexity of the “k-sum” problem can be no larger than the complexity of the “k’-sum” problem (just pick arbitrary xk’+1,…,xk, define c=xk’+1… xk and use “k’-sum” alg to find a solution for x1… xk’=c)  we can solve “k-sum” problem with complexity at most O(2n/3) for all k4

  27. void the_k_sum_problem()(Continue – Wagner’s Algorithm) Extending the 4 lists case: • Create complete binary tree of depth logk. • At depth h we’ll use So we’ll get an algorithm that requires time and space Note: if k is not a power of 2 we’ll take k’ to be - the largest power of 2 less than k, using afterwards the list elimination trick }

  28. void can_we_do_it_better_?() { But… Maybe there’s a problem with the approach ? • How many samples do we really need to get a solution with good probability ? • Do we even need a basis ? • Can we do it without scanning the whole space ? • Do we need the best solution ? • Yes • Yes • K+logk-log(-ln(1-e)) • Yes & no… • Yes • no

  29. void can_we_do_it_better_?()(Continue – Sampling space) To have a solution we need k linearly independent vectors in our sampling space S. So… We’ll want: where e[0,1]  |sampling space|=O(k+logk+f(e)) }

  30. void annealing() { Physical process of heating up solid until it melts, followed by cooling it down into a state of perfect lattice. Problem’: finding, among potentially very large number of solutions, a solution with minimal cost. • Note: We don’t even need the minimal cost solution - just one who has a noise rate below our threshold

  31. void annealing()(Continue – Combinatorial optimization) Some definitions: • The set of solutions to the combinatorial problem is taken as the set of states S’ • Note: In our case: • The price function is the energy E:S’ R that we minimize • The transition probability between neighboring states depends on their energy difference and an external temperature T

  32. void annealing()(Continue – Pseudo code algorithm) • Set T to a high temperature • Choose an arbitrary initial state c • Loop: • Select a neighbor c’ of c; set ΔE = E(c')-E(c) • If ΔE < 0 then move to c', else move to c' with probability exp(-ΔE/T). • Do the 2 steps above several more times • Decrease T • Wait long enough and cross fingers…(preferably more than 2)

  33. void annealing()(Continue – Problems) Problems: • Not all states can yield our new sample (only the ones containing at least one vector from S\basis). • The probability that a “capable” state will yield the zero vector is 1/2k • The probability that any 1jk vectors from S will yield a solution is • Note: When |S|k the phrase above approaches zero

  34. void annealing()(Continue – Reduction) Idea: • Sample a little more than is needed: |S|=O(c*k) • Assign each vector its hamming weight and sort S by it. Reduction: • Spawning the next generation: all the states which includes a vector who has a hamming weight  2*wt(l)

  35. void annealing()(Continue – Convergence & Complexity ??) Complexity: Where L denotes the number of steps to reach quasi-equilibrium in each phase and  denotes the computation time of a transition • ln(|S’|) denotes the number of phases to reach an accepted solution, using polynomial-time cooling schedule

  36. Game Over “I don’t even see the code anymore… all I can see now are blondes, brunettes, redheads…” - Cipher (“The matrix”)

  37. void appendix()([GL]) Theorem: Suppose we have oracle access to random process bx:{0,1}n{0,1}, so that where the probability is taken uniformly over internal coin tosses of bx and all possible choices of r, and b(x,r) denote the inner-product mod 2 of x and r. Then, We can in time polynomial in n/ output a list of string that contains x with probability at least ½.

  38. void appendix()(Continue – [GL] – highway) How ?? 1 way (to extract xi): Suppose s(x)=Pr[bx(r)=b(x,r)]3/4+ (hmmm??) The probability that both bx(r)=b(x,r) and bx(rei)=b(x,rei) will hold is at least but…

  39. void appendix()(Continue – [GL] – better way) 2nd way: Idea: Guess b(x,r) by ourselves. Problem: Need to guess polynomially many r’s. Solution: Generate polynomially many r’s so that they are “sufficiently” random but still we can guess them with non-negligible probability.

  40. void appendix()(Continue – [GL] – better way) Construction: • Select uniformly strings in {0,1}n and denote them by s1,…,sl. • Guess The probability that all guesses are correct is • assign each rj to different subsets of {1,..,l} s.t. • Note that: • Try all possibilities for 1,…,l and output a list of 2l candidates for zi{0,1}n

More Related