1 / 50

Function Learning and Neural Nets

Function Learning and Neural Nets. Setting. Learn a function with : Continuous-valued examples E.g., pixels of image Continuous-valued output E.g., likelihood that image is a ‘7’ Known as regression [ Regression can be turned into classification via thresholds]. f(x). x.

irish
Download Presentation

Function Learning and Neural Nets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Function Learning and Neural Nets

  2. Setting • Learn a function with : • Continuous-valued examples • E.g., pixels of image • Continuous-valued output • E.g., likelihood that image is a ‘7’ • Known as regression • [Regression can be turned into classification via thresholds]

  3. f(x) x Function-Learning (Regression) Formulation • Goal function f • Training set: (x(i),y(i)), i = 1,…,n, y(i)=f(x(i)) • Inductive inference: find a function h that fits the points well • Same Keep-It-Simple bias

  4. f(x) x Least-Squares Fitting • Hypothesize a class of functions f(x,θ) parameterized by θ • Minimize squared loss E(θ) = Σi ( f(x(i),θ)-y(i) )2

  5. Linear Least-Squares • f(x,θ) = x ∙ θ • Value of θ that optimizes E(θ) is:θ = [Σi x(i) ∙ y(i)] /[Σi x(i) ∙ x(i)] • E(θ) = Σi ( x(i)∙θ - y(i) )2 = Σi ( x(i) 2θ 2 – 2x(i)y(i)θ + y(i)2) • E’(θ) = 0 => d/d θ [Σi ( x(i) 2 θ 2 – 2 x(i) y(i) θ+ y(i)2)] • = Σi2 x(i)2 θ – 2 x(i) y(i)= 0 • => θ = [Σi x(i) ∙ y(i)] /[Σi x(i) ∙ x(i)] f(x) f(x,q) x

  6. Linear Least-Squares with constant offset • f(x,θ0,θ1) = θ0 + θ1 x • E(θ0,θ1) = Σi(θ0+θ1 x(i) - y(i) )2= Σi(θ02 + θ12 x(i) 2+ y(i)2 +2θ0θ1x(i)-2θ0y(i)-2θ1x(i)y(i)) • dE/dθ0(θ0*,θ1*) = 0 and dE/dθ1(θ0*,θ1*) = 0, so:0 = 2Σi(θ0* +θ1*x(i) - y(i)) 0 = 2Σix(i)(θ0*+θ1* x(i) - y(i)) • Verify the solution:θ0*= 1/N Σi (y(i) – θ1*x(i)) θ1*= [N (Σi x(i)y(i)) – (Σi x(i))(Σi y(i))]/ [N (Σi x(i)2) – (Σi x(i))2] f(x) f(x,q) x

  7. Multi-Dimensional Least-Squares • Let x include attributes (x1,…,xN) • Let θ include coefficients (θ1,…,θN) • Model f(x,θ) = x1θ1 + … + xNθN f(x) f(x,q) x

  8. Multi-Dimensional Least-Squares • f(x,θ) =x1θ1 + … + xNθN • Best θ given byθ = (ATA)-1 ATb • Where A is matrix of x(i)’s in rows, b is vector of y(i)’s f(x) f(x,q) x

  9. Nonlinear Least-Squares • E.g. quadratic f(x,θ) = θ0 + x θ1 + x2θ2 • E.g. exponential f(x,θ) = exp(θ0 + x θ1) • Any combinations f(x,θ) = exp(θ0 + x θ1) + θ2 + x θ3 • Fitting can be done using gradient descent quadratic other f(x) linear x

  10. Aside: Feature Transforms • Common model: weighted sums of nonlinear functions f1(x),…,fN(x) • Linear in the feature space • Polynomial g(x,θ) = θ0 + x θ1 + … + xdθd • In general g(x,θ) = f1(x) θ1 + … + fN(x) θN • Least squares fit can be solved exactly by consider a transformed dataset (x’,y) with x’=(f1(x),…,fN(x)) • More on this later…

  11. Gradient direction is orthogonal to the level sets (contours) of f,points in direction of steepest increase

  12. Gradient direction is orthogonal to the level sets (contours) of f,points in direction of steepest increase

  13. Gradient descent: iteratively move in direction

  14. Gradient descent: iteratively move in direction

  15. Gradient descent: iteratively move in direction

  16. Gradient descent: iteratively move in direction

  17. Gradient descent: iteratively move in direction

  18. Gradient descent: iteratively move in direction

  19. Gradient descent: iteratively move in direction

  20. Gradient Descent for Least Squares • Let θ = (θ1,…,θn) • Error • Take gradient: • Update rule A vector, in which element k indicates how quickly the prediction at example i changes with respect to a change in k

  21. Gradient Descent for Least Squares • Let θ = (θ1,…,θn) • Error • Take gradient: • Update rule The error at example i

  22. Gradient Descent Example: Linear Fitting • f(x,θ) =x1θ1 + … + xNθN • Update rule

  23. x2 + + x1 + - + - - S xi y x1 wi + g - - xn y = f(x,w) =g(Si=1,…,nwixi) Perceptron(The goal function f is a boolean one) w1 x1 + w2 x2 = 0 g(u) u

  24. x1 xi y wi g S xn A Single Perceptron can learn A disjunction of boolean literals x1 x2 x3 Majority function

  25. x1 xi y wi g S xn A Single Perceptron can learn A disjunction of boolean literals x1 x2 x3 Majority function XOR?

  26. Perceptron Learning Rule • θ θ+  x(i)(y(i)-g(θT x(i))) • (g outputs either 0 or 1, y is either 0 or 1) • If output is correct, weights are unchanged • If g is 0 but y is 1, then the value of gon attribute i is increased • If g is 1 but y is 0, then the value of g on attribute i is decreased • Converges if data is linearly separable, but oscillates otherwise

  27. + + x1 + - - - S xi y wi g - + + - xn Perceptron(The goal function f is a boolean one) ? g(u) y = f(x,w) =g(Si=1,…,nwixi) u

  28. x1 xi y wi g S xn Unit (Neuron) y = g(Si=1,…,nwi xi) g(u) = 1/[1 + exp(-au)]

  29. x1 x1 S S xi xi y y wi wi g g xn xn Neural Network • Network of interconnected neurons Acyclic (feed-forward) vs. recurrent networks

  30. Inputs Hidden layer Output layer Two-Layer Feed-Forward Neural Network w1j w2k

  31. Networks with hidden layers • Can represent XORs, other nonlinear functions • Common neuron types: • Linear, soft perception (sigmoid), radial basis functions • As the number of hidden units increase, so does the network’s capacity to learn functions with more nonlinear features • How to train hidden layers?

  32. Backpropagation (Principle) • New example y(k) = f(x(k)) • φ(k) = compute NN prediction with weights w(k-1) for inputs x(k) • Error function: E(k)(w(k-1)) = (φ(k) – y(k))2 • Backpropagation algorithm: Update the weights of the inputs to the last layer, then the weights of the inputs to the previous layer, etc.

  33. Stochastic Gradient Descent • Gradient descent uses a batch update because all examples are incorporated in each step • Stochastic Gradient Descent: use single example on each step • Update rule: • Pick example i (either at random or in order) and a step size  • Update ruleθ θ+  x(i)(y(i)-g(x(i),θ)) • Reduces error on i’th example… but does it converge?

  34. Understanding Backpropagation • Minimize E(q) • Gradient Descent… E(q) q

  35. Understanding Backpropagation • Minimize E(q) • Gradient Descent… E(q) Gradient of E q

  36. Understanding Backpropagation • Minimize E(q) • Gradient Descent… E(q) Step ~ gradient q

  37. Learning algorithm • Given many examples (x(1),y(1)),…, (x(N),y(N)) and a learning rate  • Init: Set k = 0 (or rand(1,N)) • Repeat: • Tweak weights with a backpropagation update on example x(k), y(k) • Set k = k+1 (or rand(1,N))

  38. Understanding Backpropagation • Example of Stochastic Gradient Descent • Decompose E(q) = e1(q)+e2(q)+…+eN(q) • Here ek = (g(x(k),q)-y(k))2 • On each iteration take a step to reduce ek E(q) Gradient of e1 q

  39. Understanding Backpropagation • Example of Stochastic Gradient Descent • Decompose E(q) = e1(q)+e2(q)+…+eN(q) • Here ek = (g(x(k),q)-y(k))2 • On each iteration take a step to reduce ek E(q) Gradient of e1 q

  40. Understanding Backpropagation • Example of Stochastic Gradient Descent • Decompose E(q) = e1(q)+e2(q)+…+eN(q) • Here ek = (g(x(k),q)-y(k))2 • On each iteration take a step to reduce ek E(q) Gradient of e2 q

  41. Understanding Backpropagation • Example of Stochastic Gradient Descent • Decompose E(q) = e1(q)+e2(q)+…+eN(q) • Here ek = (g(x(k),q)-y(k))2 • On each iteration take a step to reduce ek E(q) Gradient of e2 q

  42. Understanding Backpropagation • Example of Stochastic Gradient Descent • Decompose E(q) = e1(q)+e2(q)+…+eN(q) • Here ek = (g(x(k),q)-y(k))2 • On each iteration take a step to reduce ek E(q) Gradient of e3 q

  43. Understanding Backpropagation • Example of Stochastic Gradient Descent • Decompose E(q) = e1(q)+e2(q)+…+eN(q) • Here ek = (g(x(k),q)-y(k))2 • On each iteration take a step to reduce ek E(q) Gradient of e3 q

  44. Stochastic Gradient Descent • Objective function values (measured over all examples) over time settle into local minimum • Step size must be reduced over time, e.g., O(1/t)

  45. Caveats • Choosing a convergent “learning rate”  can be hard in practice E(q) q

  46. Example from B553: Image encoding • 2 layer network, 1 hidden radial basis function layer with 50 neurons (200 parameters) 1000 training examples Fitted NN 12x18 image

  47. Comments and Issues • How to choose the size and structure of networks? • If network is too large, risk of over-fitting (data caching) • If network is too small, representation may not be rich enough • Role of representation: e.g., learn the concept of an odd number

  48. Benefits / Drawbacks of NNs • Benefits • Easy to generate complex nonlinear function classes • Incremental learning via stochastic gradient descent • Good performance on many problems • Predictions evaluated quickly • Drawbacks • Difficult to characterize the hypothesis space • Low interpretability • Relatively slow training, local minima

  49. Performance of Function Learning • Overfitting: too many parameters • Regularization: penalize large parameter values • Minimize E(θ) +  C(θ) where C(θ) measures the cost of a parameter (independent of data) and  is a regularization parameter • Efficient optimization • If E(q) is nonconvex, can only guarantee finding a local minimum • Batch updates are expensive, stochastic updates converge slowly

  50. Readings • R&N 18.8-9

More Related