1 / 13

We have a TA!

We have a TA!. Li-Lun Wang Contact information to follow. Linear Discriminators. I don’t know { whether, weather} to laugh or cry How can we make this a learning problem? We will look for a function F: Sentences  { whether, weather} (F is partial; don’t care ≠  )

dillan
Download Presentation

We have a TA!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. We have a TA! • Li-Lun Wang • Contact information to follow CS446-Fall 06

  2. Linear Discriminators I don’t know {whether,weather}to laugh or cry How can we make this a learning problem? • We will look for a function F: Sentences {whether,weather} (F is partial; don’t care ≠ ) • We need to define the domain of this function better. • An option: For each word w in English define a Boolean feature xw : [xw =1] iff w is in the sentence • This maps a sentence to a point in {0,1}50,000 • In this space: some points are whetherpoints some are weatherpoints Learning Protocol? Supervised? Unsupervised? CS446-Fall 06

  3. What’s Good? • Learning problem: Find a function that best separates the data • What function? • What’s best? • How to find it? • A possibility: Define the learning problem to be: Find a (linear) function that best separates the data CS446-Fall 06

  4. Linear Separability • Are all functions Linearly Separable? • Is this function Linearly Separable? • Does it matter? CS446-Fall 06

  5. x2 x1 Exclusive-OR (XOR) • (x1Æ x2)Ç (:{x1} Æ:{x2}) • In general: a parity function. • xi2 {0,1} • f(x1, x2,…, xn) = 1 iff  xi is even This function is not linearly separable. CS446-Fall 06

  6. Whether Weather Sometimes Functions Can be Made Linear y3Ç y4Ç y7 New discriminator is functionally simpler x1 x2 x4Ç x2 x4 x5Ç x1 x3 x7 Space: X= x1, x2,…, xn input Transformation New Space: Y = {y1,y2,…} = {xi,xi xj, xi xj xj} CS446-Fall 06

  7. Feature Space • Data are not separable in one dimension • Not separable if we insist on using a specific class of functions x CS446-Fall 06

  8. Expanded Feature Space • Data are separable in <x, x2> space • Note: no new information (nonlinear Xform) Key issue: what features to use. Computationally, can be done implicitly (kernels) x2 x CS446-Fall 06

  9. Self-organize into Groups of 4 or 5 • Assignment 1 • The Badges Game…… • Prediction or Modeling? • Representation - features • Background Knowledge • When did learning take place? • Learning Protocol? • What is the problem? • Algorithms CS446-Fall 06

  10. A General Framework for Learning • Goal: predict an unobserved output value y 2 Y based on an observed input vector x 2 X • Estimate a functional relationship y~f(x) from a set {(x,y)i}i=1,n • Most relevant - Classification: y  {0,1} (or y  {1,2,…k} ) • (But, within the same framework can also talk about Regression, y 2< • What do we want f(x) to satisfy? • We want to minimize the Loss (Risk): L(f()) = E X,Y( [f(x)y] ) • Where: E X,Y denotes the expectation with respect to the true distribution. Simply: # of mistakes […] is a indicator function CS446-Fall 06

  11. A General Framework for Learning (II) • We want to minimize the Loss: L(f()) = E X,Y( [f(X)Y] ) • Where: E X,Y denotes the expectation with respect to the true distribution. • We cannot do that. Why not? • Instead, we try to minimize the empirical classification error. • For a set of training examples {(Xi,Yi)}i=1,n • Try to minimize the observed loss • (Issue I: when is this good enough? Not now) • This minimization problem is typically NP hard. • To alleviate this computational problem, minimize a new function – a convex upper bound of the classification error function I(f(x),y) =[f(x) y] = {1 when f(x)y; 0 otherwise} CS446-Fall 06

  12. f(x) –y Learning as an Optimization Problem • A Loss FunctionL(f(x),y) measures the penalty incurred by a classifier f on example (x,y). • There are many different loss functions one could define: • Misclassification Error: L(f(x),y) = 0 if f(x) = y; 1 otherwise • Squared Loss: L(f(x),y) = (f(x) –y)2 • Input dependent loss: L(f(x),y) = 0 if f(x)= y; c(x)otherwise. A continuous convex loss function also allows a conceptually simple optimization algorithm. CS446-Fall 06

  13. How to Learn? • Local search: • Start with a linear threshold function. • See how well you are doing. • Correct • Repeat until you converge. • There are other ways that do not search directly in the hypotheses space • Directly compute the hypothesis? CS446-Fall 06

More Related