1 / 46

More Classifiers

More Classifiers. Agenda. Key concepts for all classifiers Precision vs recall Biased sample sets Linear classifiers Intro to neural networks. Recap: Decision Boundaries.

enoch
Download Presentation

More Classifiers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. More Classifiers

  2. Agenda • Key concepts for all classifiers • Precision vs recall • Biased sample sets • Linear classifiers • Intro to neural networks

  3. Recap: Decision Boundaries • With continuous attributes, a decision boundary is the surface in example space that splits positive from negative examples x2 x1>=20 T F x2>=10 x2>=15 F T F T T F x1

  4. Beyond Error Rates

  5. Beyond Error Rate • Predicting security risk • Predicting “low risk” for a terrorist, is far worse than predicting “high risk” for an innocent bystander (but maybe not 5 million of them) • Searching for images • Returning irrelevant images is worse than omitting relevant ones

  6. Biased Sample Sets • Often there are orders of magnitude more negative examples than positive • E.g., all images of Kris on Facebook • If I classify all images as “not Kris” I’ll have >99.99% accuracy • Examples of Kris should count much more than non-Kris!

  7. False Positives True decision boundary Learned decision boundary x2 x1

  8. False Positives An example incorrectly predicted to be positive True decision boundary Learned decision boundary x2 New query x1

  9. False Negatives An example incorrectly predicted to be negative True decision boundary Learned decision boundary x2 New query x1

  10. Precision vs. Recall • Precision • # of relevant documents retrieved / # of total documents retrieved • Recall • # of relevant documents retrieved / # of total relevant documents • Numbers between 0 and 1

  11. Precision vs. Recall • Precision • # of true positives / (# true positives + # false positives) • Recall • # of true positives / (# true positives + # false negatives) • A precise classifier is selective • A classifier with high recall is inclusive

  12. Reducing False Positive Rate True decision boundary Learned decision boundary x2 x1

  13. Reducing False Negative rate True decision boundary Learned decision boundary x2 x1

  14. Precision-Recall curves Measure Precision vs Recall as the decision boundary is tuned Perfect classifier Recall Actual performance Precision

  15. Precision-Recall curves Measure Precision vs Recall as the decision boundary is tuned Recall Penalize false negatives Equal weight Penalize false positives Precision

  16. Precision-Recall curves Measure Precision vs Recall as the decision boundary is tuned Recall Precision

  17. Precision-Recall curves Measure Precision vs Recall as the decision boundary is tuned Recall Better learningperformance Precision

  18. Option 1: Classification Thresholds • Many learning algorithms (e.g., probabilistic models, linear models) give real-valued output v(x) that needs thresholding for classification v(x) > t => positive label given to x v(x) < t => negative label given to x • May want to tune threshold to get fewer false positives or false negatives

  19. Option 2: Weighted datasets • Weighted datasets: attach a weight w to each example to indicate how important it is • Instead of counting “# of errors”, count “sum of weights of errors” • Or construct a resampled dataset D’ where each example is duplicated proportionally to its w • As the relative weights of positive vs negative examples is tuned from 0 to 1, the precision-recall curve is traced out

  20. Linear classifiers : Motivation • Decision tree produces axis-aligned decision boundaries • Can we accurately classify data like this? x2 x1

  21. Plane Geometry • Any line in 2D can be expressed as the set of solutions (x,y) to the equation ax+by+c=0 (an implicit surface) • ax+by+c > 0 is one side of the line • ax+by+c < 0 is the other • ax+by+c = 0 is the line itself y b a x

  22. Plane Geometry • In 3D, a plane can be expressed as the set of solutions (x,y,z) to the equation ax+by+cz+d=0 • ax+by+cz+d> 0 is one side of the plane • ax+by+cz+d< 0 is the other side • ax+by+cz+d= 0 is the plane itself z c a b x y

  23. Linear Classifier • In d dimensions, • c0+c1*x1+…+cd*xd =0 • is a hyperplane. • Idea: • Use c0+c1*x1+…+cd*xd> 0 to denote positive classifications • Use c0+c1*x1+…+cd*xd<0 to denote negative classifications

  24. x2 + + x1 + - + - - S xi y x1 wi + g - - xn y = f(x,w) =g(Si=1,…,nwixi) Perceptron w1 x1 + w2 x2 = 0 g(u) u

  25. x1 xi y wi g S xn A Single Perceptron can learn A disjunction of boolean literals x1 x2 x3 Majority function

  26. x1 xi y wi g S xn A Single Perceptron can learn A disjunction of boolean literals x1 x2 x3 Majority function XOR?

  27. Perceptron Learning Rule • θ θ+  x(i)(y(i)-g(θT x(i))) • (g outputs either 0 or 1, y is either 0 or 1) • If output is correct, weights are unchanged • If g is 0 but y is 1, then the value of gon attribute i is increased • If g is 1 but y is 0, then the value of g on attribute i is decreased • Converges if data is linearly separable, but oscillates otherwise

  28. + + x1 + - - - S xi y wi g - + + - xn Perceptron ? g(u) y = f(x,w) =g(Si=1,…,nwixi) u

  29. x1 xi y wi g S xn Unit (Neuron) y = g(Si=1,…,nwi xi) g(u) = 1/[1 + exp(-au)]

  30. x1 x1 S S xi xi y y wi wi g g xn xn Neural Network • Network of interconnected neurons Acyclic (feed-forward) vs. recurrent networks

  31. Inputs Hidden layer Output layer Two-Layer Feed-Forward Neural Network w1j w2k

  32. Networks with hidden layers • Can represent XORs, other nonlinear functions • Common neuron types: • Soft perceptron (sigmoid), radial basis functions, linear, … • As the number of hidden units increase, so does the network’s capacity to learn functions with more nonlinear features • How to train hidden layers?

  33. Backpropagation (Principle) • Treat the problem as one of minimizing errors between the example label and the network output, given the example and network weights as input • Error(xi,yi,w) = (yi– f(xi,w))2 • Sum this error term over all examples • E(w) = iError(xi,yi,w) = i(yi – f(xi,w))2 • Minimize errors using an optimization algorithm • Stochastic gradient descent is typically used

  34. Gradient direction is orthogonal to the level sets (contours) of E,points in direction of steepest increase

  35. Gradient direction is orthogonal to the level sets (contours) of E,points in direction of steepest increase

  36. Gradient descent: iteratively move in direction

  37. Gradient descent: iteratively move in direction E

  38. Gradient descent: iteratively move in direction E

  39. Gradient descent: iteratively move in direction E

  40. Gradient descent: iteratively move in direction E

  41. Gradient descent: iteratively move in direction

  42. Gradient descent: iteratively move in direction

  43. Stochastic Gradient Descent • For each example (xi,yi), take a gradient descent step to reduce the error for (xi,yi) only.

  44. Stochastic Gradient Descent • Objective function values (measured over all examples) over time settle into local minimum • Step size must be reduced over time, e.g., O(1/t)

  45. Neural Networks: Pros and Cons • Pros • Bioinspiration is nifty • Can represent a wide variety of decision boundaries • Complexity is easily tunable (number of hidden nodes, topology) • Easily extendable to regression tasks • Cons • Haven’t gotten close to unlocking the power of the human (or cat) brain • Complex boundaries need lots of data • Slow training • Mostly lukewarm feelings in mainstream ML (although the “deep learning” variant is en vogue now)

  46. Next Class • Another guest lecture

More Related