ICS 178 Introduction Machine Learning & data Mining

ICS 178Introduction Machine Learning & data Mining Instructor max Welling Lecture 6: Logistic Regression

Logistic Regression • This is also regression but with targets Y=(0,1). I.e. it is classification! • We will fit a regression function on P(Y=1|X) linear regression logistic regression

Sigmoid function f(x) data-points with Y=1 data-points with Y=0

In 2 Dimensions A,b determine 1) orientation 2) thickness (margin) 3) offset of decision surface sigmoid f(x)

Cost Function • We want a different error measure that is better suited for 0/1 data. • This can be derived from maximizing the probability of the data again.

Learning A,b • Again, we take the derivatives of the Error wrt to the parameters. • This time however, we can’t solve them analytically, so we use gradient descent.

Gradients for Logistic Regression • After the math (on the white-board) we find: Note: first term in each eqn. (multiplied by Y) only sums over data with Y=1, while second term (multiplied by (1-Y) only sums over data with Y=0. Follow the gradient until the change in A,b falls below a small theshold (e.g. 1E-6).

Classification • Once we have found the optimal values for A,b we classify future data with: • Least squares and Logistic regression are parametric methods since • all the information in the data is stored in the parameters A,b, i.e. • after learning you can toss out the data. • Also, the decision surface is always linear, its complexity does not grow • with the amount of data. • We have imposed our prior knowledge that the decision surface should be linear.

A Real Example collaboration with S. Cole) • Fingerprints are matched against a data-base. • Each match is scored. • Using Logistic Regression we try to predict if a future match is a real or false. • Human fingerprint examiners claim 100% accuracy. Is this true?

Exercise • You have layed your hands on a dataset where data have a single attribute • and a class label (0 or 1). You train a logistic regression classifier. • A new data-case is presented. What do you do to decide in what class it falls • (use an equation or pseudo-code) • How many parameters are there to tune for this problem? • Explain what these parameters mean in terms of the function P(Y=1|X).

ICS 178 Introduction Machine Learning & data Mining