Machine Learning CS 165B Spring 2012. Course outline. Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks (Ch. 4) Linear classifiers Support Vector Machines Bayesian Learning (Ch. 6) Bayesian Networks Clustering
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Midterm on Wednesday
3 coins
C is the count of heads
M =1 iff all coins match
What kind of classifier is logistic regression?
Neighbor
Decision
Tree
Nonlinear
Functions
Linear
Functions
Discriminant FunctionsSometimes, transform the data and then learn a linear function
f2
e2
e1
f1
14
l2
u1
u2
vT1
vT2
Dimensionality reduction‘spectral decomposition’ of the matrix:
n
r terms
=
+
+...
m
m x 1
1 x n
l2
u1
u2
vT1
vT2
Dimensionality reductionapproximation / dim. reduction:
by keeping the first few terms (how many?)
m
=
+
+...
n
assume: l1 >= l2 >= ...
l2
u1
u2
vT1
vT2
Dimensionality reductionA heuristic: keep 8090% of ‘energy’ (= sum of squares of li’s)
m
=
+
+...
n
assume: l1 >= l2 >= ...
√åλi2
=
=
2
A
A
[
i
,
j
]
F

£

A
A
A
B
k
2
2
Optimality of SVDTheorem: [Eckart and Young] Among all m x n matrices B of rank at most k, we have that:

£

A
A
A
B
k
F
F
When projected onto the line joining the class means, the classes are not well separated.
Fisher chooses a direction that makes the projected classes much tighter, even though their projected means are less far apart.
Find the best direction w for accurate classification.
A measure of the separation between the projected points is the difference of the sample means.
If mi is the ddimensional sample mean from Di given by
the sample mean from the projected points Yigiven by
the difference of the projected sample means is:
Define scatterfor the projection:
Choose w in order to maximize
is called the total withinclass scatter.
Define scatter matricesSi(i = 1, 2) and Sw by
A vector w that maximizes J(w) must satisfy
In the case that Sw is nonsingular,
Map from x to z using nonlinear basis functions and use a linear discriminant in zspace
w is orthogonal to the decision surface
w 0 = b
D = distance of decision surface from origin
Consider any point x on the decision surface. Then D = wTx / w = −b / w
d(x) = distance of x from decision surface
x = xp+ d(x) w/w
wTx + b = wTxp+ d(x) wTw/w + b
g(x) = (wTxp+ b) + d(x) w
d(x) = g(x) / w = wTx / w − D