Machine Learning CS 165B Spring 2012. Course outline. Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks (Ch. 4) Linear classifiers Support Vector Machines Bayesian Learning (Ch. 6) Bayesian Networks Clustering
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Machine LearningCS 165BSpring 2012
Midterm on Wednesday
3 coins
C is the count of heads
M =1 iff all coins match
What kind of classifier is logistic regression?
Nearest
Neighbor
Decision
Tree
Nonlinear
Functions
Linear
Functions
Sometimes, transform the data and then learn a linear function
Gene expression
Face images
Handwritten digits
f2
e2
e1
f1
14
x
x
=
v1
variance (‘spread’) on the v1 axis
x
x
=
x
x
=
x
x
=
x
x
~
x
x
~
x
x
~
~
‘spectral decomposition’ of the matrix:
x
x
=
‘spectral decomposition’ of the matrix:
l1
x
x
=
u1
u2
l2
v1T
v2T
l1
l2
u1
u2
v1T
v2T
‘spectral decomposition’ of the matrix:
n
=
+
+...
m
l1
l2
u1
u2
vT1
vT2
‘spectral decomposition’ of the matrix:
n
r terms
=
+
+...
m
m x 1
1 x n
l1
l2
u1
u2
vT1
vT2
approximation / dim. reduction:
by keeping the first few terms (how many?)
m
=
+
+...
n
assume: l1 >= l2 >= ...
l1
l2
u1
u2
vT1
vT2
A heuristic: keep 8090% of ‘energy’ (= sum of squares of li’s)
m
=
+
+...
n
assume: l1 >= l2 >= ...
å
√åλi2
=
=
2
A
A
[
i
,
j
]
F

£

A
A
A
B
k
2
2
Theorem: [Eckart and Young] Among all m x n matrices B of rank at most k, we have that:

£

A
A
A
B
k
F
F
PC 1
PC 2
When projected onto the line joining the class means, the classes are not well separated.
Fisher chooses a direction that makes the projected classes much tighter, even though their projected means are less far apart.
Find the best direction w for accurate classification.
A measure of the separation between the projected points is the difference of the sample means.
If mi is the ddimensional sample mean from Di given by
the sample mean from the projected points Yigiven by
the difference of the projected sample means is:
Define scatterfor the projection:
Choose w in order to maximize
is called the total withinclass scatter.
Define scatter matricesSi(i = 1, 2) and Sw by
We obtain
where
In terms of SB and Sw, J(w) can be written as:
A vector w that maximizes J(w) must satisfy
In the case that Sw is nonsingular,
Map from x to z using nonlinear basis functions and use a linear discriminant in zspace
w is orthogonal to the decision surface
w 0 = b
D = distance of decision surface from origin
Consider any point x on the decision surface. Then D = wTx / w = −b / w
d(x) = distance of x from decision surface
x = xp+ d(x) w/w
wTx + b = wTxp+ d(x) wTw/w + b
g(x) = (wTxp+ b) + d(x) w
d(x) = g(x) / w = wTx / w − D