G54DMT – Data Mining Techniques and Applications http://www.cs.nott.ac.uk/~jqb/G54DMT. Dr. Jaume Bacardit [email protected] Topic 3: Data Mining Lecture 1: Classification Algorithms. Some slides taken from “ Jiawei Han, Data Mining: Concepts and Techniques. Chapter 6 “.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
G54DMT – Data Mining Techniques and Applicationshttp://www.cs.nott.ac.uk/~jqb/G54DMT
Dr. Jaume Bacardit
Topic 3: Data Mining
Lecture 1: Classification Algorithms
Some slides taken from “Jiawei Han, Data Mining: Concepts and Techniques. Chapter 6“
New
Instance
TrainingSet
LearningAlgorithm
Models
Inference Engine
AnnotatedInstance
1
X
X>0.5
X<0.5
Y
Y
Y<0.5
Y>0.5
0
1
X
1
If (X<0.25 and Y>0.75) or
(X>0.75 and Y<0.25) then
If (X>0.75 and Y>0.75) then
If (X<0.25 and Y<0.25) then
Y
Everything else
Default rule
0
1
X
1
Y
0
1
age?
<=30
overcast
>40
31..40
student?
credit rating?
yes
excellent
fair
no
yes
no
yes
no
yes
Attribute Selection Measure: Information Gain (ID3/C4.5)
Class P: buys_computer = “yes”
Class N: buys_computer = “no”
means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence
Similarly,
posteriori = likelihood x prior/evidence
needs to be maximized
and P(xkCi) is
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
P(buys_computer = “no”) = 5/14= 0.357
P(age = “<=30”  buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30”  buys_computer = “no”) = 3/5 = 0.6
P(income = “medium”  buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium”  buys_computer = “no”) = 2/5 = 0.4
P(student = “yes”  buys_computer = “yes) = 6/9 = 0.667
P(student = “yes”  buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair”  buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair”  buys_computer = “no”) = 2/5 = 0.4
P(XCi) : P(Xbuys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(Xbuys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(XCi)*P(Ci) : P(Xbuys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(Xbuys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
x
x
x
x
x
x
x
o
x
x
o
o
x
o
o
o
o
o
o
o
o
o
o

mk
x0
w0
x1
w1
f
å
output y
xn
wn
Input
vector x
weight
vector w
weighted
sum
Activation
function
Output vector
Output layer
Hidden layer
Input layer
Input vector: X
Small Margin
Large Margin
Support Vectors
m
Let data D be (X1, y1), …, (XD, yD), where Xi is the set of training tuples associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH)
W ● X + b = 0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
w0 + w1 x1 + w2 x2 = 0
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
subject to and
Kernel. In its original formulation a dot product.
subject to and
SVM
Relatively new concept
Deterministic algorithm
Nice Generalization properties
Hard to learn – learned in batch mode using quadratic programming techniques
Using kernels can learn very complex functions
Neural Network
Relatively old
Nondeterministic algorithm
Generalizes well but doesn’t have strong mathematical foundation
Can easily be learned in incremental fashion
To learn complex functions—use multilayer perceptron (not that trivial)