This presentation is the property of its rightful owner.
Sponsored Links
1 / 76

Ταξινόμηση II PowerPoint PPT Presentation


  • 99 Views
  • Uploaded on
  • Presentation posted in: General

Ταξινόμηση II. Οι διαφάνειες στηρίζονται στο P.-N. Tan, M.Steinbach, V. Kumar, « Introduction to Data Mining» , Addison Wesley, 2006. Σύντομη Ανακεφαλαίωση. Ορισμός. Ταξινόμηση (classification)

Download Presentation

Ταξινόμηση II

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ii

II

P.-N. Tan, M.Steinbach, V. Kumar, Introduction to Data Mining, Addison Wesley, 2006


Ii


Ii

(classification)

()

: ()

(attributes)

(class)

: (model)

: .

  • :

  • (training set)

  • (test test)

  • .


Ii

  • :

  • (descriptive modeling):

  • (predictive modeling): , ,

  • :

  • (nominal) vs (ordinal)

  • () ( y)

    , regression() y


Ii

Tid

Attrib1

Attrib2

Attrib3

Class

No

1

Yes

Large

125K

Induction

No

2

No

Medium

100K

No

3

No

Small

70K

No

4

Yes

Medium

120K

Yes

5

No

Large

95K

No

6

No

Medium

60K

No

7

Yes

Large

220K

Yes

8

No

Small

85K

No

9

No

Medium

75K

Yes

10

No

Small

90K

10

Deduction

Tid

Attrib1

Attrib2

Attrib3

Class

?

11

No

Small

55K

?

12

Yes

Medium

80K

?

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

10

  • 1.

  • ( )

    • , , )

  • 2.

  • Accuracy rate:

    • ( over-fitting)


Ii

  • 1. (data cleaning)

    • ( )

  • 2. (Relevance analysis)( () -- feature selection)

  • 3. (Data transformation)

    • /

      • {low,medium,high}

      • [0,1)


Ii

  • - Predictive accuracy

  • (speed)

    • /

  • Robustness

  • Scalability

  • Interpretability:

  • - Goodness of rules (quality)


Ii

  • (decision trees)

  • (Rule-based Methods)

  • Memory based reasoning

  • Nave Bayes and Bayesian Belief Networks

  • Support Vector Machines


Ii


Ii

:

  • (split)

  • = /

Splitting Attributes

Refund

Yes

No

NO

MarSt

Married

Single, Divorced

TaxInc

NO

< 80K

> 80K

()

YES

NO

:


Ii

  • ,

    • ---

  • ( )

  • ( ):

  • 1.

  • 2. ( ) -

  • 3. 2

  • (top-down, recursive, divide-and-conquer )

  • 4. , (tree pruning)

  • -


Ii

age?

>40

<=30

30...40

yes


Ii

NO

:

Single, Divorced

MarSt

Married

NO

Refund

> 80K

No

Yes

Refund

Yes

No

TaxInc

< 80K

NO

MarSt

Single, Divorced

YES

NO

TaxInc

NO

< 80K

> 80K

YES

NO


Ii

:

  • , .

  • (induction) greedy :

    • Hunts Algorithm ( )

    • CART

    • ID3, C4.5

    • SLIQ, SPRINT


Ii

: Hunt

, ()

Dt: t

Dt

?

  • ( )

    • Dt yt,

      t yt

    • Dt ( ),

      Dt default

    • Dt , -

      : (, )


Ii

Refund

Refund

Yes

No

Yes

No

Dont

Cheat

Marital

Status

Dont

Cheat

Marital

Status

Single,

Divorced

Refund

Married

Married

Single,

Divorced

Yes

No

Dont

Cheat

Taxable

Income

Cheat

Dont

Cheat

Dont

Cheat

Dont

Cheat

< 80K

>= 80K

Dont

Cheat

Cheat

: Hunt

Dont

Cheat


Ii

:

;

Greedy

    • ( )


Ii

:

    • - Nominal

    • - Ordinal

    • - Continuous

  • :

    • 2- - 2-way split

    • - Multi-way split


Ii

CarType

Family

Luxury

Sports

CarType

Size

CarType

{Sports, Luxury}

{Small, Large}

{Family, Luxury}

{Medium}

{Family}

{Sports}

:

  • :

  • : . (partitioning)

, , 2-1 1

,

;


Ii

Taxable

Taxable

Income

Income?

> 80K?

< 10K

> 80K

Yes

No

[10K,25K)

[25K,50K)

[50K,80K)

:


Ii

: GINI

  • ,

    • =

  • v

    • , A < v and A v

  • v ( best split point)

    • A

    • (ai+ai+1)/2 ai ai+1

Taxable

Income

> 80K?

Yes

No


Ii

:

: 10 0,10 1

Own

Car

Student

Car?

Type?

ID?

Family

Luxury

c

c

No

Yes

1

20

c

c

10

11

Sports

...

...

C0: 6

C0: 4

C0: 1

C0: 8

C0: 1

C0: 1

C0: 1

C0: 0

C0: 0

C1: 4

C1: 6

C1: 3

C1: 0

C1: 7

C1: 0

C1: 0

C1: 1

C1: 1

3 ;

-> ;


Ii

:

()

  • Greedy :

    • , (homogeneous class distribution)

  • (node impurity)

!!

-,

C0: 5

C0: 9

,

C1: 5

C1: 1

1

2

3

4

C1

0

C1

1

C1

2

C1

3

C2

6

C2

5

C2

4

C2

3

~ 0

(1) < (N2) < I(N3) < I(N4)


Ii

:

;

n, , I(n)

(parent) N k ui

N(ui) ( (ui) = N)

, , ( ) ( )

( )

( )


Ii

: Hunt

-

Algorithm GenDecTree(Sample S, Attlist A)

  • create a node N

  • If all samples are of the same class C then label N with C; terminate;

  • If A is empty then label N with the most common class C in S (majority voting); terminate;

  • Select aA, with the highest gain; Label N with a;

  • For each value v of a:

    • Grow a branch from N with condition a=v;

    • Let Sv be the subset of samples in S with a=v;

    • If Sv is empty then attach a leaf labeled with the most common class in S;

    • Else attach the node generated by GenDecTree(Sv, A-a)


Ii

:

  • Gini - Gini Index

  • - Entropy

  • - Misclassification error


Ii

: GINI

Gini t :

  • p( j | t) j t

  • c

  • (0.0) ( )

  • (1 - 1/c) ( )

:

1

2

3

4


Ii

: GINI

  • CART, SLIQ, SPRINT, IBM Intellignet Miner

,ni = i,

n = p.

p k (), ( k ), :

  • :

  • ( )


Ii

: GINI

Class P: buys_computer = yes

Class N: buys_computer = no

income

D1: {low, medium} D2{high}

D3: {low} D4 {medium, high}

,

(, age, student, credit_rating)


Ii

C1

0

C1

1

C1

2

C1

3

C2

6

C2

5

C2

4

C2

3

Entropy=0.000

Entropy=0.650

Entropy =0.92

Entropy = 1.000

:

t :

  • p(j |t) j t

  • c

  • log 2

  • log(c) ( )

  • (0.0) ( )

:

1

2

3

4

Gini = 0.000

Gini = 0.278

Gini = 0.444

Gini = 0.500


Ii

:

, p k (), :

,ni = i,

n = p.

  • ID3 and C4.5

  • (information gain)


Ii

:


Ii

Own

Car

Student

Car?

Type?

ID?

Family

Luxury

c

c

No

Yes

1

20

c

c

10

11

Sports

C0: 6

C0: 4

C0: 1

C0: 8

C0: 1

C0: 1

C0: 1

C0: 0

C0: 0

...

...

C1: 4

C1: 6

C1: 3

C1: 0

C1: 7

C1: 0

C1: 0

C1: 1

C1: 1

( )

, student-id , -> !


Ii

:

  • ,

:

  • SplitINFO:

  • ( )

  • C4.5

N

3 SplitINFO = - log(1/3) = log3

2 SplitINFO = - log(1/2) = log2 = 1

2


Ii

:

    • :

      • :

    • Gini:


Ii

C1

0

C1

1

C1

2

C1

3

C2

6

C2

5

C2

4

C2

3

Error=0.000

Error=0.167

Error =0.333

Error = 0.500

:

(classification error) t :

  • 1-1/c ( )

  • (0.0) ( )

:

1

2

3

4

Gini = 0.000

Entropy = 0.000

Gini = 0.278

Entropy = 0.650

Gini = 0.444

Entropy = 0.920

Gini = 0.500

Entropy = 1.000


Ii

:

p

(p +, 1-p -)

0.5 ( )

(0 1)


Ii

:

  • , 1 2


Ii

: Hunt

- ()

Algorithm GenDecTree(Sample S, Attlist A)

  • create a node N

  • If all samples are of the same class C then label N with C; terminate;

  • If A is empty then label N with the most common class C in S (majority voting); terminate;

  • Select aA, with the highest information gain (gini, error); Label N with a;

  • For each value v of a:

    • Grow a branch from N with condition a=v;

    • Let Sv be the subset of samples in S with a=v;

    • If Sv is empty then attach a leaf labeled with the most common class in S;

    • Else attach the node generated by GenDecTree(Sv, A-a)


Ii

o

  • :

  • NP-complete . :

  • , O(h) h

  • ( )


Ii

o

  • ( ) . .

    ,


Ii

o

  • ,

    , parity 0(1) () 2d d


Ii

Decision Boundary

,

(Border line) decision boundary ( )


Ii

Decision boundary ( decision boundaries )

1

0.9

x < 0.43?

0.8

0.7

Yes

No

0.6

y

y < 0.33?

y < 0.47?

0.5

0.4

Yes

No

Yes

No

0.3

0.2

: 4

: 0

: 0

: 4

: 0

: 4

: 3

: 0

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x


Ii

x + y < 1

Class = +

Class =

Oblique ()


Ii

-

  • - Pros

  • - Cons

    • (decision boundaries)


Ii

o

  • greedy, top-down,

  • ?

    • Bottom-up ( , )

    • Bi-directional


Ii

Tree Replication ()

-


Ii

:

  • ( )


Ii

Data Fragmentation


Overfitting

Overfitting


Ii

Overfitting

  • training error (training, resubstitution, apparent): ( )

  • generalization error(generalization):

  • Overfitting


Ii

Overfitting

    • 1.

    • 2.

  • 3.


Ii

Overfitting

: 1 (500 ) 2 (500 )

1 ( ):

0.5 sqrt(x12+x22) 1

2 ( ):

sqrt(x12+x22) > 0.5 or

sqrt(x12+x22) < 1


Ii

Everything should be made as simple as possible, but not simpler, Einstein

Overfitting

Overfitting

30%

70%

Gini

, pruning

Underfitting:


Ii

()

,

Overfitting


Ii

Overfitting

Overfitting


Ii

Overfitting

Overfitting

,


Ii

Overfitting

  • ,


Ii

Overfitting

  • verfitting ,


Ii

Overfitting

  • :

    • Pre-pruning

    • Post-pruning

    • :


Ii

Overfitting

Pre-Pruning (Early Stopping Rule)

  • :

  • :

    • (.., Gini information gain) .

      (-) ,

      (-) ,


  • Ii

    • Re-substitution errors: ( e(t) )

    • Generalization errors: ( e(t))

      :

      1. Optimistic approach :

      e(t) = e(t)


    Ii

    +

    1

    -

    3

    (TR) (TL)

    4/24 = 0.167: 6/24 = 0.25


    Ii

    Occams Razor

    • ,

    • (Fitted)


    Ii

    2. Pessimistic approach - :

    k: ,

    ti V(ti)

    • A t:e(t) = e(t) + 0.5

    • : e(T) = e(T) + k 0.5 (k: )

    • 30 10 ( 1000 ):

    • Training error = 10/1000 = 1%

    • Generalization error = (10 + 300.5)/1000 = 2.5%

    • 0.5


    Ii

    (4 + 7*0.5)/24 = 0.3125

    : (6 + 4*0.5)/24 = 0.3333

    0.5, ;


    Post pruning

    Post-Pruning

    Overfitting

    ( ) = 10/30

    = (10 + 0.5)/30 = 10.5/30

    ( ) = 9/30

    ( )

    = (9 + 4 0.5)/30 = 11/30

    PRUNE ()!


    Ii

    3. Reduced error pruning (REP):

    • :

      2/3

      1/3 ( validation set)


    Ii

    C0: 11

    C1: 3

    C0: 14

    C1: 3

    C0: 2

    C1: 4

    C0: 2

    C1: 2

    Overfitting

    post-pruning

    Case 1:

    • ?

    • ?

    • REP?

    case 1, case 2

    Case 2:


    Ii


    Ii

    • :


    Ii

    :

    Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813

    Refund:

    Entropy(Refund=Yes) = 0

    Entropy(Refund=No) = -(2/6)log(2/6) (4/6)log(4/6) = 0.9183

    Entropy(Children) = 0.3 (0) + 0.6 (0.9183) = 0.551

    Gain = 0.9 (0.8813 0.551) = 0.3303

    Missing value


    Ii

    Class=Yes

    2

    Class=No

    4

    ;

    Refund=Yes is 3/9 (3 9 refund=Yes)

    Refund=No is 6/9

    A 3/9 6/9

    Refund

    Refund

    Yes

    No

    Yes

    No


    Ii

    Refund

    Yes

    No

    NO

    MarSt

    Single, Divorced

    Married

    (MarSt)= Married is 3.67/6.67

    (MarSt)={Single,Divorced} is 3/6.67

    TaxInc

    NO

    < 80K

    > 80K

    YES

    NO


    Ii

    (model selection):


  • Login