Sponsored Links
This presentation is the property of its rightful owner.
1 / 76

Ταξινόμηση II PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on
  • Presentation posted in: General

Ταξινόμηση II. Οι διαφάνειες στηρίζονται στο P.-N. Tan, M.Steinbach, V. Kumar, « Introduction to Data Mining» , Addison Wesley, 2006. Σύντομη Ανακεφαλαίωση. Ορισμός. Ταξινόμηση (classification)

Download Presentation

Ταξινόμηση II

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


II

P.-N. Tan, M.Steinbach, V. Kumar, Introduction to Data Mining, Addison Wesley, 2006



(classification)

()

: ()

(attributes)

(class)

: (model)

: .

  • :

  • (training set)

  • (test test)

  • .


  • :

  • (descriptive modeling):

  • (predictive modeling): , ,

  • :

  • (nominal) vs (ordinal)

  • () ( y)

    , regression() y


Tid

Attrib1

Attrib2

Attrib3

Class

No

1

Yes

Large

125K

Induction

No

2

No

Medium

100K

No

3

No

Small

70K

No

4

Yes

Medium

120K

Yes

5

No

Large

95K

No

6

No

Medium

60K

No

7

Yes

Large

220K

Yes

8

No

Small

85K

No

9

No

Medium

75K

Yes

10

No

Small

90K

10

Deduction

Tid

Attrib1

Attrib2

Attrib3

Class

?

11

No

Small

55K

?

12

Yes

Medium

80K

?

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

10

  • 1.

  • ( )

    • , , )

  • 2.

  • Accuracy rate:

    • ( over-fitting)


  • 1. (data cleaning)

    • ( )

  • 2. (Relevance analysis)( () -- feature selection)

  • 3. (Data transformation)

    • /

      • {low,medium,high}

      • [0,1)


  • - Predictive accuracy

  • (speed)

    • /

  • Robustness

  • Scalability

  • Interpretability:

  • - Goodness of rules (quality)


  • (decision trees)

  • (Rule-based Methods)

  • Memory based reasoning

  • Nave Bayes and Bayesian Belief Networks

  • Support Vector Machines



:

  • (split)

  • = /

Splitting Attributes

Refund

Yes

No

NO

MarSt

Married

Single, Divorced

TaxInc

NO

< 80K

> 80K

()

YES

NO

:


  • ,

    • ---

  • ( )

  • ( ):

  • 1.

  • 2. ( ) -

  • 3. 2

  • (top-down, recursive, divide-and-conquer )

  • 4. , (tree pruning)

  • -


age?

>40

<=30

30...40

yes


NO

:

Single, Divorced

MarSt

Married

NO

Refund

> 80K

No

Yes

Refund

Yes

No

TaxInc

< 80K

NO

MarSt

Single, Divorced

YES

NO

TaxInc

NO

< 80K

> 80K

YES

NO


:

  • , .

  • (induction) greedy :

    • Hunts Algorithm ( )

    • CART

    • ID3, C4.5

    • SLIQ, SPRINT


: Hunt

, ()

Dt: t

Dt

?

  • ( )

    • Dt yt,

      t yt

    • Dt ( ),

      Dt default

    • Dt , -

      : (, )


Refund

Refund

Yes

No

Yes

No

Dont

Cheat

Marital

Status

Dont

Cheat

Marital

Status

Single,

Divorced

Refund

Married

Married

Single,

Divorced

Yes

No

Dont

Cheat

Taxable

Income

Cheat

Dont

Cheat

Dont

Cheat

Dont

Cheat

< 80K

>= 80K

Dont

Cheat

Cheat

: Hunt

Dont

Cheat


:

;

Greedy

    • ( )


:

    • - Nominal

    • - Ordinal

    • - Continuous

  • :

    • 2- - 2-way split

    • - Multi-way split


CarType

Family

Luxury

Sports

CarType

Size

CarType

{Sports, Luxury}

{Small, Large}

{Family, Luxury}

{Medium}

{Family}

{Sports}

:

  • :

  • : . (partitioning)

, , 2-1 1

,

;


Taxable

Taxable

Income

Income?

> 80K?

< 10K

> 80K

Yes

No

[10K,25K)

[25K,50K)

[50K,80K)

:


: GINI

  • ,

    • =

  • v

    • , A < v and A v

  • v ( best split point)

    • A

    • (ai+ai+1)/2 ai ai+1

Taxable

Income

> 80K?

Yes

No


:

: 10 0,10 1

Own

Car

Student

Car?

Type?

ID?

Family

Luxury

c

c

No

Yes

1

20

c

c

10

11

Sports

...

...

C0: 6

C0: 4

C0: 1

C0: 8

C0: 1

C0: 1

C0: 1

C0: 0

C0: 0

C1: 4

C1: 6

C1: 3

C1: 0

C1: 7

C1: 0

C1: 0

C1: 1

C1: 1

3 ;

-> ;


:

()

  • Greedy :

    • , (homogeneous class distribution)

  • (node impurity)

!!

-,

C0: 5

C0: 9

,

C1: 5

C1: 1

1

2

3

4

C1

0

C1

1

C1

2

C1

3

C2

6

C2

5

C2

4

C2

3

~ 0

(1) < (N2) < I(N3) < I(N4)


:

;

n, , I(n)

(parent) N k ui

N(ui) ( (ui) = N)

, , ( ) ( )

( )

( )


: Hunt

-

Algorithm GenDecTree(Sample S, Attlist A)

  • create a node N

  • If all samples are of the same class C then label N with C; terminate;

  • If A is empty then label N with the most common class C in S (majority voting); terminate;

  • Select aA, with the highest gain; Label N with a;

  • For each value v of a:

    • Grow a branch from N with condition a=v;

    • Let Sv be the subset of samples in S with a=v;

    • If Sv is empty then attach a leaf labeled with the most common class in S;

    • Else attach the node generated by GenDecTree(Sv, A-a)


:

  • Gini - Gini Index

  • - Entropy

  • - Misclassification error


: GINI

Gini t :

  • p( j | t) j t

  • c

  • (0.0) ( )

  • (1 - 1/c) ( )

:

1

2

3

4


: GINI

  • CART, SLIQ, SPRINT, IBM Intellignet Miner

,ni = i,

n = p.

p k (), ( k ), :

  • :

  • ( )


: GINI

Class P: buys_computer = yes

Class N: buys_computer = no

income

D1: {low, medium} D2{high}

D3: {low} D4 {medium, high}

,

(, age, student, credit_rating)


C1

0

C1

1

C1

2

C1

3

C2

6

C2

5

C2

4

C2

3

Entropy=0.000

Entropy=0.650

Entropy =0.92

Entropy = 1.000

:

t :

  • p(j |t) j t

  • c

  • log 2

  • log(c) ( )

  • (0.0) ( )

:

1

2

3

4

Gini = 0.000

Gini = 0.278

Gini = 0.444

Gini = 0.500


:

, p k (), :

,ni = i,

n = p.

  • ID3 and C4.5

  • (information gain)


:


Own

Car

Student

Car?

Type?

ID?

Family

Luxury

c

c

No

Yes

1

20

c

c

10

11

Sports

C0: 6

C0: 4

C0: 1

C0: 8

C0: 1

C0: 1

C0: 1

C0: 0

C0: 0

...

...

C1: 4

C1: 6

C1: 3

C1: 0

C1: 7

C1: 0

C1: 0

C1: 1

C1: 1

( )

, student-id , -> !


:

  • ,

:

  • SplitINFO:

  • ( )

  • C4.5

N

3 SplitINFO = - log(1/3) = log3

2 SplitINFO = - log(1/2) = log2 = 1

2


:

    • :

      • :

    • Gini:


C1

0

C1

1

C1

2

C1

3

C2

6

C2

5

C2

4

C2

3

Error=0.000

Error=0.167

Error =0.333

Error = 0.500

:

(classification error) t :

  • 1-1/c ( )

  • (0.0) ( )

:

1

2

3

4

Gini = 0.000

Entropy = 0.000

Gini = 0.278

Entropy = 0.650

Gini = 0.444

Entropy = 0.920

Gini = 0.500

Entropy = 1.000


:

p

(p +, 1-p -)

0.5 ( )

(0 1)


:

  • , 1 2


: Hunt

- ()

Algorithm GenDecTree(Sample S, Attlist A)

  • create a node N

  • If all samples are of the same class C then label N with C; terminate;

  • If A is empty then label N with the most common class C in S (majority voting); terminate;

  • Select aA, with the highest information gain (gini, error); Label N with a;

  • For each value v of a:

    • Grow a branch from N with condition a=v;

    • Let Sv be the subset of samples in S with a=v;

    • If Sv is empty then attach a leaf labeled with the most common class in S;

    • Else attach the node generated by GenDecTree(Sv, A-a)


o

  • :

  • NP-complete . :

  • , O(h) h

  • ( )


o

  • ( ) . .

    ,


o

  • ,

    , parity 0(1) () 2d d


Decision Boundary

,

(Border line) decision boundary ( )


Decision boundary ( decision boundaries )

1

0.9

x < 0.43?

0.8

0.7

Yes

No

0.6

y

y < 0.33?

y < 0.47?

0.5

0.4

Yes

No

Yes

No

0.3

0.2

: 4

: 0

: 0

: 4

: 0

: 4

: 3

: 0

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x


x + y < 1

Class = +

Class =

Oblique ()


-

  • - Pros

  • - Cons

    • (decision boundaries)


o

  • greedy, top-down,

  • ?

    • Bottom-up ( , )

    • Bi-directional


Tree Replication ()

-


:

  • ( )


Data Fragmentation


Overfitting


Overfitting

  • training error (training, resubstitution, apparent): ( )

  • generalization error(generalization):

  • Overfitting


Overfitting

    • 1.

    • 2.

  • 3.


Overfitting

: 1 (500 ) 2 (500 )

1 ( ):

0.5 sqrt(x12+x22) 1

2 ( ):

sqrt(x12+x22) > 0.5 or

sqrt(x12+x22) < 1


Everything should be made as simple as possible, but not simpler, Einstein

Overfitting

Overfitting

30%

70%

Gini

, pruning

Underfitting:


()

,

Overfitting


Overfitting

Overfitting


Overfitting

Overfitting

,


Overfitting

  • ,


Overfitting

  • verfitting ,


Overfitting

  • :

    • Pre-pruning

    • Post-pruning

    • :


Overfitting

Pre-Pruning (Early Stopping Rule)

  • :

  • :

    • (.., Gini information gain) .

      (-) ,

      (-) ,


    • Re-substitution errors: ( e(t) )

    • Generalization errors: ( e(t))

      :

      1. Optimistic approach :

      e(t) = e(t)


    +

    1

    -

    3

    (TR) (TL)

    4/24 = 0.167: 6/24 = 0.25


    Occams Razor

    • ,

    • (Fitted)


    2. Pessimistic approach - :

    k: ,

    ti V(ti)

    • A t:e(t) = e(t) + 0.5

    • : e(T) = e(T) + k 0.5 (k: )

    • 30 10 ( 1000 ):

    • Training error = 10/1000 = 1%

    • Generalization error = (10 + 300.5)/1000 = 2.5%

    • 0.5


    (4 + 7*0.5)/24 = 0.3125

    : (6 + 4*0.5)/24 = 0.3333

    0.5, ;


    Post-Pruning

    Overfitting

    ( ) = 10/30

    = (10 + 0.5)/30 = 10.5/30

    ( ) = 9/30

    ( )

    = (9 + 4 0.5)/30 = 11/30

    PRUNE ()!


    3. Reduced error pruning (REP):

    • :

      2/3

      1/3 ( validation set)


    C0: 11

    C1: 3

    C0: 14

    C1: 3

    C0: 2

    C1: 4

    C0: 2

    C1: 2

    Overfitting

    post-pruning

    Case 1:

    • ?

    • ?

    • REP?

    case 1, case 2

    Case 2:



    • :


    :

    Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813

    Refund:

    Entropy(Refund=Yes) = 0

    Entropy(Refund=No) = -(2/6)log(2/6) (4/6)log(4/6) = 0.9183

    Entropy(Children) = 0.3 (0) + 0.6 (0.9183) = 0.551

    Gain = 0.9 (0.8813 0.551) = 0.3303

    Missing value


    Class=Yes

    2

    Class=No

    4

    ;

    Refund=Yes is 3/9 (3 9 refund=Yes)

    Refund=No is 6/9

    A 3/9 6/9

    Refund

    Refund

    Yes

    No

    Yes

    No


    Refund

    Yes

    No

    NO

    MarSt

    Single, Divorced

    Married

    (MarSt)= Married is 3.67/6.67

    (MarSt)={Single,Divorced} is 3/6.67

    TaxInc

    NO

    < 80K

    > 80K

    YES

    NO


    (model selection):


  • Login