Fuzzy interpretation of discretized intervals dr xindong wu
Download
1 / 33

Fuzzy Interpretation of Discretized Intervals Dr. Xindong Wu - PowerPoint PPT Presentation

Fuzzy Interpretation of Discretized Intervals Dr. Xindong Wu. Andrea Porter April 11, 2002. Plan For Presentation. Introduction to Problem, HCV Discretization Techniques/Fuzzy Borders A Hybrid Solution for HCV Experiments and Results Conclusion. Introduction.

Related searches for Fuzzy Interpretation of Discretized Intervals Dr. Xindong Wu

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Fuzzy Interpretation of Discretized Intervals Dr. Xindong Wu

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Fuzzy Interpretation of Discretized IntervalsDr. Xindong Wu

Andrea Porter

April 11, 2002


Plan For Presentation

  • Introduction to Problem, HCV

  • Discretization Techniques/Fuzzy Borders

  • A Hybrid Solution for HCV

  • Experiments and Results

  • Conclusion


Introduction

  • Real-world data contains both numerical and nominal data, must be able to deal with different types of data.

  • Existing systems discretize numerical domains into intervals and treat intervals as nominal values during induction.

  • Problems occur if test examples are not covered in training data (no-match, multiple match)

  • The solution is a hybrid approach using fuzzy intervals for no-match problem.


HCV

  • Attribute based rule induction algorithm, extension matrix approach

    • Divide positive examples into intersecting groups

    • Find a heuristic conjunctive rule in each group that covers all PE and no NE

  • HCV can find a rule in the form of variable-valued logic

  • More compact than the decision trees/rules of ID3 and C4.5


Variable Valued Logic and Selectors

  • Represents decisions where variables can take a range

  • Selector:

    [ X # R ]

    X = attribute

    # = relational operator ( = , <, >, . . . )

    R = Reference, list of 1 or more values

    e.g [ Windy = true][Temp > 90]


HCV Software

  • C++ implementation

  • Can work with noisy and real-valued domains as well as nominal and noise-free databases

  • Provides a set of deduction facilities for the user to test the accuracy of the produced rules on test examples


Example DB


C4.5:The T class

X2 = b

X1 = 0 & X3 = a

X1 = 0 & X3 = b

X1 = 0 & X2 = a

C4.5 Results vs. HCV

  • HCV:The T class

  • X2 = b

  • X1 = 0 & X2 = a

  • X1 = 0 & X4 = 0

  • C4.5:The F class

    X1 = 1 & X2 = a

    • X1 = 1 & X2 = c

    • X2 = c & X3 = c


Deduction of Induction Results

  • Induction generates knowledge from existing data

  • Deduction applies induction results to interpret new data.

  • With real-world data, induction can not be assumed to be perfect

  • Three cases:

    1) no-match (measure of fit)

    2) single-match

    3) multiple-match (estimate of probability)


Discretization

  • Occurs during rule induction

  • Discretize numerical domains into intervals and treat similar to nominal values.

  • The challenge is to find the right borders for the intervals

  • Possible Methods:

    1) Simplest Class-Separating Method

    2) Information Gain Heuristic (implemented in HCV)


Simplest Class- Separating Method:

  • Interval Borders are places between each adjacent pair of examples which have different classes.

  • If attribute is very informative - method is efficient and useful.

  • If attribute is not informative - method produces too many intervals


Information Gain Heuristic

Use IGH to find more informative border.

  • x = (xi + xi+1)/2 for (i = 1, …, n-1)

  • x is a possible cut point if xi and xi+1 are of different classes.

  • Use IGH to find best x

  • Recursively split on left and right

  • To stop recursive splitting:

    1) stop if IGH is same on all possible cut points.

    2) stop if # of examples to split is less than a predefined number

    3) limit the number of intervals


Fuzzy Borders

  • Discretization of continuous domains does not always fit accurate interpretation.

  • Instead of using sharp borders, use a membership function, measures the degree of membership.

  • A value can be classified into a few different intervals at the same time (e.g. single to multiple match)


Fuzzy Borders (2)

  • Fuzzy matching - deduction with fuzzy borders of discretized intervals.

  • Take the interval with the greatest degree as the value’s discrete value.

  • 3 functions to fuzzify borders:

    1) linear

    2) polynomial

    3) arctan

  • Definitions

    s = spread parameter l = length of original

    xleft, xright = left/right sharp borders

l

xleft xright


l

sl

xleft xright

Linear Membership Function

a = -kxleft + 1/2b = kxright + 1/2

linleft(x) = kx + a

lin right(x) = -kx + b

lin(x) = MAX(0, MIN(1,linleft(x),linright(x)))

k = 1/2sl


Arctan Membership Function


*Polynomial Membership Function

polyside(x) = asidex3 + bsidex2 + csidex + dside

aside = 1/(4(ls)3)

bside = -3asidexsideside {left,right}

cside = 3aside(xside2 - (ls)2)

dside = -a(xside3 -3xside(ls)2 + 2(ls)3)

polyleft(x),if xleft -ls  x  xleft + ls

poly(x) = polyright(x),if xright -ls  x  xright +ls

1,if xleft +ls  x  xright -ls

0,otherwise


Match Degree

  • Selector method - take the max membership degree of the value in all the intervals involved. If 2 adjacent intervals have the same class, values close to the border will have low membership.

  • Conjunction method - adds with fuzzy plus

    ab=a + b - ab


No-Match Resolution

Largest Class

  • Assign all no match examples to the largest class, the default class.

  • Works well, if the number of classes in a training set is small and one class is clearly larger.

  • Deteriorates if there is a larger number of classes and the examples are evenly distributed


No-Match Resolution

Measure of Fit

Calculate the measure of fit for each class:

1) calculate MF for each selector (sel)

MF(sel, e) = 1,if sel is satisfied by e

n/|x|,otherwise

2) calculate MF for each conjunctive rule(conj)

MF(conj, e) =  MF(sel, e) * n(conj)/N


No-Match Resolution

Measure of Fit (2)

3) calculate MF for each class c

MF(c, e) = MF(conj1, e) + MF(conj2, e) - MF(conj1,e)MF(conj2,e)

* For more than two rules, apply formula recursively.

* Find maximum MF - determines which class is closest to the example


Multiple-Match

  • Caused by over-generalization of the training examples at induction time

  • Example

    • (X1 = a, X2 = 1)

      • All PE cover X1 = a

      • All NE cover X2 = 1

      • Multiple Match


Multiple-Match Resolution

First Hit

  • Use first rule which classifies the example

  • Produces reasonable results if the rules from induction have been ordered according to a measure of reliability

  • Advantages - straightforward, efficient

  • Disadvantages - have to sort rules at induction time


Multiple-Match Resolution

Largest Rule

  • Similar to largest class method from no-match resolution

  • Choose conjunctive rule that covers the most examples in the training set.


Multiple-Match Resolution

Estimation of Probability

  • Assign EP value to each class based on the size of the satisfied conjunctive rules.

    1) Find EP for each conjunctive rule (conj):

    EP(conj, e)= {n(conj)/N, if conj is satisfied by e

    0, otherwise

    n(conj) = number of examples covered by conj

    N = number of total examples


Multiple-Match Resolution

Estimation of Probability (2)

2) Find EP value for each class:

EP(c, e) = EP(conj1, e) + EP(conj2, e) - EP(conj1,e)EP(conj2,e).

* For more rules, apply formula recursively

* Choose class with highest EP value


Hybrid Interpretation

  • Used because fuzzy borders only add conflicts because they don’t reduce the number rules that are applicable

  • HCV - use sharp borders during induction and use fuzzy borders only during deduction

  • Algorithm:

    * Single match - use class indicated by rules

    * Multiple match - use estimation probability (EP) with sharp borders

    * No match - use fuzzy borders with polynomial membership function to find closest rule


The Data

  • Used 17 databases from the Machine Learning Database Repository, U. of California, Irvine.

  • Databases selected because:

    1) All include numerical data

    2) All lead to situations where no rules clearly apply.


Results – Predictive Accuracy


Results (cont.)

  • The results shown for C4.5 and NewID are the pruned ones

    • These were usually better than the unpruned ones in this experiment

  • HCV did not fine tune different parameters because this would be loss of generality and applicability of the conclusions


Accuracy Results

  • HCV(hybrid) - 9 databases

  • C4.5 (R 8) - 7 databases

  • C4.5 (R 5) - 6 databases

  • HVC (large) - 3 databases

  • HCV (fuzzy) - 2 databases


HCV Comparison

  • HCV (fuzzy) generally performs better than the simple largest class method

  • HCV’s performance improves significantly when the fuzzy borders (for no match) are combined with probability estimation (for multiple match) in HCV (hybrid)


Conclusions

  • Fuzzy borders are constructed and used at deduction time only when a no match case occurs.

  • This hybrid method performs more accurately than several other current deduction programs.

  • Fuzziness is strongly domain dependent, HCV allows the user to specify their own intervals and fuzzy functions.


ad
  • Login