1 / 88

# Data Mining Chapter 3 Output: Knowledge Representation - PowerPoint PPT Presentation

Data Mining Chapter 3 Output: Knowledge Representation. Kirk Scott. A summary of ways of representing knowledge, the results of mining: Rule sets Decision trees Regression equations Clusters Deciding what kind of output you want is the first step towards picking a mining algorithm.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Data Mining Chapter 3 Output: Knowledge Representation' - macy

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Data MiningChapter 3Output: Knowledge Representation

Kirk Scott

3.1 Tables mining:

• Output can be in the form of tables mining:

• This is kind of lame

• All they’re saying is that instances can be organized to form a lookup table for classification

• The contact lens data can be viewed in this way

• At the end they will consider another way in which the instance set itself is pretty much the result of mining

Fitting a Line statistical methods

• This would be a linear equation relating cache size to computer performance

• PRP = 37.06 + 2.47 CACH

• This defines the straight line that best fits the instances in the data set

• Figure 3.1, on the following overhead, shows both the data points and the line

Finding a Boundary statistical methods

• A different technique will find a linear decision boundary

• This linear equation in petal length and petal width will separate instances of Iris setosa and Iris versicolor

• 2.0 – 0.5 PETAL_LENGTH – 0.8 PETAL_WIDTH = 0

• An instance of Iris statistical methodssetosa should give a value >0 (above/to the right of the line) and an instance of Iris versicolor should give a value <0

• Figure 3.2, on the following overhead, shows the boundary line and the instances of the two kinds of Iris

3.3 Trees statistical methods

Null Values etc.) that might be coded for a single attribute at each node in a decision tree

• If nulls occur, you will have to make a decision based on them in any case

• The occurrence of a null value may be one of the separate branches out of a decision tree node

• At this point the value of assigning a meaning to null becomes apparent (not available, not applicable, not important…)

• Approaches to dealing with etc.) that might be coded for a single attribute at each node in a decision treeuncoded/undistinguished nulls:

• Keep track of the number of instances per branch and classify nulls with the most popular branch

• Alternatively, keep track of the relative frequency of different branches

• In the aggregate results, assign a corresponding proportion of the nulls to the different branches

Other Kinds of Comparisons etc.) that might be coded for a single attribute at each node in a decision tree

• Simple decisions compare attribute values and constants

• Some decisions may compare two attributes in the same instance

• Some decisions may be based on a function of >1 attribute per instance

Oblique Splits etc.) that might be coded for a single attribute at each node in a decision tree

• Comparing an attribute to a constant splits data parallel to an axis

• A decision function which doesn’t split parallel to an axis is called an oblique split

• In effect, the boundary between the kinds of irises shown earlier is such a split

Option Nodes etc.) that might be coded for a single attribute at each node in a decision tree

• A single node with alternative splits on different attributes is called an option node

• Instances are classified according to each split and may appear in >1 leaf classification

• The last part of analysis includes deciding what such results indicate

Weka etc.) that might be coded for a single attribute at each node in a decision tree and Hand-Made Decision Trees

• The book suggests that you can get a handle on decision trees by making one yourself

• The book illustrates how Weka includes tools for doing this

• To me this seems out of place until chapter 11 when Weka is introduced

• I will not cover it here

Regression Trees etc.) that might be coded for a single attribute at each node in a decision tree

• For a problem with numeric attributes it’s possible to devise a tree-like classifier

• Working from the bottom up:

• The leaves contain the performance prediction

• The prediction is the average of the performance of all instances that end up classified in that leaf

• The internal nodes contain numeric comparisons of attribute values

Model Trees etc.) that might be coded for a single attribute at each node in a decision tree

• A model tree is a hybrid of a decision tree and regression

• In a model tree instances are classified into a given leaf

• Once a classification reaches the leaf, the prediction is made by applying a linear equation to some subset of instance attribute values

Rule Sets from Trees etc.) that might be coded for a single attribute at each node in a decision tree

• Given a decision tree, you can generate a corresponding set of rules

• Start at the root and trace the path to each leaf, recording the conditions at each node

• The rules in such a set are independent

• Each covers a separate case

Trees from Rule Sets etc.) that might be coded for a single attribute at each node in a decision tree

• Given a rule set, you can generate a decision tree

• Now we’re interested in going in the opposite direction

• Even a relatively simple rule set can lead to a messy tree

An Example explicitly known cases

• Take these rules for example:

• If a and b then x

• If c and d then x

• The result is implicitly binary, either x or not x

• The other variables are also implicitly binary (T or F)

Messiness = Replicated levels in the treeSubtrees

• The tree is messy because it contains replicated subtrees

• If a = yes and b = no, you then have to test c and d

• If a = no, you have to do exactly the same test on c and d

• The gray leaves in the middle and the gray leaves on the right both descend from analogous branches of the tree

• The book states that “decision trees cannot easily express the disjunction implied among the different rules in a set.”

• Translation:

• One rule deals with a and b

• The other rule is disjoint from the first rule; it deals only with b and c

• As seen above, for “no” for each of a and b you have to do the same test on c and d

Another Example of Replicated the disjunction implied among the different rules in a set.”Subtrees

• Figure 3.6, on the following overhead, illustrates an exclusive or (XOR) function

• Consider the graph: the disjunction implied among the different rules in a set.”

• (x = 1) XOR (y = 1)  a

• Incidentally, note that you could also write:

• (x <> y)  a, (x = y)  b

• Now consider the tree: the disjunction implied among the different rules in a set.”

• There’s nothing surprising: First test x, then test y

• The gray leaves on the left and the right at the bottom are analogous

• Now consider the rule set:

• In this example the rule set is not simpler

• This doesn’t negate the fact that the tree has replication

Yet Another Example of a Replicated the disjunction implied among the different rules in a set.”Subtree

• Consider Figure 3.7, shown on the following overhead

• In this example there are again 4 attributes the disjunction implied among the different rules in a set.”

• This time they are 3-valued instead of binary

• There are 2 disjoint rules, each including 2 of the variables

• There is a default rule for all other cases

Other Issues with Rule Sets for each branch of the tree

• We have not seen the data mining algorithms yet, but some do not generate rule sets in a way analogous to reading all of the cases off of a decision tree

• Sets (especially those not designed to be applied in a given order) may contain conflicting rules that classify specific cases into different categories

Rule Sets that Produce Multiple Classifications for each branch of the tree

• In practice you can take two approaches

• Do not classify instances that fall into >1 category

• Count how many times each rule is triggered by a training set and use the most popular of the classification rules when two conflict

Rule Sets that Don’t Classify Certain Cases for each branch of the tree

• If a rule set doesn’t classify certain cases, there are again two alternatives:

• Do not classify those instances

• Classify those instances with the most frequently occurring instances

The Simplest Case with Rule Sets for each branch of the tree

• Suppose all variables are Boolean

• I.e., suppose rules only have two possible outcomes, T/F

• Suppose only rules with T outcomes are expressed

• (By definition, all unexpressed cases are F)

• Under the foregoing assumptions: for each branch of the tree

• The rules are independent

• The order of applying the rules is immaterial

• The outcome is deterministic

• There is no ambiguity

Reality is More Complex for each branch of the tree

• In practice, there can be ambiguity

• The authors state that the assumption that there are only two cases, T/F, and only T is expressed, is a form of closed world assumption

• In other words, the assumption is that everything is binary

Association Rules relaxed, things become messier

• This subsection is largely repetition

• Any subset of attributes may predict any other subset of attributes

• Association rules are really just a generalization or superset of classification rules

Terminology (Again) (all non-class attributes)

• Coverage = Support = Proportion of cases in the data set to which an association rule applies

• Note that the book seems to define coverage somewhat differently in this section

• Ignore what the book says and let the definition given here be the operational one

Interesting Rules rule applies where the prediction is correct

• Association Rules are considered interesting if:

• They exceed some threshold for support

• They exceed some threshold for confidence

Strength of Association Rules rule applies where the prediction is correct

• An association rule that implies another association rule is stronger than the rule it implies

• The stronger rule should be reported

• It is not necessary to report the weaker rule(s)

An Example of an Association Rule Implying Another rule applies where the prediction is correct

• The book illustrates this idea with a concrete weather example

• It will be presented here in general form

• Let Rule 1 be given as shown below

• It has a compound conclusion

• Rule 1: If A = 1 and B = 0

• Then X = 0 and Y = 1

• Suppose that Rule 1 meets the thresholds for support and confidence

• Now consider Rule 2: rule applies where the prediction is correct

• Rule 2: If A = 1 and B = 0

• Then X = 0

• This applies to the same proportion of cases as Rule 1

• Therefore it meets the support criterion

• A data set may be mined for rules

• New instances may arrive which the rule set doesn’t correctly classify

• The new instances can be handled by adding “exceptions” to the rules

• It is not necessary to re-do the mining and make wholesale changes to the existing rule set

The Iris 1Exception Example

• The book illustrates exceptions with a new instance for the iris data set

• In Figure 3.8, the amended rules are expressed in terms of default cases, exceptions, and if/else rules

• Note up front that I’m not too interested in this approach

• The book observes that exceptions may be if/else“psychologically”, if no logically preferable

• The use of exceptions may better mirror how human beings model the situation

• It may even reflect the thinking of an informed expert more closely than a re-done set of rules

More Expressive Rules if/else

• Simple rules compare attributes with constants

• Possibly more powerful rules may compare attributes with other attributes

• Recall the decision boundary example, giving a linear equation in x and y

• Also recall the XOR example where the condition of interest could be summarized as x <> y

Geometric Figures Example if/else

• The book illustrates the idea with the concept of geometric figures lying down or standing up

• Comparing attributes with fixed values might work for a given data set

• However, it boils down to comparing width and height attributes of instances

• Consider Figure 3.9 on the following overhead

• It may be computationally expensive for an algorithm to compare instance attributes

• Preprocessing data for input may include hardcoding the comparison as an attribute itself

• Note how this implies that the user already understands this relationship in the first place

Inductive Logic Programming if/else

• The tasks getHeight() and getWidth() could be functionalized

• Composite geometric figures could be defined

• Recursive rules could be defined for determining whether composite figures were lying or standing

• This branch of data mining is called inductive logic programming

• It will not be pursued further

• This goes back to the idea that the data set is itself, somehow, the result of the mining

• Instead of training and arriving at a set of rules in advance, work on the fly

Prerequisites for Accomplishing This set and classify it accordingly

• You need to define distance in the space of the n attributes of the instances

• Potentially you need to normalize or weight the individual attributes

• In general, you need to know which attributes are important in the problem domain

• In short, the existing data set should already contain correct classifications

• Instance-based methods don’t immediately appear to provide a structural representation, like a rule set

• However, taken together, the different parts of the process form a representation

• The training subset, distance metric, and nearest neighbor rule define boundaries in n-space between instances

• Figure 3.10 (c): This shows that in practice the classification neighborhoods will be simplified to rectangular areas in space

• Figure 3.10 (d): This illustrates the idea that you can have donut shaped classes, with one class’s instances completely contained within another’s

3.6 Clusters classification neighborhoods will be simplified to rectangular areas in space

• Clustering is not the classification of individual instances

• It is the partitioning of the space

• The book illustrates the ideas with Figure 3.11, shown on the following overhead

• This is followed by brief explanatory comments

• Figure 3.11 (a): This shows mutually exclusive partitions or classes

• Figure 3.11 (b): This shows that instances may be classified in >1 cluster

• Figure 3.11 (c): This shows that the assignment of an instance to a cluster may be probabilistic

• Figure 3.11 (d): A dendrogram is a technique for showing hierarchical relationships among clusters

The End or classes