Data Mining Chapter 3 Output: Knowledge Representation

1 / 88

# Data Mining Chapter 3 Output: Knowledge Representation - PowerPoint PPT Presentation

Data Mining Chapter 3 Output: Knowledge Representation. Kirk Scott. A summary of ways of representing knowledge, the results of mining: Rule sets Decision trees Regression equations Clusters Deciding what kind of output you want is the first step towards picking a mining algorithm.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Data Mining Chapter 3 Output: Knowledge Representation' - macy

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Data MiningChapter 3Output: Knowledge Representation

Kirk Scott

• Rule sets
• Decision trees
• Regression equations
• Clusters
• Deciding what kind of output you want is the first step towards picking a mining algorithm

Output can be in the form of tables

• This is kind of lame
• All they’re saying is that instances can be organized to form a lookup table for classification
• The contact lens data can be viewed in this way
• At the end they will consider another way in which the instance set itself is pretty much the result of mining

For problems with numeric attributes you can apply statistical methods

• The computer performance example was given earlier
• The methods will be in more detail in chapter 4
• The statistical approach can be illustrated graphically
Fitting a Line
• This would be a linear equation relating cache size to computer performance
• PRP = 37.06 + 2.47 CACH
• This defines the straight line that best fits the instances in the data set
• Figure 3.1, on the following overhead, shows both the data points and the line
Finding a Boundary
• A different technique will find a linear decision boundary
• This linear equation in petal length and petal width will separate instances of Iris setosa and Iris versicolor
• 2.0 – 0.5 PETAL_LENGTH – 0.8 PETAL_WIDTH = 0

An instance of Iris setosa should give a value >0 (above/to the right of the line) and an instance of Iris versicolor should give a value <0

• Figure 3.2, on the following overhead, shows the boundary line and the instances of the two kinds of Iris

The book summarizes the different kinds of decisions (< , =, etc.) that might be coded for a single attribute at each node in a decision tree

• Most are straightforward and don’t need to be repeated here
• Several more noteworthy aspects will be addressed on the following overheads
Null Values
• If nulls occur, you will have to make a decision based on them in any case
• The occurrence of a null value may be one of the separate branches out of a decision tree node
• At this point the value of assigning a meaning to null becomes apparent (not available, not applicable, not important…)

Approaches to dealing with uncoded/undistinguished nulls:

• Keep track of the number of instances per branch and classify nulls with the most popular branch
• Alternatively, keep track of the relative frequency of different branches
• In the aggregate results, assign a corresponding proportion of the nulls to the different branches
Other Kinds of Comparisons
• Simple decisions compare attribute values and constants
• Some decisions may compare two attributes in the same instance
• Some decisions may be based on a function of >1 attribute per instance
Oblique Splits
• Comparing an attribute to a constant splits data parallel to an axis
• A decision function which doesn’t split parallel to an axis is called an oblique split
• In effect, the boundary between the kinds of irises shown earlier is such a split
Option Nodes
• A single node with alternative splits on different attributes is called an option node
• Instances are classified according to each split and may appear in >1 leaf classification
• The last part of analysis includes deciding what such results indicate
• The book suggests that you can get a handle on decision trees by making one yourself
• The book illustrates how Weka includes tools for doing this
• To me this seems out of place until chapter 11 when Weka is introduced
• I will not cover it here
Regression Trees
• For a problem with numeric attributes it’s possible to devise a tree-like classifier
• Working from the bottom up:
• The leaves contain the performance prediction
• The prediction is the average of the performance of all instances that end up classified in that leaf
• The internal nodes contain numeric comparisons of attribute values
Model Trees
• A model tree is a hybrid of a decision tree and regression
• In a model tree instances are classified into a given leaf
• Once a classification reaches the leaf, the prediction is made by applying a linear equation to some subset of instance attribute values

Figure 3.4, on the following overhead, showns (a) a linear model, (b) a regression tree, and (c) a model tree

Rule Sets from Trees
• Given a decision tree, you can generate a corresponding set of rules
• Start at the root and trace the path to each leaf, recording the conditions at each node
• The rules in such a set are independent
• Each covers a separate case

The rules don’t have to be applied in a particular order

• The downside is such a rule set is more complex than an ordered set
• It is possible to prune a set derived from a tree to remove redundancy
Trees from Rule Sets
• Given a rule set, you can generate a decision tree
• Now we’re interested in going in the opposite direction
• Even a relatively simple rule set can lead to a messy tree

A rule set may compactly represent a limited number of explicitly known cases

• The other cases may be implicit in the rule set
• The implicit cases have to be spelled out in the tree
An Example
• Take these rules for example:
• If a and b then x
• If c and d then x
• The result is implicitly binary, either x or not x
• The other variables are also implicitly binary (T or F)

With 4 variables, a, b, c, and d, there can be up to 4 levels in the tree

• A tree for this problem is shown in Figure 3.5 on the following overhead
Messiness = Replicated Subtrees
• The tree is messy because it contains replicated subtrees
• If a = yes and b = no, you then have to test c and d
• If a = no, you have to do exactly the same test on c and d
• The gray leaves in the middle and the gray leaves on the right both descend from analogous branches of the tree

The book states that “decision trees cannot easily express the disjunction implied among the different rules in a set.”

• Translation:
• One rule deals with a and b
• The other rule is disjoint from the first rule; it deals only with b and c
• As seen above, for “no” for each of a and b you have to do the same test on c and d
Another Example of Replicated Subtrees
• Figure 3.6, on the following overhead, illustrates an exclusive or (XOR) function

Consider the graph:

• (x = 1) XOR (y = 1)  a
• Incidentally, note that you could also write:
• (x <> y)  a, (x = y)  b

Now consider the tree:

• There’s nothing surprising: First test x, then test y
• The gray leaves on the left and the right at the bottom are analogous
• Now consider the rule set:
• In this example the rule set is not simpler
• This doesn’t negate the fact that the tree has replication
Yet Another Example of a Replicated Subtree
• Consider Figure 3.7, shown on the following overhead

In this example there are again 4 attributes

• This time they are 3-valued instead of binary
• There are 2 disjoint rules, each including 2 of the variables
• There is a default rule for all other cases

The replication is represented in the diagram in this way:

• Each gray triangle stands for an instance of the complete subtree on the lower left which is shown in gray

The rule set would be equally complex IF there were a rule for each branch of the tree

• It is less complex in this example because of the default rule
Other Issues with Rule Sets
• We have not seen the data mining algorithms yet, but some do not generate rule sets in a way analogous to reading all of the cases off of a decision tree
• Sets (especially those not designed to be applied in a given order) may contain conflicting rules that classify specific cases into different categories
Rule Sets that Produce Multiple Classifications
• In practice you can take two approaches
• Do not classify instances that fall into >1 category
• Count how many times each rule is triggered by a training set and use the most popular of the classification rules when two conflict
Rule Sets that Don’t Classify Certain Cases
• If a rule set doesn’t classify certain cases, there are again two alternatives:
• Do not classify those instances
• Classify those instances with the most frequently occurring instances
The Simplest Case with Rule Sets
• Suppose all variables are Boolean
• I.e., suppose rules only have two possible outcomes, T/F
• Suppose only rules with T outcomes are expressed
• (By definition, all unexpressed cases are F)

Under the foregoing assumptions:

• The rules are independent
• The order of applying the rules is immaterial
• The outcome is deterministic
• There is no ambiguity
Reality is More Complex
• In practice, there can be ambiguity
• The authors state that the assumption that there are only two cases, T/F, and only T is expressed, is a form of closed world assumption
• In other words, the assumption is that everything is binary

As soon as this and any other simplifying assumptions are relaxed, things become messier

• In other words, rules become dependent, the order of application matters, etc.
• This is when you can arrive at multiple classifications or no classifications from a rule set
Association Rules
• This subsection is largely repetition
• Any subset of attributes may predict any other subset of attributes
• Association rules are really just a generalization or superset of classification rules

This is because this rule is one of many association rules (all non-class attributes)  (class attribute)

• Because so many association rules are possible, you need criteria for defining interesting ones
Terminology (Again)
• Coverage = Support = Proportion of cases in the data set to which an association rule applies
• Note that the book seems to define coverage somewhat differently in this section
• Ignore what the book says and let the definition given here be the operational one

Accuracy = Confidence = Proportion of cases to which the rule applies where the prediction is correct

Interesting Rules
• Association Rules are considered interesting if:
• They exceed some threshold for support
• They exceed some threshold for confidence
Strength of Association Rules
• An association rule that implies another association rule is stronger than the rule it implies
• The stronger rule should be reported
• It is not necessary to report the weaker rule(s)
An Example of an Association Rule Implying Another
• The book illustrates this idea with a concrete weather example
• It will be presented here in general form
• Let Rule 1 be given as shown below
• It has a compound conclusion
• Rule 1: If A = 1 and B = 0
• Then X = 0 and Y = 1
• Suppose that Rule 1 meets the thresholds for support and confidence

Now consider Rule 2:

• Rule 2: If A = 1 and B = 0
• Then X = 0
• This applies to the same proportion of cases as Rule 1
• Therefore it meets the support criterion
• It can be true in no fewer cases than Rule 1’s conclusion
• This means its confidence can be no less
• Therefore it meets the confidence criterion

Rule 1 can be said to imply Rule 2

• Rule 1 could also be said to imply a Rule 3:
• Rule 3: If A = 1 and B = 0
• Then Y = 1
• Rule 1 is the stronger rule
• Therefore, when reporting association rules, Rule 1 should be reported but Rule 2 and Rule 3 should not be reported
Rules with Exceptions
• A data set may be mined for rules
• New instances may arrive which the rule set doesn’t correctly classify
• The new instances can be handled by adding “exceptions” to the rules
• It is not necessary to re-do the mining and make wholesale changes to the existing rule set
The Iris Exception Example
• The book illustrates exceptions with a new instance for the iris data set
• In Figure 3.8, the amended rules are expressed in terms of default cases, exceptions, and if/else rules
• Note up front that I’m not too interested in this approach
• To me, a collection of nested if/elses makes more sense than the mixture presented here

The book observes that exceptions may be “psychologically”, if no logically preferable

• The use of exceptions may better mirror how human beings model the situation
• It may even reflect the thinking of an informed expert more closely than a re-done set of rules
More Expressive Rules
• Simple rules compare attributes with constants
• Possibly more powerful rules may compare attributes with other attributes
• Recall the decision boundary example, giving a linear equation in x and y
• Also recall the XOR example where the condition of interest could be summarized as x <> y
Geometric Figures Example
• The book illustrates the idea with the concept of geometric figures lying down or standing up
• Comparing attributes with fixed values might work for a given data set
• However, it boils down to comparing width and height attributes of instances
• Consider Figure 3.9 on the following overhead
Dealing with Attribute-Attribute Comparisons
• It may be computationally expensive for an algorithm to compare instance attributes
• Preprocessing data for input may include hardcoding the comparison as an attribute itself
• Note how this implies that the user already understands this relationship in the first place
Inductive Logic Programming
• The tasks getHeight() and getWidth() could be functionalized
• Composite geometric figures could be defined
• Recursive rules could be defined for determining whether composite figures were lying or standing
• This branch of data mining is called inductive logic programming
• It will not be pursued further
3.5 Instance-Based Representation
• This goes back to the idea that the data set is itself, somehow, the result of the mining
• Instead of training and arriving at a set of rules in advance, work on the fly

For each new instance, find its nearest neighbor in in the set and classify it accordingly

• More elaborately, find the k nearest neighbors
• Devise a scheme for picking the class, like the majority class of the k nearest neighbors
Prerequisites for Accomplishing This
• You need to define distance in the space of the n attributes of the instances
• Potentially you need to normalize or weight the individual attributes
• In general, you need to know which attributes are important in the problem domain
• In short, the existing data set should already contain correct classifications
• It can be advisable to pick a subset of the existing data to use when classifying new instances
Instance-Based Methods and Structural Representation
• Instance-based methods don’t immediately appear to provide a structural representation, like a rule set
• However, taken together, the different parts of the process form a representation
• The training subset, distance metric, and nearest neighbor rule define boundaries in n-space between instances

In effect, this forms a structural representation analogous to something seen before:

• You fall on side or the other of a decision boundary in space
• They are discussed after the figure is presented

Figure 3.10 (a): This shows the decision boundaries between two instances and the rest of the data set

• Figure 3.10 (b): This illustrates how you may only need a subset of the data set in order to form the boundaries if the algorithm is based purely on nearest neighbor considerations

Figure 3.10 (c): This shows that in practice the classification neighborhoods will be simplified to rectangular areas in space

• Figure 3.10 (d): This illustrates the idea that you can have donut shaped classes, with one class’s instances completely contained within another’s
3.6 Clusters
• Clustering is not the classification of individual instances
• It is the partitioning of the space
• The book illustrates the ideas with Figure 3.11, shown on the following overhead
• This is followed by brief explanatory comments
• Figure 3.11 (b): This shows that instances may be classified in >1 cluster
• Figure 3.11 (c): This shows that the assignment of an instance to a cluster may be probabilistic
• Figure 3.11 (d): A dendrogram is a technique for showing hierarchical relationships among clusters

You can ignore the following overheads

• They’re just stored here for future reference
• They were not included in the current version of the presentation of chapter 3