Loading in 5 sec....

Data Mining Chapter 3 Output: Knowledge RepresentationPowerPoint Presentation

Data Mining Chapter 3 Output: Knowledge Representation

- By
**macy** - Follow User

- 129 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Data Mining Chapter 3 Output: Knowledge Representation' - macy

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Data MiningChapter 3Output: Knowledge Representation

Kirk Scott

- A summary of ways of representing knowledge, the results of mining:
- Rule sets
- Decision trees
- Regression equations
- Clusters
- Deciding what kind of output you want is the first step towards picking a mining algorithm

3.1 Tables mining:

- Output can be in the form of tables mining:
- This is kind of lame
- All they’re saying is that instances can be organized to form a lookup table for classification
- The contact lens data can be viewed in this way
- At the end they will consider another way in which the instance set itself is pretty much the result of mining

3.2 Linear Models mining:

- For problems with numeric attributes you can apply statistical methods
- The computer performance example was given earlier
- The methods will be in more detail in chapter 4
- The statistical approach can be illustrated graphically

Fitting a Line statistical methods

- This would be a linear equation relating cache size to computer performance
- PRP = 37.06 + 2.47 CACH
- This defines the straight line that best fits the instances in the data set
- Figure 3.1, on the following overhead, shows both the data points and the line

Finding a Boundary statistical methods

- A different technique will find a linear decision boundary
- This linear equation in petal length and petal width will separate instances of Iris setosa and Iris versicolor
- 2.0 – 0.5 PETAL_LENGTH – 0.8 PETAL_WIDTH = 0

- An instance of Iris statistical methodssetosa should give a value >0 (above/to the right of the line) and an instance of Iris versicolor should give a value <0
- Figure 3.2, on the following overhead, shows the boundary line and the instances of the two kinds of Iris

3.3 Trees statistical methods

- The book summarizes the different kinds of decisions (< , =, etc.) that might be coded for a single attribute at each node in a decision tree
- Most are straightforward and don’t need to be repeated here
- Several more noteworthy aspects will be addressed on the following overheads

Null Values etc.) that might be coded for a single attribute at each node in a decision tree

- If nulls occur, you will have to make a decision based on them in any case
- The occurrence of a null value may be one of the separate branches out of a decision tree node
- At this point the value of assigning a meaning to null becomes apparent (not available, not applicable, not important…)

- Approaches to dealing with etc.) that might be coded for a single attribute at each node in a decision treeuncoded/undistinguished nulls:
- Keep track of the number of instances per branch and classify nulls with the most popular branch
- Alternatively, keep track of the relative frequency of different branches
- In the aggregate results, assign a corresponding proportion of the nulls to the different branches

Other Kinds of Comparisons etc.) that might be coded for a single attribute at each node in a decision tree

- Simple decisions compare attribute values and constants
- Some decisions may compare two attributes in the same instance
- Some decisions may be based on a function of >1 attribute per instance

Oblique Splits etc.) that might be coded for a single attribute at each node in a decision tree

- Comparing an attribute to a constant splits data parallel to an axis
- A decision function which doesn’t split parallel to an axis is called an oblique split
- In effect, the boundary between the kinds of irises shown earlier is such a split

Option Nodes etc.) that might be coded for a single attribute at each node in a decision tree

- A single node with alternative splits on different attributes is called an option node
- Instances are classified according to each split and may appear in >1 leaf classification
- The last part of analysis includes deciding what such results indicate

Weka etc.) that might be coded for a single attribute at each node in a decision tree and Hand-Made Decision Trees

- The book suggests that you can get a handle on decision trees by making one yourself
- The book illustrates how Weka includes tools for doing this
- To me this seems out of place until chapter 11 when Weka is introduced
- I will not cover it here

Regression Trees etc.) that might be coded for a single attribute at each node in a decision tree

- For a problem with numeric attributes it’s possible to devise a tree-like classifier
- Working from the bottom up:
- The leaves contain the performance prediction
- The prediction is the average of the performance of all instances that end up classified in that leaf
- The internal nodes contain numeric comparisons of attribute values

Model Trees etc.) that might be coded for a single attribute at each node in a decision tree

- A model tree is a hybrid of a decision tree and regression
- In a model tree instances are classified into a given leaf
- Once a classification reaches the leaf, the prediction is made by applying a linear equation to some subset of instance attribute values

- Figure 3.4, on the following overhead, etc.) that might be coded for a single attribute at each node in a decision treeshowns (a) a linear model, (b) a regression tree, and (c) a model tree

Rule Sets from Trees etc.) that might be coded for a single attribute at each node in a decision tree

- Given a decision tree, you can generate a corresponding set of rules
- Start at the root and trace the path to each leaf, recording the conditions at each node
- The rules in such a set are independent
- Each covers a separate case

- The rules don’t have to be applied in a particular order etc.) that might be coded for a single attribute at each node in a decision tree
- The downside is such a rule set is more complex than an ordered set
- It is possible to prune a set derived from a tree to remove redundancy

Trees from Rule Sets etc.) that might be coded for a single attribute at each node in a decision tree

- Given a rule set, you can generate a decision tree
- Now we’re interested in going in the opposite direction
- Even a relatively simple rule set can lead to a messy tree

- A rule set may compactly represent a limited number of explicitly known cases
- The other cases may be implicit in the rule set
- The implicit cases have to be spelled out in the tree

An Example explicitly known cases

- Take these rules for example:
- If a and b then x
- If c and d then x
- The result is implicitly binary, either x or not x
- The other variables are also implicitly binary (T or F)

- With 4 variables, a, b, c, and d, there can be up to 4 levels in the tree
- A tree for this problem is shown in Figure 3.5 on the following overhead

Messiness = Replicated levels in the treeSubtrees

- The tree is messy because it contains replicated subtrees
- If a = yes and b = no, you then have to test c and d
- If a = no, you have to do exactly the same test on c and d
- The gray leaves in the middle and the gray leaves on the right both descend from analogous branches of the tree

- The book states that “decision trees cannot easily express the disjunction implied among the different rules in a set.”
- Translation:
- One rule deals with a and b
- The other rule is disjoint from the first rule; it deals only with b and c
- As seen above, for “no” for each of a and b you have to do the same test on c and d

Another Example of Replicated the disjunction implied among the different rules in a set.”Subtrees

- Figure 3.6, on the following overhead, illustrates an exclusive or (XOR) function

- Consider the graph: the disjunction implied among the different rules in a set.”
- (x = 1) XOR (y = 1) a
- Incidentally, note that you could also write:
- (x <> y) a, (x = y) b

- Now consider the tree: the disjunction implied among the different rules in a set.”
- There’s nothing surprising: First test x, then test y
- The gray leaves on the left and the right at the bottom are analogous
- Now consider the rule set:
- In this example the rule set is not simpler
- This doesn’t negate the fact that the tree has replication

Yet Another Example of a Replicated the disjunction implied among the different rules in a set.”Subtree

- Consider Figure 3.7, shown on the following overhead

- In this example there are again 4 attributes the disjunction implied among the different rules in a set.”
- This time they are 3-valued instead of binary
- There are 2 disjoint rules, each including 2 of the variables
- There is a default rule for all other cases

- The replication is represented in the diagram in this way: the disjunction implied among the different rules in a set.”
- Each gray triangle stands for an instance of the complete subtree on the lower left which is shown in gray

- The rule set would be equally complex IF there were a rule for each branch of the tree
- It is less complex in this example because of the default rule

Other Issues with Rule Sets for each branch of the tree

- We have not seen the data mining algorithms yet, but some do not generate rule sets in a way analogous to reading all of the cases off of a decision tree
- Sets (especially those not designed to be applied in a given order) may contain conflicting rules that classify specific cases into different categories

Rule Sets that Produce Multiple Classifications for each branch of the tree

- In practice you can take two approaches
- Do not classify instances that fall into >1 category
- Count how many times each rule is triggered by a training set and use the most popular of the classification rules when two conflict

Rule Sets that Don’t Classify Certain Cases for each branch of the tree

- If a rule set doesn’t classify certain cases, there are again two alternatives:
- Do not classify those instances
- Classify those instances with the most frequently occurring instances

The Simplest Case with Rule Sets for each branch of the tree

- Suppose all variables are Boolean
- I.e., suppose rules only have two possible outcomes, T/F
- Suppose only rules with T outcomes are expressed
- (By definition, all unexpressed cases are F)

- Under the foregoing assumptions: for each branch of the tree
- The rules are independent
- The order of applying the rules is immaterial
- The outcome is deterministic
- There is no ambiguity

Reality is More Complex for each branch of the tree

- In practice, there can be ambiguity
- The authors state that the assumption that there are only two cases, T/F, and only T is expressed, is a form of closed world assumption
- In other words, the assumption is that everything is binary

- As soon as this and any other simplifying assumptions are relaxed, things become messier
- In other words, rules become dependent, the order of application matters, etc.
- This is when you can arrive at multiple classifications or no classifications from a rule set

Association Rules relaxed, things become messier

- This subsection is largely repetition
- Any subset of attributes may predict any other subset of attributes
- Association rules are really just a generalization or superset of classification rules

- This is because this rule is one of many association rules (all non-class attributes) (class attribute)
- Because so many association rules are possible, you need criteria for defining interesting ones

Terminology (Again) (all non-class attributes)

- Coverage = Support = Proportion of cases in the data set to which an association rule applies
- Note that the book seems to define coverage somewhat differently in this section
- Ignore what the book says and let the definition given here be the operational one

- Accuracy = Confidence = Proportion of cases to which the rule applies where the prediction is correct

Interesting Rules rule applies where the prediction is correct

- Association Rules are considered interesting if:
- They exceed some threshold for support
- They exceed some threshold for confidence

Strength of Association Rules rule applies where the prediction is correct

- An association rule that implies another association rule is stronger than the rule it implies
- The stronger rule should be reported
- It is not necessary to report the weaker rule(s)

An Example of an Association Rule Implying Another rule applies where the prediction is correct

- The book illustrates this idea with a concrete weather example
- It will be presented here in general form
- Let Rule 1 be given as shown below
- It has a compound conclusion
- Rule 1: If A = 1 and B = 0
- Then X = 0 and Y = 1
- Suppose that Rule 1 meets the thresholds for support and confidence

- Now consider Rule 2: rule applies where the prediction is correct
- Rule 2: If A = 1 and B = 0
- Then X = 0
- This applies to the same proportion of cases as Rule 1
- Therefore it meets the support criterion

- Rule 2’s conclusion is less restrictive than Rule 1’s conclusion
- It can be true in no fewer cases than Rule 1’s conclusion
- This means its confidence can be no less
- Therefore it meets the confidence criterion

- Rule 1 can be said to imply Rule 2 conclusion
- Rule 1 could also be said to imply a Rule 3:
- Rule 3: If A = 1 and B = 0
- Then Y = 1

- However, neither Rule 2 nor Rule 3 can be said to imply Rule 1
- Rule 1 is the stronger rule
- Therefore, when reporting association rules, Rule 1 should be reported but Rule 2 and Rule 3 should not be reported

Rules with Exceptions 1

- A data set may be mined for rules
- New instances may arrive which the rule set doesn’t correctly classify
- The new instances can be handled by adding “exceptions” to the rules
- Adding exceptions has this advantage:
- It is not necessary to re-do the mining and make wholesale changes to the existing rule set

The Iris 1Exception Example

- The book illustrates exceptions with a new instance for the iris data set
- In Figure 3.8, the amended rules are expressed in terms of default cases, exceptions, and if/else rules
- Note up front that I’m not too interested in this approach
- Comments will follow

- It is apparent that a default/exception is a reversed if/else
- To me, a collection of nested if/elses makes more sense than the mixture presented here

- The book observes that exceptions may be if/else“psychologically”, if no logically preferable
- The use of exceptions may better mirror how human beings model the situation
- It may even reflect the thinking of an informed expert more closely than a re-done set of rules

More Expressive Rules if/else

- Simple rules compare attributes with constants
- Possibly more powerful rules may compare attributes with other attributes
- Recall the decision boundary example, giving a linear equation in x and y
- Also recall the XOR example where the condition of interest could be summarized as x <> y

Geometric Figures Example if/else

- The book illustrates the idea with the concept of geometric figures lying down or standing up
- Comparing attributes with fixed values might work for a given data set
- However, it boils down to comparing width and height attributes of instances
- Consider Figure 3.9 on the following overhead

Dealing with Attribute-Attribute Comparisons if/else

- It may be computationally expensive for an algorithm to compare instance attributes
- Preprocessing data for input may include hardcoding the comparison as an attribute itself
- Note how this implies that the user already understands this relationship in the first place

Inductive Logic Programming if/else

- The tasks getHeight() and getWidth() could be functionalized
- Composite geometric figures could be defined
- Recursive rules could be defined for determining whether composite figures were lying or standing
- This branch of data mining is called inductive logic programming
- It will not be pursued further

3.5 Instance-Based Representation if/else

- This goes back to the idea that the data set is itself, somehow, the result of the mining
- Instead of training and arriving at a set of rules in advance, work on the fly

- For each new instance, find its nearest neighbor in in the set and classify it accordingly
- More elaborately, find the k nearest neighbors
- Devise a scheme for picking the class, like the majority class of the k nearest neighbors

Prerequisites for Accomplishing This set and classify it accordingly

- You need to define distance in the space of the n attributes of the instances
- Potentially you need to normalize or weight the individual attributes
- In general, you need to know which attributes are important in the problem domain
- In short, the existing data set should already contain correct classifications

- Comparing a new instance with all existing instances can be expensive
- It can be advisable to pick a subset of the existing data to use when classifying new instances

Instance-Based Methods and Structural Representation expensive

- Instance-based methods don’t immediately appear to provide a structural representation, like a rule set
- However, taken together, the different parts of the process form a representation
- The training subset, distance metric, and nearest neighbor rule define boundaries in n-space between instances

- In effect, this forms a structural representation analogous to something seen before:
- You fall on side or the other of a decision boundary in space

- Figure 3.10, on the following overhead, illustrates some related ideas
- They are discussed after the figure is presented

- Figure 3.10 (a): This shows the decision boundaries between two instances and the rest of the data set
- Figure 3.10 (b): This illustrates how you may only need a subset of the data set in order to form the boundaries if the algorithm is based purely on nearest neighbor considerations

- Figure 3.10 (c): This shows that in practice the classification neighborhoods will be simplified to rectangular areas in space
- Figure 3.10 (d): This illustrates the idea that you can have donut shaped classes, with one class’s instances completely contained within another’s

3.6 Clusters classification neighborhoods will be simplified to rectangular areas in space

- Clustering is not the classification of individual instances
- It is the partitioning of the space
- The book illustrates the ideas with Figure 3.11, shown on the following overhead
- This is followed by brief explanatory comments

- Figure 3.11 (a): This shows mutually exclusive partitions or classes
- Figure 3.11 (b): This shows that instances may be classified in >1 cluster
- Figure 3.11 (c): This shows that the assignment of an instance to a cluster may be probabilistic
- Figure 3.11 (d): A dendrogram is a technique for showing hierarchical relationships among clusters

The End or classes

- You can ignore the following overheads or classes
- They’re just stored here for future reference
- They were not included in the current version of the presentation of chapter 3

Download Presentation

Connecting to Server..