Midterm 3 Revision and ID3

Lecture 20 Midterm 3 Revision and ID3 Prof. Sin-Min Lee

Armstrong’s Axioms • We can find F+ by applying Armstrong’s Axioms: • if   , then   (reflexivity) • if  , then    (augmentation) • if  , and   , then   (transitivity) • These rules are • sound (generate only functional dependencies that actually hold) and • complete (generate all functional dependencies that hold).

Additional rules • If   and , then  (union) • If   , then   and (decomposition) • If   and  , then   (pseudotransitivity) The above rules can be inferred from Armstrong’s axioms.

Example • R = (A, B, C, G, H, I)F = { A BA CCG HCG IB H} • Some members of F+ • A H • by transitivity from A B and B H • AG I • by augmenting A C with G, to get AG CG and then transitivity with CG I • CG HI • by augmenting CG I to infer CG  CGI, and augmenting of CG H to inferCGI HI, and then transitivity

2. Closure of an attribute set • Given a set of attributes A and a set of FDs F, closure of A under F is the set of all attributes implied by A • In other words, the largest B such that: • A  B • Redefining super keys: • The closure of a super key is the entire relation schema • Redefining candidate keys: • 1. It is a super key • 2. No subset of it is a super key

Computing the closure for A • Simple algorithm • 1. Start with B = A. • 2. Go over all functional dependencies,  , in F+ • 3. If  B, then • Add  to B • 4. Repeat till B changes

Example • R = (A, B, C, G, H, I)F = { A BA CCG HCG IB H} • (AG) + ? • 1. result = AG 2. result = ABCG (A C and A  B) 3. result = ABCGH (CG H and CG  AGBC) 4. result = ABCGHI (CG I and CG  AGBCH Is (AG) a candidate key ? 1. It is a super key. 2. (A+) = BC, (G+) = G. YES.

Uses of attribute set closures • Determining superkeys and candidate keys • Determining if A  B is a valid FD • Check if A+ contains B • Can be used to compute F+

3. Extraneous Attributes • Consider F, and a functional dependency, A  B. • “Extraneous”: Are there any attributes in A or B that can be safely removed ? • Without changing the constraints implied by F • Example: Given F = {AC, ABCD} • C is extraneous in ABCD since AB C can be inferred even after deleting C

4. Canonical Cover • A canonical coverfor F is a set of dependencies Fc such that • F logically implies all dependencies in Fc, and • Fclogically implies all dependencies in F, and • No functional dependency in Fccontains an extraneous attribute, and • Each left side of functional dependency in Fcis unique • In some (vague) sense, it is a minimal version of F • Read up algorithms to compute Fc

Loss-less Decompositions • Definition: A decomposition of R into (R1, R2) is called lossless if, for all legal instance of r(R): • r = R1(r ) R2(r ) • In other words, projecting on R1 and R2, and joining back, results in the relation you started with • Rule: A decomposition of R into (R1, R2) is lossless, iff: • R1 ∩ R2  R1 or R1 ∩ R2  R2 • in F+.

Dependency-preserving Decompositions • Is it easy to check if the dependencies in F hold ? • Okay as long as the dependencies can be checked in the same table. • Consider R = (A, B, C), and F ={A  B, B  C} • 1. Decompose into R1 = (A, B), and R2 = (A, C) • Lossless ? Yes. • But, makes it hard to check for B  C • The data is in multiple tables. • 2. On the other hand, R1 = (A, B), and R2 = (B, C), • is both lossless and dependency-preserving • Really ? What about A  C ? • If we can check A  B, and B  C, A  C is implied.

Dependency-preserving Decompositions • Definition: • Consider decomposition of R into R1, …, Rn. • Let Fibe the set of dependencies F + that include only attributes in Ri. • The decomposition is dependency preserving, if (F1 F2  …  Fn )+ = F +

Example • Suppose we have R(A,B,C) with • FD1. AB • FD2.AC • FD3. BC The decomposition R1(A,B),R2(A,C) is lossless but not dependency preservation. The decomposition R1(A,B),R2(A,C) and R3(B,C) is dependency preservation.

BCNF • Given a relation schema R, and a set of functional dependencies F, if every FD, A  B, is either: • 1. Trivial • 2. A is a superkey of R • Then, R is in BCNF (Boyce-Codd Normal Form) • Why is BCNF good ?

BCNF • What if the schema is not in BCNF ? • Decompose (split) the schema into two pieces. • Careful: you want the decomposition to be lossless

Achieving BCNF Schemas • For all dependencies A  B in F+, check if A is a superkey • By using attribute closure • If not, then • Choose a dependency in F+ that breaks the BCNF rules, say A  B • Create R1 = A B • Create R2 = A (R – B – A) • Note that: R1 ∩ R2 = A and A  AB (= R1), so this is lossless decomposition • Repeat for R1, and R2 • By defining F1+ to be all dependencies in F that contain only attributes in R1 • Similarly F2+

B  C Example 1 • R = (A, B, C) • F = {A  B, B  C} • Candidate keys = {A} • BCNF = No. B  C violates. • R1 = (B, C) • F1 = {B  C} • Candidate keys = {B} • BCNF = true • R2 = (A, B) • F2 = {A  B} • Candidate keys = {A} • BCNF = true

From A  B and BC  D by pseudo-transitivity AC  D A  B • R = (A, B, C, D, E) • F = {A  B, BC  D} • Candidate keys = {ACE} • BCNF = Violated by {A  B, BC  D} etc… Example 2-1 • R1 = (A, B) • F1 = {A  B} • Candidate keys = {A} • BCNF = true • R2 = (A, C, D, E) • F2 = {AC  D} • Candidate keys = {ACE} • BCNF = false (AC  D) • Dependency preservation ??? • We can check: • A  B (R1), AC  D (R3), • but we lost BC  D • So this is not a dependency • -preserving decomposition • R3 = (A, C, D) • F3 = {AC  D} • Candidate keys = {AC} • BCNF = true • R4 = (A, C, E) • F4 = {} [[ only trivial ]] • Candidate keys = {ACE} • BCNF = true

BC  D A  B • R = (A, B, C, D, E) • F = {A  B, BC  D} • Candidate keys = {ACE} • BCNF = Violated by {A  B, BC  D} etc… Example 2-2 • R1 = (B, C, D) • F1 = {BC  D} • Candidate keys = {BC} • BCNF = true • R2 = (B, C, A, E) • F2 = {A  B} • Candidate keys = {ACE} • BCNF = false (A  B) • Dependency preservation ??? • We can check: • BC  D (R1), A  B (R3), • Dependency-preserving • decomposition • R3 = (A, B) • F3 = {A  B} • Candidate keys = {A} • BCNF = true • R4 = (A, C, E) • F4 = {} [[ only trivial ]] • Candidate keys = {ACE} • BCNF = true

E  HA A  BC • R = (A, B, C, D, E, H) • F = {A  BC, E  HA} • Candidate keys = {DE} • BCNF = Violated by {A  BC} etc… Example 3 • R1 = (A, B, C) • F1 = {A  BC} • Candidate keys = {A} • BCNF = true • R2 = (A, D, E, H) • F2 = {E  HA} • Candidate keys = {DE} • BCNF = false (E  HA) • Dependency preservation ??? • We can check: • A  BC (R1), E  HA (R3), • Dependency-preserving • decomposition • R3 = (E, H, A) • F3 = {E  HA} • Candidate keys = {E} • BCNF = true • R4 = (ED) • F4 = {} [[ only trivial ]] • Candidate keys = {DE} • BCNF = true

Classification vs. Prediction • Classification: • When a classifier is build it predicts categorical class labels of new data – classifies unknown data. We also say that it predicts class labelsof the new data • Construction of the classifier (a model) is based on a training set in which the values of a decision attribute (class labels) are given and is tested on a test set • Prediction • Statistical method that models continuous-valued functions, i.e., predicts unknown or missing values

Training Data Classifier (Model) Classification Process : Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Classifier Testing Data Unseen Data Testing and Prediction (by a classifier) (Jeff, Professor, 4) Tenured?

J. Ross Quinlan originally developed ID3 at the University of Sydney. He first presented ID3 in 1975 in a book, Machine Learning, vol. 1, no. 1. ID3 is based off the Concept Learning System (CLS) algorithm. The basic CLS algorithm over a set of training instances C: Step 1: If all instances in C are positive, then create YES node and halt. If all instances in C are negative, create a NO node and halt. Otherwise select a feature, F with values v1, ..., vn and create a decision node. Step 2: Partition the training instances in C into subsets C1, C2, ..., Cn according to the values of V. Step 3: apply the algorithm recursively to each of the sets Ci. Note, the trainer (the expert) decides which feature to select.

ID3 improves on CLS by adding a feature selection heuristic. ID3 searches through the attributes of the training instances and extracts the attribute that best separates the given examples. If the attribute perfectly classifies the training sets then ID3 stops; otherwise it recursively operates on the n (where n = number of possible values of an attribute) partitioned subsets to get their "best" attribute. The algorithm uses a greedy search, that is, it picks the best attribute and never looks back to reconsider earlier choices.

A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a classification or decision.

Choosing Attributes and ID3 • The order in which attributes are chosen determines how complicated the tree is. • ID3 uses information theory to determine the most informative attribute. • A measure of the information content of a message is the inverse of the probability of receiving the message: • information1(M) = 1/probability(M) • Taking logs (base 2) makes information correspond to the number of bits required to encode a message: • information(M) = -log2(probability(M))

Information • The information content of a message should be related to the degree of surprise in receiving the message. • Messages with a high probability of arrival are not as informative as messages with low probability. • Learning aims to predict accurately, i.e. reduce surprise. • Probabilities are multiplied to get the probability of two or more things both/all happening. Taking logarithms of the probabilities allows information to be added instead of multiplied.

A measure used from Information Theory in the ID3 algorithm and many others used in decision tree construction is that of Entropy. Informally, the entropy of a dataset can be considered to be how disordered it is. It has been shown that entropy is related to information, in the sense that the higher the entropy, or uncertainty, of some data, then the more information is required in order to completely describe that data. In building a decision tree, we aim to decrease the entropy of the dataset until we reach leaf nodes at which point the subset that we are left with is pure, or has zero entropy and represents instances all of one class (all instances have the same value for the target attribute).

We measure the entropy of a dataset,S, with respect to one attribute, in this case the target attribute, with the following calculation: where Pi is the proportion of instances in the dataset that take the ith value of the target attribute This probability measures give us an indication of how uncertain we are about the data. And we use a log2 measure as this represents how many bits we would need to use in order to specify what the class (value of the target attribute) is of a random instance.

Using the example of the marketing data, we know that there are two classes in the data and so we use the fractions that each class represents in an entropy calculation: Entropy (S = [9/14 responses, 5/14 no responses])= -9/14 log2 9/14 - 5/14 log2 5/14 = 0.947 bits

Entropy • Different messages have different probabilities of arrival. • Overall level of uncertainty (termed entropy) is: • -Σ P log2P • Frequency can be used as a probability estimate. • E.g. if there are 5 +ve examples and 3 -ve examples in a node the estimated probability of +ve is 5/8 = 0.625.

Example • Initial decision tree is one node with all examples. • There are 4 +ve examples and 3 -ve examples • i.e. probability of +ve is 4/7 = 0.57; probability of -ve is 3/7 = 0.43 • Entropy is: - (0.57 * log 0.57) - (0.43 * log 0.43) = 0.99

Evaluate possible ways of splitting. • Try split on size which has three values: large, medium and small. • There are four instances with size = large. • There are two large positives examples and two large negative examples. • The probability of +ve is 0.5 • The entropy is: - (0.5 * log 0.5) - (0.5 * log 0.5) = 1

There is one small +ve and one small -ve • Entropy is: - (0.5 * log 0.5) - (0.5 * log 0.5) = 1 • There is only one medium +ve and no medium -ves, so entropy is 0. • Expected information for a split on size is: • The expected information gain is: 0.99 - 0.86 = 0.13

Now try splitting on colour and shape. • Colour has an information gain of 0.52 • Shape has an information gain of 0.7 • Therefore split on shape. • Repeat for all subtree

Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes

Each internal node tests an attribute Each branch corresponds to an attribute value node Each leaf node assigns a classification Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity High Normal No Yes

No Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes Decision Tree for PlayTennis Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ?

Decision Tree for Conjunction Outlook=Sunny  Wind=Weak Outlook Sunny Overcast Rain Wind No No Strong Weak No Yes

Decision Tree for Disjunction Outlook=Sunny  Wind=Weak Outlook Sunny Overcast Rain Yes Wind Wind Strong Weak Strong Weak No Yes No Yes

Decision Tree for XOR Outlook=Sunny XOR Wind=Weak Outlook Sunny Overcast Rain Wind Wind Wind Strong Weak Strong Weak Strong Weak Yes No No Yes No Yes

Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes Decision Tree • decision trees represent disjunctions of conjunctions (Outlook=Sunny  Humidity=Normal)  (Outlook=Overcast)  (Outlook=Rain  Wind=Weak)

Midterm 3 Revision and ID3

Midterm 3 Revision and ID3

Presentation Transcript

Iterative Dichotomiser 3 (ID3) Algorithm

Iterative Dichotomiser 3 ID3 Algorithm

Midterm 2 Revision 2

Midterm 3 Revision and Parallel Computers

Midterm 3 Revision 2

Revision for Midterm 3 Part 3

Revision-3

Iterative Dichotomiser 3 (ID3) Algorithm

ID3 Algorithm

ID3 Algorithm

Midterm 3 Revision and ID3

Revision of Midterm 2

Huffman Trees and ID3

Revision: 3

Midterm 2 Revision

Midterm 3 Revision

Midterm 2 Revision

Huffman code and ID3

Midterm 3 Revision

ID3 and Decision tree

Revision: 3

ID3 Algorithm