Midterm Review

Midterm Review

Constructing decision trees • Normal procedure: top down in recursive divide-and-conquer fashion • First: an attribute is selected for root node and a branch is created for each possible attribute value • Then: the instances are split into subsets (one for each branch extending from the node) • Finally: the same procedure is repeated recursively for each branch, using only instances that reach the branch • Process stops if all instances have the same class, or when most instances have the same class.

Weather data

Which attribute to select? (b) (a) Choose attribute that results in lowest entropy of the children nodes (c) (d)

Example: attribute “Outlook”

Information gain • Usually people don’t use directly the entropy of a node. Rather the information gain is being used. • Clearly, greater the information gain better the purity of a node. So, we choose “Outlook” for the root.

Continuing to split

Highly-branching attributes • The weather data with ID code

Tree stump for ID code attribute Subsets are more likely to be pure if there is a large number of values Information gain is biased towards choosing attributes with a large number of values What’s the remedy?

Gain ratio

Gain ratios for weather data Well, in this example of only 14 training instances, the “ID code” has still greater gain ratio. But its advantage is greatly reduced.

Numerical attributes • Tests in nodes can be of the form xj > constant • Divides the space into rectangles.

Considering splits • The only thing we need to do differently in our algorithm is to consider splitting between each data point in each dimension.

Bankruptcy Example

Bankruptcy Example • We consider all the possible splits in each dimension, and compute the average entropies of the children.

Bankruptcy Example • Now, we consider all the splits of the remaining part of space. • Note that we have to recalculate all the average entropies again, because the points that fall into the leaf node are taken out of consideration.

Regression Trees • Like decision trees, but with real-valued constant outputs at the leaves.

Splitting • Use average variance of the children to evaluate the quality of splitting on a particular feature. • Here we have a data set, for which I've just indicated the y values.

Splitting • Compute a weighted average variance • We can see that the average variance of splitting on feature 3 is much lower than of splitting on f7, and so we'd choose to split on f3.

Stopping • Stop when the variance at the leaf is small enough. • Then, set the value at the leaf to be the mean of the y values of the elements.

Rules: Coverage and Accuracy • Coverage of a rule: • Fraction of records that satisfy the antecedent of a rule • Accuracy of a rule: • Fraction of records that satisfy both the antecedent and consequent of a rule (over those that satisfy the antecedent) (Status=Single)  No Coverage = 40%, Accuracy = 50%

A simple covering algorithm • Generates a rule by adding tests that maximize rule’s accuracy. • Here, each new test (growing the rule) reduces rule’s coverage. • Goal: maximizing accuracy • t: total number of instances covered by rule • p: positive examples of the class covered by rule • t-p: number of errors made by rule Þ Select test that maximizes the ratio p/t

Pseudo-code for PRISM For each class C Initialize E to the instance set While E contains instances in class C Create a rule R with an empty left-hand side that predicts class C Until R is perfect (or there are no more attributes to use) do For each attribute A not mentioned in R, and each value v, Consider adding the condition A = v to the left-hand side of R Select A and v to maximize the accuracy p/t (break ties by choosing the condition with the largest p) Add A = v to R Remove the instances covered by R from E RIPPER Algorithm is similar. It uses instead of p/t the info gain.

Probabilistic Reasoning

Conditional Independence – Naïve Bayes Two assumptions: • Attributes are equally important • Conditionally independent (given the class value) • This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute (if the class is known) • Although based on assumptions that are almost never correct, this scheme works well in practice!

Weather Data Here we don’t really have effects, but rather evidence.

The weather data example P(play=yes | E) = P(Outlook=Sunny | play=yes) * P(Temp=Cool | play=yes) * P(Humidity=High | play=yes) * P(Windy=True | play=yes) * P(play=yes) / P(E) = = (2/9) * (3/9) * (3/9) * (3/9) * (9/14) / P(E) = 0.0053 / P(E) Don’t worry for the 1/P(E); It’s alpha, the normalization constant.

Normalization constant P(play=yes | E) + P(play=no | E) = 1 0.0053 / P(E) + 0.0206 / P(E) = 1 P(E) = 0.0053 + 0.0206 So, P(play=yes | E) = 0.0053 / (0.0053 + 0.0206) = 20.5% P(play=no | E) = 0.0206 / (0.0053 + 0.0206) = 79.5%

The “zero-frequency problem” P(play=yes | E) = P(Outlook=Sunny | play=yes) * P(Temp=Cool | play=yes) * P(Humidity=High | play=yes) * P(Windy=True | play=yes) * P(play=yes) / P(E) = = (2/9) * (3/9) * (3/9) * (3/9) *(9/14) / P(E) = 0.0053 / P(E) It will be instead: = ((2+1)/(9+3)) * ((3+1)/(9+3)) * ((3+1)/(9+2)) * ((3+1)/(9+2)) *(9/14) / P(E) = 0.007 / P(E) • What if an attribute value doesn’t occur with every class value (e.g. “Humidity = High” for class “Play=Yes”)? • Probability P(Humidity=High|play=yes) will be zero! • A posteriori probability will also be zero! • No matter how likely the other values are! • Remedy: • Add 1 to the count for every attribute value-class combination (Laplace estimator); • Add k (no of possible attribute values) to the denumerator. (see example on the right). Number of possible values for ‘Outlook’ Number of possible values for ‘Windy’

Missing values • Training: instance is not included in frequency count for attribute value-class combination • Classification: attribute will be omitted from calculation • Example: P(play=yes | E) = P(Temp=Cool | play=yes) * P(Humidity=High | play=yes) * P(Windy=True | play=yes) * P(play=yes) / P(E) = = (3/9) * (3/9) * (3/9) *(9/14) / P(E) = 0.0238 / P(E) P(play=no | E) = P(Temp=Cool | play=no) * P(Humidity=High | play=no) * P(Windy=True | play=no) * P(play=no) / P(E) = = (1/5) * (4/5) * (3/5) *(5/14) / P(E) = 0.0343 / P(E) After normalization: P(play=yes | E) = 41%, P(play=no | E) = 59%

Dealing with numeric attributes • Usual assumption: attributes have a normal or Gaussian probability distribution (given the class). • Probability density function for the normal distribution is: • We approximate mby the sample mean: • We approximate s 2by the sample variance:

Weather Data f(temperature=66 | yes) =e(- ((66-m)^2 / 2*var) ) / sqrt(2*3.14*var) m = (83+70+68+64+69+75+75+72+81)/ 9 = 73 var = ( (83-73)^2 + (70-73)^2 + (68-73)^2 + (64-73)^2 + (69-73)^2 + (75-73)^2 + (75-73)^2 + (72-73)^2 + (81-73)^2 )/ (9-1) = 38 f(temperature=66 | yes) =e(- ((66-73)^2 / (2*38) ) ) / sqrt(2*3.14*38) = .034

Bayesian Net Semantics We order them according to the topological of the given BayesNet Suppose we have the variables X1,…,Xn. The probability for them to have the values x1,…,xn respectively is P(xn,…,x1): e.g., P(j  m  a b e) = P(j | a) P(m | a) P(a | b, e) P(b) P(e) = … P(xn,…,x1): is short for P(Xn=xn,…, Xn= x1):

Inference in Bayesian Networks • Notation: • X denotes query variable • E denotes the set of evidence variables E1,…,Em, and e is a particular event, i.e. an assignment to the variables in E. • Y will denote the set of the remaining variables (hidden variables). • Typical query asks for the posterior probability P(x|e1,…,em) • E.g. We could ask: What’s the probability of a burglary if both Mary and John call, P(burglary | johhcalls, marycalls)?

Classification • We compute and compare the following: • However, how do we compute: What about the hidden variables Y1,…,Yk?

Inference by enumeration Example: P(burglary | johhcalls, marycalls)? (Abbrev. P(b|j,m))

Weather data What is the Bayesian Network corresponding to Naïve Bayes?

Play probability table Based on the data… P(play=yes) = 9/14 P(play=no) = 5/14 Let’s correct with Laplace … P(play=yes) = (9+1)/(14+2) = .625 P(play=yes) = (5+1)/(14+2) = .375

Outlook probability table Based on the data… P(outlook=sunny|play=yes) = (2+1)/(9+3) = .25 P(outlook=overcast|play=yes) = (4+1)/(9+3) = .417 P(outlook=rainy|play=yes) = (3+1)/(9+3) = .333 P(outlook=sunny|play=no) = (3+1)/(5+3) = .5 P(outlook=overcast|play=no) = (0+1)/(5+3) = .125 P(outlook=rainy|play=no) = (2+1)/(5+3) = .375

Windy probability table Based on the data…let’s find the conditional probabilities for “windy” P(windy=true|play=yes,outlook=sunny) = (1+1)/(2+2) = .5

Windy probability table Based on the data… P(windy=true|play=yes,outlook=sunny) = (1+1)/(2+2) = .5 P(windy=true|play=yes,outlook=overcast) = 0.5 P(windy=true|play=yes,outlook=rainy) = 0.2 P(windy=true|play=no,outlook=sunny) = 0.4 P(windy=true|play=no,outlook=overcast) = 0.5 P(windy=true|play=no,outlook=rainy) = 0.75

Final figure Classify it Classify it

Classification I Classify it P(play=yes|outlook=sunny, temp=cool,humidity=high, windy=true) = *P(play=yes) *P(outlook=sunny|play=yes) *P(temp=cool|play=yes, outlook=sunny) *P(humidity=high|play=yes, temp=cool) *P(windy=true|play=yes, outlook=sunny) = *0.625*0.25*0.4*0.2*0.5 = *0.00625

Classification II Classify it P(play=no|outlook=sunny, temp=cool,humidity=high, windy=true) = *P(play=no) *P(outlook=sunny|play=no) *P(temp=cool|play=no, outlook=sunny) *P(humidity=high|play= no, temp=cool) *P(windy=true|play=no, outlook=sunny) = *0.375*0.5*0.167*0.333*0.4 = *0.00417

Classification III Classify it P(play=yes|outlook=sunny, temp=cool,humidity=high, windy=true) = *0.00625 P(play=no|outlook=sunny, temp=cool,humidity=high, windy=true) = *.00417  = 1/(0.00625+0.00417) =95.969 P(play=yes|outlook=sunny, temp=cool,humidity=high, windy=true) = 95.969*0.00625 = 0.60

Classification IV (missing values or hidden variables) P(play=yes|temp=cool, humidity=high, windy=true) = *outlookP(play=yes) *P(outlook|play=yes) *P(temp=cool|play=yes,outlook) *P(humidity=high|play=yes, temp=cool) *P(windy=true|play=yes,outlook) =…(next slide)

Classification V (missing values or hidden variables) P(play=yes|temp=cool, humidity=high, windy=true) = *outlookP(play=yes)*P(outlook|play=yes)*P(temp=cool|play=yes,outlook) *P(humidity=high|play=yes,temp=cool)*P(windy=true|play=yes,outlook) = *[ P(play=yes)*P(outlook= sunny|play=yes)*P(temp=cool|play=yes,outlook=sunny) *P(humidity=high|play=yes,temp=cool)*P(windy=true|play=yes,outlook=sunny) +P(play=yes)*P(outlook= overcast|play=yes)*P(temp=cool|play=yes,outlook=overcast) *P(humidity=high|play=yes,temp=cool)*P(windy=true|play=yes,outlook=overcast) +P(play=yes)*P(outlook= rainy|play=yes)*P(temp=cool|play=yes,outlook=rainy) *P(humidity=high|play=yes,temp=cool)*P(windy=true|play=yes,outlook=rainy) ] = *[ 0.625*0.25*0.4*0.2*0.5 + 0.625*0.417*0.286*0.2*0.5 + 0.625*0.33*0.333*0.2*0.2 ] =*0.01645

Classification VI (missing values or hidden variables) P(play=no|temp=cool, humidity=high, windy=true) = *outlookP(play=no)*P(outlook|play=no)*P(temp=cool|play=no,outlook) *P(humidity=high|play=no,temp=cool)*P(windy=true|play=no,outlook) = *[ P(play=no)*P(outlook=sunny|play=no)*P(temp=cool|play=no,outlook=sunny) *P(humidity=high|play=no,temp=cool)*P(windy=true|play=no,outlook=sunny) +P(play=no)*P(outlook= overcast|play=no)*P(temp=cool|play=no,outlook=overcast) *P(humidity=high|play=no,temp=cool)*P(windy=true|play=no,outlook=overcast) +P(play=no)*P(outlook= rainy|play=no)*P(temp=cool|play=no,outlook=rainy) *P(humidity=high|play=no,temp=cool)*P(windy=true|play=no,outlook=rainy) ] = *[ 0.375*0.5*0.167*0.333*0.4 + 0.375*0.125*0.333*0.333*0.5 + 0.375*0.375*0.4*0.333*0.75 ] =*0.0208

Midterm Review

Midterm Review

Presentation Transcript

MidTerm Review

Midterm Review

Midterm Review

Midterm Review

Midterm Review

Midterm Review

Midterm review

Midterm Review

Midterm Review!

Midterm Review

Midterm Review

Midterm Review

Midterm review

Midterm Review

Midterm Review