1 / 77

Review

Review. Belief and Probability. The connection between toothaches and cavities is not a logical consequence in either direction. However, we can provide a degree of belief on the sentences. We usually get this belief from statistical data.

Download Presentation

Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Review

  2. Belief and Probability • The connection between toothaches and cavities is not a logical consequence in either direction. • However, we can provide a degree of belief on the sentences. • We usually get this belief from statistical data. • Assigning probability 0 to a sentence correspond to an unequivocal belief that the sentence is false. • Assigning probability 1 to a sentence correspond to an unequivocal belief that the sentence is true.

  3. Syntax • Basic element: random variable • Possible worlds defined by assignment of values to random variables. • Boolean random variables Cavity (do I have a cavity?) • Discrete random variables Weather is one of <sunny,rainy,cloudy,snow> • Domain values must be exhaustive and mutually exclusive • Elementary propositions e.g., • Weather =sunny(abbreviated as sunny) • Cavity = false(abbreviated as cavity) • Complex propositions formed from elementary propositions and standard logical connectives e.g., • Weather = sunny  Cavity = false • sunny  cavity

  4. Atomic events • Atomic event: A complete specification of the state of the world E.g., if the world is described by only two Boolean variables Cavity and Toothache, then there are 4 distinct atomic events: Cavity = false  Toothache = false Cavity = false  Toothache = true Cavity = true  Toothache = false Cavity = true  Toothache = true • Atomic events are mutually exclusive and exhaustive

  5. Joint probability • Joint probability distribution for a set of random variables gives the probability of every atomic event on those random variables. • If we consider all the variables then the joint probability distribution is called full joint probability distribution. • A full joint distribution specifies the probability of every atomic event. • Any probabilistic question about a domain can be answered by the full joint distribution.

  6. Prior and Conditional probability • Prior or unconditional probabilityassociated with aproposition is the degree of belief accorded to it in the absence of any other information. P(Cavity = true) = 0.1 (or P(cavity) = 0.1) P(Weather = sunny) = 0.7 (or P(sunny) = 0.7) • Conditional or posterior probabilities P(cavity | toothache) = 0.8 i.e., given that toothache is all I know • Definition of conditional probability: • Product rule:

  7. Chain rule • Chain ruleis derived by successive application of product rule:

  8. Inference by enumeration • Suppose we are given a proposition φ. Start with the joint probability distribution: • Sum up the probabilities of the atomic events where φ is true.

  9. Inference by enumeration

  10. Inference by enumeration

  11. Inference by enumeration • We can also compute conditional probabilities:

  12. Normalization Constant

  13. Hidden Variables • What does it mean to compute • We sum up: 0.108 + 0.012 • I.e. the probabilities of atomic events that make the proposition cavity  toothache true. So, • The variable “Catch” is a called a hidden variable.

  14. Independence • We can write: P(toothache, catch, cavity, cloudy) =P(cloudy | toothache, catch, cavity)P(toothache, catch, cavity) =P(cloudy)P(toothache, catch, cavity) • Thus, the 32 element table for four variables can be constructed from one 8-element table and one 4-element table!! • A and B are independent if for each a, b in the domain of A and B respectively, we have P(a|b) = P(a) or P(b|a) = P(b) or P(a, b) = P(a) P(b) • Absolute independence powerful but rare

  15. Bayes' Rule • Product rule • Bayes' rule: • Useful for assessing diagnostic probability from causal probability:

  16. Bayes’ rule (cont’d) Let  s  be the proposition that the patient has a stiff neck m  be the proposition that the patient has meningitis, P(s|m) = 0.5 P(m) = 1/50000 P(s) = 1/20 P(m|s) = P(s|m) P(m) / P(s) = (0.5) x (1/50000) / (1/20) = 0.0002 That is, we expect only 1 in 5000 patients with a stiff neck to have meningitis.

  17. More than two variables Now, the Naïve Bayes model makes the following assumption: Although Effect1, …,Effectn might not be independent in general, they are independent given the value of Cause. This is called conditional independence. E.g. If I have a cavity, the probability that the probe catches in it doesn't depend on whether I have a toothache. We write this as: P(toothache, catch | cavity) = P(toothache | cavity) . P(catch | cavity) . P(cavity)

  18. Naïve Bayes What about when the Naïve Bayes assumption doesn’t hold? Instead we have a network of inter-dependencies. Let’s, first review the conditional independence. For finding the alpha we need to compute also: Then the alpha is:

  19. Conditional Independence Equations Equivalent statements: P(toothache, catch | cavity) = P(toothache | cavity) P(catch | cavity) P(toothache | catch, cavity) = P(toothache | cavity) In general

  20. Conditional Independence (cont’d) • We can write out full joint distribution using chain rule, e.g.: P(toothache, catch, cavity) = P(toothache | catch, cavity) P(catch, cavity) = P(toothache | catch, cavity) P(catch | cavity) P(cavity) = P(toothache | cavity) P(catch | cavity) P(cavity) • In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in n to linear in n. • Conditional independence is our most basic and robust form of knowledge about uncertain environments. Because of Cond. Indep.

  21. Bayesian Networks: Motivation • The full joint probability can be used to answer any question about the domain, • but intractable as the number of variables grow. • Furthermore specifying probabilities of atomic events is rather unnatural and can be very difficult.

  22. Bayesian networks • Syntax: • a set of nodes, one per variable • a directed, acyclic graph (a link means: "directly influences") • a conditional distribution for each node given its parents: P(Xi| Parents (Xi)) • The conditional distribution is represented as a conditional probability table (CPT) giving the distribution over Xi for each combination of parent values.

  23. Example • Topology of network encodes conditional independence assertions: • Weather is independent of the other variables • Toothache and Catch are conditionally independent given Cavity, which is indicated by the absence of a link between them.

  24. Another Example The topology shows that burglary and earthquakes directly affect the probability of alarm, but whether Mary or John call depends only on the alarm. Thus, our assumptions are that they don’t perceive any burglaries directly, and they don’t confer before calling.

  25. Semantics • The full joint distribution is defined as the product of the local conditional distributions: P(x1, … ,xn) = i = 1 P(xi | parents(xi)) e.g., P(j  m  a b e) = P(j | a) P(m | a) P(a | b, e) P(b) P(e) = …

  26. Inference in Bayesian Networks • The basic task for any probabilistic inference system is to compute the posterior probability for a query variable, given some observed events (or effects) – that is, some assignment of values to a set of evidence variables. • A typical query: P(x|e1,…,em) • We could ask: What’s the probability of a burglary if both Mary and John calls P(burglary | johncalls, marycalls)?

  27. Inference by enumeration Sum out variables from the joint without actually constructing its explicit representation e (earthquake) and a (alarm) are values of the hidden variables. All the possible e’s and a’s have to be considered. Now, rewrite full joint entries using product of CPT entries:

  28. Numerically… P(b | j,m) = P(b) eP(e)aP(a|b,e)P(j|a)P(m|a) = …=  * 0.00059 P(b | j,m) =  P(b) eP(e)aP(a|b,e)P(j|a)P(m|a) = …=  * 0.0015 P(B | j,m) =  <0.00059, 0.0015> = <0.284, 0.716>. Complete it for exercise

  29. Machine Learning

  30. Pseudo-code for 1R For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate

  31. Evaluating the weather attributes Arbitrarily breaking the tie between the first and third rule sets we pick the first. Oddly enough the game is played when it’s overcast and rainy but not when it’s sunny. Perhaps it’s an indoor pursuit.

  32. Statistical modeling: Probabilities for the weather data

  33. Naïve Bayes for classification • Classification learning: what’s the probability of the class given an instance? • e = instance • h = class value for instance • Naïve Bayes assumption: evidence can be split into independent parts (i.e. attributes of instance!) P(h | e) = P(e | h) P(h) / P(e) = P(e1|h) P(e2|h)… P(en|h) P(h) / P(e)

  34. The weather data example P(Play=yes | e) = P(Outlook=Sunny | Play=yes) * P(Temp=Cool | Play=yes) * P(Humidity=High | Play=yes) * P(Windy=True | Play=yes) * P(Play=yes) / P(e) = = (2/9) * (3/9) * (3/9) * (3/9) *(9/14) / P(e) = 0.0053 / P(e) Don’t worry for the 1/P(E); It’s alpha, the normalization constant.

  35. The weather data example P(Play=no | e) = P(Outlook=Sunny | Play=no) * P(Temp=Cool | Play=no) * P(Humidity=High | Play=no) * P(Windy=True | Play=no) * P(play=no) / P(e) = = (3/5) * (1/5) * (4/5) * (3/5) *(5/14) / P(e) = 0.0206 / P(e)

  36. Normalization constant  = 1/P(e) = 1/(0.0053 + 0.0206) So, P(Play=yes | e) = 0.0053 / (0.0053 + 0.0206) = 20.5% P(Play=no | e) = 0.0206 / (0.0053 + 0.0206) = 79.5%

  37. The “zero-frequency problem” • What if an attribute value doesn’t occur with every class value (e.g. “Humidity = High” for class “Play=Yes”)? • Probability P(Humidity=High | play=yes) will be zero! • A posteriori probability will also be zero! • No matter how likely the other values are! • Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator) • I.e. initialize the counters to 1 instead of 0. • Result: probabilities will never be zero! (stabilizes probability estimates)

  38. Constructing Decision Trees • Normal procedure: top down in recursive divide-and-conquer fashion • First: an attribute is selected for root node and a branch is created for each possible attribute value • Then: the instances are split into subsets (one for each branch extending from the node) • Finally: the same procedure is repeated recursively for each branch, using only instances that reach the branch • Process stops if all instances have the same class

  39. Which attribute to select? (b) (a) (c) (d)

  40. A criterion for attribute selection • Which is the best attribute? • The one which will result in the smallest tree • Heuristic: choose the attribute that produces the “purest” nodes • Popular impurity criterion: entropy of nodes • Lower the entropy purer the node. • Strategy: choose attribute that results in lowest entropy of the children nodes.

  41. Example: attribute “Outlook”

  42. The final decision tree • Note: not all leaves need to be pure; sometimes identical instances have different classes Þ Splitting stops when data can’t be split any further

  43. Numerical attributes • Tests in nodes can be of the form xj > constant • Divides the space into rectangles.

  44. Considering splits • The only thing we really need to do differently in our algorithm is to consider splitting between each data point in each dimension. • So, in our bankruptcy domain, we'd consider 9 different splits in the R dimension • In general, we'd expect to consider m - 1 splits, if we have m data points; • But in our data set we have some examples with equal R values.

  45. Considering splits II • And there are another 6 possible splits in the L dimension • because L is an integer, really, there are lots of duplicate L values.

  46. Bankruptcy Example

  47. Bankruptcy Example • We consider all the possible splits in each dimension, and compute the average entropies of the children. • And we see that, conveniently, all the points with L not greater than 1.5 are of class 0, so we can make a leaf there.

  48. Bankruptcy Example • Now, we consider all the splits of the remaining part of space. • Note that we have to recalculate all the average entropies again, because the points that fall into the leaf node are taken out of consideration.

  49. Bankruptcy Example • Now the best split is at R > 0.9. And we see that all the points for which that's true are positive, so we can make another leaf.

  50. Bankruptcy Example • Continuing in this way, we finally obtain:

More Related