Probabilistic Calculus to the Rescue

111 Views

Download Presentation
## Probabilistic Calculus to the Rescue

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Suppose we know the likelihood**of each of the (propositional) worlds (aka Joint Probability distribution) Then we can use standard rules of probability to compute the likelihood of all queries (as I will remind you) So, Joint Probability Distribution is all that you ever need! In the case of Pearl example, we just need the joint probability distribution over B,E,A,J,M (32 numbers) --In general 2n separate numbers (which should add up to 1) If Joint Distribution is sufficient for reasoning, what is domain knowledge supposed to help us with? --Answer: Indirectly by helping us specify the joint probability distribution with fewer than 2n numbers ---The local relations between propositions can be seen as “constraining” the form the joint probability distribution can take! Burglary => Alarm Earth-Quake => Alarm Alarm => John-calls Alarm => Mary-calls Probabilistic Calculus to the Rescue Only 10 (instead of 32) numbers to specify!**How do we learn the bayes nets?**• We assumed that both the topology and CPTs for bayes nets are given by experts • What if we want to learn them from data? • And use them to predict other data..**Statistics**Probability**P(H)**H P(d|H) D1 D2 DN i.i.d**True hypothesis eventually dominates…**probability of indefinitely producing uncharacteristic data 0**Bayesian prediction is optimal**(Given the hypothesis prior, all other predictions are less likely)**So, BN learning is just probability estimation!**(as long as data is complete, and topology is given..)**Works for any topology**Data B=T, E=T, A=F, J=T, M=F . . B=F,E=T,A=T,J=F,M=T B E A J M P(J|A) = (#data items where J and A are true) (#data items where A is true) So, BN learning is just probability estimation?**Steps in ML based learning**• Write down an expression for the likelihood of the data as a function of the parameter(s) Assume i.i.d. distribution • Write down the derivative of the log likelihood with respect to each parameter • Find the parameter values such that the derivatives are zero There are two ways this step can become complex Individual (partial) derivatives lead to non-linearfunctions (depends on the type of distribution the parameters are controlling; binomial is a very easy case) Individual (partial) derivatives will involve more than one parameter (thus leading to simultaneous equations) In general, we will need to use continuous function optimization techniques One idea is to use gradient descent to find the point where the derivative goes to zero. But for gradient descent to find global optimum, we need to know for sure that the function we are optimizing has a single optimum (this is why convex functions are important. If the likelihood is a convex function, then gradient descent will be guaranteed to find the global minimum).**Continuous Function Optimization**• Function optimization involves finding the zeroes of the gradient • We can use Newton-Raphson method • ..but will need the second derivative… • ..for a function of n variables, the second derivate is an nxn matrix (called Hessian)**Beyond Known Topology & Complete data!**• So we just noted that if we know the topology of the Bayes net, and we have complete datathen the parameters are un-entangled, and can be learned separately from just data counts. • Questions: How big a deal is this? • Can we have known topology? • Can we have complete data? • What if there are hidden nodes**Some times you don’t really know the topology**Russel’s restaurant waiting habbits.**Classification as a special case of data modeling**• Until now, we were interesting in learning the model of the entire data (i.e., we want to be able to predict each of the attribute values of the data) • Sometimes, we are most interested in predicting just a subset (or even one) of the attributes of the data • This will be a “classification” task**Structure (Topology) Learning**• Search over different network topologies • Question: How do we decide which topology is better? • Idea 1: Check if the independence relations posited by the topology actually hold • Idea 2: Consider which topology agrees with the data more (i.e., provides higher likelihood) • But need to be careful--increasing edges in a network cannot reduce likelihood • Idea 3: Need to penalize complexity of the network (either using prior on network topologies, or using syntactic complexity measures)**Naïve Bayes Models: The Jean Harlow of Bayesnet Learners..**WillWait … Alt bar Est**Example**P(willwait=yes) = 6/12 = .5 P(Patrons=“full”|willwait=yes) = 2/6=0.333 P(Patrons=“some”|willwait=yes)= 4/6=0.666 Similarly we can show that P(Patrons=“full”|willwait=no) =0.6666 P(willwait=yes|Patrons=full) = P(patrons=full|willwait=yes) * P(willwait=yes) ----------------------------------------------------------- P(Patrons=full) = k* .333*.5 P(willwait=no|Patrons=full) = k* 0.666*.5**I beseech you, in the bowels of Christ, think it possible**you may be mistaken. --Cromwell to synod of the Church of Scotland; 1650 (aka Cromwell's Rule) Need for Smoothing.. • Suppose I toss a coin twice, and it comes up heads both times • What is the empirical probability of Rao’s coin coming tails? • Suppose I continue to toss the coin another 3000 times, and it comes heads all these times • What is the empirical probability of Rao’s coin coming tails? What is happening? We have a “prior” on the coin tosses We slowly modify that prior in light of evidence How do we get NBC to do it?**Using M-estimates to improve probablity estimates**Zero is FOREVER • The simple frequency based estimation of P(Ai=vj|Ck) can be inaccurate, especially when the true value is close to zero, and the number of training examples is small (so the probability that your examples don’t contain rare cases is quite high) • Solution: Use M-estimate P(Ai=vj | Ci) = [#(Ci, Ai=vi) + mp ] / [#(Ci) + m] • m virtual samples, with p being the probability that each of those samples has Ai=vj • If we believe that our sample set is large enough, we can keep m small. Otherwise, keep it large. • Essentially we are augmenting the #(Ci) normal samples with m more virtual samples drawn according to the prior probability on how Ai takes values • p is the prior probability of Ai taking the value vi • If we don’t have any background information, assume uniform probability (that is 1/d if Ai can take d values) Also, to avoid overflow errors do addition of logarithms of probabilities (instead of multiplication of probabilities)**Beyond Known Topology & Complete data!**• So we just noted that if we know the topology of the Bayes net, and we have complete datathen the parameters are un-entangled, and can be learned separately from just data counts. • Questions: How big a deal is this? • Can we have known topology? • Can we have complete data? • What if there are hidden nodes**Missing Data**Fractional samples 1 1 0 (0.7) 1 0 0 (0.3) What should we do? --Idea: Just consider the complete data as the training data Go ahead and learn the parameters --But wait, now that we have parameters, we can infer the missing value! (suppose we infer B to be 1 with 0.7 and 0 with 0.3 prob) --But wait wait, now that we have inferred the missing value we can re-estimate the parameters.. Infinite Regress? No.. Expectation Maximization**Involves Bayes Net inference; can get by with approximate**inference Involves maximization; can get away with just improvement (i.e., a few steps of gradient ascent)**Candy Example**Start with 1000 samples Initialize parameters as**Why does EM Work?**Log of Sums don’t have easy closed form optima; use Jensen’s inequality and focus on Sum of logs which will be a lower bound Ft (J) is an arbitrary prob dist over J By Jensen’s inequality The “size of the step” is determined adaptively by where the max of the lowerbound is.. --In contrast, gradient descent requires a stepsize parameter --Newton Raphson requires second derivative..