1 / 41

Bayesian Probabilities

Bayesian Probabilities. Recall that Bayes theorem gives us a simple way to compute likelihoods for hypotheses and thus is useful for problems like diagnosis, recognition, interpretation, etc P(h | E) = p(E | h) * P(h) As we discussed, there are several problems when applying Bayes

nsandoval
Download Presentation

Bayesian Probabilities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Probabilities • Recall that Bayes theorem gives us a simple way to compute likelihoods for hypotheses and thus is useful for problems like diagnosis, recognition, interpretation, etc • P(h | E) = p(E | h) * P(h) • As we discussed, there are several problems when applying Bayes • Bayes theorem works only when we have independent events • We need too many probabilities to account for all of the combinations of evidence in E • Probabilities are derived from statistics which might include some bias • We can get around the first two problems if we assume independence (sometimes known as Naïve Bayesian probabilities) or if we can construct a suitable Bayesian network • We can also employ a form of learning so that the probabilities more suit the problem at hand

  2. Co-dependent Events • Recall from our sidewalk being wet example • The two causes were that it rained or that we ran the sprinkler • While the two events do not seem to be dependent in fact we might not run the sprinkler if it is supposed to rain or we might run the sprinkler if we postponed running it because it might rain and it didn’t • Therefore, we cannot treat rain and sprinkler as independent events and we need to resolve this by using what is known as the probability chain rule • P(A & B & C) = P(A) * P(B | A) * P(C | A & B) • the & means “each item in the list happens” • if we have 10 co-dependent events, say A1, …, A10, then our last probability in the list is P(A10 | A1 & A2 & … & A9)

  3. More on the Chain Rule • In order to apply the chain rule, I will need a large number of probabilities • Assume that I have events A, B, C, D, E • P(A & B & C & D & E) = P(A) * P(B | A) * P(C | A & B) * P(D | A & B & C) * P(E | A & B & C & D) • P(A & B & D & E) = P(A) * P(B | A) * P(D | A & B) * P(E | A & B & D) • P(A & C & D) = P(A) * P(C | A) * P(D | A & C) • etc • So I will need P(event i | some combination of other events j) for all i and j • if I have 5 co-dependent events, I need 32 conditional probabilities (along with 5 prior probabilities)

  4. Independence • Because of the problem of co-dependence, we might want to see if two events are independent • For two events to be independent, if one event arises it does not impact the probability of the other event • Consider for instance if you roll a 6 on a die, the probability of rolling a 6 on the same or another die should not change from 1 in 6 • Consider drawing a red card from deck of cards and replacing it (putting it back into the deck) will not change the probability that the next card drawn is also red • However, after drawing the first red card, if you do not replace it, then the probability of drawing a second red card will change • We say that two events, A and B, are independent if and only if P(A ∩ B) = P(A)P(B)

  5. Continued • If we have independent events, then it simplifies what we need in order to compute our probabilities • When reasoning about events A and B, if they are independent, we do not need probabilities of P(A), P(B) and P(A ∩ B), just P(A) and P(B) • Similarly, if events A and B are independent, then • P(A | B) = P(A) because B being present or absent will have no impact on A • With the assumption that events are independent in place, our previous equation of the form: • P(H | e1 & e2 & e3 & …) = P(H) * P(e1 & e2 & e3 & … | H) = P(H) * P(e1 | H) * P(e2 | e1 & H) * P(e3 | e1 & e2 & H) * … • can rewrite a statement as • P(H | e1 & e2 & e3 & …) = P(H) * P(e1 | H) * P(e2 | H) * P(e3 | H) * … • Thus, independence gets past the problem of needing an exponential number of probabilities

  6. Naïve Bayesian Classifier • Assume we have q data sets where each data set comprises some elements of the set {d1, d2, …, dn} and each set qi has been classified into one of m categories {c1, c2, …, cm} • Given a new case = {dj, dk, …} we compute the likelihood that the new case is in each of the categories as follows: • P(ci | case) = P(ci) * P(dj | ci) * P(dk | ci) * … for each i from 1 to m • Where P(ci) is simply the number of times ci occurs out of q • And where P(dj | ci) is the number of times datum dj occurred in a data set that was classified as category ci • P(ci | case) is a Naïve Bayesian classifier (NBC) for category i • This only works if the data making up any one case are independent – this assumption is not necessarily true, thus the word “naïve” • The good news? This is easy!

  7. Chain Rule vs. Naïve Bayes • Let’s consider an example • We want to determine the probability that a person will have spoken or typed a particular phrase, say “the man bit a dog” • We will compute the probability of this by examining several hundred or thousand training sentences • Using Naïve Bayes: • P(“the man bit a dog”) = P(“the”) * P(“man”) * P(“bit”) * P(“a”) * P(“dog”) • We compute this just by counting the number of times each word occurs in the training sentences • Using the chain rule • P(“the man bit the dog”) = P(“the”) * P(“man” | “the”) * P(“bit” | “the man”) * P(“a” | “the man bit”) * P(“dog” | “the man bit a”) • P(“bit” | “the man”) is the number of times the word “bit” followed “the man” in all of our training sentences • The probability computed by the chain rule is far smaller but also much more realistic, so using Naïve Bayes should be done only with caution and the foreknowledge that we can make an independence assumption

  8. Spam Filters • One of the most common uses of a NBC is to construct a spam filter • the spam filter works by learning a “bag of words” – that is, the words that are typically associated with spam • we take all of the words of every email message and discard any common words (I, of, the, is, etc) • Now we “train” our spam filter by computing these probabilities: • P(spam) – the number of emails that were spam out of the training set • P(!spam) – the number of emails that were not spam out of the training set • P(word1 | spam) – the number of times word1 appeared in spam emails • P(word1 | !spam) – the number of times word 1 appears in non spam emails • And so forth for every non common word

  9. Using Our Spam Filter • A new email comes in • Discard all common words • Compute P(spam | words) and P(!spam | words) • P(spam | word1 & word2 & … & wordn) = P(spam) * P(word1 | spam) * P(word2 | spam) * … * P(wordn | spam) • P(!spam | word1 & word2 & … & wordn) = P(!spam) * P(word1 | !spam) * P(word2 | !spam) * … * P(wordn | !spam) • Which probability is higher? That gives you your answer • Without the naïve assumption, our computation becomes: • P(spam | word1, word2, …, wordn) = P(spam) * P(word1, word2, … , wordn | spam) = P(spam) * P(word1 | spam) * P(word2, … , wordn | spam) = P(spam) * P(word1 | spam) * P(word1 & word2 | spam) * P( … , wordn | spam) * … • English has well over 100,000 words but many are common, so that a spam filter may only deal with say 5,000 words but that would require 25000 probabilities if we did not use naïve Bayes!

  10. Classifier Example: Clustering • Clustering is used in data mining to infer boundaries for classes • Here, we assume that we have already clustered a set of data into two classes • We want to use NBC to determine if a new datum, which lies in between the two clusters, is of one class or another • We have two categories, we will call them red and green • P(red) = # of red entries / # of total entries = 20 / 60 • P(green) = # of green entries / # of total entries = 40 / 60 • We add a new datum (shown in white in the figure) and identify which class it is most likely a part of • P(x | green) = # of green entries nearby / # of green entries = 1 / 40 • P(x | red) = # of red entries nearby / # of red entries = 3 / 20 • P(x is green | green entries) = p(green) * P(x | green) = 40 / 60 * 1 / 40 = 1 / 60 • P(x is red | red entries) = p(red) * P(x | red) = 20 / 60 * 3 / 20 = 1 / 20

  11. Bayesian Network • Thus far, our computations have been based on a single mapping from input to output • What if our reasoning process is more involved? • for instance, we have symptoms that map to intermediate conclusions that map to disease hypotheses? • or we have an explicit causal chain? • We can form a network of the values that make up the chain of reasoning, these values are our random variables and each variable will be represented as a node • the network will be a directed acyclic graph – nodes will point from cause to effect, and we hope that the resulting directed graph is acyclic (although as we will see this may not be the case) • we use prior probabilities for the initial nodes and conditional probabilities to link those nodes to other nodes

  12. Simple Example • Below we have a simple Bayesian network arranged as a DAG • Our causes are construction and/or accident and our effects are orange barrels, bad traffic and/or flashing lights • Each variable (node) will either be true or false • Given the input values of whether we see B, T or L, we want to compute the likelihood that the cause was either C or A • notice that T can be caused by either C or A or both – while this makes our graph somewhat more complicated than a linear graph, it does not contain a cycle

  13. Computing the Cause • We use the chain rule to compute the probability of a chain of states being true (C, A, B, T, L) (we address in a bit what we really want to compute, not this particular chain) • p(C, A, B, T, L) = p(C) * p(A | C) * p(B | C, A) * p(T | B, C, A) * p(L | C, A, B, T) where p(C) is a prior probability and the others are conditional probabilities • with 5 items, we need 25 = 32 conditional probabilities • We can reduce some of the above terms • B has nothing to do with A so that p(B | C, A) becomes p(B | C) • T has nothing to do with B so p(T | B, C, A) becomes p(T | C, A) • L has nothing to do with C so p(L | C, A, B, T) becomes P(T | C, A) and • We can reduce p(C, A, B, T, L) to p(C) * p(A) * p(B | C) * p(T | C, A) * p(L | A) • If we could assume independence, then even p(T | C, A) could be simplified, however since two causes feed into the same effect, this may not be wise • If we did choose to implement this as p(T | C) * p(T | A) then we are assuming independence and we solve this with Naïve Bayes

  14. A Naïve Bayes Example Network • Assume we have one cause and m effects as shown in the figure below • Compute p(Y) and p(!Y) • Number of times Y occurs in the data and the number of times Y does not occur out of n data • Compute p(Xi | Y) • Number of times Xi occurs when Y is true for each i from 1 to m • Given some collection V, a subset of {X1, X2, …, Xm}, then p(Y | V) = p(Y) * p(vi | Y) for each vi in V • Let’s examine an example • Y – student is a grad student • X1 – student taking CSC course • X2 – student works • X3 – student is serious (dedicated) • X4 – student calls prof by first name

  15. Example Continued • The University has 15,000 students of which 1,500 are graduate students • p(Y) = 1500/15000 = .10 • Out of 1500 graduate students, 60 are taking CSC courses • P(CSC | grad student) = 60 / 1500 = .04 • P(!CSC | grad student) = 1450 / 1500 = .96 • Out of 1500 graduate students, 1250 work full time • P(work | grad student) = 1250 / 1500 = .83 • P(!word | grad student) = 250 / 1500 = .17 • Out of 1500 graduate students, 1400 are serious about their studies • P(serious | grad student) = 1400 / 1500 = .93 • P(!serious | grad student) = 100 / 1500 = .07 • Out of 1500 graduate students, 750 call their profs by their first names • P(first name | grad student) = 750 / 1500 = .5 • P(!first name | grad student) = 750 / 1500 = .5

  16. Example Continued • NOTE: we will similarly have statistics for the non graduate students (see below) • A given student works, is in CSC, is serious but does not call his prof by the first name • p(CSC | !grad student) = 250 / 13500 = .02 • p(work fulltime | !grad student) = 2500 / 13500 = .19 • p(serious | !grad student) = 5000 / 13500 = .37 • p(!first name | !grad student) = 12000 / 13500 = .89 • What is the probability that the student is a graduate student? • p(grad student | works & CSC & serious & !first name) = p(grad student) * p(works | grad student) * p(CSC | grad student) * p(serious | grad student) * p(!first name | grad student) = .1 * .04 * .83 * .93 * .5 = .0015 • p(!grad student | works & CSC & serious & !first name) = p(!grad student) * p(works | !grad student) * p(CSC | !grad student) * p(serious | !grad student) * p(!first name | !grad student) = .9 * .02 * .19 * .37 * .89 = .0011 • Therefore we can conclude the student is a grad student

  17. A Lengthier Example • The network below shows the cause and effects of two independent situations • an earthquake (which can cause a radio announcement and/or your alarm to go off) • and a burglary (which can cause your alarm to go off) • if your alarm goes off, your neighbor may call you to find out why • The joint probability of alarm, earthquake, radio, burglary = p(A, R, E, B) = p(A | R, E, B) * p(R | E, B) * p(E | B) * p(B) • But because A & R and E & B are independent pairs of events, we can reduce p(A | R, E, B) to p(A | E, B) and p(R | E, B) to p(R | E) • What about neighbor call? Notice that it is dependent on alarm which is dependent on earthquake and burglary

  18. Conditional Independence • In our previous example, we saw that neighbor calling was not independent of burglary or earthquake and therefore, a joint probability p(C, A, R, E, B) will be far more complicated • However, in such a case, we can make the node (neighbor calling) conditionally independent of burglary and earthquake if we are given either alarm or !alarm On the left, A and B are independent of each other unless we instantiate C, so that A is dependent on B given C is true (or false) On the right, A and B are dependent unless we instantiate C, so A and B are independent given C is true (or false)

  19. Example • Here, age and gender are independent and smoking and exposure to toxins are independent if we are given age • Next, smoking is dependent on both age and gender and cancer is dependent on both exposure and smoking and there’s nothing we can do about that • But, serum calcium and lung tumor are independent given cancer • So, given age and cancer • p(A, G, E, S, C, L, SC) = p(A) * p(G) * p(E | A) * p(S | A, G) * p(C | E, S) * p(SC | C) * p(L | C)

  20. Instantiating Nodes • What does it mean in our previous example that “given age” and “given cancer”? • For given cancer, that just means that we are assuming cancer = true • We in fact will instantiate cancer to false and compute the chain again to see which is more likely, this will tell us the probability that the patient has cancer and the probability that the patient does not have cancer • In this way, we can use the Bayesian network to compute for a given node or set of nodes, which values are the most likely • Since age is not a boolean variable, we will have a series of probabilities for different categories of age • E.g., p(age < 18), p(18 <= age < 30), p(30 <= age < 55), along with conditional probabilities for age, e.g., p(smoking | age < 18), p(smoking | 18 <= age < 30), …

  21. What Happens With Dependencies? • In our previous example, what if we do not know the age or if the patient has cancer? If we cannot instantiate those nodes, we have cycles • Let’s do an easy example first • Recall our “grass is wet” example, we said that running the sprinkler was not independent of rain • The Bayesian network that represents this domain is shown to the right, but it has a cycle, so there is a dependence, and if we do not know if we ran the sprinkler or not, we cannot remove that dependence • What then should we do?

  22. Solution • We must find some way of removing the cycle from the previous graph • We will take one of the nodes that is causing the cycle and instantiate it to both true and false • Let’s compute p(rain | grass) by instantiating sprinkler to both true and false • p(rain | grass wet) = [p(r, g, s) + p(r, g, !s)] / [p(r, g, s) + p(r, g, !s) + p(!r, g, s) + p(!r, g, !s)] = (.2 * .01 * .99 + .2 * .99 * .8) / (.2 * .01 * .99 + .2 * .99 * .8 + .8 * .4 * .9 + .8 * .6 * 0) = .36 • So there is a 36% chance that the grass is wet because it rained • in the denominator, grass remains true throughout our denominator because we know that the grass was wet

  23. More Complex Example • Let’s return to our cancer example and see how to resolve it • The cycle occurs because of age pointing at two nodes and those two nodes pointing at cancer • In order to remove the cycle, we must collapse the Exposure to Toxins and Smoking nodes into one • We will also collapse the Age and Gender nodes into one to simplify that Age points to the two separate nodes (which are now collapsed into one) • How do we use a collapsed node? We must instantiate all possible combinations of the values in the node and use these to compute the probability for cancer • Exposure = t, Smoking = t • Exposure = t, Smoking = f • Exposure = f, Smoking = t • Exposure = f, Smoking = f • This becomes far more computationally complex if our node has 10 variables in it • 210 combinations instead of 22

  24. Junction Trees • With more forethought in our design, we may be able to avoid such a problem of collapsing a large number of nodes into a group node by creating what is known as a junction tree • Any Bayesian network can be transformed by • adding links between the parent nodes of any given node • adding links to any cycle of length more than three so that cycles are all of length three or shorter (this helps complete the graph) • Each cycle is a clique of no more than 3 nodes • each of which forms a junction resulting in dependencies of no more than 3 nodes to restrict the number of probabilities needed

  25. Propagation Algorithm • To this point, we have assumed that our knowledge is static but what if we are dealing with either incomplete knowledge or changing evidence? • Now we need to not only perform a chain of computations, we may have to feed back into the network based on new posterior knowledge • This requires a bi-directional propagation algorithm • Judea Pearl came up with an approach that, somewhat like a neural network, involves forward and backward passes until probabilities at each node converge (do not change between iterations) • The idea is, initially the same: introduce the prior probabilities and compute intermediate node conditional probabilities • But when we reach our conclusion (the last or bottom nodes), we can take our result and combine it with new evidence (posterior probabilities) and pass them backward through the network, recomputing each node’s probability as before but backwards

  26. Pearl’s Algorithm • The algorithm is complex and so is only introduced here with a brief example • Bel(B) = p(B | e+, e-) = a * p (B)T o l(B) • p(B) is computed by starting with the prior probability and l(B) is computed by starting with the posterior probability • Think of the values of p(B) and l(B) as being rows and columns of a matrix, T means transpose, and o is the dot inner product of the two matrices, a is a normalizing constant What Pearl’s algorithm offers is the ability to change our evidence over time and be able to update probabilities (beliefs) without having to start from scratch in our computations

  27. Example • We have a murder investigation under way with three suspects and the murder weapon, a knife, with finger prints • X: identity of the last user of the weapon (and therefore the most likely person to be the murderer) • Y: the last holder of the weapon • Z: the identity of the person’s finger prints found on the weapon • our three murder suspects are a, b, and c • Our Bayesian network is merely X  Y  Z • Based on the fingerprint evidence, we have a prior probability of e+ = • p(X = a) = .8, p(X = b) = .1, p(X = c) = .1 • And the following conditional probabilities • p(Y = q | X = q) = .8, p(Q = !q | X = q) = .1 • that is, there is an 80% chance that the fingerprint indicates the murderer and a 10% chance that the fingerprint indicates one of the non-murderers

  28. Belief Computations • We start with our formula Bel(B) = p(B | e+, e-) = a * p (B)T o l(B) where a = [p(e)]-1 • p(Y) = • A lab report provides us with l(Y) = (.8, .6, .5) giving more support that either b or c are the murderers so now we have to update given e- (a posterior probability) • We can now compute • Bel(Y) = a * (.8, .6, .5) o (.66, .17, .17) = (.74, .14, .12) • At this point, we compute l(X) = • and now we update Bel(X) = a * (.75, .61, .54) o (.8, .1, .1) = (.84, .01, .01)

  29. Continued • Now suspect 1 produces a strong alibi reducing the probability that he was the murderer from .8 to .28 • so that p(X) = (.28, .36, .36) • We repeat our belief computations, first by computing p(Y) = • Bel(X) = a * (.28, .36, .36) o (.75, .61, .54) = (.34, .35, .31) • Bel(Y) = a * (.3, .35, .35) o (.8, .65, .5) = (.38, .34, .28) • At this point, probabilities tell us that the most likely suspect in the murder is b because p(X = b) > p(X = a) by a very slight margin • although our belief in Y says that the fingerprints are still more likely to be a’s than b’s since p(Y = a) > p(Y = b)

  30. Tree-based Propagation • Our murder investigation had a single, linear causality, based on fingerprints • What if we had other pieces of evidence aside from finger prints? Then it would be likely that our chain of causality would not merely be linear, but perhaps a tree shape • We would have to enhance the propagation algorithm to include data fusion which include top-down and bottom-up propagations where nodes could be anticipatory, evidence, judgment, or a root node • trees may not have single root nodes, but if they are still acyclic, they use a similar top-down, bottom-up propagation

  31. Graph-based Propagation • Unfortunately, it is unlikely that a real-world problem would result in an acyclic tree • Below is a more common form of causal network with multiple possible conclusions where D stands for various diseases and O for various observations (symptoms) • When our graph has cycles, we must again resort to either collapsing nodes or instantiating nodes • We would have to instantiate all combinations of 3 nodes below in order to remove all cycles, e.g., for D2, D3, D4 try TTT, TTF, TFT, TFF, FTT, FTF, FFT, FFF Or, we might try to collapse nodes such as D2, D3 and D4 into a single node Or we might add links between the disease nodes and observation nodes to create smaller cliques

  32. Approximate Algorithms • Bayesian networks are inherently cyclical • That is, most domains of interest have causes and effects that are co-dependent leading lead to graphs with cycles • if we assume independence between nodes, we do not accurately represent the domain • The result is the need to deal with these cycles by instantiating nodes to all possible combinations, which is of course intractable • Junction trees can reduce the amount of intractability but does not remove it (and there is no single strategy that will minimize a Bayes network in general) • Pearl’s algorithm introduces even greater complexity when the graphs have cycles • There are a number of approximation algorithms available, but each is applicable only to a particular structured network – that is, there is no single approximation algorithm that will either reduce the complexity or provide an accurate result in all cases

  33. Dynamic Bayesian Networks • Cause-effect situations may also be temporal • at time i, an event arises and causes an event at time i+1 • the Bayesian belief network is static, it captures a situation at a singular point in time • we need a dynamic network instead • The dynamic Bayesian network is similar to our previous networks except that each edge represents not merely a dependency, but a temporal change • when you take the branch from state i to state i+1, you are not only indicating that state i can cause i+1 but that i was at a time prior to i+1 Each node represents a sound at a particular time interval NOTE: the DBN is really a form of hidden Markov model, so we defer discussion of this until later

  34. Bayesian Forms of Learning • There are three forms of Bayesian learning • learning probabilities • learning structure • supervised learning of probabilities • In the first form, we merely want to learn the probabilities needed for Bayesian reasoning • this can be done merely by counting occurrences • take all the training data and compute every necessary probability • we might adopt the naïve stance that data are conditionally independent • P(d | h) = P(a1, a2, a3, …, an | h) = P(a1 | h) * P(a2 | h) * … * P(an | h) • this assumption is used for Naïve Bayesian Classifiers and this is how our spam filters can learn over time, by recounting the probabilities from time to time

  35. Another Example of Naïve Bayesian Learning • We want to learn, given some conditions, whether to play tennis or not • see the table on the next page • The data available generated tells us from previous occurrences what the conditions were and whether we played tennis or not during those conditions • there are 14 previous days’ worth of data • To compute our prior probabilities, we just do • P(tennis) = days we played tennis / totals days = 9 / 14 • P(!tennis) = days we didn’t play tennis = 5 / 14 • The evidential probabilities are computed by adding up the number of Tennis = yes and Tennis = no for that evidence, for instance • P(wind = strong | tennis) = 3 / 9 = .33 and P(wind = strong | !tennis) = 3 / 5 = .60

  36. Continued • We have a problem in computing our evidential probabilities • we do not have enough data to tell us if we played in some of the various combinations of conditions • did we play when it was overcast, mild, normal humidity and weak winds? No, so we have no probability for that • do we use 0% if we have no probability? • We must rely on the Naïve Bayesian assumption of conditional independence to get around this problem p(Sunny & Hot & Weak | Yes) = p(tennis) * p(sunny | tennis) * p(hot | tennis) * p(weak | tennis) = 9 / 14 * 2 / 9 * 2 / 9 * 6 / 9 = 12 / 567

  37. Learning Structure • For a Bayesian network, how do we know what states should exist in our structure? How do we know what links should exist between states? • There are two forms of learning here • to learn the states that should exist • to learn which transitions should exist between states • Learning states is less common as we usually have a model in mind before we get started • we already know the causes we want to model and the data tell us what effects we have probabilities for • what the data might not tell us is all of the possible links between the nodes, but so we might have to learn the transitions • Learning transitions is more common and more interesting

  38. Learning Transitions • One approach is to start with a fully connected graph • we learn the transition probabilities using the Baum-Welch algorithm and remove any links whose probabilities are 0 (or negligible) • but this approach will be impractical • we will discuss Baum-Welch with HMMs • Another approach is to create a model using neighbor-merging • start with each observation of each test case representing its own node • as each new test case is introduced, merge nodes that have the same observation at time t • the network begin to collapse • Another approach is to use V-merging • here, we not only collapse states that are the same, but also states that share the same transitions • for instance, if we have a situation where in case j si-1 goes to si goes to si+1 and we match that in case k, then we collapse that entire set of transitions into a single set of transitions • notice there is nothing probabilistic about learning the structure

  39. Example • Given a collection of research articles, learn the structure of a paper’s header • that is, the fields that go into a paper • Data came in three forms: labeled (by human), unlabeled, distantly labeled (data came from bibtex entries, which contains all of the relevant data but had extra fields that were to be discarded) from approximately 5700 papers • the transition probabilities were learned by simple counting

  40. Applications for Bayes • As we saw, spam filters are commonly implemented using naïve Bayes classifiers but what other applications use Bayes? • Inference/diagnosis – as we saw in many of the examples presented here, we might use Bayes to identify the most likely cause of various effects that we see, the cause might be a diagnostic conclusion, a classification, or more generally, assigning credit (blame) • As an example, Pathfinder is a medical diagnostic system for lymph node disease which has 60 diseases and over 130 features where features are not always binary (many are real valued) • Prediction – given our Bayesian network, we might want to predict what the most likely result will be given starting conditions

  41. Continued • Design – nodes represent the components that might go into the design product, and the features that each component provides or goals that it fulfills • Design testing can also be solved • Similarly, decision making can easily be modeled with the intent being to maximize the expected utility • Story understanding – words and concepts are stored in a network and a new story is introduced and the most likely concept is recognized probabilistically • Word sense disambiguation (what meaning does a given work have) is supported as intermediate nodes in the network

More Related