Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha
This presentation is the property of its rightful owner.
Sponsored Links
1 / 62

Bayesian Learning. Pt 2. 6.7- 6.12 Machine Learning Promethea Pythaitha. PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on
  • Presentation posted in: General

Bayesian Learning. Pt 2. 6.7- 6.12 Machine Learning Promethea Pythaitha. Bayes Optimal Classifier. Gibbs Algorithm. Naïve Bayes Classifier. Bayesian Belief networks. EM algorithm. Bayesian Optimal classifier. So far we have asked: Which is the most likely-to be correct hypothesis:

Download Presentation

Bayesian Learning. Pt 2. 6.7- 6.12 Machine Learning Promethea Pythaitha.

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

Bayesian Learning. Pt 2. 6.7- 6.12 Machine LearningPromethea Pythaitha.


Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

  • Bayes Optimal Classifier.

  • Gibbs Algorithm.

  • Naïve Bayes Classifier.

  • Bayesian Belief networks.

  • EM algorithm.


Bayesian optimal classifier

Bayesian Optimal classifier.

  • So far we have asked: Which is the most likely-to be correct hypothesis:

    • Which h is the M.A.P hypothesis.

      • Recall MAP = Maximum a-posteriori hypothesis.

        • Which hypothesis has the highest likelyhood of corectness given the sample data we have seen.


Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

  • What we usually want is the classification of a specific instance x not in the training data D.

    • One way: Find hMAP and return it’s prediction for x.

    • Or decide what is the most probable classification for x.


Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

  • Boolean classification:

  • Have hypotheses h1  h6 with posterior probabilities:

  • And h1 classifies x as - , the rest as +.

  • Then net support for – is 25%, and for + is 75%.

  • The “Bayes Optimal Classification” is + even though the hMAP says - .


Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

  • Bayes Optimal Classifier.

    • Classifies an instance by taking the average of all the hypotheses predictions weighted by the credibility of the hypothesis.

    • Eqn 6.18.


Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

Any system that classifies instances using this system is a “Bayes Optimal Classifier”

  • No other classification method, given the same hypothesis space and prior knowledge can outperform this method – on average!!!

  • A particularly interesting result is that the predictions made by a BOC can correspond to hypotheses that are not even in it’s hypothesis space.

    • This helps deal with the limitations imposed by overly restricted hypothesis spaces.


Best performance at what cost

Best performance!!!----- At what cost?

  • Bayes Optimal Classification is the best – on average – but it can be very computationally costly.

    • First it has to learn all the posterior probabilities for the hypotheses in H.

    • Then it has to poll the hypotheses to find what each one predicts for x’s classification.

    • Then it has to compute this big weighted sum (eqn 6.18)

      • But remember, hypothesis spaces get very large….

        • Recall the reason why we used a specific and general boundarys in the CELA algorithm.


Gibbs algorithm

Gibbs Algorithm.

  • One way to avoid some of the cmputation is:

    • 1: Select h from H based on the posterior-probability distribution.

      • So more “credible” hypotheses are selected with higher probability than others.

    • 2: Return h(x).

    • This saves the time of computing the results hi(x) for all hi in H, and doing the big sum…. But it is less optimal.


How well can it do

How well can it do?

  • If we compute expected misclassification error of the Gibbs algorithm over target concepts drawn at random based on the a-priori probability distribution assumed by the learner,

  • Then this error will be at most twice that for the B.O.C.

  • ** On Average.


Na ve bayes classifier nbc

Naïve Bayes Classifier. [NBC]

  • Highly practical Bayesian learner.

    • Can, under right conditions, rank as well as

      Neural-nets or Decision-trees.

  • Applies to any learning task where each instance x is described by a conjunction of attributes (or attribute-value pairs) and where target function f(x) can take any value in finite set V.

  • We are given training data D, and asked to classify a new instance x = <a1, a2, …, an>


Bayesian approach

Bayesian approach:

  • Classify a new instance by assigning the most probable target value: vMAP, given the attribute values of the instance.

  • Eqn 6.19.


Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

  • To get the classification we need to find the vj that maximizes

    P(a1, a2, …, an| vj )*P(vj )

  • Second term: EASY.

    • It’s simply # instances with classification vj over total # instances.

  • First term: Hard!!!

    • Need a HUGE set of training data.

      • Suppose we have 10 attributes (with 2 possibilities each) and 15 classifications.  15,360 possibilities,

        • And we have say 200 instances with known classifications.

        •  Cannot get a reliable estimate!!!


The na ve assumption

The Naïve assumption.

  • One way to ‘fix’ the problem is to assume the attributes are conditionally independent.

  • Assume P(a1, a2, …, an| vj) = Πi P(ai| vj)

  • Then the Naïve Bayes Classifier uses this for the prediction:

  • Eqn 6.20.


Na ve bayes algorithm

Naïve Bayes Algorithm.

  • 1: Learn the P(ai| vj) and P(ai| vj) for all a’s and v’s. (based on training data)

    • In our example this is 10(2)*15 = 300.

      • Sample size of 200 is plenty!

  • ** This set of numbers is the learned hypothesis.

  • 2: Use this hypothesis to find vNB.

    • IF our “naïve assumption”: Conditional independence is true then vNB = vMAP


Bayesian learning vs other machine learning methods

Bayesian Learning vs. Other Machine Learning methods.

  • In Bayesian Learning, there is not explicit search through a hypothesis space.

    • Does not produce an inference rule (D-tree) or a weight vector (NN).

    • Instead it forms the hypothesis by observing frequencies of various data combinations.


Example

Example.

  • There are four possible end-states of stellar evolution:

    • 1: White-Dwarf.

    • 2: Neutron Star.

    • 3: Black-hole.

    • 4: Brown dwarf.


White dwarf

White Dwarf.

  • About the size of the Earth, the mass of our Sun. [Up to 1.44 times solar mass]

    • The little white dot in the center of each is the White –Dwarf.


Neutron stars

Neutron Stars.

  • About the size of the Gallatin Valley.

  • About twice the mass of our Sun (up to 2.9 Solar Masses)

  • Don’t go too close! They have huge enough gravitational and Electromagnetic fields, stretch you into spaghetti, rip out every metal atom in your body, and finally spread you across the whole surface!!

  • Form in Type II

    supernovae.

    The Neutron star is a tiny

    speck at the center of that

    cloud.


Black holes 3 to 50 solar masses

Black-Holes. (3 to 50 Solar masses)

  • The ultimate cosmic sink-hole, even devours light!!

  • Time-dilation, etc. come into effect near the event-horizon.


Brown dwarfs

Brown-Dwarfs.

  • Stars that never got hot enough to start fusion (<.1 Solar masses)


Classification

Classification:

  • Because it is hard to get data and impossible to observe these from close up, we need an accurate way of identifying these remnants.

  • Two ways:

    • 1: Computer model of Stellar structure

      • Create a program that models a star, and has to estimate the equations-of-state governing the more bizarre remnants (such as Neutron Stars) as they involve super-nuclear densities and are not well predicted by Quantum mechanics.

        • Learning algorithms (such as NN’s) are sometimes used to tune the model based on known stellar remnants.

    • 2: group an unclassified remnant with others having similar attributes.


Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

  • The latter is more like a Bayesian Classifier.

    • Define (for masses of progenitor stars)

      • .1 to 10 Solar Masses = Average.

      • 10 to 40 Solar Masses = Giant

      • 40 to 150 Solar Masses = Supergiant

      • 0 to .1 Solar Masses = Tiny.

    • Define (for masses of remnants)

      • < 1.44 Solar masses = Small

      • 1.44 to 2.9 Solar masses = Medium

      • > 2.9 Solar masses = Large.

    • Define classifications:

      • WD = White Dwarf.

      • NS = Neutron Star.

      • BH = Black hole

      • BD = Brown Dwarf.


Some training data

Some Training Data:


Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

  • If we find a new Stellar remnant with attributes

    <Average, Medium> we could certainly put it’s mass into a stellar model that has been fine-tuned by our Neural Net, or, we could simply use a Bayesian Classification:

  • Either would give the same result:

    • Comparing with data we have, and matching attributes, this has to be a Neutron star.

    • Similarly we can predict <Tiny, Small> Brown-Dwarf.

    • <Supergiant, large>  Black Hole.


Quantitative example table 3 2 pg 59

Quantitative example.Table 3.2 pg 59.


Quantitative example

Quantitative example.

  • See table 3.2 pg59.

  • Possible target values = {no, yes}

  • P(no) = 9/14 = .64

  • P(yes) = 1-P(no) = .36

  • Want to know: PlayTennis? If <sunny, cool, high, strong>

  • Need P(sunny|no), P(sunny|yes), etc…

    • P(sunny|no) = #sunny’s in no category / #no’s = 3/5.

    • P(sunny|yes) = 2/9. etc…

    • NB classification: NO.

      • Support for no = P(no)*P(sunny|no)*P(cool|no)*P(high|no)*P(strong|no) =.0206.

      • Support for yes = …= .0053.


Estimating probabilities

Estimating probabilities.

  • Usually

    • P(event) = (# times event occurred)/(total # trials)

  • Fair coin: 50/50 heads/tails.

    • So out of two tosses, we expect 1 head, 1 tail.

    • Don’t bet your life on it!!!

      • What about 10 tosses?

      • P(all tails) = ½^10 = 1/1024.

      • More likely than winning the lottery, which DOES happen!


Small sample bias

Small sample bias.

  • Using the simple ratio induces a bias:

    • ex: P(heads) = 0/10 = 0.

      • NOT TRUE!!

        • Sample not representative of population.

      • And will dominate NB classifier.

        • Multiplication by 0.


M estimate

M-estimate.

  • P(event e) = [#times e occurred + m*p]/[# trials +m]

  • m = “equivalent sample size.”

  • p = prior estimate of P(e).

    • Essentially assuming m virtual trials following the predicted distribution (usually uniform.) – in addition to the real ones.

    • Reduces the small-sample bias.


Text classification using na ve bayes

Text Classification using Naïve Bayes.

  • Used a Naïve Bayes approach to decide “like” or “dislike” based on words in the text.

    • Simplifying assumptions:

      • 1: Position of word did not matter.

      • 2: Most common 100 words were removed from consideration.

  • Overall performance = 89% accuracy

    • Versus 5% for random guessing.


Bayesian belief network

Bayesian Belief network.

  • The assumption of Conditional independence may not be true!!

    • EX:

      • v = lives in community “k”

        • Suppose we know that a survey has been done indicating 90% of the people there are young-earth creationists.

      • h1 = Is a Young-earth Creationist.

      • h2 = Discredits Darwinian Evolution.

      • h3 = Likes Carrots.

      • Clearly (h1|v) and (h2|v) are not independent, but (h3|v) is unaffected.


Reality

Reality.

  • In any set of attributes, some will be conditionally independent.

  • And some will not.


Bayesian belief networks

Bayesian Belief Networks.

  • Allow conditional independence rules to be stated for certain subsets of the attributes.

  • Best of both worlds:

    • More realistic than assuming all attributes are conditionally independent.

    • Computationally cheaper than if we ignore the possibility of independent attributes.


Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

  • Formally a Bayesian belief network describes a probability distribution over a set of variables.

    • If we have variables Y1, …, Yn

    • Yk has domain Vk

      • then the Bayesian belief network is a probability density distribution over V1xV2x….Vn.


Representation

Representation.

  • Each variable in the instance is a node in the BBN.

  • Every node has:

  • 1: Network arcs assert a variable is conditionally independent of it’s non-children, given it’s parents.

  • 2: Conditional probability tables define the distribution of a variable given those of it’s parents.


Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

  • Strongly reminiscent of a Neural Net structure. Here we have conditional probabilities instead of weights.

    • See fig 6.3. pg 186.


Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

  • Storm affects probability of someone lighting a campfire – not the other way around.

  • BBN’s allow us to state causality rules!!

  • Once we have learned the BBN, we can calculate the probability distribution of any attribute.

    • pg 186.


Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

  • Learning the Network:

  • If structure is known and all attributes are visible, then learn probabilities like in NB classifier.

  • If structure is known, but not all attributes are visible, have hidden values.

    • Train using NN-type Gradient ascent

    • Or EM algorithm.

  • If structure is unknown… various methods.


Gradient ascent training

Gradient ascent training.

  • Maximize P(D|h) by going in direction of steepest ascent: Gradient(ln P(D|h))

  • Define ‘weight’ wijk as the conditional probability that Yi takes value yij with parents in the configuration uik.

  • General form:

  • Eqn 6.25.


Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

  • Weight update:

  • Since for given i and k, the sum of wijk’s must be 1, and now can exceed 1, we update all wijk’s and then normalize.

  • wijk wijk / (Σj wijk)


Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

  • Works very well in practice, though it can get stuck on local optima, much like Gradient descent for NN’s!!


Em algorithm

EM algorithm.

  • Alternate way of learning hidden variables given training data.

  • In general, use the mean value for an attribute when it’s value is not known.

  • “The EM algorithm can be used even for variable whose value is never even directly observed, provided the general form of the probability distribution governing those variables is known.”


Learning the means of several unknown variables normal distribution

Learning the means of several unknown variables.(Normal distribution)

  • Guess h = <μ1,μ2>

    • Must know variance.

  • Estimate probability of data pt. I coming from each distribution assuming h is correct.

  • Update h.


Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

  • What complicates the situation is that we have more than one unknown variable.

  • The general idea is:

    • Pick randomly your estimate of the means.

    • Assuming they are correct, and using the standard deviation (which must be known and equal for all variables) figure out which data points are most likely to have come from which distributions. They will have the largest effect on the revision of that mean.


Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

  • Then redefine each approximate mean as a weighted sample mean, where all data points are considered, but their effect on the kth distribution mean is weighted by how likely they were to come from it.

  • Loop till we get convergence to a set of means.


  • Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

    • In general, we use a quality test to measure the fit of the sample data with the learned distribution means.

      • For normally distributed variables, we could use a hypothesis test:

        • How certain can we be that true-mean = estimated value, given the sample data.

    • If the quality function has only one maximum, this method will find it.

      • Otherwise, it can get stuck on a local maximum, but in practice works quite well.


    Example estimating means of 2 gaussians

    Example: Estimating means of 2 Gaussians.

    • Suppose we have two unknown variables, and we want their means μ1 and μ2.

      (true–mean 1, true-mean 2)

    • All we have is a few sample data points from these distributions {blue and green dots}

      We cannot see the actual distributions.

    • We know the standard deviation before hand.


    Review of the normal distribution

    Review of the normal distribution:

    • The inflection point is at one standard deviation: σ from the mean.

    • Centered at the mean, the probability of getting a point at most 1 σ away is 68%

      • This means the probability of drawing a point to the right of μ+1*σ is at most .5(1-.68) = 16%

    • At most 2 σ from the mean is 95%

      • This means the probability of drawing a point to the right of μ+2*σ is at most .5(1-.95) = 2.5%

    • At most 3 σ from the mean is 99.7%

      • This means the probability of drawing a point to the right of μ+3*σ is at most .5(1-.997) = 0.15%


    Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

    • In the above, the “to the right probability” is the same for the left, but I am doing one-sided for a reason.

    • The important part is to note how quickly the “tails” of the normal distribution “fall off”.

      In other words, the probability of getting a specific data point drops drastically as we go away from the mean.


    Back to our example

    Back to our example.

    • We randomly pick our estimates for the true means: h = <μ1, μ2>

    • Then we drop them on our data set, and for each ith data point, find the probability that it came from distribution 1, and that from distribution 2.


    Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

    • Recall the discussion of how fast the probabilities go down as we go away from the mean in a normal distribution, we get approximately this:

    • There’s about a 60% chance that the blue dots came from Dist. 1, and about <.001% that they came from Dist. 2

    • Similarly there’s about a 50% chance that the green dots came from Dist. 2, and about <.001% that they came from Dist. 1

      • These figures are estimates for qualitative understanding!

      • Real values can be found using the rigorous definition that is following shortly.


    Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

    • So we now calculate new estimates μ1’, μ2’ for true-mean1 and true-mean2.

      • Basically, we say:

        • Assume the blue points did come from Dist. 1.

          Then the mean μ1 should be more to the left.

        • Similarly, the mean μ2 should be more to the right since it seems the green points should be coming from Dist. 2.

      • The fact is we actually modify μ1 using all the data points (even those that we do not think come from distribution1) but we weight each point’s effect by how likely it is that the point came from distribution 1.


    Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

    • Now replace h = < μ1, μ2> with h’ = < μ1’, μ2’>

    •  and repeat till we get convergence of h.


    Formally

    Formally.

    • Initialize h = <μ1, μ2> using random values for the means.

    • Untill h converges: DO

      • Calculate E(zij) for each hidden variable zij.

        • E(zij) = Likelihood that xi comes from distribution j.

      • Calculate our evolved hypothesis h’ = <μ1’,μ2’>

        assuming zij = E(zij) as calculated using h.

      • Replace h with h’.


    Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

    • It can be proved that this will find a locally optimal h, given the training data.

      • If our quality function only has a global maximum, it is guaranteed to find it.

      • Otherwise, can get stuck on a local optimum.

        • Just like Gradient Ascent for BBN’s

          • Or Gradient ascent for NN’,

          • Etc.

    • Very reminiscent of an EA!!

      • For example: For a GA, have a population.

        • Optimize based on that population.

          • Create descendants and choose best….

        • Update the population.


    General em algorithm

    General EM algorithm.

    • Let X = {x1,…,xm} : observed data in m trials.

    • Let Z = {z1,…, zm}: unobserved data in m trials.

    • Let Y = XṶZ = full data.

    • It is correct to treat Z as a random variable with a probability distribution completely determined by X and some parameters Θ to be learned.


    Bayesian learning pt 2 6 7 6 12 machine learning promethea pythaitha

    • Denote the current Θ as h.

    • We are looking for the hypothesis h’ that maximizes E(ln P(Y|h’)).

      • This is the maximum-likelihood hypothesis.

      • Like I said before, we want to use the current hypothesis to estimate the Quality of states reachable from h:

      • Q(h’|h) = E [ln P(Y|h’) | h, X]

      • Here, Q(h’) depends on h since we are calculating this using h.


    Algorithm

    Algorithm:

    • Until we have convergence in h: DO

      • 1: Estimation: E

        • Calculate Q(h’|h) using current hypothesis h and observed X to estimate the probability distribution over the whole Y.

        • Q(h’|h)  E[ln P(Y|h’) |h, X]

      • 2: Maximization: M

        • Replace h with h’ that will maximize Q.


  • Login