Bayesian Learning. Pt 2. 6.7- 6.12 Machine Learning Promethea Pythaitha.

Download Presentation

Bayesian Learning. Pt 2. 6.7- 6.12 Machine Learning Promethea Pythaitha.

Loading in 2 Seconds...

- 112 Views
- Uploaded on
- Presentation posted in: General

Bayesian Learning. Pt 2. 6.7- 6.12 Machine Learning Promethea Pythaitha.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Bayesian Learning. Pt 2. 6.7- 6.12 Machine LearningPromethea Pythaitha.

- Bayes Optimal Classifier.
- Gibbs Algorithm.
- Naïve Bayes Classifier.
- Bayesian Belief networks.
- EM algorithm.

- So far we have asked: Which is the most likely-to be correct hypothesis:
- Which h is the M.A.P hypothesis.
- Recall MAP = Maximum a-posteriori hypothesis.
- Which hypothesis has the highest likelyhood of corectness given the sample data we have seen.

- Recall MAP = Maximum a-posteriori hypothesis.

- Which h is the M.A.P hypothesis.

- What we usually want is the classification of a specific instance x not in the training data D.
- One way: Find hMAP and return it’s prediction for x.
- Or decide what is the most probable classification for x.

- Boolean classification:
- Have hypotheses h1 h6 with posterior probabilities:
- And h1 classifies x as - , the rest as +.
- Then net support for – is 25%, and for + is 75%.
- The “Bayes Optimal Classification” is + even though the hMAP says - .

- Bayes Optimal Classifier.
- Classifies an instance by taking the average of all the hypotheses predictions weighted by the credibility of the hypothesis.
- Eqn 6.18.

Any system that classifies instances using this system is a “Bayes Optimal Classifier”

- No other classification method, given the same hypothesis space and prior knowledge can outperform this method – on average!!!
- A particularly interesting result is that the predictions made by a BOC can correspond to hypotheses that are not even in it’s hypothesis space.
- This helps deal with the limitations imposed by overly restricted hypothesis spaces.

- Bayes Optimal Classification is the best – on average – but it can be very computationally costly.
- First it has to learn all the posterior probabilities for the hypotheses in H.
- Then it has to poll the hypotheses to find what each one predicts for x’s classification.
- Then it has to compute this big weighted sum (eqn 6.18)
- But remember, hypothesis spaces get very large….
- Recall the reason why we used a specific and general boundarys in the CELA algorithm.

- But remember, hypothesis spaces get very large….

- One way to avoid some of the cmputation is:
- 1: Select h from H based on the posterior-probability distribution.
- So more “credible” hypotheses are selected with higher probability than others.

- 2: Return h(x).
- This saves the time of computing the results hi(x) for all hi in H, and doing the big sum…. But it is less optimal.

- 1: Select h from H based on the posterior-probability distribution.

- If we compute expected misclassification error of the Gibbs algorithm over target concepts drawn at random based on the a-priori probability distribution assumed by the learner,
- Then this error will be at most twice that for the B.O.C.
- ** On Average.

- Highly practical Bayesian learner.
- Can, under right conditions, rank as well as
Neural-nets or Decision-trees.

- Can, under right conditions, rank as well as
- Applies to any learning task where each instance x is described by a conjunction of attributes (or attribute-value pairs) and where target function f(x) can take any value in finite set V.
- We are given training data D, and asked to classify a new instance x = <a1, a2, …, an>

- Classify a new instance by assigning the most probable target value: vMAP, given the attribute values of the instance.
- Eqn 6.19.

- To get the classification we need to find the vj that maximizes
P(a1, a2, …, an| vj )*P(vj )

- Second term: EASY.
- It’s simply # instances with classification vj over total # instances.

- First term: Hard!!!
- Need a HUGE set of training data.
- Suppose we have 10 attributes (with 2 possibilities each) and 15 classifications. 15,360 possibilities,
- And we have say 200 instances with known classifications.
- Cannot get a reliable estimate!!!

- Suppose we have 10 attributes (with 2 possibilities each) and 15 classifications. 15,360 possibilities,

- Need a HUGE set of training data.

- One way to ‘fix’ the problem is to assume the attributes are conditionally independent.
- Assume P(a1, a2, …, an| vj) = Πi P(ai| vj)
- Then the Naïve Bayes Classifier uses this for the prediction:
- Eqn 6.20.

- 1: Learn the P(ai| vj) and P(ai| vj) for all a’s and v’s. (based on training data)
- In our example this is 10(2)*15 = 300.
- Sample size of 200 is plenty!

- In our example this is 10(2)*15 = 300.
- ** This set of numbers is the learned hypothesis.
- 2: Use this hypothesis to find vNB.
- IF our “naïve assumption”: Conditional independence is true then vNB = vMAP

- In Bayesian Learning, there is not explicit search through a hypothesis space.
- Does not produce an inference rule (D-tree) or a weight vector (NN).
- Instead it forms the hypothesis by observing frequencies of various data combinations.

- There are four possible end-states of stellar evolution:
- 1: White-Dwarf.
- 2: Neutron Star.
- 3: Black-hole.
- 4: Brown dwarf.

- About the size of the Earth, the mass of our Sun. [Up to 1.44 times solar mass]
- The little white dot in the center of each is the White –Dwarf.

- About the size of the Gallatin Valley.
- About twice the mass of our Sun (up to 2.9 Solar Masses)
- Don’t go too close! They have huge enough gravitational and Electromagnetic fields, stretch you into spaghetti, rip out every metal atom in your body, and finally spread you across the whole surface!!
- Form in Type II
supernovae.

The Neutron star is a tiny

speck at the center of that

cloud.

- The ultimate cosmic sink-hole, even devours light!!
- Time-dilation, etc. come into effect near the event-horizon.

- Stars that never got hot enough to start fusion (<.1 Solar masses)

- Because it is hard to get data and impossible to observe these from close up, we need an accurate way of identifying these remnants.
- Two ways:
- 1: Computer model of Stellar structure
- Create a program that models a star, and has to estimate the equations-of-state governing the more bizarre remnants (such as Neutron Stars) as they involve super-nuclear densities and are not well predicted by Quantum mechanics.
- Learning algorithms (such as NN’s) are sometimes used to tune the model based on known stellar remnants.

- Create a program that models a star, and has to estimate the equations-of-state governing the more bizarre remnants (such as Neutron Stars) as they involve super-nuclear densities and are not well predicted by Quantum mechanics.
- 2: group an unclassified remnant with others having similar attributes.

- 1: Computer model of Stellar structure

- The latter is more like a Bayesian Classifier.
- Define (for masses of progenitor stars)
- .1 to 10 Solar Masses = Average.
- 10 to 40 Solar Masses = Giant
- 40 to 150 Solar Masses = Supergiant
- 0 to .1 Solar Masses = Tiny.

- Define (for masses of remnants)
- < 1.44 Solar masses = Small
- 1.44 to 2.9 Solar masses = Medium
- > 2.9 Solar masses = Large.

- Define classifications:
- WD = White Dwarf.
- NS = Neutron Star.
- BH = Black hole
- BD = Brown Dwarf.

- Define (for masses of progenitor stars)

- If we find a new Stellar remnant with attributes
<Average, Medium> we could certainly put it’s mass into a stellar model that has been fine-tuned by our Neural Net, or, we could simply use a Bayesian Classification:

- Either would give the same result:
- Comparing with data we have, and matching attributes, this has to be a Neutron star.
- Similarly we can predict <Tiny, Small> Brown-Dwarf.
- <Supergiant, large> Black Hole.

- See table 3.2 pg59.
- Possible target values = {no, yes}
- P(no) = 9/14 = .64
- P(yes) = 1-P(no) = .36
- Want to know: PlayTennis? If <sunny, cool, high, strong>
- Need P(sunny|no), P(sunny|yes), etc…
- P(sunny|no) = #sunny’s in no category / #no’s = 3/5.
- P(sunny|yes) = 2/9. etc…
- NB classification: NO.
- Support for no = P(no)*P(sunny|no)*P(cool|no)*P(high|no)*P(strong|no) =.0206.
- Support for yes = …= .0053.

- Usually
- P(event) = (# times event occurred)/(total # trials)

- Fair coin: 50/50 heads/tails.
- So out of two tosses, we expect 1 head, 1 tail.
- Don’t bet your life on it!!!
- What about 10 tosses?
- P(all tails) = ½^10 = 1/1024.
- More likely than winning the lottery, which DOES happen!

- Using the simple ratio induces a bias:
- ex: P(heads) = 0/10 = 0.
- NOT TRUE!!
- Sample not representative of population.

- And will dominate NB classifier.
- Multiplication by 0.

- NOT TRUE!!

- ex: P(heads) = 0/10 = 0.

- P(event e) = [#times e occurred + m*p]/[# trials +m]
- m = “equivalent sample size.”
- p = prior estimate of P(e).
- Essentially assuming m virtual trials following the predicted distribution (usually uniform.) – in addition to the real ones.
- Reduces the small-sample bias.

- Used a Naïve Bayes approach to decide “like” or “dislike” based on words in the text.
- Simplifying assumptions:
- 1: Position of word did not matter.
- 2: Most common 100 words were removed from consideration.

- Simplifying assumptions:
- Overall performance = 89% accuracy
- Versus 5% for random guessing.

- The assumption of Conditional independence may not be true!!
- EX:
- v = lives in community “k”
- Suppose we know that a survey has been done indicating 90% of the people there are young-earth creationists.

- h1 = Is a Young-earth Creationist.
- h2 = Discredits Darwinian Evolution.
- h3 = Likes Carrots.
- Clearly (h1|v) and (h2|v) are not independent, but (h3|v) is unaffected.

- v = lives in community “k”

- EX:

- In any set of attributes, some will be conditionally independent.
- And some will not.

- Allow conditional independence rules to be stated for certain subsets of the attributes.
- Best of both worlds:
- More realistic than assuming all attributes are conditionally independent.
- Computationally cheaper than if we ignore the possibility of independent attributes.

- Formally a Bayesian belief network describes a probability distribution over a set of variables.
- If we have variables Y1, …, Yn
- Yk has domain Vk
- then the Bayesian belief network is a probability density distribution over V1xV2x….Vn.

- Each variable in the instance is a node in the BBN.
- Every node has:
- 1: Network arcs assert a variable is conditionally independent of it’s non-children, given it’s parents.
- 2: Conditional probability tables define the distribution of a variable given those of it’s parents.

- Strongly reminiscent of a Neural Net structure. Here we have conditional probabilities instead of weights.
- See fig 6.3. pg 186.

- Storm affects probability of someone lighting a campfire – not the other way around.
- BBN’s allow us to state causality rules!!
- Once we have learned the BBN, we can calculate the probability distribution of any attribute.
- pg 186.

- Learning the Network:
- If structure is known and all attributes are visible, then learn probabilities like in NB classifier.
- If structure is known, but not all attributes are visible, have hidden values.
- Train using NN-type Gradient ascent
- Or EM algorithm.

- If structure is unknown… various methods.

- Maximize P(D|h) by going in direction of steepest ascent: Gradient(ln P(D|h))
- Define ‘weight’ wijk as the conditional probability that Yi takes value yij with parents in the configuration uik.
- General form:
- Eqn 6.25.

- Weight update:
- Since for given i and k, the sum of wijk’s must be 1, and now can exceed 1, we update all wijk’s and then normalize.
- wijk wijk / (Σj wijk)

- Works very well in practice, though it can get stuck on local optima, much like Gradient descent for NN’s!!

- Alternate way of learning hidden variables given training data.
- In general, use the mean value for an attribute when it’s value is not known.
- “The EM algorithm can be used even for variable whose value is never even directly observed, provided the general form of the probability distribution governing those variables is known.”

- Guess h = <μ1,μ2>
- Must know variance.

- Estimate probability of data pt. I coming from each distribution assuming h is correct.
- Update h.

- What complicates the situation is that we have more than one unknown variable.
- The general idea is:
- Pick randomly your estimate of the means.
- Assuming they are correct, and using the standard deviation (which must be known and equal for all variables) figure out which data points are most likely to have come from which distributions. They will have the largest effect on the revision of that mean.

- Then redefine each approximate mean as a weighted sample mean, where all data points are considered, but their effect on the kth distribution mean is weighted by how likely they were to come from it.

- In general, we use a quality test to measure the fit of the sample data with the learned distribution means.
- For normally distributed variables, we could use a hypothesis test:
- How certain can we be that true-mean = estimated value, given the sample data.

- For normally distributed variables, we could use a hypothesis test:
- If the quality function has only one maximum, this method will find it.
- Otherwise, it can get stuck on a local maximum, but in practice works quite well.

- Suppose we have two unknown variables, and we want their means μ1 and μ2.
(true–mean 1, true-mean 2)

- All we have is a few sample data points from these distributions {blue and green dots}
We cannot see the actual distributions.

- We know the standard deviation before hand.

- The inflection point is at one standard deviation: σ from the mean.
- Centered at the mean, the probability of getting a point at most 1 σ away is 68%
- This means the probability of drawing a point to the right of μ+1*σ is at most .5(1-.68) = 16%

- At most 2 σ from the mean is 95%
- This means the probability of drawing a point to the right of μ+2*σ is at most .5(1-.95) = 2.5%

- At most 3 σ from the mean is 99.7%
- This means the probability of drawing a point to the right of μ+3*σ is at most .5(1-.997) = 0.15%

- In the above, the “to the right probability” is the same for the left, but I am doing one-sided for a reason.
- The important part is to note how quickly the “tails” of the normal distribution “fall off”.
In other words, the probability of getting a specific data point drops drastically as we go away from the mean.

- We randomly pick our estimates for the true means: h = <μ1, μ2>
- Then we drop them on our data set, and for each ith data point, find the probability that it came from distribution 1, and that from distribution 2.

- Recall the discussion of how fast the probabilities go down as we go away from the mean in a normal distribution, we get approximately this:
- There’s about a 60% chance that the blue dots came from Dist. 1, and about <.001% that they came from Dist. 2
- Similarly there’s about a 50% chance that the green dots came from Dist. 2, and about <.001% that they came from Dist. 1
- These figures are estimates for qualitative understanding!
- Real values can be found using the rigorous definition that is following shortly.

- So we now calculate new estimates μ1’, μ2’ for true-mean1 and true-mean2.
- Basically, we say:
- Assume the blue points did come from Dist. 1.
Then the mean μ1 should be more to the left.

- Similarly, the mean μ2 should be more to the right since it seems the green points should be coming from Dist. 2.

- Assume the blue points did come from Dist. 1.
- The fact is we actually modify μ1 using all the data points (even those that we do not think come from distribution1) but we weight each point’s effect by how likely it is that the point came from distribution 1.

- Basically, we say:

- Now replace h = < μ1, μ2> with h’ = < μ1’, μ2’>
- and repeat till we get convergence of h.

- Initialize h = <μ1, μ2> using random values for the means.
- Untill h converges: DO
- Calculate E(zij) for each hidden variable zij.
- E(zij) = Likelihood that xi comes from distribution j.

- Calculate our evolved hypothesis h’ = <μ1’,μ2’>
assuming zij = E(zij) as calculated using h.

- Replace h with h’.

- Calculate E(zij) for each hidden variable zij.

- It can be proved that this will find a locally optimal h, given the training data.
- If our quality function only has a global maximum, it is guaranteed to find it.
- Otherwise, can get stuck on a local optimum.
- Just like Gradient Ascent for BBN’s
- Or Gradient ascent for NN’,
- Etc.

- Just like Gradient Ascent for BBN’s

- Very reminiscent of an EA!!
- For example: For a GA, have a population.
- Optimize based on that population.
- Create descendants and choose best….

- Update the population.

- Optimize based on that population.

- For example: For a GA, have a population.

- Let X = {x1,…,xm} : observed data in m trials.
- Let Z = {z1,…, zm}: unobserved data in m trials.
- Let Y = XṶZ = full data.
- It is correct to treat Z as a random variable with a probability distribution completely determined by X and some parameters Θ to be learned.

- Denote the current Θ as h.
- We are looking for the hypothesis h’ that maximizes E(ln P(Y|h’)).
- This is the maximum-likelihood hypothesis.
- Like I said before, we want to use the current hypothesis to estimate the Quality of states reachable from h:
- Q(h’|h) = E [ln P(Y|h’) | h, X]
- Here, Q(h’) depends on h since we are calculating this using h.

- Until we have convergence in h: DO
- 1: Estimation: E
- Calculate Q(h’|h) using current hypothesis h and observed X to estimate the probability distribution over the whole Y.
- Q(h’|h) E[ln P(Y|h’) |h, X]

- 2: Maximization: M
- Replace h with h’ that will maximize Q.

- 1: Estimation: E