Bayesian learning finalized (with high probability)

1 / 24

# Bayesian learning finalized (with high probability) - PowerPoint PPT Presentation

Bayesian learning finalized (with high probability). Everything’s random. Basic Bayesian viewpoint: Treat (almost) everything as a random variable Data/independent var: X vector Class/dependent var: Y Parameters : Θ E.g., mean, variance, correlations, multinomial params, etc.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Bayesian learning finalized (with high probability)' - basil

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Everything’s random...
• Basic Bayesian viewpoint:
• Treat (almost) everything as a random variable
• Data/independent var: X vector
• Class/dependent var: Y
• Parameters:Θ
• E.g., mean, variance, correlations, multinomial params, etc.
• Use Bayes’ Rule to assess probabilities of classes
• Allows us to say: “It is is very unlikely that the mean height is 2 light years”
Uncertainty over params
• Maximum likelihood treats parameters as (unknown) constants
• Job is just to pick the constants so as to maximize data likelihood
• Fullblown Bayesian modeling treats params as random variables
• PDF over parameter variables tells us how certain/uncertain we are about the location of that parameter
• Also allows us to express prior beliefs (probabilities) about params
Example: Coin flipping
• Have a “weighted” coin -- want to figure out θ=Pr[heads]
• Maximum likelihood:
• Flip coin a bunch of times, measure #heads; #tails
• Use estimator to return a single value for θ
• This is called a point estimate
Example: Coin flipping
• Have a “weighted” coin -- want to figure out θ=Pr[heads]
• Bayesian posterior estimation:
• Start w/ distribution over what θ might be
• Flip coin a bunch of times, measure #heads; #tails
• Update distribution, but never reduce to a single number
• Always keep around Pr[θ | data]: posterior estimate
Example: Coin flipping

?

?

?

?

?

?

?

0 flips total

Example: Coin flipping

100 flips total

How does it work?
• Think of parameters as just another kind of random variable
• Now your data distribution is
• This is the generative distribution
• A.k.a. observation distribution, sensor model, etc.
• What we want is some model of parameter as a function of the data
• Get there with Bayes’ rule:
What does that mean?
• Let’s look at the parts:
• Generative distribution
• Describes how data is generated by the underlying process
• Usually easy to write down (well, easier than the other parts, anyway)
• Same old PDF/PMF we’ve been working with
• Can be used to “generate” new samples of data that “look like” your training data
What does that mean?
• The parameter prior or a priori distribution:
• Allows you to say “this value of is more likely than that one is...”
• Allows you to express beliefs/assumptions/ preferences about the parameters of the system
• Also takes over when the data is sparse (small N)
• In the limit of large data, prior should “wash out”, letting the data dominate the estimate of the parameter
• Can let be “uniform” (a.k.a., “uninformative”) to minimize its impact
What does that mean?
• The data prior:
• Expresses the probability of seeing data set Xindependent of any particular model
• Huh?
What does that mean?
• The data prior:
• Expresses the probability of seeing data set Xindependent of any particular model
• Can get it from the joint data/parameter model:
• In practice, often don’t need it explicitly (why?)
What does that mean?
• Finally, the posterior (or a posteriori) distribution:
• Lit., “from what comes after” (Latin)
• Essentially, “What we believe about the parameter after we look at the data”
• As compared to the “prior” or “a priori” (lit., “from what is before”) parameter distribution,
Example: coin flipping
• A (biased) coin lands heads-up w/ prob p and tails-up w/ prob 1-p
• Parameter of the system is p
• Goal is to find Pr[p | sequence of coin flips]
• (Technically, we want a PDF, f(p | flips))
• Q: what family of PDFs is appropriate?

Normalization constant: “Beta function”

Pr[tails]

Example: coin flipping
• We need a PDF that generates possible values of p
• p∈[0,1]
• Commonly used distribution is beta distribution:
The Beta Distribution

Image courtesey of Wikimedia commons

Generative distribution
• f(p|α,β) is the prior distribution for p
• Parameters α and β are hyperparameters
• Govern shape of f()
• Still need the generative distribution: Pr[h,t|p]
• h,t: number of heads, tails
• Use a binomial distribution:
Posterior
• Now, by Bayes’ rule:
Exercise
• Suppose you want to estimate the average air speed of an unladen (African) swallow
• Let’s say that airspeeds of individual swallows, x, are Gaussianly distributed with mean and variance 1:
• Let’s say, also, that we think the mean is “around” 50 kph, but we’re not sure exactly what it is. But our uncertainty (variance) is 10 kph.
• Derive the posterior estimate of the mean airspeed.