bayesian learning finalized with high probability l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Bayesian learning finalized (with high probability) PowerPoint Presentation
Download Presentation
Bayesian learning finalized (with high probability)

Loading in 2 Seconds...

play fullscreen
1 / 24

Bayesian learning finalized (with high probability) - PowerPoint PPT Presentation


  • 183 Views
  • Uploaded on

Bayesian learning finalized (with high probability). Everything’s random. Basic Bayesian viewpoint: Treat (almost) everything as a random variable Data/independent var: X vector Class/dependent var: Y Parameters : Θ E.g., mean, variance, correlations, multinomial params, etc.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Bayesian learning finalized (with high probability)' - basil


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
everything s random
Everything’s random...
  • Basic Bayesian viewpoint:
  • Treat (almost) everything as a random variable
    • Data/independent var: X vector
    • Class/dependent var: Y
    • Parameters:Θ
      • E.g., mean, variance, correlations, multinomial params, etc.
  • Use Bayes’ Rule to assess probabilities of classes
  • Allows us to say: “It is is very unlikely that the mean height is 2 light years”
uncertainty over params
Uncertainty over params
  • Maximum likelihood treats parameters as (unknown) constants
    • Job is just to pick the constants so as to maximize data likelihood
  • Fullblown Bayesian modeling treats params as random variables
    • PDF over parameter variables tells us how certain/uncertain we are about the location of that parameter
    • Also allows us to express prior beliefs (probabilities) about params
example coin flipping
Example: Coin flipping
  • Have a “weighted” coin -- want to figure out θ=Pr[heads]
  • Maximum likelihood:
    • Flip coin a bunch of times, measure #heads; #tails
    • Use estimator to return a single value for θ
    • This is called a point estimate
example coin flipping5
Example: Coin flipping
  • Have a “weighted” coin -- want to figure out θ=Pr[heads]
  • Bayesian posterior estimation:
    • Start w/ distribution over what θ might be
    • Flip coin a bunch of times, measure #heads; #tails
    • Update distribution, but never reduce to a single number
    • Always keep around Pr[θ | data]: posterior estimate
example coin flipping6
Example: Coin flipping

?

?

?

?

?

?

?

0 flips total

example coin flipping12
Example: Coin flipping

100 flips total

how does it work
How does it work?
  • Think of parameters as just another kind of random variable
  • Now your data distribution is
    • This is the generative distribution
    • A.k.a. observation distribution, sensor model, etc.
  • What we want is some model of parameter as a function of the data
  • Get there with Bayes’ rule:
what does that mean
What does that mean?
  • Let’s look at the parts:
    • Generative distribution
    • Describes how data is generated by the underlying process
    • Usually easy to write down (well, easier than the other parts, anyway)
    • Same old PDF/PMF we’ve been working with
    • Can be used to “generate” new samples of data that “look like” your training data
what does that mean15
What does that mean?
  • The parameter prior or a priori distribution:
    • Allows you to say “this value of is more likely than that one is...”
    • Allows you to express beliefs/assumptions/ preferences about the parameters of the system
    • Also takes over when the data is sparse (small N)
    • In the limit of large data, prior should “wash out”, letting the data dominate the estimate of the parameter
    • Can let be “uniform” (a.k.a., “uninformative”) to minimize its impact
what does that mean16
What does that mean?
  • The data prior:
    • Expresses the probability of seeing data set Xindependent of any particular model
    • Huh?
what does that mean17
What does that mean?
  • The data prior:
    • Expresses the probability of seeing data set Xindependent of any particular model
    • Can get it from the joint data/parameter model:
    • In practice, often don’t need it explicitly (why?)
what does that mean18
What does that mean?
  • Finally, the posterior (or a posteriori) distribution:
    • Lit., “from what comes after” (Latin)
    • Essentially, “What we believe about the parameter after we look at the data”
    • As compared to the “prior” or “a priori” (lit., “from what is before”) parameter distribution,
example coin flipping19
Example: coin flipping
  • A (biased) coin lands heads-up w/ prob p and tails-up w/ prob 1-p
  • Parameter of the system is p
  • Goal is to find Pr[p | sequence of coin flips]
    • (Technically, we want a PDF, f(p | flips))
  • Q: what family of PDFs is appropriate?
example coin flipping20

Normalization constant: “Beta function”

Pr[heads]

Pr[tails]

Example: coin flipping
  • We need a PDF that generates possible values of p
    • p∈[0,1]
  • Commonly used distribution is beta distribution:
the beta distribution
The Beta Distribution

Image courtesey of Wikimedia commons

generative distribution
Generative distribution
  • f(p|α,β) is the prior distribution for p
    • Parameters α and β are hyperparameters
    • Govern shape of f()
  • Still need the generative distribution: Pr[h,t|p]
    • h,t: number of heads, tails
  • Use a binomial distribution:
posterior
Posterior
  • Now, by Bayes’ rule:
exercise
Exercise
  • Suppose you want to estimate the average air speed of an unladen (African) swallow
  • Let’s say that airspeeds of individual swallows, x, are Gaussianly distributed with mean and variance 1:
  • Let’s say, also, that we think the mean is “around” 50 kph, but we’re not sure exactly what it is. But our uncertainty (variance) is 10 kph.
  • Derive the posterior estimate of the mean airspeed.