Dr. David A. Clifton, W.K. Kellogg Junior Research Fellow Institute of Biomedical Engineering

Dr. David A. Clifton, W.K. Kellogg Junior Research Fellow Institute of Biomedical Engineering University of Oxford Probabilistic Data Analysis Part I of IIInformation-Driven HealthcareCentre for Doctoral Training in Healthcare Innovationdavidc@robots.ox.ac.uk

Probabilistic Data Analysis Probability – why bother? Background Introducing Bayes Bayesian learning GMMs

Online Machine Learning

Personalised Medicine The scientific and medical communities agree that healthcare needs to be personalised for maximum benefit (for health and economic reasons) Rather than use a generic system that has somehow “learned” the behaviour of an entire population... ...can’t we learn the behaviour of individuals? Can’t we make patient-specific treatments? Genomics (“point-of-care” tests) through to patient management through to lifestyle/home monitoring

Probabilistic Data Analysis Wouldn’t it be nice if we could use a unified framework for data analysis that allowed us to learn automatically the behaviour of individuals in a principled manner? Wouldn’t it be nice if we could automatically quantify our level of confidence in the machine’s decisions? Wouldn’t it be nice if we could cope with incomplete data (an ECG lead falls off), or noisy data (poor sensor contact), or measurement artefact (moving patients / clinicians)? Wouldn’t it be nice if we could learn something deeper about the underlying process that we believe generated the data? Probabilistic data analysis is exactly this nice.

Editorial Opinion • Machine learning is a big field, and has many disciplines / religions within it. My personal opinion: The Bad Probabilistic methods that do not incorporate our uncertainty – howdo we cope with real data? Frequentist techniques The Ugly “I use what works”:download a black box,run it on data,publish results Black-box techniques The Good Principled, probabilisticmethods that allow us tocope with real data (noise, uncertainty) Bayesian techniques

Opinions from the gallery... • “A Frequentist is a person whose long-run ambition is to be wrong 5% of the time.” • A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.” (Dr. Charles Annis)

Opinions from the gallery... • “My first answer is: Bayes.” (Prof. Christopher Bishop)

Opinions from the gallery... • “The supervised feedforward network has been superceded. ” (Prof. David Mackay)

Opinions from the gallery... • “You can do it the wrong way,or you can use Bayes.” (Prof. Steve Roberts)

The Marquis de Laplace Bonjour again, mesamis. Laplace was originally approached by a heavily-indebted friend in order to assess the statistical nature of a game of chance (at which his friend had been losing) Laplace recognised the work of a (now famous) deceased parson, and singlehandedly created the field of Bayesian analysis

The Mass of Saturn Frequentist probability states that the probability of a statistic (i.e., a random variable) can be estimated by taking the limit of that statistic over an infinite number of trials Laplacian (i.e., Bayesian) probability realises that the frequentist interpretation of probability is insufficient How did Laplace estimate the mass of Saturn in a principled way?

The Mass of Saturn It’s a trap! The mass of Saturn is, unsurprisingly, a constant – not a random variable. The frequentist approach would be to imagine an ensemble of universes in which everything except the mass of Saturn remained constant. Then we could form a distribution from that ensemble. This is daft.

The Mass of Saturn p(M | x) m1 m2 M “It is a bet of 11,000 to 1 that the error of this result is not 1/100th of its value.” (Actual error in Laplace’s estimate of M: 0.63%) Laplace computed the posterior distribution of the mass of Saturn, M, given available astronomical observations, x Laplace stated the area between m1 and m2 was his belief that m1 ≤ M ≤ m2

Laplace on probability • “It is remarkable that a science, whichcommenced with a consideration ofgames of chance, should be elevatedto the rank of the most importantsubjects of human knowledge.” (Marquis de Laplace)

Laplace on probability • “Probability is nothing but common sense reduced to computation.” (Marquis de Laplace)

Elementary probability theory • Every (?) schoolchild is taught the fundamental rule of conditional probability, given events A and B: • Rearranging, we can find the joint probability:

Elementary probability theory • Which we can then rearrange to give us the ever-familiar: We could also say P(A) and P(B) are the marginal distributions, or prior distributions

Bayes • Let’s use Bayes’ rule for learning the parameters of some model. • Suppose the model has parameters θ, and we have a training set of observations x. • We can define our belief in what those parameters are to be a pdf, p(θ | x)

Bayes • If our belief the parameters is the pdf, p(θ | x) • ...then Bayes’ rule tells us that we can determine this pdf bymultiplying the data likelihoodp(x | θ) by a prior, P(θ) • The data likelihood is just the probability of the data given some parameters; e.g., the Gaussian equation x ~ N(μ, σ2) • The prior is our belief in what values the parameters θ could be – this could be the uniform distribution θ

Bayes • This is a lecture on probability, and so we are duty-bound to toss some coins and consider the probability of obtaining heads • Let’s make it interesting, and try to determine if the coin we just received from a croupier in Monte Carlo is “bent” or not • Let’s suppose that we don’t have any strong prior knowledge, so P(θ) is the uniform distribution • The data likelihood is the good ol’ Bernoulli, Hn (1-H)(N-n)

Bayes • The denominator is just a normalising constant to ensure that the output posterior distribution integrates to unity (i.e., makes it a pdf) • Formally, it’s , which isn’t friendly • So let’s say for now:

Recursive Bayes • We start off with no observations, x. We only have our lonely prior, P(θ). • We then observe one coin-toss. We can work out its likelihood p(x | θ) using good ol’ Bernoulli, as we saw before. • Multiplying the two together, we have our new output – the posterior probability – in this case, our new belief in the probability of obtaining heads from our coin

Heads Tails

Quantifying uncertainty • This, as I’m sure you will agree, is very useful. We now have a model that explicitly quantifies our uncertainty in our output, P(H | x). • Traditional frequentist methods would just estimate a single value: P(H | x) = number of heads observed / number of tosses.How much use is a single output? How reliable is it?

Another example...

Real data • Let’s model some real data. Wouldn’t it be nice if, rather than caring very much about the particular data that we observed, we began to understand the underlying process that generated those data? • Let’s assume for now that our data (which have Q dimensions) are all drawn from an unchanging, underlying distribution, D. • (In future, we’ll see howwe can relax this assumption, and cope withunderlying distributions thatchange through time – suchas with patient age, or hospital stay) The assumptions are strong with this one.

The Gaussian Mixture Model • The GMM estimates D from a training set of observations xusing a mixture of Gaussian components: • This is a multivariate pdf, because x has lots of dimensions (Q of them, in fact). In our ICU dataset, this could be Q=12. • Each p(x | θk) is a (multivariate) Gaussian distribution, with parameters θk =N(μk, Σk) • P(k) is the relative importance (prior) of the kthGaussian

Dr. David A. Clifton, W.K. Kellogg Junior Research Fellow Institute of Biomedical Engineering