Statistics: How to Get Data and Model to Fit Together?

Statistics How to get data and model to fit together?

The field of statistics • Not dry numbers, but the essence in them. • Model vs data – estimation of parameters of interest • Ex. of parameters: mean yearly precipitation, the explanation value one discharge time series has on another, the relationship between stage and discharge. • Parameters are typically unknown but the data gives us information that can be useful for estimation. • Model choice • Gives answers to questions like: “is the mean precipitation the same in to neighbouring regions?”, “Can we use one discharge time series to say something about another?” • These answers are not absolute, but are given with a given precision (confidence or probability).

Data uncertainty • Perfect measurements+ perfect models = zero uncertainty (whether it’s model parameters or the models themselves) Sources of uncertainty: • Real measurements come with a built-in uncertainty. • Models can’t take everything into account. Unmeasured confounder (local variations in topography and soil in a hydrological model for instance) Both problems can be handled by looking at how the measurements spread out, i.e. at the probability distribution of the data, given model and parameters. Our models thus need to contain probability distributions. This uncertainty has consequences for how sure we can be about out models and parameters.

Data summery – statistics (the numbers, not the field) Ways of summarizing the data (x1,…,xn): • Means: • Empirical quantiles: for instance, q(0.5) is the median. • Empirical variance (sum of squared deviations): • Empirical correlation between two quantities, x and y: • Histograms, counts the instances or rates inside intervals. Gives an indicator how the data is distributed. • Empirical histograms, counts the rate if instances lower than each given value.Empirical quantiles are easy to read from this.

Data summery vs the underlying reality (distributions) While the data is known, what has produced the data is not. This means summary statistics don’t represents the reality of the system, only an indicator of what this reality might be. For instance, the histogram from the previous slide was produced by the normal distribution, but this distribution doesn’t look exactly the same as the histogram. Similarly the mean isn’t the distribution mean (the expectancy), the empirical correlation isn’t the real correlation, the empirical median isn’t the distribution median, the empirical variance isn’t the distribution variance etc. One single distribution can produce a wide variety of outcomes! But the opposite is then also the case, a wide variety of distributions could produce the same data! How to go from data to distribution?

Example of distributional problems Problem Expressed distributionally • What’s the distribution of y given x? • What is the distribution of future flood events, given the past? • Does model A summarize/predict the distribution of the data better than model B? • What’s the joint distribution of the missing and actual data? • Regression. What is the functional relationship between x and y? • Forecasting. What to expect from floods using what’s been happening before as input? • Model testing. Is hydrological model A better than B? • Filling in missing data.

Probability • Views on probability • The long term rate of outcomes that falls into a given category. For instance, one can expect that 1/6 of all dice throws gives the outcome “one”. • The relationship between a payoff and what you are willing to risk for it. For instance, you might be willing to risk 10kr in order to get back 60kr on a bet that the next outcome of the die is “one”. • A formal way of dealing with plausibility (reasoning under uncertainty). A probability of 1/6 for getting “one” on the die means that you have neither greater nor less belief in that outcome than in the other 5 possible outcomes. • Notation: Will use Pr(”something”) to say “the probability of something”. I is a frequentist definition, while II and II are Bayesian.

Laws of probability • 0≤Pr(A)≤1 • Pr(A)+Pr(not A)=1 • Pr(A or B)=Pr(A)+Pr(B) when A and B are mutually exclusive. Examples: Pr(flood on the west coast)=1.1 means you have calculated incorrectly! Pr(”two or more one the die”) = 1-Pr(”one”) = 1-1/6=5/6 Pr(”one or two on a single dice throw”) = Pr(”one”)+Pr(”two”)= 1/6+1/6=1/3

Laws of probability 2 – conditional probability Pr(A | B) gives the probability for A in cases where B is true. Pr(A|B)=Pr(A) means that A is independent from B. B gives no information about A. A dependency between parameters and data is what makes parameter estimation. Pr(A and B)=Pr(A|B)Pr(B) Since Pr(A and B)=Pr(B|A)Pr(A) also, we get Bayes formula: Pr(A|B)=Pr(B|A)Pr(A)/Pr(B) Examples: Pr(rain | overcast) One throw of the die does not affect the next => Pr(”one on the second throw” | ”one on the first throw”) = Pr(”one on the second throw”). Pr(”one on the first and second throw”) = Pr(one on the first throw)* Pr(”one on the second throw” | ”one on the first throw”) = Pr(one on the first throw)Pr(”one on the second throw”) =1/6*1/6=1/36. From Bayes formula: If B is independent from A, Pr(A|B)=Pr(A), then A is also independent from B; Pr(B|A)=Pr(B).

Ex. of conditional probabilities Assume that Pr(rain two consecutive days)=10%, and that Pr(rain a given day)=20%. What’s the probability of rain tomorrow if it rains today? Pr(rain tomorrow | rain today) = Pr(rain today and tomorrow)/Pr(rain today)= 10%/20%=50%. If it’s overcast 50% of the time and it’s always overcast when it’s raining what is the probability of rain given overcast? Pr(rain | overcast) = Pr(overcast and rain)/Pr(overcast)= Pr(overcast | rain)Pr(rain)/Pr(overcast)= 100%*20%/50%=40%. (PS: I redevelop Bayes formula here!) Say that overcast is evidence for rain. Pr(rain | overcast)>Pr(rain)

Conditional probability as inferential logic • From the previous example, it was seen that the probability of rain increases when we know it’s overcast. With “probability as inferential logic” terminology, overcast is evidence for rain. • Evidence is information that increases the probability for something we are uncertain about. It’s possible to make rules for evidence, even when exact probabilities are not available. • Ex: • When A->B, then B is evidence for A. (If rain -> overcast then overcast is evidence for rain). • When A is evidence for B, then B is evidence for A. (If a flood at position A increases the risk of there being a flood at location B, then ...) Note that the strength of the evidence does not have to be the same both ways. • If A is evidence for B and B is evidence for C (and there are no direct dependency between A and C), then A is evidence for C. (If Oddgeir mostly speaks the truth and he says it’s overcast, then that is evidence for rain.) • If A is evidence for B, then ”not A” is evidence for ”not B”. (Not overcast, clear skies, is evidence against rain. If you have been searching for the boss inside the building without finding him/her, then that is evidence for he/she not being in the building. • See “Reasoning under uncertainty” on Youtube for more.

The law of total probability If one has the conditional probabilities for one thing and the unconditional (marginal) probabilities of another, one can retrieve the unconditional (marginal) distribution of the first thing. This process is called marginalization. Let’s say we have three possibilities spanning the realm of all possible outcomes: B1, B2 or B3. So, one and only one of B1, B2 and B3 can be true. (For instance ”rain”, ”overcast without rain” and ”sunny”, A could be the event that a person uses his car to get to work.) Pr(A) = Pr(A and B1) + Pr(A and B2) + Pr(A and B3) = Pr(A|B1)Pr(B1)+Pr(A|B2)Pr(B2)+Pr(A|B3)Pr(B3) It’s the same if there are more (or less) possibilities for B. Example: Assume that the probability of hail in the summer half-year is 2% and in the winter 20% (these are thus conditional probabilities). What’s the probability of hail in an arbitrary day in the year? Pr(hail)=Pr(hail | summer )Pr(summer)+Pr(hail | winter)Pr(winter)= 20%*50%+2%*50%=10%+1%=11%

Properties of stochastic variables (stuff that has a distribution) • The expectation value is the mean of a distribution, weighted on the probabilities • For a die, the expectation value is 3.5. • For a uniformly distributed variable between 0 and 1, the expectation is ½. • For a normally distributed variable, the expectation is a parameter, . • The standard deviation (sd) is a measure how much spread you can expect. Technically it’s the square root of the variance (Var), defined as: • For a uniformly distributed variable between 0 and 1, the variance is 1/12. • For a normally distributed variable, the standard deviation is a parameter,  (or variance 2). when there are N different possible outcomes

Covariance and correlation If two variables X and Y are dependent (so Pr(Y|X)Pr(Y)), there is a measure for that also. First off one can define a covariance, which tells how X and Y varies linearly together: Where Nx and Ny are the different possible outcomes for X and Y respectively. Covariance will however depend both on the (linear) dependency between X and Y but also the scale of both of them. To remove the latter, we form the correlation: Note that -1XY1 always. XY =1 means perfect linear dependency. Also note that the correlation summarizes dependency only linear dependency, not non-linear dependency! It is even possible to have perfect dependency but no correlation!

Samples from stochastic variables- the law of large numbers • Rates approaches the probabilities • The mean approaches the expectancy value. • The empirical variance approaches the distribution variance. • The rate of samples falling inside an interval approaches the interval probability. Thus the histogram approaches the probability density. • Empirical quantiles approaches the distributional quantiles. • Empirical correlations approaches distributional correlation. If we can sample from a statistical distribution enough times, we will eventually see that… The data we see is seen as a sample set from some (unknown) distribution. f(x)

Diagnostic plots concerning probability distributions • One can compare the histogram with the distribution (probability density). • Cumulative rates can be compared to the cumulative distribution function. • One can also plot theoretical quantiles vs sample quantiles. This is called QQ plots. If the right distribution has been chosen, the points should lie on a straight line.

The Bernoulli process and the binomial distribution If you count the number of successes in n trials, you get the binomial distribution. It is characterized by the success rate, p. This is often an unknown parameter that we’d like to estimate. p=probability for heads (p=50%) p=probability of discharge>threshold. E(X)=np. Var(X)=np(1-p) A process of independent events having only two outcomes of interest, success and failure, is called a Bernoulli process. Ex: • Coin tosses. • Years where the discharge went over a given threshold in Glomma. Incorrect use: Rainy days last month. In this case, n=30, p=0.3 Related: The negative binomial distribution. Counts the number of ’failures’ before the k’th success.

Distributional families - Poisson t t1 t2 t3 t4 The Poisson distribution is characterized by a rate parameter, .  =Deadly traffic danger  =Threshold rate If the rate is uncertain in a particular way (gamma distributed) the outcome will be negative binomially distributed. E(X)=Var(X)=. The Poisson distribution is something you get when you count the number of events that happens independently in time (Poisson process), inside a time interval. • Number of car accidents with deadly outcome per year. • Number of times the discharge goes above a given threshold in a given time interval. (PS: Strictly speaking not independent!) In this case, =10.

Probability density For stuff that has a continuous nature (discharge/humidity/temperature) we can’t assign probabilities directly, since there’s an uncountable amount of outcomes. What we can do instead is to assign a probability density... We use the notation f(x) for probability densities, with the x denoting which stochastic variable X which it is the probability density over. A probability density gives the probability of falling inside a small interval times the size of this interval. Pr(x<X<x+dx)=f(x)*dx Pr(x<X<x+dx)=f(x)*dx For larger intervals, integration is needed (law 3): Probabilities still have to sum to one (law 2): Conditional probabilities: Law of total probability: Bayes formula: Expectancy:

Cumulative distributions and quantiles If one has a probability density, one can calculate (by integration) the probability of an outcome less than or equal to a given value x. Seen as a function of x, this defines the cumulative probability, F(x). If we turn the cumulative distribution around, we can ask for which value this distribution attains a given probability. I.e. for which value is there a given probability that X is lower than that? One then gets a quantile, q(p). This is the value for which there is probability p that X is lower p -> q(p)=F-1(p) Special quantile: the median. 50% probability to be above that value and 50% probability of being below. Quantiles can be used for indicating the spread of possible outcomes (uncertainty intervals) 95% of the probability mass is inside the 2.5% and 97.5% quantile, for instance. Ex: The 85% quantile of the standard normal distribution is approx. 1.

The normal distribution The probability density, f(x), is smooth and has a single peak around a parameter value, . Mathematically it looks like this: where  is the expectancy and  is the standard deviation. If a stochastic variable, X, is normally distributed we often write this as X~N(,). 68.3% probability 95.4% probability 99.73% probability (-1.96,+1.96) contains 95% of the probability mass for the normal distribution. 99.99994% probability

Why the normal distribution? While the normal distribution may look complicated it has a host of nice properties: • Its’ smooth and allows all real valued outcomes. • It’s characterized by a single peak. • If you condition on a distribution being positive, smooth and having a single peak, a Taylor expansion will indicate that the normal distribution will be an approximation around this peak. • Symmetric • Information theory suggests choosing the distribution that maximises entropy (minimizes information) conditioned on what you know. The max entropy distribution when you know the centre (expectation) and spread (standard deviation) is the normal distribution. • The sum of two normally distributed variables is normally distributed. • A large sum equally distributed independent variables will be approximately normally distributed. (The central limit theorem). • Believe it or not, the normal distribution is pleasant to work with, mathematically! • Is the distributional basis for a lot of the standard methodology in statistics. Should work well for temperatures, not so well for discharge!

The lognormal distribution (scale variables) When something needs to be strictly positive (the mass of a person, the volume in a reservoir, the discharge in a river), the normal distribution does not work. It assigns positive probabilities to negative outcomes. A simple way to fix this is by first taking the logarithm of your original quantity, and assign the normal distribution to that. If X>0, will log(X) take values all over the real line. The assumption log(X)~N(,) also gives a distribution for X, called the lognormal distribution, X~logN(,). If  is increased, the uncertainty (standard deviation) also increases, but the relative uncertainty remains the same. From the central limit theorem, one can argue that the product of independent identically distributed positive variables will go towards the lognormal distribution.

The (inverse) gamma distribution The gamma distribution is another such that only takes positive values: It has a form that makes it mathematically convenient when studying variation parameters (sums of independent square contributions) and when studying rate parameters (Poisson). In Bayesian statistics, the distribution of it’s inverse is however often more convenient. If X is gamma distributed, then 1/X is inverse gamma distributed:

Statistical inference • If we have a task like extreme value analysis, regression, forecasting, there will be unknown numbers, parameters, that we want to estimate. • A model summarizes how the data has been produced by the likelihood, f(D|). where  is the unknown parameters and D is the data. The data might not tell a clear-cut story about . • Statistical inference deals with: • Estimation of parameter values  in a model. • Uncertainty of these parameter values. • Model choice and uncertainty concerning model choice • Estimates and uncertainty of derived quantities. (For instance risk analysis.) • Estimates and uncertainties of latent variables (unmeasured stuff that has a distribution).  D

Statistical (parameter) inference We then want to go the other way, say something about the parameters  given the data, D. Two fundamentally ways of dealing with this exists:  • Frequentist (classic): • Parameters are unknown but seen as fixed. Distributions (and uncertainty) is only assigned to data f(D|)and to stuff derived from that via various methods. • We can assign probabilities to methodology having something to do with the parameters (estimators, confidence interval etc) before the data. • Plugging actual data into this, we don’t have anything with probability left, but talk about confidence instead. D • Bayesian: • Uncertainty is handled by probability distributions whether it’s parameters or data. Since we have f(D|) we can turn it around and ask for f( |D) (posterior distribution) using Bayes formula. • We need to start out with a distribution f() summarizing what we know prior to the data. • We also have a troublesome integral for the prior prediction distribution (marginal data distribution). • We’ve seen type of inference this already... Pr(rain | overcast) = Pr(overcast | rain)Pr(rain)/Pr(overcast)

When the model clashes with reality Wish to find an uncertainty interval for the average mass of mammoths Data set: x=(5000kg,6000kg,11000kg) Model 1: xi~N(,) i.i.d. • The normal allows negative values for both the average mass and the measurements! • Results in a 95% confidence interval, C()=(-650kg,15300kg) which contains values that just can’t be right! Model 2: log(xi) ~N(,) i.i.d. (xi ~ logN(,) ) • Only positive expectancies and measurements possible. • 95% confidence interval transformed back to the original scale: (2500kg,19000kg). • Even better if we could put in pre-knowledge. Message: Use models that deals with reality. Learn to use various models/distributions/methods for various problems.GIGO.

Frequentist methodology Only data is given a distribution. Focused on estimation, with uncertainty, model choice being secondary. Model choice and uncertainty comes from the probability of producing new data that looks like the data you have. Parameters do not have a distribution, but estimators do. An estimator is a method for producing a parameter estimate, before the data. So before the data, an estimator has a probability distribution.

Frequentist statistics: Estimation Estimation is done through an estimator, a method for producing a number from a dataset generated by the model. An estimator should be consistent. The probability of the difference between estimator and actual parameter being larger than a given value should go to zero as the number of data increases. One would also like estimators to be unbiased, that is having an expectancy equal to the parameter value. Often used methods for making estimators: • The method of moments. Estimate the parameters so that the expectancy matches the data mean, the distributional variance matches the empirical variance etc. Advantage: Easy to make. Disadvantage: Little theory about the estimator distribution (meaning bad for assessing uncertainty), can be pathological, limited areas of usage. • The L moment method. Variant of the method of moments using so-called L moments. Advantage: Good experience from flood frequency analysis. Disadvantage: See over + not so easy to make. • The ML method. Estimate the parameters to maximize the probability for the data, i.e. The likelihood f(D|). Advantage: More or less unlimited areas of usage. Asymptotic theory for the uncertainty exists, pathological estimates impossible. Disadvantage: May require more heavy numerical methods, can be skewed.

Frequentist statistics: Numeric methods Not all models give you a likelihood that has readily available analytical expressions for the (ML) estimators. For such cases one needs numerical optimization. These come in two categories: • Hill-climbing/local climbing: Start from one (or a small set of) point(s) in parameter space and use the local ”topography” of the likelihood to find the nearest peak. Examples: Newton’s algorithm, Nelder-Mead. • Global methods: More complicated, requires much coding, execution time and adjustments. Examples: simulated annealing, genetic algorithms.

Frequentist statistics: Parameter uncertainty and confidence intervals An estimate isn’t the truth. There can be many different parameter values that can with reasonable probability produce the data. Frequentist statistics operates with confidence intervals. A 95% confidence interval is a method for making intervals from the data which before the data has a 95% probability of encompassing the correct parameter value. (A Bayesian credibility interval has a 95% probability for encompassing the correct parameter value, given the data). Confidence intervals are made by looking at the distribution of so-called test-statistics (often estimators).

Confidence interval techniques • Exact methods. This can be applied when you know exactly the distribution of the test statistics. Ex: A 95% confidence interval for the expectancy in the normal distribution: where s empirical standard deviation and tn-1 is the t distribution with n-1 degrees of freedom. • Asymptotic theory. When the amount of data goes towards infinity then the ML estimators has the following distribution: Thus will be a 95% confidence interval. • Bootstrap. Here one tries to recreate the distribution that produced the data, either by plugging in parameter estimates into the model or by re-drawing from the data with replacement (non-parametric). One then looks at the spread of the set of new parameter estimates for these new datasets

Frequentist statistics: Hypothesis testing Sometimes we will be uncertain about which model to use. Is there a dependency between x and y? Does the expectancy hold a particular value? Classic hypothesis testing is done by: • Formulate a zero-hypothesis, H0, and an alternative hypothesis, Ha. • Make a threshold probability called the significance level, for the probability of rejecting an ok zero-hypothesis. Typical value: 5%. • Focus on a test statistics (a function of the data, likelihood or estimators for instance). Find an expression for the probability density of this. • By looking at the alternative hypothesis, find what values for the test statistics are extreme for the zero-hypothesis. Find from the distribution of the test statistics an interval of the 5% (significance level) most extreme outcomes. • If the test statistics is inside this interval for the data, the zero-hypothesis is rejected with 100%-significance level confidence. P value: The probability of getting a test statistics as extreme as you got, given that the zero-hypothesis is correct. P value<significance level means rejection. Power: Gives the probability for rejecting the zero-hypothesis for various versions of the alternative hypothesis (so a function of the parameter values). You want this as high as possible. This can affect experimental planning.

Frequentist statistics: Model testing (2) The t test. Checks if one dataset has an expectancy equal to a given value or if two datasets have the same expectancy. In practice done by seeing if the 95% confidence interval (for the expectancy minus the given value or for the difference) encompasses zero. General methodology: • The likelihood ratio test. Under a zero-hypothesis, where k is the difference in the amount of parameters and lA and l0 are max likelihood for alternative hypothesis and zero-hypothesis, respectively. (Only valid asymptotically.) • The score test. Uses to check if the parameter estimate is far enough from a specific value to be rejected. (See the confidence interval that ranges from to to test this).

Frequentist statistics: other model choice strategies Hypothesis testing is primary for when you only want to reject a zero-hypothesis with strong evidence. But often, what you are seeking is whichever model serves the purpose best, like minimizing the prediction inaccuracies. Sometimes hypothesis testing can even be impossible, because both your model have equal model complexity. Note that prediction uncertainty comes from the stochasticity of the data itself, errors in the model and uncertainty concerning parameter estimates. Stochasticity in the data is something we can’t get rid of, but the other two contributions needs to be balanced. • Methods: • Adjusted R2 (only regression) • AIC=-2*log(ML)+2*k, k=#parameters • BIC=-2*log(ML)+log(n)*k • FIC • Divide the data into i training and validation sets. • Cross validation • CV-ANOVA (ANOVA test on the results of cross validation.) Prediction uncertainty Estimation uncertainty Model errors Model complexity

Statistics: How to Get Data and Model to Fit Together?