Things gone bye.

Things gone bye.

How to Predict The Future • Either the world is driven completely by random chance events (and your best bet for predicting the future is using Tarot cards or a Magic 8 Ball™), or there are detectable patterns in the world. • If you talk to a preschool teacher or a PhD in math, they will tell you that math is all about pattern detection.

Weeks of Gestation (in weeks) Weight at birth (in lbs) … 29 weeks 38 weeks 39 weeks 40 weeks What We Want…. • You want to do deterministic modeling where we’re able to fill in a table like this: …and express it with a simple formula like this: lbs = weeks * something β(beta) value

What (else) We Want…. • Once we have made guesses at those numbers, we want to say how confident we are that they are right.

The Process • The process of going from a single predictor or a set of predictors to a predicted outcome is called statistical modeling. • People get far too excited about figuring out which statistic (with accompanying p-values anxiety) to use for the factors that are used in models.

The Steps • Say what you are testing. • Note the scale (nominal, ordinal, interval) of all the predictors. • Describe the predictors numerically and graphically. • Measures of central tendency and variability • Look for association between the predictors and the outcome. • Look at the strength of the association. • Look for interactions.

What is a model and why care? • The predictors and the outcomes can be on a continuous scale (time in days) or categorical factors (mom smoked, yes or no). • Generally we try to use all the information available when we make a prediction about the future. • The amount of blood ejected each time the heart beats (continuous scale) as opposed to whether or not the heart is beating • The number of cancer cells seen on a slide (or the presence or absence of malignant cells) • The models we build are remarkably similar regardless of whether we have categorical or continuous outcomes.

The Structure of a Model • All the models I learned in school were formulated at their core like this: Outcome = baseline + predictor + predictor • The math can get ugly very quickly depending on the properties of the outcome (continuous, count, categories) but the core idea is that these models are all using additive contributions from some predictors! Impact of time Impact of being a smoker Baby’s Weight some number Weeks * a number a number

What Makes a Bad Model • Predicts some outcomes poorly • Is strongly influenced by a small number of data points • Shows systematic patterns in how it fails to predict

Goals I see modeling as having two goals: • Estimate parameters. • How much weight gain occurs each week as a baby is developing? • Estimate how well it describes your data. (Is your guess precise?) • How far off will my guess be when I predict the next child? • Are there regions where my guesses are far off, like premature or late deliveries? • Is there a lot of variability at one point and not at others? • Can I see any problems when I fit the model to THIS data?

Looking for Errors • Statisticians use the word “error” differently than everyone else. • You know that you will not have perfect prediction. Instead, you will be off. That is error. It does not mean somebody made a mistake! It just means you can’t make a perfect prediction. • Specifying how far you will be off is the fun and interesting part of statistics. The rest is just math.

Looking at Errors Outcome = baseline + predictor + predictor + error a number drawn from a bell shaped distribution some number Impact being a smoker Impact of time Baby Weight Weeks * a number a number

Looking for Errors • Hopefully you will see that, given any specific predictor value, your guessed values for the outcome will be close to the values you actually observe in the outcome. Also, any observed outcome values that stray too far from your guess are unlikely. • That pattern of how far off your guesses are from your observed data can frequently be described by a bell-shaped (“normal”) histogram. So, if you measure errors between your prediction and the observed outcomes, the distribution should be “normal.”

Guesses and Errors Histogram of actual weights at 40 week births 5.5 lbs 9.5 lbs My model guesses 7.5 lbs Histogram of errors at 40 weeks I guessed way too high rarely I guessed way too low rarely 0 error if child was 7.5 lbs Most errors are off by just a bit

Variance vs. Standard Error • The variability around a continuous outcome is frequently described as a variance. The variability around samples in a sampling distribution is frequently described as a standard error. • There are patterns in the variability affected by the number of people in the sample.

Looking at Errors • There are some kinds of errors that you will be unwilling to accept. • If I want to predict the number of times an evil lackey proposes marriage to a mad scientist, I will not accept a negative number! • If I am predicting the chance of someone developing cancer, I will not accept a number less than 0% or greater than 100%. • Specifying the type of errors is a critical part of building a model.

More on Errors • In addition to specifying the range of legal values, another critical component is specifying the variability in the errors. • You have met several probability distributions which let you quantify what is an unusual score given a few parameters describing your data. • Continuous outcomes • Uniform, Normal, T, F • Categorical outcomes • The Binomial, Bernoulli, Chi-square

Ordinary Least Squares • Perhaps the easiest models to draw and understand are ones where you have a continuous outcome like weight and a continuous predictor like time. • The model is just a line…. • Y = mX + b Weight = estimated weight gain each week after conception * number of weeks + weight at 0 weeks

Maximum Likelihood Visual

Bad Models • All models are wrong. • Your data is sacred (after you remove the pregnant men) and you fit models to the data. You do not fit data to a model. That difference is not a semantic minor detail.

Poor Predictions • Sometimes you have data points that are not well fit by the model. Go to extreme measures to document those points. If the data is not a true error, then run the analysis with it and without it. Include the point(s) in all your plots with a special symbol and if one person changes your inferences, consider excluding them. • You may have different subgroups that you have not identified yet.

Induced because of HUGE size A True Outlier

Looking at Residuals • A critical step in examining the quality of a model is graphically looking at the residuals. • Residuals are the differences between the estimated values and the observed values for each person/critter/observation. • Look for curves, changing variability across the range of values or changes over time.

Patterns in Residuals From Crawley: Statistical Computing

Curve Fitting • Linear models can model curves • The math is not too bad…. • You can use explicit mathematical formulas. If you see curves in your residuals, you can use things like: • Polynomials or inverse polynomials • Exponentials • Power functions

Nonlinear Regression • Often the formulas to describe your data are extraordinarily complicated and you want to use non-linear or non-parametric modeling instead. • Key words you will see include: • Non-parametric smoothing • Lowess regression • Spine regression • GAM • Tree models

A Bad Fit • What happens when you fit a straight linear model to curvilinear data? residual Is this better than a flat line at the mean?

Is it good? • A tiny p-value does not mean a good model! • Where on the output does it tell that this is a good or a poor model?

Flatten the line, then look up and down to see if you are systematically off. Residuals?

Curve Fitting! • You can build a model that has a curve using a polynomial… the degree of the polynomial determines how many “bends” appear in a curve. So a 2nd degree polynomial would use x and x2 while a 3rd degree polynomial would use x and x2 and x3. These squared or cubed values don’t do anything especially complicated. They are just like adding new variables.

Polynomials size = intercept + X * something + X2*something else size = intercept + X * something + X2* something else + X3 * another thing poly2 = lm(y~poly(x,2)) poly3 = lm(y~poly(x,3))

Generalized Linear Models • You will eventually move out of the realm of predicting continuous outcomes with normal error. When you do, you will move into the realm of Generalized Linear Models (GLM). • You want to have a linear model predicting an outcome where you restrict the possible outcome values (e.g., only allow values between 0 and 1) and deal with errors not being consistently normal across the entire range. • You can change (transform) your outcome and model this with just another linear model similar to what I have shown.

GLM in English • If you are predicting the number of bacteria you see in a Petri dish, you can not possibly see a negative number of bacteria. A GLM model can be written so that your predicted values can not be negative. • Contrast this with the baby weight example where with a bit of bad data for your predictor value, you could have the formula spit out a negative weight or a baby weighing a ton.

GLM • Instead of modeling like this: Outcome = baseline + predictor + predictor + error • You can model with GLM like this: Tweaked outcome = baseline + predictor + predictor + not normalerror normal/bell-shaped • log(odds of event) = • baseline + predictor * β1+ predictor * β2 + binomial error

Ordinary Regression • So, the ordinary least squares regression models are really just a case of GLM. In these cases I specify that the tweak to the outcome is to just make the outcome identical to what it was originally and the error is normal. • The tweak to the outcome is called the link and this case the link is called identity.

Mort = 389 - 5.98* lattitude

Link Functions • The tweaks to the outcome are called links: • Identity link = predicting a continuous outcome (baby weight) • Log link = if you can’t have negative values • Logit link = if you have to restrict the range to between 0 and 1 • There are other links.

Error Structure • Why bother to specify an error structure other than normal? • Strong skew, kurtosis errors, bounded errors, negative counts • The shape of the error distribution is not a bell-shaped curve. Rather than worrying about the math to describe those curves, you simply need to know that different types of data have different error structures. • Normal errors – continuous outcomes • Poisson errors - counts • Binomial errors - proportions • Gamma errors - variation

Binary Response • If you are not dealing with a continuous outcome, or count data, you will likely have a binary (yes/no scored as 1 or 0) outcome. • Clearly you need to do some major tweaking to the outcome because linear models, as we have seen, can predict very large and small numbers. • Also, the variability of a binary outcome is very different from a continuous variable.

Logistic Regression • The solution is to specify a link that limits values to be between 0 and 1 (think of the changed outcome as being the probability of being scored 1) and use an error term that behaves well with binary outcomes. • This is a GLM with a logit link and binomal errors. • This kind of analysis is so popular that most people don’t know it is a GLM. Rather, they know it only as logistic regression.

Logistic model log(odds of high) = -17.81 + . 4539 * lattitude

So Long (and thanks for all the fish) • Drop by and say hi or send me an email if you have questions in the future. balise@stanford.edu

Things gone bye.

Things gone bye.

Presentation Transcript

Mold – Things Gone Wrong

BYE

Bye Bye Black Sheep

Bye Bye Birdie

bye

BYE

Bye Bye Black Tooth

BYE

Bye-bye, summer!

Gone, Gone, Gone

Bye Bye LAT

Bye bye until next time . . .

Gone but not forgotten… good-bye June Bug

Bye, Bye, Autumn

Bye Bye Private Property

bye

Walking Dead Season 6 premiere Episode "Days Gone Bye"

Bye bye Bobby

Bye Bye Parabens

Bye Bye Private Property