BMS 617

1 / 26

# BMS 617 - PowerPoint PPT Presentation

BMS 617. Lecture 11: Models. What is a model?. In general, a model is a (simpler) representation of something else We use models to study complex phenomena Easier to manipulate than the real thing of interest Easier to focus on specific aspects

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'BMS 617' - yannis

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### BMS 617

Lecture 11: Models

Marshall University Genomics Core Facility

What is a model?
• In general, a model is a (simpler) representation of something else
• We use models to study complex phenomena
• Easier to manipulate than the real thing of interest
• Easier to focus on specific aspects
• E.g. we use mouse models to study human disease
• Easier to control behavior of the mouse
• Easier to control genetics…

Marshall University School of Medicine

What is a mathematical model?
• A mathematical model is an equation (or set of equations) that describes a physical state or process
• Describes how values in the state or process are related to each other
• Aim is not to provide a perfect model
• A good model is simple enough to be easy to understand
• Yet complex enough to be useful

Marshall University School of Medicine

Statistical Models
• Statistical models are mathematical models that model both the ideal predictions and the random “scatter” or “noise”
• Model both the population values and the “random” variation from the population values
• “Random” variation is really just variation not explained or accounted for by the model

Marshall University School of Medicine

Model terminology
• A model is an equation (or set of equations)
• The equation defines the outcome, or dependent variable as a function of
• one or more independent variables, and
• one or more parameters
• Each data point has its own values for the independent and dependent variables
• The values of the parameters are properties of the population
• Do not vary from data point to data point

Marshall University School of Medicine

Fitting a model to data
• The parameters are properties of the population
• They are unknown
• Typically, we collect a sample of data points
• Assuming the model is correct, we can use the sample to estimate the parameters of the model
• This is called “fitting a model to the data”
• Results in estimates and confidence intervals for each of the parameters

Marshall University School of Medicine

Simplest possible model

The simplest possible model for a data set involves no independent variable!

Sample values from a population

Assume the population values follow a Normal distribution

Our model is

Marshall University School of Medicine

Average as a model
• In the simple model Y=μ+ε,
• Y is the dependent variable
• Different value for each data point
• μ is a parameter
• The mean of the population
• Single, unknown value we will estimate from our data
• ε is the “random error”
• Different for each data point, assumed normally distributed with mean zero
• Can make the roles of the variable types more explicit by writingYi=μ+εi

Marshall University School of Medicine

Why the mean is important
• If we assume the model is correct:
• Our data are sampled from a population where the values are some fixed value, plus some scatter that is normally distributed with mean zero
• then we want to use our data to estimate μ
• It turns out that the value of μ that makes our observed data the most likely, out of all possible choices of μ, is the mean of our data
• The mean is the maximum likelihood estimate of μ

Marshall University School of Medicine

A more sophisticated model: linear regression
• Remember the example from linear regression:
• Measured insulin sensitivity and %C20-22 content in 13 healthy men
• Hypothesized that an increase in %C20-22 content caused an increase in insulin sensitivity
• Used linear regression to fit the modelY = intercept + slope × X + scatterto the data
• Y is the insulin sensitivity, X the %C20-22 content
• In more conventional notation:Y = β0 + β1 × X + ε, orYi = β0 + β1 × Xi + εi

Marshall University School of Medicine

Linear regression as a statistical model
• The linear regression model has two parameters:
• β0, the intercept
• β1, the slope
• These are both properties of the population
• We use the data to estimate them
• Uses the method of “least squares”
• Gives the maximum likelihood estimate for the two parameters
• The values of the parameters that maximize the chances of our data being observed

Marshall University School of Medicine

Recap of models
• The linear regression in this example gave an estimate of the slope of 37.2, and an estimate of the intercept of -486.5
• Our estimated model isInsulin sensitivity = 37.2 × %C20-22 - 486.5 + ε
• The model is not assumed to be perfect!
• Simple, but powerful enough to draw some basic conclusions
• Within the range of the data, an increase in one unit in %C20-22 results, on average, in an increase in 37.2 units in insulin sensitivity

Marshall University School of Medicine

Other types of model
• We will look at other types of model in upcoming lectures:
• Multiple regression
• More than one independent variable
• Logistic regression
• Outcome variable is binary, one or more independent variables
• Proportional hazards regression
• Outcome variable is survival time, one or more independent variables

Marshall University School of Medicine

Comparing Models
• In the linear regression example, we also computed a p-value
• The null hypothesis was that the slope was zero
• I.e. we compared the modelY = β0 + β1 × X + εto Y = β0 + ε
• So we can think of this statistical test as the comparison between two models
• In fact, we can think of most (perhaps all) statistical tests as the comparison between two models

Marshall University School of Medicine

Marshall University School of Medicine

Why model comparison is not straightforward
• It is not enough just to compare the “residuals” between two models
• Remember the residuals are the error terms in the model
• A model with more parameters will always come closer to the data
• However, the confidence intervals will be wider
• So the model will be less useful for predicting future values

Marshall University School of Medicine

Comparing the models and R2
• The total sum of squares of the distance of points from the mean
• i.e. the total variance
• is 155,642.3.
• The total sum of squares of the residuals is 63,361.37
• The difference between these is 92,280.93, which is 59.3% of the total variance
• So the linear model results in an improvement in the variance which is 59.3% of the total: this is the definition of R2: R2=0.593

Marshall University School of Medicine

Interpreting the difference in variance

With a little algebra, you can show that the difference between the total variance and the sum of the squares of the residuals is the sum of the squares of the distance between the regression line and the mean

So the regression line “accounts for 59.3% of the variance”

Marshall University School of Medicine

Computing a p-value for model comparison
• To compute a p-value for the comparison of models, we look at both the sum of squares for each model and the degrees of freedom for each model
• The number of degrees of freedom is the number of data points, minus the number of parameters in the model
• We had 13 data points, so there are 12 degrees of freedom for the null hypothesis model, and 11 degrees of freedom for the linear model

Marshall University School of Medicine

Mean squares and F-ratio
• The same data presented in the format of an ANOVA (we will see this later)
• “Total” represents the total variation in the data
• “Random” is the variation in the data around the regression line
• “Regression” is the difference between them: the sum of squares of distances from the regression line to the mean
• The “mean squares” is the sum of squares divided by the degrees of freedom
• The F-ratio is the ratio of mean squares

Marshall University School of Medicine

Computing a p-value
• The null hypothesis is that the “horizontal line model” is the correct model
• i.e. the slope in the regression model is zero
• If the null hypothesis were true, the F-ratio would be close to 1 (this is not obvious!)
• The distribution of values of the F-ratio, assuming the null hypothesis is known, is a known distribution
• Called the F-distribution
• depends on two different degrees of freedom
• so a p-value can be computed
• The p-value in this example is p=0.0021

Marshall University School of Medicine

Recap
• We re-examined the linear regression example and re-cast it as a comparison of statistical models
• Can compute a p-value for the null hypothesis that the simpler model is “correct”
• “As correct as the more complex model”
• This is the same p-value we computed before
• The R2 value is the proportion of variance “explained by” the regression
• We can do the same for other statistical tests!

Marshall University School of Medicine

A t-test considered as a comparison of models
• Recall the GRHL2 expression in Basal-A and Basal-B cancer cells
• We can re-cast this as a linear regression…
• Let x=0 for Basal A cells and x=1 for Basal B cells
• Our linear model is:Expression = β0 + β1 × x + εwith the null hypothesisExpression = β0 + ε
• What is β1?
• Slope = increase in expression for increase in one unit of x
• = difference in expression between Basal A and Basal B
• = difference in means…

Marshall University School of Medicine

t-test as a comparison of models

Marshall University School of Medicine

Results of running the t-test as a comparison of models

Running the linear regression gives estimates of the intercept of 1.933 and slope of -1.861

The table of variances is

Marshall University School of Medicine

Interpreting the table of variances
• The total sum of squares (33.753) is the sum of squares of the differences between each value and the overall mean
• This, divided by the df (33.753/26=1.298) is the sample variance
• The residual sum of squares is the sum of the squares of each expression value minus its predicted value
• The predicted value is just the mean for its basal type
• This is the “within group” variance
• The regression sum of squares is the sum of squares of the differences between predicted values and the overall mean
• This is the sum of squares of the differences between the group means and the overall mean
• One squared difference for each data point
• These interpretations will be really useful to consider when we study ANOVA

Marshall University School of Medicine