Some statistical basics

Some statistical basics Marian Scott

Why bother with Statistics We need statistical skills to: • Make sense of numerical information, • Summarise data, • Present results (graphically), • Test hypotheses • Construct models

Variables- number and type • Univariate: there is one variable of interest measured on the individuals in the sample. We may ask: • What is the distribution of results-this may be further resolved into questions concerning the mean or average value of the variable and the scatter or variability in the results?

Bivariate • Bivariate two variables of interest are measured on each member of the sample. We may ask : • How are the two variables related? • If one variable is time, how does the other variable change? • How can we model the dependence of one variable on the other?

Multivariate Multivariate many variables of interest are measured on the individuals in the sample, we might ask: • What relationships exist between the variables? • Is it possible to reduce the number of variables, but still retain 'all' the information? Can we identify any grouping of the individuals on the basis of the variables?

Data types • Numerical: a variable may be either continuous or discrete. • For a discrete variable, the values taken are whole numbers (e.g. number of chromosome abnormalities, numbers of eggs). • For a continuous variable, values taken are real numbers (positive or negative and including fractional parts) (e.g. blood lead level, alkalinity, weight, temperature).

categorical • Categorical: a limited number of categories or classes exist, each member of the sample belongs to one and only one of the classes e.g. sex is categorical. • Sex is a nominal categorical variable since the categories are unordered. • Dose of a drug or level of diluent (eg recorded as low, medium ,high) would be an ordinal categorical variable since the different classes are ordered

Inference and Statistical Significance Sample Population inference • Is the sample representative? • Is the population homogeneous? • Since only a sample has been taken from the population we cannot be 100% certain • Significance testing

Hypothesis Testing II • Null hypothesis: usually ‘no effect’ • Alternative hypothesis: ‘effect’ • Make a decision based on the evidence (the data) • There is a risk of getting it wrong! • Two types of error:- • reject null when we shouldn’t- Type I • don’t reject null when we should- Type II

Significance Levels • We cannot reduce probabilities of both Type I and Type II errors to zero. • So we control the probability of a Type I error. • This is referred to as the Significance Level or p-value. • Generally p-value of <0.05 is considered a reasonable risk of a Type I error.(beyond reasonable doubt)

Statistical Significance vs. Practical Importance • Statistical significance is concerned with the ability to discriminate between treatments given the background variation. • Practical importance relates to the scientific domain and is concerned with scientific discovery and explanation.

Power Power is related to Type II error probability of power = 1 - making a Type IIerror Aim: to keep power as high as possible

Sample size calculations • What is the objective of the experiment? • How much of a difference is it important to be able to detect (the effect size)? • At what significance level do you want to conduct the test? (decrease the significance level, reduces power) • What is the power of the experiment (what is the probability that you will detect such a difference when it actually exists)? • How variable is the population? Greater variation needs larger sample size to achieve the same power

Power Curves

Modelling continuous variables-checking Normality • Normal density function and histogram • Check for symmetry • Other possibility-Normal probability plot

Modelling continuous variables-checking Normality • Normal probability plot • Should show a straight line • p-value of test is also reported (null: data are Normally distributed)

Statistical inference • Hypothesis testing and the p-value • Statistical significance vs real-world importance • Confidence intervals

Confidence intervals- an alternative to hypothesis testing • A confidence interval is a range of credible values for the population parameter. The confidence coefficient is the percentage of times that the method will in the long run capture the true population parameter. • A common form is sample estimator  2* estimated standard error

Statistical models • Outcomes or Responses these are the results of the practical work and are sometimes referred to as ‘dependent variables’. • Causes or Explanationsthese are the conditions or environment within which the outcomes or responses have been observed and are sometimes referred to as ‘independent variables’, but more commonly known as covariates.

Statistical models • In experiments many of the covariates have been determined by the experimenter but some may be aspects that the experimenter has no control over but that are relevant to the outcomes or responses. • In observational studies, these are usually not under the control of the experimenter but are recorded as possible explanations of the outcomes or responses.

Specifying a statistical models • Models specify the way in which outcomes and causes link together, eg. • Metabolite = Temperature • The = sign does not indicate equality in a mathematical sense and there should be an additional item on the right hand side giving a formula:- • Metabolite = Temperature + Error

statistical model interpretation • Metabolite = Temperature + Error • The outcome Metabolite is explained by Temperature and other things that we have not recorded which we call Error. • The task that we then have in terms of data analysis is simply to find out if the effect that Temperature has is ‘large’ in comparison to that which Error has so that we can say whether or not the Metabolite that we observe is explained by Temperature.

Correlations and linear relationships • Strength of linear relationship • Simple indicator lying between –1 and +1 • Check your plots for linearity

gene correlations

Interpreting correlations • The correlation coefficient is used as a measure of the linear relationship between two variables, • The correlation coefficient is a measure of the strength of the linear association between two variables. If the relationship is non-linear, the coefficient can still be evaluated and may appear sensible, so beware- plot the data first.

Simple regression model • The basic regression model assumes: • The average value of the response x, is linearly related to the explanatory t, • The spread of the response x, about the average is the SAME for all values of t, The VARIABILITY of the response x, about the average follows a NORMAL distribution for each value of t.

Simple regression model • Model is fit typically using least squares • Goodness of fit of model assessed based on residual sum of squares and R2 • Assumptions checked using residual plots • Inference about model parameters carried out using hypothesis tests or confidence intervals

statistical model interpretation • The traditional ‘statistical tests’ such as t-tests, ANOVA, ANCOVA and regression are each special cases of a more general type of model, making a number of assumptions - • t-tests work where there are two groups, • ANOVA works with categorical explanatory variables, • regression assumes that explanatory variables are continuous, • Our explanatory variables are not like this, they are mixtures of continuous and categorical, so we need a more flexible approach- the G(eneral) L(inear) M(odel).

General linear models • General Linear Models (GLMs) are a comprehensive set of techniques that cover a wide range of analyses. Problems that make use of number of specific techniques may be specified as GLM problems using a unified specification called a Model Syntax. The form of the Model Syntax varies a little from statistics package to statistics package, but is essentially just a way of unambiguously specifying what the relationship is between variables (categorical or continuous).

Example Traditional Test GLM word equation Comparing the effect of burning and clipping on bracken Two sample t-test SHOOTS = MANAGEMENT Comparing the effect of two different drugs with a placebo One-way analysis of variance EFFECT = DRUG Comparing the yield between fertilisers conducting the experiment in several fields One-way analysis of variance with blocking YIELD = FIELD + FERTILISER Investigating the relationship between height and weight in people Regression WEIGHT = HEIGHT Investigating the relationship between oxygen consumption and weight in scampi, taking level of activity into account Analysis of covariance, with emphasis on regression OXYGEN = WEIGHT + ACTIVITYor under different assumptions(an interaction between the terms)OXYGEN = WEIGHT | ACTIVITY Examples

summary • hypothesis tests and confidence intervals are used to make inferences • we build statistical models to explore relationships and explain variation • the modelling framework is a general one – general linear models, generalised additive models • assumptions should be checked.

Some statistical basics

Some statistical basics

Presentation Transcript

Some Basics of Assessment

some simple basics

Some survival basics

Composition, some Basics

Basics of Statistical Estimation

Remembering Some Basics

Some RMG Basics

Some R Basics

Some Basic Statistical Concepts

SOME BASICS of XHTML

Covering Some Basics

Some Chandra Basics

Some R Basics

Some Photography BASICS

Some basics :

Some Basics on Photos

Mars: some basics

Some Basics of CQUEST

Basics of Statistical Estimation