Course in Statistics and Data analysis

Course in Statisticsand Data analysis Course B, September 2009 Stephan Frickenhaus

Outline theses my experience is: Many young researchers lack knowledge of analysis tools, so producing/sampling data is not the problem but analysing gets a problem right before publication. Once, appropriate tools are known (and: Excel is not approriate for analysis), still knowledge of methods/conceptsmay be missing. This course tries to tackle both …

schedule Day 1: 8.9., 10:00 - 16:00 Room E4005 The probability distribution, The p-value concept, statistical tests in R Day 2: 9.9., 10:00 - 16:00 Room E4005 Multivariate Analysis, Correlation tests, ANOVA, Ordination with factors and environmental data Cluster-Analysis (maybe as start of Day 3) Day 3: 10.9., 10:00 - 16:00 Glaskasten F User-driven interactive: bring your project data and we work on it

Contents / Setup • Tool-based (program „R“) course • Install „R“ from www.r-project.org • Exploring data analysis • Graphically • Numerically • Exploring what significance really is • Statistics tests no longer as black-boxes

DAY1 – Lecture part I • With each type of data we have different methods to analyse, give examples! type of data examples Data Linear: Length in cm Circular: Angle in degree Numerical (metric) data Sex, Colour, Species Nominal (class) data Ordinal (ranked) data Age group, school class, phase in cell-division

First steps from data … • Plot in a co-ordinate system (scatter-plot),histogram, boxplot • Count in a table, barplot, piechart • Count in a table, with an axis, barplot Linear: Length in cm Circular: Angle in degree Sex, Colour, Species Age group, school class, phase in cell-division

… to methods • Check for groups, trends, correlations • Check for differences, ratios • Check for differences, ratios, relation to order • Plot in a co-ordinate system (scatter-plot),histogram, boxplot • Count in a table, barplot, piechart • Count in a table, with an axis, barplot metric nomiinal ordinal

…to combinations of data • X-Y-Plots metric metric • Class=color in scatter plot • Check for groups/clusters nomiinal metric • X-Y-plot with colors=class ordinal metric metric

…towards models: multivariate data • Organize data in tables • Keep data of same measurement in ONE row • Distinguish groups in extra column by nominal data

Before discussing, what we can do with such a table, lets do first steps in the tool R!

Start Practice with R www.r-project.org http://ftp5.gwdg.de/pub/misc/cran/

Lecture part II • What, if the summary of data is not enough? E.g., we want to say, whether an observed mean value is probably greater than 0.5? • It is not enough to conclude „We clearly find mean(x)<mean(y)“because this may be an outcome due to small sample sizes, and in reality the means may be equal, and there is maybe no effect at all. • We must define some terms to learn how to be more quantitative about such statements, like „with 1% error we can exclude that x and y are from the same population“

Some terms… • Population : • all individuals of the kind measured • If we measure them all, we know exactly the mean value etc., the true mean • Some times we do not have it accessible • Sometimes we think it has infinitely many individuals • Sample : • A subset of individuals from a population • It has, e.g., a sample mean that is not equal to the true mean (the mean of the population) • sample size : number of individuals picked

…more terms, for real numbered variables X Probability density function p(x) the probability to pick samples xi from X in the interval [a,b] Cumulative distribution function cdf(x) probability to pick an x below a

p(x) prob density function p(x) x a b Full range of X makes 100% p(x)>=0 Need not be symmetric!

cumulative distr. function cdf(x) 1 x max(X) min(X) cdf starts from 0 at the minimal possible value of X, reaches 1 at the maximal possible value of X. Here p drops to 0. cdf is monotonically increasing, because it integrates a p≥0.

Mean E and Standard deviation S p(x) x E(X), need not be at the maximum of p(x) S(X) measures somehow the width of p(x), i.e., the scattering of x around E(x).

Long-tail distributions p(x) x Some rare samples will have very large values x ! When we have few samples, we pick from these rare values maybe none!

What is a statistics test? • Example: We have a sample x of size 6. • How probable is it, that the mean of the sample x is between 2 and 2.5, although E(X)=0? • To answer this: • 1) we repeat many times taking samples of size 6 and count how often. • 2) we need an assumption about the probability density of X and then integrate a statistics distribution of mean(x) to measure Pr(2<mean(x)<2.5) May be too expensive LATER:Can I check what the pdf of X is?

…influence of sample size on the mean • repeat a sampling from X with sd(X)=1.0 at different sizes N • Take sample means • How do repeated means vary (standard deviation) • Result… • For high N, sd(mean) goes (central limit theorem) How for low N ??? Its given by the t-statistics t = mean(x)/(sd(x)/sqrt(N)), which depends on sample size N.

A first test:Test the influence of sample size • How do I know how many samples I need to make a correct statement about the mean like E(X)≥0.89? • „correct“ is to be quantified as the „type-I error“:How probable is it that I see the same or more extreme value by chance alone, i.e., although the population mean is 0 ? Concept of the Null-Hypothesis How shure can I be to exclude, that the population mean is not zero, also when I find a sample mean of m=0.89. So, we evaluate how probable such an outcome is, when a certain pdf(X), e.g., the normal distribution, which has an E(X)=0. To evaluate this Pr, we need a test-statistic t for it and a distribution pdf(t) to integrate for Pr.

T-statistics • T has a complicated mathematical, its graph is similar to bell-shaped curve. • It has for small sample size N longer tails (green) Blue area= Pr(T<3) Pr(T>=3)

T is known in R Sample size -1 Test for sample x=c(1,2) Pr(t<3), for n=2 Upper boundary 3? t=mean(x)/sd(x)*sqrt(2) =1.5/0.707*1.44=3.0 So, ~90 from 100 repeated samples will give mean below 1.5 1-pt(3,df=1) = 0.1024164 is the chance to have mean(x) greater 1.5 ! (remember, N=2), Under the assumption that x is drawn from a population with mean 0 !

Now the test itself: • We have a sample size 2 The Null-Hypothesis Our sample is from a population with mean 0. The test that checks this is in R… Ignore this 0

Course in Statistics and Data analysis

Course in Statistics and Data analysis

Presentation Transcript

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Statistics and Data Analysis

Course in Statistics and Data analysis