Week 1 Review of basic concepts in statistics handout available at http://homepages.gold.ac.uk/aphome

Week 1Review of basic concepts in statisticshandout available at http://homepages.gold.ac.uk/aphome Trevor Thompson 30-9-2007

Review of following topics: • Population vs. sample • Measurement scales • Plotting data • Mean & Standard deviation • Degrees of freedom • Transforming data • Normal distribution - Howell (2002) Chap 1-3. ‘Statistical Methods for Psychology’

Population vs. sample • Population - an entire collection of measurements • (e.g. reaction times, IQ scores, height or even height of male Goldsmiths students) • Sample – smaller subset of observations taken from population • sample should be drawn randomly to make inferences about population. Random assignment to groups improves validity

Population vs. sample • In general: • population parameters =Greek letters • sample statistics=English letters -worth learning glossary of other symbols now to avoid later confusion (e.g. Σ=the sum of)

Measurement scales • Categorical or ‘Nominal’ • e.g. male/female, or catholic/protestant/other • Continuous • Ordinal - e.g. private/sergeant/admiral • Interval- e.g. temperature in celsius • Ratio - e.g. weight, height etc

Plotting data • Basic rule is to select plot which represents what you want to say in the clearest and simplest way • Avoid ‘chart junk’ (e.g. plotting in 3D where 2D would be clearer) • Popular options include bar charts, histograms, pie charts etc - see any text book. SPSS charts discussed in workshop

Summary statistics • Two essential components of data are: • (i) central tendency of the data &(ii) spread of the data (e.g. standard deviation) • Although mean (central tendency) and standard deviation (spread) are most commonly used, other measures can also be useful

Measures of central tendency • Mode • the most frequent observation: 1, 2, 2, 3, 4 ,5 • Mode • the most frequent observation: 1, 2, 2, 3, 4 ,5 • Median • the middle number of a dataset arranged in numerical order: 0, 1, 2, 5, 1000 (average of middle two numbers when even number of scores exist) • relatively uninfluenced by outliers • Median • the middle number of a dataset arranged in numerical order: 0, 1, 2, 5, 1000 (average of middle two numbers when even number of scores exist) • relatively uninfluenced by outliers • Mean =

Measures of dispersion • Several ways to measure spread of data: • Range (max-min), IQR or Inter-Quartile Range (middle 50%), Average Deviation, Mean Absolute Deviation • Variance – average of the squared deviations • Variance for population of 3 scores (-10,0,10) is 66.66 (200/3) • Standard deviation is simply the square root of the variance

Calculating sample variance • Population variance (2) is the true variance of the population calculated by -this equation is used when we have all values in a population (unusual) • However, the variance of a sample (S2) tends to be smaller than the population from which it was drawn. So, we use this equation: • The correction factor of ‘N-1’ increases the variance to be closer to the true population variance (in fact, the average of all possible sample variances exactly equals 2)

Degrees of freedom • When calculating sample variance, we calculate the sample mean thus making make the last number in the dataset redundant – i.e. we lose a ‘degree of freedom’ (last no. is not free to vary) e.g. M=10, sample data: 12, 9, 10, 11, 8 Calculating the sample mean (10) means that we have already (implicitly) included the last number in our calculations.If we (knew and) used the population mean rather than the sample mean this would not be the case so we could use N not N-1. • Why is ‘N-1’ used to calculate sample variance? • Howell illustrates this with a worked example (and mathematical proof can be retrieved with internet search) • Bottom line is whenever we have to estimate a statistic (e.g. mean) we lose a degree of freedom

Transforming data • One reason we might ‘transform’ data is to convert from one scale to another • e.g. feet into inches, centigrade into fahrenheit,raw IQ scores into standard IQ scores • Scale conversion can usually be achieved by simple linear transformation (multiplying/dividing by a constant and adding/subtracting a constant) Xnew = b*Xold + c • So to convert centigrade data into fahrenheit we would apply the following:

Transforming data • Z-transform (standardisation) is one common type of linear transform, which produces a new variable with M=0 & SD=1 • Z -scores= X • Standardisation is useful when comparing the same dimension measured on different scales (e.g. anxiety scores measured on a VAS and questionnaire) • After standardisation these scales could also be added together (adding two quantities on different scales is obviously problematic)

Normal Distribution • Many real-life variables(height, weight, IQ etc etc)are distributed like this • Mathematical equation mimics this normal(or Gaussian) distribution

Normal Distribution • The mathematical normal distribution is useful as its known mathematical properties give us useful info about our real-life variable (assuming our real-life variable is normally distributed) • For example, 2 standard deviations above the mean represent the extreme 2.5% of scores (calculus equations used to derive this) • Consequently, a person with an IQ score of 130 (M=100, SD=15), would be in the top 2.5% (assuming IQ is normally distributed)

Normal Distribution • Normality is important assumption (though more about this next week). Violations of normality generally take two forms: • SKEWNESS • KURTOSIS

Week 1 Review of basic concepts in statistics handout available at http://homepages.gold.ac.uk/aphome