Distributions Transformations

1. Research in Clinical Psychology Distributions & Transformations Dr Rachel Msetfi r.msetfi@lancaster.ac.uk Messy slide <>messy dataMessy slide <>messy data

2. Learning outcomes You will be reminded of the importance of checking distributions before carrying out some statistical procedures You will be able to examine distributions of data and test some of the assumptions of commonly used statistical tests You will be able to apply appropriate transformations to the data where necessary You will have a more in depth understanding of the numbers that make up distributions and how they behave when transformations are applied You will know how to choose what transformation to apply and when to try less standard procedures You will have access to a step by step guide on how to do all of the above in SPSS! You will know where and how to report Where we want to end upWhere we want to end up

3. Session aims Find out what you know already Explore the nature of distributions Consider assumptions of statistical tests Testing the distribution Group task Transforming the distribution Group tasks Context and reporting Review

4. Current state of affairs�

5. The nature of distributions What is a distribution? An arrangement of values of a variable showing their observed or theoretical frequency of occurrence Observed: Collect scores on some test or q�aire. Scores are not identical. So count [frequency of] occurrence of each score Example: 40 people complete the digit span task, 4 people get a score of 5, 10 people get a score of 6, 16 people get a score of 7 etc., This distribution of scores can be displayed graphically, most often a histogram:

6. Distribution of digit span scores in a typical student population

7. Prediction and probability Assumption: Sample of events is representative of all future events. Therefore the relative frequency of an event is equal to the probability of an event occurring in the future. The observed distribution told us that, in this particular sample, 16 out of 40 people scored 7 on the digit span. The relative frequency of the score 7 of 16/40 = .4 is like a prediction of what you might find if you carried out the test again with another 40 people. Thus you might say that the probability of getting a score of 7 is .4

8. Notes about probability Probability (p) is calculated in exactly the same manner as relative frequency Probability is always a decimal between 0 and 1 (cannot ever be less than zero or more than 1) If the probability of a future event is 0 - then we never expect the event to happen, because it never has! If the probability of a future event is 1 - then expect the future event to happen always, because it always does! Although it should be noted that this is never certain.....

9. From observed to theoretical distributions�

10. Theoretical distribution It�s not much of a leap from from the logic of prediction and probability to the notion of theoretical distributions which underpin statistical analyses One example is the Standard Normal Distribution which is a model of what we expect to happen with normally distributed scores. It is based on the expected relative frequencyof scores in the population. Therefore the SND is a theoretical probability distribution. It indicates the probability of all possible future events (or scores) in the population. If observed data conform to the shape of the SND - bell shaped curve, symmetrical, mean in the centre � we know, for example, that (roughly) the prob of a score being more than 2 SDs above the mean is .025 (ie 2.5%) If data don�t fulfil the assumptions then the results of a test may not be accurate What is an assumption? It�s a belief that underlies a conclusion you make. Assumption: Behaviour: Conclusion: Outcome: What if my assumption is not true My conclusion is completely wrong!!What is an assumption? It�s a belief that underlies a conclusion you make. Assumption: Behaviour: Conclusion: Outcome: What if my assumption is not true My conclusion is completely wrong!!

11. Common assumptions Assumptions Normality Homogeneity of variances (across groups) Multi-collinearity (covered in regression lecture) Heteroscedasticity (covered in regression lecture) Testing the assumptions

12. Normal distribution

13. How do we know when the distribution deviates from normality? Look at the visual representations of the data Histograms, box plots & normal QQ plots Use Rachel_DR_Data.savfor thisUse Rachel_DR_Data.savfor this

14. Normal QQ plot Plots observed value (score) against that expected if the distribution was normal. Normality is consistent with data points conforming to the diagonal line.

15. How do we know when the distribution deviates from normality? Look at measures of central tendency If the distribution is perfectly normal and symmetrical, then:

17. How do we know when the distribution deviates from normality? SPSS: Analyze > Descriptive Statistics > Explore See Andy Field, p. 40. Kurtosis - +values = pointy, -values = flatSee Andy Field, p. 40. Kurtosis - +values = pointy, -values = flat

18. How do we know when the distribution deviates from normality? SPSS: Analyze > Descriptive Statistics > Explore See Andy Field, p. 40. Kurtosis - +values = pointy, -values = flat On whether to use Kolmogorov or Shapiro, some people sat use Shapiro for N<2000See Andy Field, p. 40. Kurtosis - +values = pointy, -values = flat On whether to use Kolmogorov or Shapiro, some people sat use Shapiro for N<2000

19. Review: Checking for normality Use histograms, box plots or QQ plots to look at the shape of the distribution Note measures of central tendency: mean, median & mode should be roughly equal Examine skewness& kurtosis statistics, values near to 0 are desirable Look at the results of the normality tests, significant p-values indicate non-normal data

20. Homogeneity of Variance This applies to between group analyses such as t-tests and ANOVA. Tests are based on the assumption that the variance in scores is similar across groups. Varianceie extent to which individual scores deviate from the meanVarianceie extent to which individual scores deviate from the mean

21. How do we know whether our variances are homogenous? Look at bar charts with error bars & values given for SE and SD

22. How do we know whether our variances are homogenous? Don�t rely on guesswork, check Levene�s test for homogeneity of variance. SPSS gives Levene�s as standard in the t-test output. You can also find it in the explore option, the result will be identical.

23. Review: Checking Homogeneity of Variance Look at the data, particularly graphs displaying means and error bars. Error bars should be roughly equal height Do not rely on this! Check with Levene�s test. A significant test shows that the variances are significantly different.

24. Practice Using the SPSS data file provided check for normality and homogeneity of variance

25. What do we do next? Our data deviates from normal and/or our variances are different. At this point you need to make choices. You should always: Check for outliers, data entry errors. Does this distribution make sense given what it being measured? You could: Argue that parametric tests are robust and can cope with departures from normality and continue using parametric test. Use adjusted versions of tests which account for heterogeneity of variance. Welch ANOVA, t-test equal variances not assumed with adjusted df. Use tests which do not require such assumptions to be met (ie Mann Whitney U).

26. What do we do next? You can choose to transform your data such that it more closely meets test assumptions & then use parametric tests with more confidence. Remember you need to be confident that these choices are justifiable and defensible. What is transformation: it involves converting the scores using a mathematical operation on each score. Linear transformations are not particularly helpful: x1+4=5; x2+4=6; x3+4=7 etc., will not alter the shape of the distribution Non-linear transformations change the spacing of the different points on the scale. EG x12=1; x22=4; x32=9 etc., so you can see that by squaring every score we are changing the scale and spacing. This can change the shape of the distribution. X in the example above refers to a score of 1, x2 refers to a score of 2 etcX in the example above refers to a score of 1, x2 refers to a score of 2 etc

27. How transformation works on the positively skewed distribution

28. What kind of transformation should I use to achieve this effect? Add a constant to each variable before doing the transform such that the lowest value of the variable will be =1 Then choose an operation depending on how strong a transformation you need (ie strength = how far of a shove to the right you need). In order of strength: Square root? Log?? Inverse (1/x)??? Always retest for normality after the transform If your transformation is too strong, you could push the distribution to far to the right and convert it to a negative skew If none of the above works, try a power transformation (variablepower). Values <1 appropriate for positive skew.

29. A note on using the inverse transformation.. The inverse transformation is the strongest one (1/x). It also reverses the order of scores (reflection). This might make interpretation of any subsequent regression analyses rather tricky. So the general advice is that when you have used 1/x, you should reflect back again to make interpretation more straightforward.

30. Transforming negatively skewed distributions All the transformations mentioned so far push the distribution to the right ? So before starting to transform the neg dist, we need to reflect it. This will convert the neg skew to a pos skew, which can be pushed to the right ? BUT as reflection reverses the order of scores, you must reflect back again after the transformation.

31. Review of transforming for normality If the skew is positive: Check the value of the lowest score. If it is less than 1, add a constant to each score to make your lowest score 1 (so if the lowest score was -15, add 16 to all scores to take the lowest value up to +1). Use either the i) square root, ii) log or ii) inverse transformation. If you use the inverse, you will reversed the order of your scores, so you must reverse them back again after the transformation Check for normality again using Shapiro Wilks test. If significant, go back to 2 and try the next transformation on the list. If the skew is negative: Reflect the distribution using ((highest score + 1) � score) Use either the i) square root, ii) log or ii) inverse transformation. Reflect the distribution back again to preserve the original order of scores (unless you have used inverse) Check for normality again using Shapiro Wilks test. If significant, go back to 2 and try the next transformation on the list.

32. Transformation for Homogeneity of Variance Power transformations work well here. SPSS has a clever little option which tells you the best power transformation to get the job done!

33. Scan down to the end of the very long output� The information you need is quite well hidden next to something called the �Split vs. Level Plot

34. Proof that it works

35. Review of transforming for homogeneity of variance Check the value of the lowest score. If it is less than 1, add a constant to each score to make your lowest score 1 (so if the lowest score was -15, add 16 to all scores to take the lowest value up to +1). Use the SPSS option under Explore, called �Split vs. Level with Levene�s test, to identify a suitable power value. Use the transform option to raise the score to the power suggested (score**power). Rerun explore to check whether the power transformation has been successful using Levene�s test.

36. Group task Take the data set Based on earlier tests, try some transformations Share your findings

37. Reporting: when, where, how.. Transformations are usually reported in the Method, sometimes in a subsection explicitly labeled �Data analysis� Examples: Peselow et al., (1994). British Journal of Psychiatry. �Since the distributions for both the individual personality disorder scores and the three personality cluster scores were inherently skewed in a positive direction (subjects were not expected to display some maladaptive traits for all of the personality disorders or even for the personality clusters), the data were subjected to either square root or logarithmic transformations to obtain a normal distribution, an assumption for ANOVA. In some instances, transformations did not result in a fully normal distribution of the data. However, research has shown that ANOVA F-values are robustly insensitive to moderately skewed distributions (Lindquist, 1953).� (p.350)

38. Reporting: when, where, how.. This method section had a paragraph labeled, �Analytic Strategy� Examples: Pangold et al., (2002). Journal of Child Psychology and Psychiatry. �In the two studies reported here, they [scores] were not even approximately normally distributed. Most children had very low depression scores (as, indeed, we would expect) and there was a long right tail comprising a small group of children with high scores (see the cumulative frequency curves in Figures 1 and 5). Common transformations (such as taking logs or reciprocals) still did not produce distributions close to normality.� (p.1050) NB: read on to find out what they did next�.

39. Point to note Sometime transformations do not work!!!! This can be a particular difficulty when you measure a psychopathological variable (like depression) in the normal population. Most people score low here and the skewness and kurtosis can be so extreme that none of the usual transformations work. Alternatives to try are Box-Cox which gives you appropriate power transformation to approximate normality. Arctan transform is another one to try for extreme kurtsosis(see notes below). Box and Cox (1964) This transformation is not available as standard in spss and requires some syntax. There are 2 steps, first use Box-Cox to identify the appropriate power, then second, use that power to transform and retest.Box and Cox (1964) This transformation is not available as standard in spss and requires some syntax. There are 2 steps, first use Box-Cox to identify the appropriate power, then second, use that power to transform and retest.

40. Conclusion The accuracy of many of the most commonly used statistical tests are based on assumptions of normality and homogeneity of variance. Many would argue that such tests are robust in the face of skewness and heterogeneity of variance. However, if you are in doubt about the necessity for transformation, do your analysis on both raw and transformed scores. If both analyses provide similar answers, fine, if not err on the side of caution�.

41. SPSS explore dialog for checking normality Make sure that in the plots option, you select �Normality plots with tests�

42. Transform in SPSS Go to: Transform > Compute Variable:

Distributions Transformations

Distributions Transformations

Presentation Transcript

Transformations

Transformations

Transformations

Transformations

Transformations

Transformations

Transformations

Transformations

Transformations

Transformations

Transformations

Transformations

Transformations

Transformations

Transformations

Transformations

Transformations

Transformations

Transformations

Transformations