CRITICAL NUMBERS Bivariate Data: When two variables meet

1 / 72

# CRITICAL NUMBERS Bivariate Data: When two variables meet - PowerPoint PPT Presentation

##### CRITICAL NUMBERS Bivariate Data: When two variables meet

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. CRITICAL NUMBERSBivariate Data:When two variables meet

2. Categorical (Qualitative) Nominal (no natural ordering) Haemoglobin types gender Ordered categorical Anaemic / borderline / not anaemic Quantitative (numerical) Count (can only take certain values) Number of positive tests for anaemia Continuous (limited only by accuracy of instrument) Haemoglobin concentration (g/dl) Recap: types of data

3. Population and Sample

4. The Standard Error • The standard error (se) is an estimate of the precision of the population parameter estimate that doesn’t require lots of repeated samples. It is used to determine how far from the true value (the population parameter) the sample estimate is likely to be. Thus, all other things being equal, we would expect estimates to get more precise and the value of the se to decrease as sample size increases.

5. A confidence interval describes the variability surrounding the sample estimate It gives limits within which we are confident (in terms of probability) that the true population parameter lies. For example a 95% CI means that if you could sample an infinite number of times 95% of the time the CI would contain the true population parameter 5% of the time the CI would fail to contain the true population parameter Alternatively: a confidence interval gives a range of values that will include the true population value for 95% of all possible samples Confidence Intervals

6. Hypothesis testing: the main steps Set null hypothesis Set study (alternative) hypothesis Carry out significance test Obtain test statistic Compare test statistic to hypothesized critical value Obtain p-value Make a decision

7. P-values • A p-value is the probability of obtaining your results or results more extreme, if the null hypothesis is true • It is used to make a decision about whether to reject, or not reject the null hypothesis • But how small is small? The significance level is usually set at 0.05. Thus if the p-value is less than this value we reject the null hypothesis

8. P-values We say that our results are statistically significant if the p-value is less than the significance level () set at 5% We cannot say that the null hypothesis is true, only that there is not enough evidence to reject it

9. At the end of session, you should know about: • Approaches to analysis for simple continuous bivariate data At the end of session, you should be able to: • Construct and interpret scatterplots for quantitative bivariate data • Identify when to use correlation • Interpret the results of correlation coefficients • Identify when to use linear regression • Interpret the results for linear regression

10. The Scenario “Our Dr has noticed that since she moved practices, from one in a wealthy suburb of the city to one in a more deprived area, she is seeing many more teenage pregnancies. She wants to know whether it is worth her setting up a contraceptive advice clinic especially for teenagers…”

11. What do we mean when we talk about bivariate data? • Data where there are two variables • The two variables can be either categorical, or numerical • This session we are dealing with continuous bivariate data i.e. both variables are continuous • During the risk lecture last year we looked at categorical bivariate data …

12. … categorical bivariate data example from Risk lecture Baycol Other statins Number who die from 2 1 rhabdomyolysis Number alive or die 999 998 9 999 999 of other causes Total 1 000 000 10 000 000 • There are two binary (categorical) variables • Type of statin (Baycol / other) • Whether died of rhabdomyolysis or not • From these data we examined the risk of death from rhabdomyolysis of Baycol compared to other statins

13. Association between two variables: Correlation or regression? There are two basic situations: • There is no distinction between the two variables. No causation is implied, simply association: • use correlation • One variable Y is a response to another variable X. You could use the value of X to predict what Y would be: • use regression

14. Correlation: are two variables associated? When examining the relationship between two continuous variables ALWAYS look at the scatterplot, as you will be able to see visually the pattern of the relationship between them

15. Teenage pregnancy example

16. Teenage pregnancy example • There appears to be a linear relationship between adult smoking rates and teenage pregnancy • So, now what do you do….? • ….. could calculate the correlation coefficient • This is a measure of the linear association between two variables • Used when you are not interested in predicting the value of one variable for a given value of the other variable • Any relationship is not assumed to be a causal one – it may be caused by other factors

17. Teenage pregnancy example

18. Properties of Pearson’s correlation coefficient (r) • r must be between -1 and +1 • +1 = perfect positive linear association • -1 = perfect negative linear association • 0 = no linear relation at all

19. Consider the following graphs, what do you think their value for r could be?

20. A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

21. A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

22. A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

23. A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

24. A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

25. A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

26. A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

27. A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

28. A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

29. A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

30. A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

31. A = 1.0, B = 0.8, C = 0.0, D = -0.8, E = -1.0

32. Confidence interval for the correlation coefficient • Complicated to calculate by hand, but useful • Hypothesis tests • Can be done, the null hypothesis is that the population correlation r = 0, but this is not very useful as an estimate of the strength of an association, because it is influenced by the number of observations (see next slide)…..

33. And so what do correlations of 0.63 and 0.16 look like?

34. Teenage pregnancy example:null & alternative hypothesis State the null and alternative hypothesis: H0: No relationship or correlation between adult smoking and teenage pregnancy rates i.e.population correlation coefficient (r) = 0.0 HA: There is a relationship or correlation between adult smoking and teenage pregnancy rates i.e.population correlation coefficient (r) 0.0

35. Teenage pregnancy example

36. Example: Answers The correlation coefficient is 0.94 (p< 0.001) What does P < 0.001 mean? • Your results are unlikely when the null hypothesis is true Is this result statistically significant? • The result is statistically significant at the 5% level because the P-value is less than the significance level () set at 5% or 0.05 You decide? • That there is sufficient evidence to reject the null hypothesis and therefore you accept the alternative hypothesis that there is a correlation between adult smoking and the teenage pregnancy rates

37. Points to note • Do not assume causality - a different variable could have caused both to change together – in this case it is unlikely that smoking increases the risk of conception! • Be careful comparing r from different studies with different n • Do not assume the scatterplot looks the same outside the range of the axes • Avoid multiple testing • Always examine the scatterplot!

38. Teenage pregnancy example

39. Teenage pregnancy example

40. Association between two variables: Correlation or regression? There are two basic situations: • There is no distinction between the two variables. No causation is implied, simply association: • use correlation • One variable Y is a response to another variable X. You could use the value of X to predict what Y would be: • use regression

41. Regression:Quantifying the relationship between two continuous variables Teenage pregnancy example: If you believe that the relationship is causal i.e. that the level of smoking in an area affects the teenage pregnancy rate for that area, you may want to: • Quantify the relationship between smoking and the teenage pregnancy rate • Predict on average what the pregnancy rate would be, given a particular level of smoking

42. Regression:Quantifying the relationship between two continuous variables Teenage pregnancy example: However, in this case it would not be sensible as both are mediated by deprivation. So let’s look at the rates of teenage pregnancy by area deprivation. If we believe that deprivation is causally linked with teenage pregnancy we could: • Quantify the relationship between deprivation and the teenage pregnancy rate • Predict on average what the pregnancy rate would be, given a particular level of deprivation

43. Y Response variable (dependent variable) X Predictor / explanatory variable (independent variable)

44. Always plot the graph this way round, with the explanatory (independent) variable on the horizontal axis and the dependent variable on the vertical axis • We try to fit the “best” straight line • If the relationship is linear, this should give the best prediction of Y for any value of X

45. Estimating the best fitting line • The standard way to do this is using a method called least squares using a computer. • The method chooses a line so that the square of the vertical distances between the line and the point (averaged over all points) is minimised.