1 / 64

Lecture 3 – Data Summary Measures and Graphical Display of Results

Lecture 3 – Data Summary Measures and Graphical Display of Results. Univariate Data – Analysis of one variable at a time. Why Think About/Explore Data?. Done to accomplish: Checking for data entry errors Describing demographic and study characteristics

deo
Download Presentation

Lecture 3 – Data Summary Measures and Graphical Display of Results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 3 – Data Summary Measures and Graphical Display of Results Univariate Data – Analysis of one variable at a time

  2. Why Think About/Explore Data? • Done to accomplish: • Checking for data entry errors • Describing demographic and study characteristics • Examining distributions of outcomes • Central tendency • Variability • Checking for outliers • Checking assumptionsfor subsequent analyses • Give a picture of your sample

  3. In order to understand choices of which statistics could be appropriate, it is paramount to ascertain what measurement level the outcome (s) and predictor (s) have. Dependent variable = outcome Independent variable = predictor

  4. Types of Data • Nominal – Qualitative Data Measured in unordered categories • Ordinal – Qualitative Data Measured in ordered categories • Continuous – Quantitative Data Measured on a continuum (summarize with %’s): (summarize with %’s): summarize with Many Summary Measures

  5. Types of Data • Nominal – Qualitative Data Measured in unordered categories • Race • Blood Type • Dead/Alive • Ordinal – Qualitative Data Measured in ordered categories • Cancer Stages • Socio-economic Status (low, med, hi) • Continuous – Quantitative Data Measured on a continuum • Serum Creatinine • Height/Weight/BMI • Gender • On Dialysis/Not on Dialysis • Likert (unlikely, somewhat unlikely, neutral, likely, very likely) • Systolic Blood Pressure • Diastolic Blood Pressure • Others???

  6. Continuous (Numerical) Measures of Location • Mean Arithmetic Average Sum of Values/Number of Values Nice mathematical/statistical properties • Median (a.k.a 50th Percentile) Value where half the sample is above, half the sample is below Better measure for skewed data. Robust to Extreme values • Mode Most Frequently Occurring value in Sample

  7. Continuous (Numerical) NORMAL DISTRIBUTION

  8. Continuous (Numerical) Measures of Variability • Range= (maximum - minimum) • Interquartile range = (Q3 – Q1) always covers half the sample(75th - 25th percentile) • Variance= average of the squares of the deviations of the observations from their mean • Standard deviation=

  9. Continuous (Numerical) NORMAL DISTRIBUTION http://www.stattucino.com/berrie/dsl/index.html

  10. Describing Data using Numerical Summaries • Descriptive statistics: Explore data in order to describe their main features Get an initial picture of data sample

  11. Let’s Talk Data…

  12. Categorical

  13. Categorical

  14. Categorical

  15. Continuous

  16. N = 115

  17. BMI Mean: 32.2 Std: 5.4 Median: 31.8

  18. Mean: 136.3 Std: 17.1 Median: 135

  19. Mean: 189.77 Std: 148.9 Median: 154.11

  20. symmetric skewed to the right skewed to the left Shape of a distribution Mean less than Median (negatively skewed) Mean greater than Median (positively skewed)

  21. Mean: 136.3 Std: 17.1 Median: 135 Skewness: 0.38

  22. Mean: 189.77 Std: 148.9 Median: 154.11 Skewness: 5.63

  23. NORMAL DISTRIBUTION Normal Distribution – Has Excellent Statistical Properties Many Statistical techniques require normal distributions If data does not have Normal Distribution, need to consider alternative techniques appropriate for data

  24. Box (and Whisker) Plots • A graph of the 5 number summary with suspected outliers plotted individually • 5 number summary: • Min, Q1, Median, Q3, Max • A line somewhere inside the box marks the Median • IQR = Q3 – Q1 • Cases more than 1.5*IQR are plotted individually (possible outliers) • Lines from the box extend to the smallest and largest values that are not more than 1.5*IQR

  25. Outlier 1.5 x IQR 75th Percentile median 25th Percentile mean

  26. Skewed to the left Skewed to the right Symmetric + + +

  27. Normal Probability Plot • Plot that can help assess normality. • Idea: plot the observed levels of the variable against the expected levels corresponding to a Normal distribution. • If data lie in a reasonably straight diagonal line, then assumption of Normality is reasonable.

  28. Normal Probability Plots Triglycerides BMI

  29. Error Bar Plots Circle denotes the mean and the bars denote the standard deviation (in this case).

  30. Part II – Measures of Association (plus a little more)

  31. Measures of Association • Continuous Variables • Correlation • Agreement (reliability) • Categorical Variables • Two-way layout (2×2 tables) • “Risk” measures • Agreement • Others

  32. Two Continuous Variables Correlation • General sense: the relationship between two variables (quantitative or qualitative) • Narrow (statistical) sense: measure of interdependence between two continuous random variables • The degree to which increases or decreases in Y occur with increases or decreases in X • Values range between -1 (perfect discordance) and 1 (perfect concordance) • A value of 0 indicates no association

  33. Pearson Correlation Purpose - measures linear association between two continuous variables X and Y Data

  34. Pearson Correlation The Pearson (product-moment) correlation coefficient can be calculated for 2 continuous variables in a sample (regardless of distribution) using the formula:

  35. Correlation Figures Y A B C • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • ρ = 1 • • • • • ρ = -1 • • • • ρ = 0 X No relationship Perfect positive relationship Perfect negative relationship E D • • • • • • • • • • • • • • • • • • • • • • • • • • ρ = -0.8 • ρ = 0.5 Strong negative relationship Moderate positive relationship

  36. Correlation Inference • Easy “large sample” test for H0: ρ=0 For n ≥ 25, compute which has N(0, ) distribution under H0 • This test assumes X,Y~ NBiv(μX, μY, σX2, σX2, ρ) • Many times a tenuous assumption! • Beware positive skewness & outliers • Beware data not truly continuous

  37. Timeout: ASSUMPTIONS • As with any mathematical or physical model, model assumptions are critical to making the correct inference • Dealing with assumptions has lead to development of: • Nonparametric statistics: techniques that reduce or eliminate dependence on the underlying distribution of the data • Robust statistics: techniques that are affected little by departures from assumptions

  38. Correlation (resumed) • A nonparametric version of the correlation coefficient: Spearman’s Rank Correlation • Like ρ, rs : • ranges from -1 to 1 • 0 no correlation, 1 perfect agreement • only requires ordinal data

  39. Correlation Example: SBP and DBP

  40. Correlation Example: SBP and DBP • All Data: ρ = 0.42; rs = 0.71 • Outlier deleted: ρ = 0.75; rs = 0.82

  41. Correlation Coefficient Questions – • Can we calculate a correlation coefficient between the incomes of a group of people and what city they live in? No, we cannot, since city is a categorical variable. Correlation requires that both variables be quantitative.

  42. Correlation Coefficient Questions – • Does it change the correlation between height and weight if we measure height in inches rather than centimeters and weight in pounds rather than kilograms? No. Because ρ (and r) uses the standardized values of the observations, ρ does not change when we change the units of measurements of x , y, or both. The correlation ρ itself has no unit of measure; it is just a number.

  43. Correlation Coefficient Question – • Does ρ = 0 mean there is no relationship between X and Y ? y • • • • • • • • • • • • • x Correlation only measures the strength of the linear relationship between two variables. Correlation does not describe nonlinear relationships between two variables, no matter how strong they are.

  44. Correlation and Regression Y Y • • • • • • • • • • • • • • • • • • • • • • • • • • ρ = 0.5 ρ = -0.8 • X X Moderate positive relationship Strong negative relationship Y = α+βX

  45. Correlation and Regression SBP and DBP example (continued) σSBP= 4.9 (mmHg) σDBP= 3.3 (mmHg) ρ = 0.75 SBP = 40.1 + 1.12×DBP DBP = 16.3 + 0.51×SBP

  46. Correlation and Covariance • Suppose two random variables, X and Y: E(X) = μX, V(X) = σX2; E(Y) = μY, V(Y) = σY2; and Corr(X,Y) = ρ • Define Cov(X,Y) = E[(X-μX)(Y-μY)] Note: Cov(X,X) = E[(X-μX)(X-μx)] = E(X-μX)2 = σX2 • Population correlation (ρ) is defined as: • Thus Cov(X,Y) = ρσXσY

  47. Correlation and Covariance What’s the big deal about covariance? Use it to find the variance of functions of random variables, e.g.: In general:

  48. Correlation as Agreement Suppose two nurses are measuring SBP in the same patients and each nurse measures SBP 3 times in each patient.

More Related