Section 3.1: Elementary Graphical Treatment of Data Before doing ANYTHING with data: • Understand the question. • An approximate answer to the exact question is always better than an exact answer to an approximate question. John Tukey. • Know how the experiment was conducted.
The FIRST thing to do with the data is to PLOT THE DATA • Plot all individual points. • If there are connections between points, e.g. points are from same pairs (or sometimes separate blocks), show connections between related points.
Plotting data is an extremely important step. • More often than not data I get when consulting have problems like incorrect data or attributes they didn’t tell me about. • Plotting helps reveal relationships and answers. • Plotting is a very effective way to present results. • “A picture is worth a thousand words.”
Example: 8 lb. test fishing line question: Which type(s) of line are strongest? Listing numerical data Trilene XL 11.5 11.3 11.7 11.6 11.7 11.4 11.5 11.5 11.6 11.4 Trilene XT 11.6 11.8 11.7 11.7 11.5 11.6 11.6 11.8 11.5 11.7 Stren 11.1 11.1 11.2 11.0 11.1 11.3 11.2 10.9 11.0 11.1 It’s hard to see what’s happening without organizing the data.
A “dot” diagram XL XT Stren 11.8 ** 11.7 ** *** 11.6 ** *** 11.5 *** * 11.4 ** * 11.3 * * 11.2 ** 11.1 **** 11.0 ** 10.9 *
Stem and leaf plot It shows the distribution shape and at the same time preserves the original values. In the gears’ runouts example, for the gears hung group, we have data points of 7, 8, 8, 10, 10, 10, 10, 11, 11, 11, 12, 13… A stem and leaf plot is 0 7 8 8 1 0 0 0 0 1 1 1 2 3
Two groups can be compared with back to back stem and leaf diagrams E.g. Stopping distances of bikes Treaded tire Smooth tire 34 1 8 9 35 5 5 36 6 4 37 5 38 1 39 1 2 0 40 Or dot diagrams | | | * | ** | | * |** Treaded 340 350 360 370 380 390 400 |*** | * | | * | | * | Smooth
When there are associations between sets of data values, plot the data accordingly. E.g., Snowfall for duluth and White Bear Lake 1972-2000 A not very good way to plot the data WB Lake Duluth 130 * 120 * 110 ** ** 100 *** * 90 ***** 80 ****** ****** 70 ** *** 60 ** ********** 50 **** *** 40 *** *** 30 * *** 20
Duluth White Bear
A study of trace metals in South Indian River 5 3 1 6 2 4 T=top water zinc concentration (mg/L) B=bottom water zinc (mg/L) 1 2 3 4 5 6 Top 0.415 0.238 0.390 0.410 0.605 0.609 Bottom 0.430 0.266 0.567 0.531 0.707 0.716
One of the first things to do when analyzing data is to PLOT the data • This is not a useful way to plot the data. There is not a clear distinction between bottom water and top water zinc—even though Bottom>Top at all 6 locations. Top Bottom
A better way Top Bottom Connect points in the same pair.
Another way (scatter plot) Bottom=Top
This following plot would imply a natural ordering of sites from 1 to 6. • This would not be the best way to plot the data unless the sites 1-6 correspond to a natural ordering such as distance downstream of a factory.
Run charts (a version of scatter plot) • The variable on the x axis is a time variable. • Table: 30 consecutive outer diameters turned on a lathe
Moving along time, the outer diameters tend to get smaller until part 16, where there is a large jump, followed by a pattern of diameter generally decreasing in time.
Section 3.2: Quantiles and Related Graphical Tools Quantile: Roughly speaking, for a number p between 0 and 1, the p quantile of a distribution is a number such that a fraction p of the distribution lies to the left and a fraction 1-p of the distribution lies to the right.
p quantile = 1O0*pthpercentile Q(0.10) = 0.10 quantile = 10thpercentile Q(0.50) = 0.50 quantile = 50thpercentile = median Q(0.25) =0.25 quantile = 25thpercentile= first quartile Q(0.75) =0.75 quantile = 75thpercentile= third quartile
The pthquantile is ordered point corresponding to the point with index So the comulative probability corresponding to the ithpoint is
Consider the following n=10 points Q(0.25) = 0.25 quantile = 857 Q(0.50) = median = . Q(0.75) = 9614 IQR = Interquartile Range = Q(0.75) - Q(0.25)= 9614- 8572= 1042
To find the 93rdpercentile: 0.93 is part way between 0.85 and 0.95 . So the Q(0.93) is 0.8 of the way from Q(0.85) to Q(0.95) Q(0.85) + 0.8(Q(0.95)-Q(0.85)) =0.2*Q(0.85) + 0.8*Q(0.95) = 0.2(9614)+ 0.8(10,688) = 10,473.
Boxplots are useful summaries, particularly when there are too many points for a dot plot. • To make a boxplot, we need essentially 5 numbers.
Section 3.2.3 Q-Q Plots and Comparing Distributional Shapes • Most of the statistical tools we will use in this class assume normal distributions (a bell shaped distribution for the population of possible values). • In order to know if these are the right tools for a particular job, we need to be able to assess if the data appear to have come from a normal population.
With large amounts of data, one can draw a histogram of the measured values and see if it is bell-shaped. • A normal plot is a method for assessing normality that works well with big or small data sets. It gives a good visual check for normality.
Simulation: 100 observations, normal with mean=5, st dev=1 • x<-rnorm(100, mean=5, sd=1) • qqnorm(x)
A normal plot is a plot of the data in a way such that data from normal populations will come out pretty much in a straight line. • We plot the corresponding quantiles of a "standard normal'' distribution versus ordered y values
In other words In order to plot the data and check for normality, we compare • our observed data to • what we would expect from a sample of standard normal data.
So if we plot ordered values from a normal population against corresponding quantiles of a standard normal population, we expect to get a reasonably straight line, since any normal distribution is linearly related to the standard normal distribution.
The textbook plots the • standard normal quantiles on the vertical axis and • the ordered data points on the horizontal axis. Many software packages and other books plot the standard normal quantiles on the horizontal axis and the ordered data points on the vertical axis. Either way, the plot should look ``fairly'' straight if the data are from a normal distribution.
Section 3.3: Numerical Summaries Measures of Location: The data are found spread around what value ? Median = Q(O.50) = 50thpercentile. Sample mean = arithmetic mean = average The mean is more affected by unusual values than the median.
Measures of Spread: • R = Range = Biggest – Smallest • The size of the range can be affected by how many values we have. Many number will tend to have a larger range than fewer numbers. • IQR = lnterquartile Range = Q(0.75) – Q(0.25) Range that include half of the values.
Sample variance = Essentially an average squared deviation from the mean. • Sample standard deviation =
Statistics and Parameters A statistic is a numerical summary of the sample data. = sample mean s2 = sample variance
A parameter is a summary of an entire population or a theoretical distribution, for example a normal distribution. m = population mean s2 = population variance Average squared deviation from the mean. s = population standard deviation
For a sample of size n, the sample variance is • Why divide by n -1? This makes an unbiased estimator of . Unbiased means on the average correct.
Suppose we have a large population of ball bearings with diameters m=1cm and Sample 1 0.98 0.00032 2 1.03 0.00031 3 1.01 0.00045 4 1.02 0.00052 . . . . . . ∞ ------ -------- Mean 1.00 0.0004 If we knew m we would find Fact So and would be too small for s2. Dividing by n-1 makes s2 come out right (s2 )on average.
Notice that s2 is undefined if n=1; we can't divide by zero. This makes sense. If we have only one number, that number tells us nothing about potential spread in the population.
Plotting summary statistics over time is useful for issues such as quality control. Read section 3.3.4 for general information.