161.120 Introductory Statistics Week 2 Lecture slides • Graphical Displays of Univariate Data: Dot Plots & Stem-and-leaf Plots and Histograms • Text sections 2.4 and 2.5 • CAST sections 2.2 and 2.4 • Describing Centre and Spread • Text sections 2.6 and 2.7 • CAST sections 2.5 and 2.6 • Transformation & Discrete Data • CAST sections 2.7 and 2.8
Dot Plots • A graphical display of a batch of numbers • Each value is shown as a dot against a numerical axis • Problem of overlapping • Jittering dots (used in CAST) • randomly move the dots perpendicularly to the axis in order to separate them somewhat • Stacking dots • group values into classes, then vertically stack the dots in each class • the heights of the stacks show the density for each class • The loss of detailed information in a stacked dot plot is rarely important
Stem and leaf Plots Basically a stacked dot plot using digits instead of dots and slightly different layout • The 'axis' is drawn vertically • A value is printed on the axis for each stack, giving the most significant digits that are common for all values on that stack. This is called the stem for the stack. • The digits representing the values are called the leaf digits and are drawn in a row to the right of the stems
Decimal points are not shown in the stems or the leaves • The stem '12' and leaf '3' could represent 12300 or 1230 or 123 or 12.3 or 1.23 or 0.123, etc. so need to provide a key or state the units of the stem • Distribution of values is shown by ‘canopy’ of leaves • Sometimes not shown well • Can change the value of the leaves • Or split the stems
Example 2.8 Big Music Collection About how many CDs do you own? Stem is ‘100s’ and leaf unit is ‘10s’. Final digit is truncated. Numbers ranged from 0 to about 450, with 450 being a clear outlier and most values ranging from 0 to 99. The shape is skewed right.
Outliers and How to Handle Them Outlier: a data point that is not consistent with the bulk of the data. • Look for them via graphs. • Can have big influence on conclusions. • Can cause complications in some statistical analyses. • Cannot discard without justification.
Possible Reasons for Outliersand Reasonable Actions • Mistake made while taking measurement or entering it into computer. If verified, should be discarded/corrected. • Individual in question belongs to a different group than bulk of individuals measured. Values may be discarded if summary is desired and reported for the majority group only. • Outlier is legitimate data value and represents natural variability for the group and variable(s) measured. Values may not be discarded — they provide important information about location and spread.
Example 2.7 Tiny Boatsmen Weights (in pounds) of 18 men on crew team: Cambridge:188.5, 183.0, 194.5, 185.0, 214.0, 203.5, 186.0, 178.5, 109.0 Oxford: 186.0, 184.5, 204.0, 184.5, 195.5, 202.5, 174.0, 183.0, 109.5 Note: last weight in each list is unusually small. They are the coxswains for their teams, while others are rowers.
Clusters • If a dot plot or stem and leaf plot separates into two or more groups of values (clusters), this suggests that the 'individuals' from which the data were recorded may similarly be split into two or more groups. • Clusters may correspond to males and females, different varieties of plants,… • Detecting the cause of differences between the groups may lead to valuable insights into the data.
Histograms • Directly displays the 'canopy' shape, without separately displaying the individual values. • Are particularly useful displays for large data sets • Area equals relative frequency • Each value must contribute the same area to the histogram • Equal width classes • height of the rectangles equals the frequency of the class • vertical axis labeled ‘frequency’ • Mixed class widths • vertical axis labeled ‘density’
Choice of histogram classes • Histogram classes should be chosen to give an outline that is as smooth as possible • Too narrow leads to jagged histogram • Too wide leads to 'blocky' histogram and detail is lost • Adjusting the class width and the starting position for the first class can give a surprising amount of variability in histogram shape for small data sets. As a result, you must be wary of over-interpreting features such as clusters or skewness in such histograms.
Interpreting Histograms, Stemplots, and Dotplots • Values are centered around 20 cm. • Two possible low outliers. • Apart from outliers, spans range from about 16 to 23 cm.
Five-Number Summaries • Find extremes (high, low), the median, and the quartiles (medians of lower and upper halves of the values). • Quick overview of the data values. • Information about the center, spread, and shape of data.
Notation and Finding the Quartiles Split the ordered values into the half that is below the median and the half that is above the median. Q1 = lower quartile = median of data values that are below the median Q3 = upper quartile = median of data values that are above the median
Example 2.10 Fastest Speeds (cont) 55 60 80 80 80 80 85 85 85 8590 90 90 90 90 92 94 95 95 9595 95 95 100 100 100 100 100 100 100100 100 101 102 105 105 105 105 105 105105 105 109 110 110 110 110 110 110 110110 110 110 110 110 112 115 115 115 115115 115 120 120 120 120 120 120 120 120120 120 124 125 125 125 125 125 125 130130 140 140 140 140 145 150 Ordered Data(in rows of 10 values) for the 87 males: • Median = (87+1)/2 = 44th value in the list = 110 mph • Q1 = median of the 43 values below the median = (43+1)/2 = 22nd value from the start of the list = 95 mph • Q3= median of the 43 values above the median = (43+1)/2 = 22nd value from the end of the list = 120 mph
Percentiles The kth percentile is a number that has k% of the data values at or below it and (100 – k)% of the data values at or above it. • Lower quartile = 25th percentile • Median = 50th percentile • Upper quartile = 75th percentile
Median, quartiles and area • The data set is split into quarters by the median and quartiles. • Histogram area is proportional to relative frequency therefore the median and quartiles split the histogram into four equal areas.
What does a box plot tell you about the distribution? • Centre • The vertical line inside the box (the median) gives an indication of the centre of the distribution. • Spread • The width of the box (the interquartile range) gives an indication of the spread of values in the distribution. • IQR = UQ - LQ • Shape • High density corresponds to adjacent box plot values being close together. In particular, if the extreme and quartile on one side are closer to the median than the extreme and quartile on the other side, this shows that the distribution is skew.
Box plot: Clusters & Outliers • Clusters • Boxplots cannot show clusters in a data set • Before using a box plots check that clusters do not exist by using dot plot, stem and leaf plot or a histogram • Outliers • The basic box plot does not clearly show an outlier • Any values more than 1.5 times the IQR from the box are considered to be outliers and displayed with a separate cross • Outliers are displayed with a separate cross • The 'whiskers' that are drawn to the sides of the central box extend only as far as the most extreme values that are not classified as outliers.
Example 2.10 Fastest Speeds Ever Driven Five-Number Summary for 87 males • Median = 110 mph measures the center of the data • Two extremes describe spread over 100% of dataRange = 150 – 55 = 95 mph • Two quartiles describe spread over middle 50% of dataInterquartile Range = 120 – 95 = 25 mph
Comparing two or more groups • Box plots are particularly useful for comparing different groups of values • Rice yields in 1996
Picturing Location and Spread with Boxplots Boxplots for right handspans of males and females. • Box covers the middle 50% of the data • Line within box marks the median value • Possible outliers are marked with asterisk • Apart from outliers, lines extending from box reach to min and max values.
2.5 Pictures for Quantitative Data • Histograms: similar to bar graphs, used for any number of data values. • Stem-and-leaf plots and dotplots: present all individual values, useful for small to moderate sized data sets. • Boxplot or box-and-whisker plot: useful summary for comparing two or more groups.
2.6 Numerical Summaries of Quantitative Data Notation for Raw Data: n = number of individuals in a data setx1, x2 ,x3,…,xnrepresent individual raw datavalues Example:A data set consists of handspan values in centimeters for six females; the values are 21, 19, 20, 20, 22, and 19. Then, n = 6x1= 21, x2 = 19,x3 = 20, x4 = 20, x5 = 22, andx6 = 19
Describing the Location of a Data Set • Mean: the numerical average • Median: the middle value (if n odd) or the average of the middle two values (n even) Symmetric: mean = median Skewed Left: mean < median Skewed Right: mean > median
Determining the Mean and Median The Mean where means “add together all the values” The Median If n is odd: M = middle of ordered values.Count (n + 1)/2 down from top of ordered list. If n is even: M = average of middle two ordered values.Average values that are (n/2) and (n/2) + 1 down from top of ordered list.
Example 2.9 Will “Normal” Rainfall Get Rid of Those Odors? Data: Average rainfall (inches) for Davis, California for 47 years Mean = 18.69 inches Median = 16.72 inches In 1997-98, a company with odor problem blamed it on excessive rain. That year rainfall was 29.69 inches. More rain occurred in 4 other years.
The Influence of Outliers on the Mean and Median Larger influence on mean than median. High outliers will increase the mean. Low outliers will decrease the mean. If ages at death are: 70, 72, 74, 76, and 78then mean = median = 74 years. If ages at death are: 35, 72, 74, 76, and 78then median = 74 but mean = 67 years.
2.7 Bell-Shaped Distributions of Numbers • Many measurements follow a predictable pattern: • Most individuals are clumped around the center • The greater the distance a value is from the center, the fewer individuals have that value. Variables that follow such a pattern are said to be “bell-shaped”. A special case is called a normal distribution or normal curve.
Example 2.11 Bell-Shaped British Women’s Heights Data: representative sample of 199 married British couples.Below shows a histogram of the wives’ heights with a normal curve superimposed. The mean height = 1602 millimeters.
Describing Spread with Standard Deviation Standard deviation measures variability by summarizing how far individual data values are from the mean. Think of the standard deviation as roughly the average distance values fall from the mean.
Describing Spread with Standard Deviation Both sets have same mean of 100. Set 1: all values are equal to the mean so there is no variability at all. Set 2: one value equals the mean and other four values are 10 points away from the mean, so the average distance away from the meanis about 10.
Calculating the Standard Deviation Formula for the (sample) standarddeviation: The value of s2 is called the (sample) variance. An equivalent formula, easier to compute, is:
Calculating the Standard Deviation Step 1: Calculate , the sample mean. Step 2: For each observation, calculate the difference between the data value and the mean. Step 3: Square each difference in step 2. Step 4: Sum the squared differences in step 3, and then divide this sum by n – 1. Step 5: Take the square root of the value in step 4.
Interpreting the Standard Deviation for Bell-Shaped Curves: The Empirical Rule For any bell-shaped curve, approximately • 68% of the values fall within 1 standard deviation of the mean in either direction • 95% of the values fall within 2 standard deviations of the mean in either direction • 99.7% of the values fall within 3 standard deviations of the mean in either direction
The Empirical Rule, the Standard Deviation, and the Range • Empirical Rule => the range from the minimum to the maximum data values equals about 4 to 6 standard deviations for data with an approximate bell shape. • You can get a rough idea of the value of the standard deviation by dividing the range by 6.
Example 2.11 Women’s Heights (cont) Mean height for the 199 British women is 1602 mm and standard deviation is 62.4 mm. • 68% of the 199 heights would fall in the range 1602 62.4, or 1539.6 to 1664.4 mm • 95% of the heights would fall in the interval 1602 2(62.4), or 1477.2 to 1726.8 mm • 99.7% of the heights would fall in the interval 1602 3(62.4), or 1414.8 to 1789.2 mm
Example 2.11 Women’s Heights (cont) Summary of the actual results: Note: The minimum height = 1410 mm and the maximum height = 1760 mm, for a range of 1760 – 1410 = 350 mm. So an estimate of the standard deviation is:
Standardized z-Scores Standardized score or z-score: Example:Mean resting pulse rate for adult men is 70 beats per minute (bpm), standard deviation is 8 bpm. The standardized score for a resting pulse rate of 80: A pulse rate of 80 is 1.25 standard deviations above the mean pulse rate for adult men.
The Empirical Rule Restated For bell-shaped data, • About 68% of the values have z-scores between –1 and +1. • About 95% of the values have z-scores between –2 and +2. • About 99.7% of the values have z-scores between –3 and +3.
Transformations • Sometimes it is convenient to express numbers on a different scale Americans easily recognise that 90° Fahrenheit is a hot day. We understand temperatures better on the Celsius scale. • No gain or loss of information (usually) • Graphical and numerical summaries are affected. Transformations can help us understand a data set
Linear transformations new value = a + b x old value • imperical to metric measurements grams = 28.3494 x ounces • temperature Fahrenheit = 32 + 1.8 x Celsius • Relative positions of the points do not change so we neither gain nor lose information.
linear transformation • Affect the centre and spread of the data • Shape remains unchanged • Graphical displays: only the numbers labeling the axis changes • Do not help you to understand the distribution of values in the data
Nonlinear transformations Examples: • The wavelength of radiation (in metres) may alternatively be recorded as a frequency (in cycles per second) -- a reciprocal relationship. • A medical researcher might record the mean time between seizures for acute epileptic patients, or the rate of seizures per year -- another reciprocal relationship. • The Richter scale transforms the measured intensity of earthquakes to a logarithmic scale. Nonlinear transformations…. • Changes the relative distances between data values • Changes the shape of a distribution
Logarithmic transformations • The most commonly used nonlinear transformation replaces each value by its logarithm new value = log10(old value) • base-10 logarithms easier to interpret and used in CAST, but natural logarithms (base e) have a similar effect. • logarithms can be found only for positive numbers. • log10(1) = 0, log10(10) = 1, log10(100) = 2, log10(1000) = 3, • log10(0.1) = –1, log10(0.01) = –2, etc. • Spreads out low values in a distribution and compresses high values. • Useful for skew data with a long tail towards the high values. • It will spread out a dense cluster of low values and may detect clustering or outliers that would not be visible in graphical displays of the original data.
A family of nonlinear transformations Power transformations- raises each value in the data set to a power p • p < 1 increases the spread in the lower tail of data values and decrease the spread in the upper tail. • p > 1 expanding the upper tail of the data values and compressing the lower tail. (Rarely helpful)
Discrete Data Displays • Large counts • The distribution of values can be summarised with the same methods as continuous data. • Moderate counts • Most of the earlier displays can still be used, but • Stacked dot plots are better than jittered dot plots • No information is lost by stacking since there can be a column of crosses for each distinct value. • Histogram class boundaries should end in '.5' to ensure that data values do not occur on the boundary of two classes. • Since the median, quartiles and extremes are always whole numbers (or occasionally half-way between two whole numbers), box plots do not give a very effective comparison of groups. • Small counts • A bar chart is a better representation of the data than a histogram