280 likes | 306 Views
Jan. 20-23. Shapes of distributions… “Statistics” for one quantitative variable… Mean and median Percentiles Standard deviations Transforming data… Rescale: Y = c times X Recenter: Y = X plus a adding variables to each other other transformations.
E N D
Jan. 20-23 • Shapes of distributions… • “Statistics” for one quantitative variable… • Mean and median • Percentiles • Standard deviations • Transforming data… • Rescale: Y = c times X • Recenter: Y = X plus a • adding variables to each other • other transformations
Shape of a distribution… • Outliers • Unimodal --- Bimodal --- Multimodal • Symmetrical • Skew - right or left?
Population • vs. • Sample
A statistic is • anything that can be computed from data.
STATISTICS of a single quantitative variable • MEAN • MEDIAN • QUARTILES ( Q1, Q3 ) • Five-number summary • Boxplots • Interquartile range • PERCENTILES / QUANTILES / FRACTILES • (“quantiles” and “fractiles” are synonyms for “percentiles” for people who don’t like the implied multiplication by 100) • STANDARD DEVIATION • VARIANCE
Statistics of one variable… • MEAN — Sum of values, divided by n • MEDIAN — Middle value • (when values are ranked, smallest to largest) • (or, average of two middle values)
Colleges – Datadesk histogram median — 5 mean — 5.36
salaries median — 60,000 mean — 106,875
So, which measure of “center” is best? • All the measures agree (roughly) when the distribution is symmetrical • Mean has attractive mathematical properties • Also, the mean is related to the total, if that’s what you care about • Median may be more “typical” when the distribution is non-symmetrical • A measure is “robust” if it works reasonably well under a wide variety of circumstances • Medians are robust
Jan. 23 • RMS, Geometric mean • Percentiles, Quartiles (Q1, Q3), BOX PLOTS • Measures of spread: • IQR (range containing middle half) • Standard deviation ( , s ) • Variance • Transforming data… • Rescale: Y = c times X • Recenter: Y = X plus a • adding variables to each other • other transformations • “STANDARDIZING” a variable • NORMAL DISTRIBUTIONS
Computing percentiles • To calculate 20-th percentile: • Rank the values from smallest to largest • Compute 20% of n… 20% of 72 = 14.4 • Count off that many values (from lowest)… • The value at which you stop is the 20-th percentile. • What if you stop between values ?
QUARTILES • Lower quartile (Q1) = 25-th percentile • Upper quartile (Q3) = 75-th percentile • ( What’s Q2 ? ) • INTERQUARTILE RANGE ( IQR ) = Q3 minus Q1
Five-number summary • — maximum • — Q3 • — median • — Q1 • — minimum
VARIANCE and STANDARD DEVIATION • VARIANCE (s2): • STANDARD DEVIATION (s):
Linear Transformations • If you MULTIPLY or DIVIDE a variable by a constant… • Y = c times X Y = X / c • then… • measures of center are multiplied or divided by c • measures of spread are multiplied or divided by |c| • If you ADD or SUBTRACT a constant from a variable… • Y = X + a Y = X – a • then… • measures of center are increased (decreased) by a • measures of spread are UNCHANGED.
More transformations • ADDING VARIABLES: • W = X + Y • Mean (W) = Mean (X) + Mean (Y) • Standard Deviation of (W) — anything can happen • OTHER TRANSFORMATIONS: • Y = X squared ? • Y = log (X) ? • …NO RELIABLE RULES for mean • or std. dev.
Standardized Variables • Write and S for mean, standard deviation of X • Then form transformed variable: • Z = (X - ) / S • Then… • mean (Z) = 0 • std dev (Z) = 1 • Z answers the question: How many standard deviations is this value above (or below) the mean?
Jan. 25 • More on transforming and standardizing variables • More on normal distributions Jan. 27++ Relations among variables --- scatterplots “independent” variables correlations linear regressions (best fit lines)
Normal Density Function • X ~ (,) • = mean, = std. dev. • (Why Greek? Why not x-bar, s?)
Trying the integral • Standard normal: mean = 0, std. dev. = 1 • Density curve: • …so the area between a and b is: 1 0
The core computation • If X ~ N(,), what fraction of values are between • a and b ? • Rule of 68 – 95 – 99.7 • Standardizing • Tables and computers • Reversing the calculation a b
Standardizing • Same Question: • Is X between a and b ? • Is (X-)/ between (b-)/ and (b-)/ ? • But Z = (X-)/ is a variable with a standard normal distribution (mean 0, standard deviation 1). • So, if we can answer this question for standard normals, we can answer it for all normals.