Statistics

Statistics

Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical methods help: • Understand patterns of variation in data, and • Describe characteristics of a population.

Objectives After completing this section, participants should be able to: • Define and indentify the three types of data: nominal, ordinal, and continuous; • Construct and interpret histograms, • Calculate and interpret the sample mean, standard deviation, variance, median, range; • Characterize distributions as: skewed, symmetric, bimodal, multimodal, normal, and mound-shaped;

Introduction When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical methods help: • Understand patterns of variation in data, and • Describe characteristics of a population. In this section, we address the following for continuous data: • Graphical displays of data, and • Numerical measures of location and spread.

Introduction The methods used to display and analyze data depend on the type of data of interest. Measurements can be classified into three types: • Nominal – Measurements are unordered categories. • Ordinal – Measurements are ordered categories. • Continuous –The measurement of interest is in units of measure that, at least conceptually, follow a continuous scale.

Introduction Another classification of data, used in distinguishing control charts, has two main categories: • Attribute data: Counts and proportions resulting from nominal data. • Variables data: Again, conceptually continuous measurements with a unit of measure.

Histograms A Six Sigma team is charged with reducing delivery time to customers Delivery time is defined as the number of days between shipment of an order by the company and receipt of the order at the customer’s site. Delivery time is obtained for 100 randomly chosen orders. Note that we treat this data as continuous.

Histograms The list of data values is not very informative. The 100 data values can be grouped in order to give more information on overall behavior. This Stem-and-Leaf plot groups the data while retaining the original values. Note that 36 deliveries occurred within 4 days, 27 took 5 to 9 days, etc. What can you conclude about delivery times?

Histograms The chart below, called a histogram, is a special kind of bar chart. The histogram respects the order that is implicit in the continuous data, and gives a cleaner picture of the data than does the stem-and-leaf plot. The height of each bar, given on the vertical axis, indicates the number of delivery times that fall within each interval, given on the horizontal axis.

Histograms We say that the histogram gives a picture of the distribution of delivery times. The continuous curve superimposed on the histogram gives a picture of the shape of the distribution. This distribution has a long right tail. We say that it is skewed to the right.

Histograms We can use histograms to assess three characteristics of the distribution: Centering, spread, and shape. Where are the delivery times centered? How much do they spread or vary? What is the shape of the distribution?

Histograms We can generate a histogram and discuss this distribution in terms of centering, spread and shape. Does the process appear to be meeting the specifications limits (the blue vertical lines)?

Left Skewed: Data trails off to the left. Symmetric: Data has approximately the same distribution on either side of the center. Right Skewed: Data trails off to the right. Histograms Common Histogram Shapes

Bi-modal or multi-modal: Data has more than one peak. Uni-modal: Data has one peak. Uniform: Data is evenly distributed over its range. Histograms More Histogram Shapes

Histograms Compare the centering and spread (variability) of these three distributions.

Histograms Histograms provide many benefits. Histograms: • Summarize the data. • Allow one to assess centering, spread, and shape. • Help to identify unusual patterns in data. Histograms also have some limitations: • Conclusions about the shape of the underlying distribution should not be drawn without a large enough data set (at least 75 randomly chosen data values - 100 data values are recommended). • Individual data values are not shown. • Improper bin sizes, as we will see on the following slide, can mask important data features.

0.60 0.20 0.30 0.15 0.40 0.20 0.10 0.20 0.10 0.05 310 230 70 150 70 110 150 190 230 270 70 90 110 130 150 170 190 210 230 250 Histograms 9 bins of width 20: 18 bins of width 10: Too many bins? Do we see too much noise? 5 bins of width 40: Too few? Do we lose too much information?

Measures of Location and Spread • Graphical displays are often supplemented with numerical measures that summarize the information in the data. • Measures of centering or location include: • Mean • Median • Mode • Measures of spread or variability include: • Variance • Standard deviation • Range

Measures of Location and Spread In a study of pull-off force for bonded wires, given in foot-pounds, what can we conclude about the distribution of values? What is a typical pull-off force? How do the measurements vary about the center?

Measures of Location and Spread The sample mean or average is the most important measure of centering. The sample mean, referred to as ‘X-bar’, is the average of all observations from a sample: The mean is the center of gravity or balancing point of a data set. The sample mean is an estimate of the population mean, which is the average of all observations from a population.

Measures of Location and Spread The sample mean for the pull-off force data. Notice that the mean is denoted by a fulcrum to emphasize that it is the balancing point of the distribution of values.

Measures of Location and Spread The sample median is the 50th percentile of the sample data. Half of the data values lie below the median and half lie above the median. The median is the middle value when the data are ordered. The sample median is denoted X 0.50. The sample median and sample mean are approximately the same if the distribution is symmetric.

Measures of Location and Spread The sample mode is the most frequently occurring value in a dataset. The mode is of little interest in itself. Terms such as unimodal, bimodal and multimodal are of interest: • A unimodal distribution has one peak. • A bimodal distribution has two peaks. • A multimodal distribution has two or more peaks. Multimodal distributions are often indications that more than one underlying population or process is represented in the data.

Possibly a mixture of data from four different batches of material? 0.25 0.20 0.15 0.10 0.05 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 Measures of Location and Spread Example: Histogram of the Asphalt Content

Batch 1 Batch 2 0.50 0.50 0.30 0.30 0.10 0.10 3 4 5 6 7 8 9 3 4 5 6 7 8 9 Batch 3 Batch 4 0.50 0.60 0.40 0.40 0.30 0.20 0.20 0.10 3 4 5 6 7 8 9 3 4 5 6 7 8 9 Measures of Location and Spread

Measures of Location and Spread The two most important measures of the variability or spread of sample data are the sample variance and sample standard deviation. Sample Variance, denoted by S2: • S2is an estimate of the population variance, 2. • S2= “average” squared distance between the data points and the sample mean. Sample Standard Deviation, denoted by S: • S is an estimate of the population standard deviation, . • S = square root of “average” squared distance between the data points and the sample mean.

The sample variance is “average” squared distance between the data points and the sample mean: The sample standard deviation is the square root of the sample variance: Note that the sample standard deviation is a value whose units are the original measurement units. Measures of Location and Spread

Measures of Location and Spread Calculation of variance and standard deviation for the pull-off force data.

Measures of Location and Spread The samplerange (R) is the difference between the largest observation and the smallest observation. The sample range is the simplest measure of spread about the sample mean: R = High value - Low value This formula is often written as: R= Xmax - Xmin Example: For the pull-off force data, R= Xmax- Xmin= 13.6 - 12.3 = 1.3 ft-lbs

Parameters and Statistics We will start by defining some basic statistical terms used throughout this course. A population is a set of all possible observations or units of interest. • Some populations are finite (all parts in inventory today). • Others are conceptual (all parts that can be produced by a machine at given settings). A sample is a set of observations drawn from a population. A random sample is a representative sample drawn from the population. Such a sample must be selected in a random manner so that each member of the population has an equal probability of being selected.

Parameters and Statistics The population meanis the theoretical (unknown) average of all population measurements. For continuous data, the population mean is denoted by the Greek letter  (mu). Thepopulation standard deviation is the theoretical (unknown) standard deviation of a population. For continuous data, it is denoted by the Greek letter  (sigma). Thepopulation variance is the theoretical variance of a population. For continuous data, it is denoted by 2(sigma squared). We virtually never know the true values of the population mean, standard deviation, or variance.

Parameters and Statistics Aparameter is a numerical value calculated from population data. • The population mean, standard deviation, and variance are examples of parameters. • Since we are virtually never able to compute parameters, they are theoretical quantities. • Parameters are often represented by Greek letters: m, s, and s2 are examples of this convention. A statisticis a numerical value calculated from sample data. • Examples are the sample mean ( ), the sample standard deviation (S), and the sample variance (S2). • These statistics are used to estimate the corresponding population parameters.

m s, s2 p S, S2 Parameters and Statistics Population Sample Unknown! Can calculate!

Parameters and Statistics Other examples of parameters are: • the median, • the range, and • any percentile of a population. These are estimated by taking a random sample from a population, and calculating its median, range, and percentiles. Of critical importance in estimating population parameters is the ability to draw a sample (often, but not always, a random sample) from the population. In order to do this, the population of interest must be well-defined.

Distributions The word distribution is used to describe the pattern formed by measurements. For example, we discuss the distribution of cycle times or of errors. In the case of a population of measurements, we talk about the theoretical distribution. For example, cycle times from a sample might have the distribution given by the histogram on the next slide. Keep in mind that the theoretical distribution is, in general unknown. It is something that we try to estimate.

0.40 0.30 0.20 0.10 100 150 200 250 Distributions The histogram might give cycle times for a sample of 100 orders. The continuous curve that is overlayed on the histogram might represent the distribution of cycle times in the underlying population.

Distributions Theoretical distributions are useful for two main reasons: • Modeling data, and • Providing a “yardstick” for sample statistics. With this in mind, we will introduce several useful theoretical distributions: • The normal (or Gaussian) distribution, which is based on continuous data, • The binomial distribution, which applies to two-category nominal data, and • The Poisson distribution, which applies to counts of occurrences.

The Normal Distribution The normal distribution is the basis for many of the statistical techniques that we cover throughout the course. There are many other distributions that are used for modeling continuous data. However, the normal distribution is useful in terms of sample statistics as well as for modeling data. The normal distribution has the classic bell-shape. There are infinitely many normal distributions, each defined by a value for the mean, m, and one for the variance, s2. If a quantity, call it X, has a normal distribution with mean  and variance 2, we denote this by writing X ~ N(, 2 ).

N(15, 9) N(25, 1) The Normal Distribution There are infinitely many possible normal distributions defined by values of the population parameters  and 2. Below, we see examples of normal curves with different means and variances.

The Normal Distribution Example: The distributions of characteristics of manufactured product are often normal. Suppose that certain bonded wires have pull-off force measurements, X, that are normally distributed with mean 10 and variance 4. The distribution of X is denoted by X ~ N(10, 4). Note that s = 2. The shaded area shows P(X>13), where X represents the pull-off force.

99.73% 95.45% 68.27% The Normal Distribution

The Binomial Distribution Another distribution that is extremely prevalent is called the binomial distribution. Binomial data are data that result from a series of trials, where each trial results in only one of two possible values, pass or fail, success or failure, yes or no, etc. To have a binomial distribution, three conditions must be met: • The number of trials, denoted by n, is fixed in advance; • The probability of obtaining a success (which is denoted by “p”) must be constant from trial to trial; • The trials are independent (obtaining a success on one trial must not affect the likelihood of obtaining an success on another trial).

The Binomial Distribution A binomial variable is the total number of successes in n trials where the previous conditions are satisfied. The binomial distribution provides a model for many industrial situations. The following are quantities that might well have binomial distributions: • The number of parts produced with a particular type of defect; • The number orders for a given part that take in excess of 20 days to fill; • The number of late deliveries of a certain type of shipment.

The Binomial Distribution In the case of binomial data, we are usually interested in the proportion of successes in a series of trials (rather than the total number of successes). Equivalently, we are interested in the probability p of a success. For example, we may be interested in the proportion of parts with a defect, or the proportion of late deliveries. The proportion of successes in a random sample drawn from a binomial distribution is denoted by .

The Binomial Distribution Example: Suppose that 100 records are randomly chosen from the data warehouse, and that 12 of these have a particular type of error. Then an estimate of p, the proportion of records in the entire data warehouse that have this type of error, is given by:

The Poisson Distribution Another distribution of interest is called the Poisson distribution. The Poisson distribution is used to model the number of occurrences of an event that is relatively rare in some unit of time or space. The following might be modeled by Poisson distributions: • The number of stacking marks per month’s production of cups; • The number of customer returns of a given type of product, reported weekly; • The number of OSHA recordable injuries per 100,000 man hours; • The number of defects in a large casting.

The Poisson Distribution The parameter of interest for a Poisson distribution is the average number of occurrences in the unit of time or space. So, for example, the population mean for the number of errors of a specific type entering the data warehouse daily, or the population mean of the number of defects in large castings. This theoretical mean is denoted by “c”, for “count”. A sample can be used to estimate c. The estimate is simply the average of the counts in the sample. For example, if the numbers of errors for five randomly chosen days are 8, 5, 6, 4, and 7, then c is estimated by

The Empirical Rule • The Empirical Rule provides an estimate of the proportion of data values falling within a certain distance of the mean. • The Empirical Rule states that, if a frequency distribution is approximately symmetric and mounded in shape, then: • Approximately 68% of all values will fall within one standard deviation of the mean. • Approximately 95% will fall within two standard deviations of the mean. • Nearly 100% will fall within 3 standard deviations of the mean. • The Empirical Rule is derived from the probabilities associated with a normal random variable.

99.73% 95.45% 68.27% The Empirical Rule We repeat a graph shown earlier.

Statistics

Statistics

Presentation Transcript

Statistics

Statistics

Statistics 300: Elementary Statistics

Statistics - Descriptive statistics

Statistics

Statistics

Statistics

Statistics

Statistics

Statistics

Statistics

Statistics

Statistics 300: Elementary Statistics

Statistics 300: Elementary Statistics

Statistics on Statistics.

Social Statistics: Inferential Statistics

Statistics 1: Elementary Statistics

Mathematics & Statistics Statistics

Statistics 300: Elementary Statistics

Statistics South Africa Official statistics; Statistics Act

Statistics

Statistics

Statistics

Presentation Transcript

Statistics

Statistics

Statistics 300: Elementary Statistics

Statistics - Descriptive statistics

Statistics

Statistics

Statistics

Statistics

Statistics

Statistics

Statistics

Statistics

Statistics 300: Elementary Statistics

Statistics 300: Elementary Statistics

Statistics on Statistics.

Social Statistics: Inferential Statistics

Statistics 1: Elementary Statistics

Mathematics &amp; Statistics Statistics

Statistics 300: Elementary Statistics

Statistics South Africa Official statistics; Statistics Act

Statistics

Mathematics & Statistics Statistics