Statistics

1 / 49

# Statistics - PowerPoint PPT Presentation

Statistics. Introduction. When studying a process, we are interested in understanding sources of variation in our inputs and outputs. Descriptive statistical methods help: Understand patterns of variation in data, and Describe characteristics of a population. Objectives.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Statistics' - casper

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Statistics

Introduction

When studying a process, we are interested in understanding sources of variation in our inputs and outputs.

Descriptive statistical methods help:

• Understand patterns of variation in data, and
• Describe characteristics of a population.
Objectives

After completing this section, participants should be able to:

• Define and indentify the three types of data: nominal, ordinal, and continuous;
• Construct and interpret histograms,
• Calculate and interpret the sample mean, standard deviation, variance, median, range;
• Characterize distributions as: skewed, symmetric, bimodal, multimodal, normal, and mound-shaped;
Introduction

When studying a process, we are interested in understanding sources of variation in our inputs and outputs.

Descriptive statistical methods help:

• Understand patterns of variation in data, and
• Describe characteristics of a population.

In this section, we address the following for continuous data:

• Graphical displays of data, and
• Numerical measures of location and spread.
Introduction

The methods used to display and analyze data depend on the type of data of interest.

Measurements can be classified into three types:

• Nominal – Measurements are unordered categories.
• Ordinal – Measurements are ordered categories.
• Continuous –The measurement of interest is in units of measure that, at least conceptually, follow a continuous scale.
Introduction

Another classification of data, used in distinguishing control charts, has two main categories:

• Attribute data: Counts and proportions resulting from nominal data.
• Variables data: Again, conceptually continuous measurements with a unit of measure.
Histograms

A Six Sigma team is charged with reducing delivery time to customers

Delivery time is defined as the number of days between shipment of an order by the company and receipt of the order at the customer’s site.

Delivery time is obtained for 100 randomly chosen orders.

Note that we treat this data as continuous.

Histograms

The list of data values is not very informative. The 100 data values can be grouped in order to give more information on overall behavior.

This Stem-and-Leaf plot groups the data while retaining the original values.

Note that 36 deliveries occurred within 4 days, 27 took 5 to 9 days, etc.

What can you conclude about delivery times?

Histograms

The chart below, called a histogram, is a special kind of bar chart.

The histogram respects the order that is implicit in the continuous data, and gives a cleaner picture of the data than does the stem-and-leaf plot.

The height of each bar, given on the vertical axis, indicates the number of delivery times that fall within each interval, given on the horizontal axis.

Histograms

We say that the histogram gives a picture of the distribution of delivery times.

The continuous curve superimposed on the histogram gives a picture of the shape of the distribution.

This distribution has a long right tail.

We say that it is skewed to the right.

Histograms

We can use histograms to assess three characteristics of the distribution: Centering, spread, and shape.

Where are the delivery times centered?

How much do they spread or vary?

What is the shape of the distribution?

Histograms

We can generate a histogram and discuss this distribution in terms of centering, spread and shape.

Does the process appear to be meeting the specifications limits (the blue vertical lines)?

Left Skewed: Data trails off to the left.

Symmetric: Data has approximately the same distribution on either side of the center.

Right Skewed: Data trails off to the right.

Histograms

Common Histogram Shapes

Bi-modal or multi-modal: Data has more than one peak.

Uni-modal: Data has one peak.

Uniform: Data is evenly distributed over its range.

Histograms

More Histogram Shapes

Histograms

Compare the centering and spread (variability) of these three distributions.

Histograms

Histograms provide many benefits. Histograms:

• Summarize the data.
• Allow one to assess centering, spread, and shape.
• Help to identify unusual patterns in data.

Histograms also have some limitations:

• Conclusions about the shape of the underlying distribution should not be drawn without a large enough data set (at least 75 randomly chosen data values - 100 data values are recommended).
• Individual data values are not shown.
• Improper bin sizes, as we will see on the following slide, can mask important data features.

0.60

0.20

0.30

0.15

0.40

0.20

0.10

0.20

0.10

0.05

310

230

70

150

70

110

150

190

230

270

70

90

110

130

150

170

190

210

230

250

Histograms

9 bins of width 20:

18 bins of width 10: Too many bins? Do we see too much noise?

5 bins of width 40: Too few? Do we lose too much information?

• Graphical displays are often supplemented with numerical measures that summarize the information in the data.
• Measures of centering or location include:
• Mean
• Median
• Mode
• Measures of spread or variability include:
• Variance
• Standard deviation
• Range

In a study of pull-off force for bonded wires, given in foot-pounds, what can we conclude about the distribution of values?

What is a typical pull-off force?

How do the measurements vary about the center?

The sample mean or average is the most important measure of centering.

The sample mean, referred to as ‘X-bar’, is the average of all observations from a sample:

The mean is the center of gravity or balancing point of a data set.

The sample mean is an estimate of the population mean, which is the average of all observations from a population.

The sample mean for the pull-off force data.

Notice that the mean is denoted by a fulcrum to emphasize that it is the balancing point of the distribution of values.

The sample median is the 50th percentile of the sample data. Half of the data values lie below the median and half lie above the median.

The median is the middle value when the data are ordered.

The sample median is denoted X 0.50.

The sample median and sample mean are approximately the same if the distribution is symmetric.

The sample mode is the most frequently occurring value in a dataset.

The mode is of little interest in itself.

Terms such as unimodal, bimodal and multimodal are of interest:

• A unimodal distribution has one peak.
• A bimodal distribution has two peaks.
• A multimodal distribution has two or more peaks.

Multimodal distributions are often indications that more than one underlying population or process is represented in the data.

0.25

0.20

0.15

0.10

0.05

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

8.5

9.0

Example: Histogram of the Asphalt Content

Batch 1

Batch 2

0.50

0.50

0.30

0.30

0.10

0.10

3

4

5

6

7

8

9

3

4

5

6

7

8

9

Batch 3

Batch 4

0.50

0.60

0.40

0.40

0.30

0.20

0.20

0.10

3

4

5

6

7

8

9

3

4

5

6

7

8

9

The two most important measures of the variability or spread of sample data are the sample variance and sample standard deviation.

Sample Variance, denoted by S2:

• S2is an estimate of the population variance, 2.
• S2= “average” squared distance between the data points and the sample mean.

Sample Standard Deviation, denoted by S:

• S is an estimate of the population standard deviation, .
• S = square root of “average” squared distance between the data points and the sample mean.

The sample variance is “average” squared distance between the data points and the sample mean:

The sample standard deviation is the square root of the sample variance:

Note that the sample standard deviation is a value whose units are the original measurement units.

Calculation of variance and standard deviation for the pull-off force data.

The samplerange (R) is the difference between the largest observation and the smallest observation.

The sample range is the simplest measure of spread about the sample mean:

R = High value - Low value

This formula is often written as:

R= Xmax - Xmin

Example: For the pull-off force data,

R= Xmax- Xmin= 13.6 - 12.3 = 1.3 ft-lbs

Parameters and Statistics

We will start by defining some basic statistical terms used throughout this course.

A population is a set of all possible observations or units of interest.

• Some populations are finite (all parts in inventory today).
• Others are conceptual (all parts that can be produced by a machine at given settings).

A sample is a set of observations drawn from a population.

A random sample is a representative sample drawn from the population. Such a sample must be selected in a random manner so that each member of the population has an equal probability of being selected.

Parameters and Statistics

The population meanis the theoretical (unknown) average of all population measurements. For continuous data, the population mean is denoted by the Greek letter  (mu).

Thepopulation standard deviation is the theoretical (unknown) standard deviation of a population. For continuous data, it is denoted by the Greek letter  (sigma).

Thepopulation variance is the theoretical variance of a population. For continuous data, it is denoted by 2(sigma squared).

We virtually never know the true values of the population mean, standard deviation, or variance.

Parameters and Statistics

Aparameter is a numerical value calculated from population data.

• The population mean, standard deviation, and variance are examples of parameters.
• Since we are virtually never able to compute parameters, they are theoretical quantities.
• Parameters are often represented by Greek letters: m, s, and s2 are examples of this convention.

A statisticis a numerical value calculated from sample data.

• Examples are the sample mean ( ), the sample standard deviation (S), and the sample variance (S2).
• These statistics are used to estimate the corresponding population parameters.

m

s, s2

p

S, S2

Parameters and Statistics

Population

Sample

Unknown! Can calculate!

Parameters and Statistics

Other examples of parameters are:

• the median,
• the range, and
• any percentile of a population.

These are estimated by taking a random sample from a population, and calculating its median, range, and percentiles.

Of critical importance in estimating population parameters is the ability to draw a sample (often, but not always, a random sample) from the population.

In order to do this, the population of interest must be well-defined.

Distributions

The word distribution is used to describe the pattern formed by measurements.

For example, we discuss the distribution of cycle times or of errors.

In the case of a population of measurements, we talk about the theoretical distribution.

For example, cycle times from a sample might have the distribution given by the histogram on the next slide.

Keep in mind that the theoretical distribution is, in general unknown.

It is something that we try to estimate.

0.40

0.30

0.20

0.10

100

150

200

250

Distributions

The histogram might give cycle times for a sample of 100 orders.

The continuous curve that is overlayed on the histogram might represent the distribution of cycle times in the underlying population.

Distributions

Theoretical distributions are useful for two main reasons:

• Modeling data, and
• Providing a “yardstick” for sample statistics.

With this in mind, we will introduce several useful theoretical distributions:

• The normal (or Gaussian) distribution, which is based on continuous data,
• The binomial distribution, which applies to two-category nominal data, and
• The Poisson distribution, which applies to counts of occurrences.
The Normal Distribution

The normal distribution is the basis for many of the statistical techniques that we cover throughout the course.

There are many other distributions that are used for modeling continuous data.

However, the normal distribution is useful in terms of sample statistics as well as for modeling data.

The normal distribution has the classic bell-shape.

There are infinitely many normal distributions, each defined by a value for the mean, m, and one for the variance, s2.

If a quantity, call it X, has a normal distribution with mean  and variance 2, we denote this by writing X ~ N(, 2 ).

N(15, 9)

N(25, 1)

The Normal Distribution

There are infinitely many possible normal distributions defined by values of the population parameters  and 2.

Below, we see examples of normal curves with different means and variances.

The Normal Distribution

Example: The distributions of characteristics of manufactured product are often normal.

Suppose that certain bonded wires have pull-off force measurements, X, that are normally distributed with mean 10 and variance 4. The distribution of X is denoted by X ~ N(10, 4). Note that s = 2.

The shaded area shows P(X>13), where X represents the pull-off force.

The Binomial Distribution

Another distribution that is extremely prevalent is called the binomial distribution.

Binomial data are data that result from a series of trials, where each trial results in only one of two possible values, pass or fail, success or failure, yes or no, etc.

To have a binomial distribution, three conditions must be met:

• The number of trials, denoted by n, is fixed in advance;
• The probability of obtaining a success (which is denoted by “p”) must be constant from trial to trial;
• The trials are independent (obtaining a success on one trial must not affect the likelihood of obtaining an success on another trial).
The Binomial Distribution

A binomial variable is the total number of successes in n trials where the previous conditions are satisfied.

The binomial distribution provides a model for many industrial situations.

The following are quantities that might well have binomial distributions:

• The number of parts produced with a particular type of defect;
• The number orders for a given part that take in excess of 20 days to fill;
• The number of late deliveries of a certain type of shipment.
The Binomial Distribution

In the case of binomial data, we are usually interested in the proportion of successes in a series of trials (rather than the total number of successes).

Equivalently, we are interested in the probability p of a success.

For example, we may be interested in the proportion of parts with a defect, or the proportion of late deliveries.

The proportion of successes in a random sample drawn from a binomial distribution is denoted by .

The Binomial Distribution

Example:

Suppose that 100 records are randomly chosen from the data warehouse, and that 12 of these have a particular type of error.

Then an estimate of p, the proportion of records in the entire data warehouse that have this type of error, is given by:

The Poisson Distribution

Another distribution of interest is called the Poisson distribution.

The Poisson distribution is used to model the number of occurrences of an event that is relatively rare in some unit of time or space.

The following might be modeled by Poisson distributions:

• The number of stacking marks per month’s production of cups;
• The number of customer returns of a given type of product, reported weekly;
• The number of OSHA recordable injuries per 100,000 man hours;
• The number of defects in a large casting.
The Poisson Distribution

The parameter of interest for a Poisson distribution is the average number of occurrences in the unit of time or space.

So, for example, the population mean for the number of errors of a specific type entering the data warehouse daily, or the population mean of the number of defects in large castings.

This theoretical mean is denoted by “c”, for “count”.

A sample can be used to estimate c. The estimate is simply the average of the counts in the sample.

For example, if the numbers of errors for five randomly chosen days are 8, 5, 6, 4, and 7, then c is estimated by

The Empirical Rule
• The Empirical Rule provides an estimate of the proportion of data values falling within a certain distance of the mean.
• The Empirical Rule states that, if a frequency distribution is approximately symmetric and mounded in shape, then:
• Approximately 68% of all values will fall within one standard deviation of the mean.
• Approximately 95% will fall within two standard deviations of the mean.
• Nearly 100% will fall within 3 standard deviations of the mean.
• The Empirical Rule is derived from the probabilities associated with a normal random variable.

99.73%

95.45%

68.27%

The Empirical Rule

We repeat a graph shown earlier.