Descriptive Statistics

Descriptive Statistics

Descriptive Statistics • Summarization of a collection of data in a clear and understandable way • the most basic form of statistics • lays the foundation for all statistical knowledge

Inferential Statistics • Two main methods: • estimation • the sample statistic is used to estimate a population parameter • a confidence interval about the estimate is constructed. • hypothesis testing • a null hypothesis is put forward • Analysis of the data is then used to determine whether to reject it. • Inferential statistics generally require that sampling be random

1 2 3 4 5 TYPES OF DATA • Nominal : gender, type of customer (loyalty), flavor/color liked, etc. • Ordinal/Ranking :type of user, preferred brand, brand awareness, etc. • Interval: Attitudinal or satisfaction scales. • Are you satisfied with your education at U of L? • Dissatisfied Satisfied • Ratio: Income, price willing to pay, age, etc.

Type of Measurement Type of descriptive analysis Frequency table Proportion (percentage) Frequency table Category proportions (percentages) Mode Two categories Nominal More than two categories

Type of Measurement Type of descriptive analysis Ordinal Rank order Median Interval Arithmetic mean Ratio means

Frequency Tables • The arrangement of statistical data in a row-and-column format that exhibits the count of responses or observations for each category assigned to a variable • How many of certain brand users can be called loyal? • What percentage of the market are heavy users and light users? • How many consumers are aware of a new product? • What brand is the “Top of Mind” of the market?

WebSurveyor Bar Chart

Bar Graph

Measures of Central Location or Tendency • Mean: average value • Mode: the most frequent category • Median: the middle observation of the data

The Mean (average value) • sum of all the scores divided by the number of scores. • a good measure of central tendency for roughly symmetric distributions • can be misleading in skewed distributions since it can be greatly influenced by extreme scores in which case other statistics such as the median may be more informative • formula m = SX/N (population) • X = xi/n (sample) • where m/X is the population/sample mean • and N/n is the number of scores. ¯ ¯

Mode • the most frequent category • users 25% • non-users 75% • Advantages: • meaning is obvious • the only measure of central tendency that can be used with nominal data. • Disadvantages • manydistributions have more than one mode, i.e. are "multimodal • greatly subject to sample fluctuations • therefore not recommended to be used as the only measure of central tendency.

Median • the middle observation of the data • number times per week consumers use mouthwash • 1 1 2 2 2 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 6 6 6 7 7 Light user Heavy user Mode Median Mean Frequency distribution of Mouthwash use per week

Curve is basically bell shaped from -  to  • symmetric with scores concentrated in the middle (i.e. on the mean) than in the tails. • Mean, medium and mode coincide • They differ in how spread out they are. • The area under each curve is 1. • The height of a normal distribution can be specified mathematically in terms of two parameters: the mean (m) and the standard deviation (s). Normal Distributions

Normal Distribution   -   a b Area between a and b = P(a=X =b)

Normal Distributions with different Mean -  1 2 0 

Skewed Distributions • Occur when one tail of the distribution is longer than the other. • Positive Skew Distributions • have a long tail in the positive direction. • sometimes called "skewed to the right" • more common than distributions with negative skews • E.g. distribution of income. Most people make under $40,000 a year, but some make quite a bit more with a small number making many millions of dollars per year • The positive tail therefore extends out quite a long way Negative Skew Distributions • have a long tail in the negative direction. • called "skewed to the left." • negative tail stops at zero

Measures of Dispersion or Variability • Minimum, Maximum, and Range • Variance • Standard Deviation

2 = (x- xi)2/n ¯ Variance • The difference between an observed value and the mean is called the deviation from the mean • The variance is the mean squared deviation from the mean • i.e. you subtract each value from the mean, square each result and then take the average. • Because it is squared it can never be negative

S =  (x- xi)2/n ¯ Standard Deviation • The standard deviation is the square root of the variance • Thus the standard deviation is expressed in the same units as the variables • Helps us to understand how clustered or spread the distribution is around the mean value.

Measures of Dispersion Suppose we are testing the new flavor of a fruit punch Dislike 1 2 3 4 5 Like Data 1. 3 2. 5 3. 3 4. 5 5. 3 6. 5 x x X= 4 2= 1 S = 1 x x 2 = (x- xi)2/n ¯ x S =  (x- xi)2/n ¯ x

Measures of Dispersion Dislike 1 2 3 4 5 Like Data 1. 5 2. 4 3. 5 4. 5 5. 5 6. 4 x x X = 4.6 2=0.26 S = 0.52 x x 2 = (x- xi)2/n ¯ x S =  (x- xi)2/n ¯ x ¯

Measures of Dispersion • Dislike 1 2 3 4 5 Like Data • 1. 1 • 2. 5 • 3. 1 • 4. 5 • 5. 1 • 6. 5 x x X= 3 2=4 S = 2 x x 2 = (x- xi)2/n ¯ x S =  (x- xi)2/n ¯ x ¯

3 1 2 -    Normal Distributions with different SD

How does the Normal Distribution help to make decisions? • Suppose you are about to introduce new “Guacamole Doritos” to the market. • Need to determine: • Desired flavor intensity (How hot it should be) • Package size offered • Introduction price

What do you do in order to answer your questions? ASK THE CONSUMER • How? TAKE A SAMPLE • How can you be sure that what you conclude on the sample would be true for the whole population?

Suppose you conducted a research study • Took a random sample of n=100 subjects • They tasted the new "Guacamole Doritos” • They rated the flavor of the chip on the following scale: Too Perfect Too Mild Flavor Hot 1 2 3 4 5 6 7

Results show : x1 = 2.3 and S1= 1.5 • Can you conclude that on average the target population thought the flavor was mild? • Suppose you take a series of random samples of n=100 subjects: x2 = 3.7 and S2 = 2 x3 = 4.3 and S3 = 0.5 x4 = 2.8 and S4 = .97 . . . x50 = 3.7 and S50 = 2

X = (ΣXi)/n The Sampling Distribution The means of all the samples will have their own distribution called the sampling distribution of the means It is a normal distribution The sampling distribution of a proportions is a binomial that approximates a normal distribution in large samples (30+) The mean of the sampling distribution of the mean = It equals the population parameter

=  =  / n X S =   Σ(Xi-X)/n-1 Sampling Distribution The standard deviation of the sampling distribution is called the sampling error of the mean (or proportion). The formula for the proportion is Often the population standard deviation is unknown and has to be estimated from the sample p= π(1-π)/n

Population distribution of the Doritos’ flavor (X)  X  Sample distribution of the x Doritos’ flavor x 1 2 3 4 5 6 7

What relationship does the Population Distribution have to the Sample Distribution? The Central Limit Theorem Let x1,x2….. xn denote a random sample selected from a population having mean  and variance 2. Let X denote the sample mean. If n is large, the X has approximately a Normal Distribution with mean  and variance 2/n. • The Central Limit Theorem does not mean that the sample mean = population mean. • It means that you can attach a probability to that value and decide.

Interpretation • The process of making pertinent inferences and drawing conclusions • concerning the meaning and implications of a research investigation • You do not need to know the population distribution in order to take decisions. • In order to draw conclusions n must be “big enough.” • How big?, it DEPENDS

Univariate Statistics • Test of statistical significance • Hypothesis testing one variable at a time • Hypothesis • Unproven proposition • Supposition that tentatively explains certain facts or phenomena • Assumption about nature of the world

What is a Hypothesis Test? • It is used when we want to make inferences about a population. • Generally we have a particular theory, or hypothesis, about certain events like: • The average age of our regular customers • The average money spent per week on fast food restaurants • The percentage of unsatisfied customers of our store.

Basic Concepts • The hypothesis the researcher wants to test is called the alternative hypothesis H1. • The opposite of the alternative hypothesis us the null hypothesis H0 (the status quo)(no difference between the sample and the population, or between samples). • The objective is to DISPROVE the null hypothesis. • The Significance Level is the Critical probability of choosing between the null hypothesis and the alternative hypothesis

General Procedure for Hypothesis Test • Formulate H1 andH0 • Select appropriate test • Choose level of significance • Calculate the test statistic • Determine the probability associated with the statistic. • Determine the critical value of the test statistic.

General Procedure for Hypothesis Test • a)Compare with the level of significance,  b) Determine if the critical value falls in the rejection region. • Reject or do not reject H0 • Draw a conclusion

1. Formulate H1andH0 • Null hypothesis represents status quo. • Alternative hypothesis represents the desired result. • Example: One-Sample t-test • The manager of Pepperoni Pizza has developed a new baking method with lower costs and wishes to test it with some customers. He asked customers to rate the difference between both pizzas on a scale from -10 (old style) to +10 (new style)

1. Formulate H1andH0 • As a manager you would like to observe a difference between both pizzas • Since the new baking method is cheaper, you would like the preference to be for it. • Null Hypothesis H0 =0 • Alternative H1 0 or H1 >0 Two tail test One tail test

2. Select Appropriate Test • The selection of a proper Test depends on: • Scale of the data • categorical • interval • the statistic you seek to compare • proportions • means • the sampling distribution of such statistic • Normal Distribution • T Distribution • 2 Distribution • Number of variables • Univariate • Bivariate • Multivariate • Type of question to be answered

3. Choose Level of Significance • Whenever we draw inferences about a population, there is a risk that an incorrect conclusion will be reached • The significance level states the probability of incorrectly rejecting H0. This error is commonly known as Type I error, and we denote the significance level as . • Significance Level selected is typically .05 or .01 • In our example the Type I error would be rejecting the null hypothesis that the pizzas are equal, when they really are perceived equal by the customers of the entire population.

3. Choose Level of Significance • We commit Type error II when we incorrectly accept a null hypothesis when it is false. The probability of committing Type error II is denoted by . • In our example, the Type II error would be not rejecting the null hypothesis that the pizzas are equal, when they are perceived to be different by the customers of the entire population.

Type I and Type II Errors Accept null Reject null Null is true Correct- no error Type I error Null is false Type II error Correct- no error

Which is worse? • Both are serious, but traditionally Type I error has been considered more serious, that’s why the objective of hypothesis testing is to reject H0 only when there is enough evidence that supports it. • Therefore, we choose  to be as small as possible without compromising . • Increasing the sample size for a given α will decrease β

4. Calculate the Test Statistic Example • If we are testing whether the consumer perceives a difference between the pizzas • We would need a statistic for the mean • We know that X N(, 2/n) Perceived difference between the pizzas (X) for a given population of size N with mean  and variance estimated from the sample 2/n

If we suppose Ho true, then =0 and X N(0, 2/n) • If we standardized X, we would get • Since we do not know the population value of , we would have to estimate it with the SD of the sample. X- 0 /n  N(0, 1) Z =

But…..X no longer has a Normal distribution, now X has a T distribution with n-1 degrees of freedom. X- 0 s/n  T(n-1) t = - 0 

X= perceived difference between the pizzas •  = real population mean, that equals zero if H0 is true. • x = 3.5, observed sample mean • SD= 2.1, observed sample standard deviation • n=40 • =.01 3.5 - 0 2.1/40  T (39) t = T=.005(39)=2.074 t =10.54

Descriptive Statistics