Understanding Central Tendency and Variance in Univariate Data Analysis

Lecture 3Summer Semester 2009 BEA 140 By Leon Jiang Leon Jiang, University of Tasmania

Some points more for univariate data Leon Jiang, University of Tasmania

Central tendency • Mean • Median • Mode Leon Jiang, University of Tasmania

Variance • Population: 2 = ( Xi2 - (Xi)2/N ) / N • Sample: s2 = ( Xi2 - (Xi)2/n ) / (n-1) Leon Jiang, University of Tasmania

Standard deviation • s2 = ( Xi2 - (Xi)2/n ) / (n-1) Leon Jiang, University of Tasmania

The meaning of Stdv. • “ For most data batches around two thirds ( or 68%) of the data will fall within one standard deviation of the mean, and around 95% within two standard deviations of the mean.” • - empirical rule • - rule of thumb Leon Jiang, University of Tasmania

MEASURING FROM GROUPED DATA Leon Jiang, University of Tasmania

Measuring For Grouped Data • When no raw data but only secondary source of data available, we have to analyze this secondary set of data, which has been grouped for reporting purposes. • A set of grouped data is not like a set of raw data in that the information in it has already been grouped arbitrarily. • A set of grouped data is subjective or at least it is not so objective as raw data, therefore small errors exist. Leon Jiang, University of Tasmania

Generally we use a frequency distribution table to show the grouping of data Leon Jiang, University of Tasmania

Class mark for frequency distribution of grouped data • Class mark , Xj is a representative value of all observations located in the class. • A class mark is determined by the largest value and the smallest value in the class. • Xj = ( largest value + smallest value ) / 2 • Xj = (RUCL + RLCL) / 2 • Where, RUCL => the largest value ; RLCL => the smallest value Leon Jiang, University of Tasmania

Central tendency for grouped data • Mean of g.d (grouped data) is defined as the weighted sum of class marks, with class frequencies as weights. i.e. • X(mean) = (Σfj xj ) / n • X ( mean ) = 294/54=5.44 Leon Jiang, University of Tasmania

Median for g.d • Locating the median class : • the class containing the median. • But how and where? • Total number of calls in the frequency distribution is 54 (=> even number). • and therefore, according to the formula of median ( median = n + 1 / 2 ), the median ought to be the 27.5th value. • The class containing the 27.5th value is the median class. Leon Jiang, University of Tasmania

FORMULA FOR MD: • MD = LCL + class width * ( how far into class ) / (how many in class ) 3.0 + 2 * (27.5 - 11) / 19 Leon Jiang, University of Tasmania

MD = LCL + class width * ( how far into class ) / (how many in class ) 3.0 + 2 * (27.5 - 11) / 19 Leon Jiang, University of Tasmania

Small errors likely exist most of the time • Median from raw data = 4.4 • Median from grouped data = 4.47 Leon Jiang, University of Tasmania

An example: MD = LCL + class width * ( how far into class ) / (how many in class ) Leon Jiang, University of Tasmania

LCL + class width * (how far into the class) / how many in the class • 100 + 10 * (8.5 – 3) / (9 – 3) Median = 109.17 Leon Jiang, University of Tasmania

Mode for g.d. • With grouped data, we tend to talk more of a modal class – the class (classes) with the highest frequency rather than the mode. • But, if asked for a mode with grouped data, the best we can do is to tell the class mark of modal class as follows: Modal class: 3 &U 5 ( 19 observations ) Mode : 4 ( class mark of modal class ) Leon Jiang, University of Tasmania

Dispersion ( variance ) for grouped data • The sample variance formula is : • S2 ={Σfj Xj2 – (Σfj Xj)2 / n }/ (n-1) The population variance formula is : • = {Σfj Xj2 – (Σfj Xj)2 / N }/ N • Standard deviation = or Leon Jiang, University of Tasmania

Preparing a table to help work out S.d. Leon Jiang, University of Tasmania

Working out the standard deviation for the example~! • S2 ={Σfj Xj2 – (Σfj Xj)2 / n }/ (n-1) • Standard deviation = • S = 14.14 • Mean = 1770 / 16 = 110.625 Leon Jiang, University of Tasmania

Shape • Skewness – relates to symmetry of distribution. • Positively skewed or right skewed: tail extends to right , mean > Median > Mode • Negatively skewed or left skewed: tail extends to left, mean < median < mode Leon Jiang, University of Tasmania

Standard scores • The standard score expresses any observation in terms of the number of standard deviation it is from the mean. • t score ( for sample) * z score (for population) Leon Jiang, University of Tasmania

Interpretation of standard score • Mean 5, standard deviation 2, for a sample • t score for 8 = (8-5)/2=1.5 • Interpretation: the observation is 1.5 standard deviations above the sample mean. Leon Jiang, University of Tasmania

Bivariate Variables Summary measures Leon Jiang, University of Tasmania

Bivariate variables • In the previous parts, we were all the time talking about a single numerical variable such as the rate of return of mutual funds. • From this lecture, we shall start to study two variables with correlation. Leon Jiang, University of Tasmania

Two numerical variables • A case: • In a call center, operators were trained to receive phone calls. However, the duration of calls shows a significant difference from one another. The shorter the duration of a call, the more efficient an operator proves to be. • Suppose, the call center manager wants to know if the training hours the operators received have any correlation to the duration of those phone calls the operators handled. • The data pooled down are as the follows: • X Training hours • Y Duration minutes Leon Jiang, University of Tasmania

Data pooled like this X (training hours): 6.5 7.5 6 8.5 5.5 3.5 8.5 8 8 7 8.5 9.5 Y (duration mins): 6.2 2.9 9.2 3.2 8.9 13.6 2.5 4.2 4.3 3.1 3.4 2.7 X (training hours): ……………………………………………………. Y (duration mins): ……………………………………………………. Anyway, in total there have been 54 phone calls in this set of data being studied. * Now, what we are about to find out is to know whether these two variables ( X training hours of operators ; Y duration minutes of calls) show any real correlation. Or , by putting it simply, the call center manager wants to know if the more training hours the operators receive, the shorter the duration of calls the operators handle will be. Leon Jiang, University of Tasmania

Setting up a scatter diagram for the data here ~! • A scatter diagram ( scattergram ) between two variables will indicate the form, type and strength of the relation. • Form – whether linear or non-linear • Type – direct (positive) or inverse (negative) • Strength – how closely data are co-ordinated, e.g. if linear, how close ordered pairs are to a line describing their relationship. This is indicated by a correlation measure. Leon Jiang, University of Tasmania

(Pearson’s) Coefficient of Correlation • This is a summary measure that describes the form, type and strength of a scattergram. • The range of r is between –1 , 0 , 1. • -1: perfect negative relationship – all points exactly ona negative sloping line • 0: no linear correlation • 1: perfect positive relationship Leon Jiang, University of Tasmania

Back to the case study • r( Pearson’s coefficient of correlation) = - 0.9209 • This means X and Y have a very strong negative linear relationship. • Or , let’s say the training hours the operators received really show a strong negative relationship with the duration of calls they handled. Leon Jiang, University of Tasmania

In-depth analysis of this linear relationship – linear regression • Determining the Coefficient of Correlation is concerned with summarizing the form, type and strength of the relationship between two variables. • The motivation for regression is the desire to quantify the relationship, often for the purposes of using the knowledge of one variable to predict the other. • Say , using one variable ( X ) to predict the other variable ( Y ). Leon Jiang, University of Tasmania

The regression line is mathematically expressed by this equation • Yc = a + bX • Yc is the computed value of Y. • a is the sample regression constant, or Y-intercept. • b is the sample regression coefficient, or slope of the line. Leon Jiang, University of Tasmania

Least squares method • This is a mathematical technique that determines what values of a and b minimize the sum of squared differences. Any values for a and b other than those determined by the least-squares method result in a greater sum of squared differences between the actual value of Y and the predicted value of Y. • Simply put, least-squares method is used to find a line of best fit for two correlated variables. Leon Jiang, University of Tasmania

Working out the linear regression ~! • Residual is defined as the vertical distance between the actual value and the predicted value ( the point on the line of best fit). • In least-squares regression, we find the values of a and b, such that sum of squares of residuals, is a minimum. • Actual pairs : (X1, Y1), (X2, Y2),… ... • Predicted (calculated )pairs: (X1, Yc1), (X1,Yc2), … … Leon Jiang, University of Tasmania

Back to the case study~! • Since we have known that the training hours correlate to the duration of calls. It is somehow to say : if we know the training hours an operator received , in some sense we can predict how many minutes , on average, he or she should take to handle a phone call. • Or, in linear regression, we know X and by using the least squares method, we can calculate out Y. Leon Jiang, University of Tasmania

Solutions for a & b • Two formulae respectively for a and b. Leon Jiang, University of Tasmania

Establishing a table to work out linear regression Leon Jiang, University of Tasmania

Outcomes ~! • b=-1.79595 • a=18.40399 . • Then Yc=18.404 –1.796X • This is the linear regression. • Interpretation : for each extra hour of training, there is an associated decrease of 1.796 minutes in call duration. Leon Jiang, University of Tasmania

One consideration~! • Note: regression says nothing about causation, only about association~! • This means X does not necessarily cause a change in Y. • Or, the training hours do not necessarily change the duration of calls, instead they have correlation. • Think about : does smoking cigarettes cause life expectancy shorter? • Not really~! ? Leon Jiang, University of Tasmania

The standard error of the estimate • Standard error measures how well actual Y and computed Y are matched – the smaller Se, the better the match and predictive accuracy. Leon Jiang, University of Tasmania

Note! • Standard error is very similar to standard deviation. • Standard error is for bivariate, whilst standard deviation is for univariate. Leon Jiang, University of Tasmania

Computational form for Se. • You can use this computational form to find out Se. Leon Jiang, University of Tasmania

Coefficient of determination • Total variation = SST= • Explained variation = SSR • Unexplained variation = SSE= • Coefficient of determination =SSR / SST= Leon Jiang, University of Tasmania

Coefficient of determination - • The Coefficient of determination by calculation turned out to be 0.848 • This means 85% of total variation in call duration (around the average duration level) has been explained by a linear relation between duration and training hours. Leon Jiang, University of Tasmania

We just saw summery measures for dealing with two numerical variables. What about ordinal data? Leon Jiang, University of Tasmania

Two ordinal variables • A scattergram can also be used to illustrate a possible relationship between two ordinal variables. • We often have ordinal variables in fields such as Marketing and Management where people have been asked to rank some attribute. • An example could be a series of taste trials carried out during product development, such as the example below, where a panel was asked to rank soft drinks by “Refreshing ness” and “Sweetness”. Leon Jiang, University of Tasmania

Understanding this example • This example illustrates which one of the drinks is the most refreshing and which is the second most refreshing … • Likewise, which is the sweetest and which is the second sweetest … Leon Jiang, University of Tasmania

Leon Jiang, University of Tasmania

Understanding Central Tendency and Variance in Univariate Data Analysis

Understanding Central Tendency and Variance in Univariate Data Analysis

Presentation Transcript

Summer 2009 Registration (semester begins May 11, 2009)

Econ 201 Lecture 5.3 Summer 2009

First Semester 2008-2009

Summer 2009

WORK ON COMPUTERS Summer Semester 2009

2009-2010 Semester 2

Econ 201 Lecture 6.1 Summer 2009

Fungi 3 Lecture 26 Summer, 2006

Summer 2009

Lecture 41: Semester Review

Understanding Telecommunications Lecture-3: week 3- Semester-2/ 2009

Understanding Telecommunications Lecture-7: week 10- Semester-2/ 2009

Understanding Telecommunications Lecture- 7 : week 7 - Semester-2/ 2009

Online-learning Lecture-3: Constructivist School of Learning week 3- Semester-1/ 2009

Understanding Telecommunications Lecture- 1 : week 1 - Semester-2/ 2009

Lecture 3 ESS_2nd semester

Summer SAS Workshop Lecture 3

Econ 201 Summer 2009 Lecture 1.03

Lecture 3 ESS_2nd semester