business statistics autumn 2008 l.
Skip this Video
Loading SlideShow in 5 Seconds..
Business Statistics Autumn 2008 PowerPoint Presentation
Download Presentation
Business Statistics Autumn 2008

Loading in 2 Seconds...

play fullscreen
1 / 173

Business Statistics Autumn 2008 - PowerPoint PPT Presentation

  • Uploaded on

Business Statistics Autumn 2008 Chicago GSB C. Alan Bester About this Course Below is a link to the course website. Please visit and bookmark this site NOW. Please review the syllabus and Course FAQ .

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Business Statistics Autumn 2008' - Patman

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
business statistics autumn 2008

Business StatisticsAutumn 2008

Chicago GSB

C. Alan Bester

about this course
About this Course
  • Below is a link to the course website. Please visit and bookmark this site NOW.

  • Please review the syllabus and Course FAQ.
  • Links to the data and many in class examples are embedded in these notes, and are also available by browsing the course website.

“Statistical Method”

(We’ll start here)

Formulate problem

Get some data

Visualize the data

Do some statistical calculations

Interpret results


Notes1: Data: Plots and Summaries

1. Data

2. Looking at a Single Variable

2.1 Tables

2.2 Histograms

2.3 Dotplots

2.4 Time Series Plots

3. Summarizing a Single Numeric Variable

3.1 The Mean and Median

3.2 The Variance and Standard Deviation

3.3 The Empirical Rule

3.4 Percentiles, quartiles, and the IQR

4. Looking at Two Variables

4.1 Categorical variables: the Two-way table

4.2 Numeric variables: Scatter Plots

4.3 Relating Numeric and Categorical variables


5. Summarizing Bivariate Relations

5.1 In Tables

5.2 Covariance and Correlation

6. Linearly related variables

6.1 Linear functions

6.2 Mean and variance of a linear function

6.3 Linear combinations

6.4 Mean and variance of a linear combination

7. Linear Regression

8. Pivot Tables (Optional)

Note: As you’ve probably noticed, there are lot of slides. That is partly because I like to restate ideas and limit the number of concepts on any single slide. You will find there are really only a handful of “big ideas” that we will develop throughout the quarter…



Here is some data (our sample):




(many more rows !!)

The data is from a large survey carried out by a marketing

research company in Britain. (Marketing data)

Each row corresponds to a household.

Each column corresponds to a different feature of the household.

The features are called variables.

The rows are called observations.


Most data sets come in this form.

A rectangular array.

Rows are observations.

Columns are variables.

Variables are the fundamental object in statistics.

They come in several types.


The variable labeled "age" is simply the age (in years)

of the responder.

This is a numeric variable.

This variable has units, and averages are interpretable.

In contrast, the variable "Reg" is the geographical region

of the household. Each "number" is really just a code

for a region:

A variable like Reg

is called categorical.

Think of:

numeric vs. categorical

quantitative vs. qualitative


Instead of using numbers we could have used

text strings in the data file, that is,















Instead of

we could have

But it is extremely common to use numeric codes.

Another example: Which Democratic candidate do you support?

1= Hillary Clinton, 2= John Edwards,

3= Barack Obama, 4= Bill Richardson


The variable soc is categorical.

It takes on codes 1-6, with meanings:

This is an ordered categorical variable.

You can't think of it as a numerical measure

but A < B < ... < E. (“A” is actually the lowest social grade)

Soc is ordered like age, but does not have units.

It does not really make sense to compute the difference or to average two soc measurements.

It does make sense to difference two ages.


That pretty much covers it.

Variables are either numeric, categorical, or

ordered categorical.

Of course a numeric variable is always ordered.

For numeric variables we also have:

A variable is discrete if you can list its possible values.

Otherwise it is called continuous.


For example, the amount of rainfall in the City of Chicago

this month is usually thought of as being continuous.

As a practical matter, any variable is discrete since

we put it in the computer. What it comes down to

is, if there are a lot of possible values, we think of it

as continuous. (This is not really that important now;

it will be later when we get to probability.)

For example, you might think of age as continuous

even though we measure it in years and can easily

list its possible values.

Number of children is more likely to be thought of as discrete.


Again, a good rule when working with a numeric variable is to keep in mind the units in which it is measured.

For example age has units years.

Percentages, which are numeric, don't have units.

But there are always units somewhere. For example, if we look at the percentage of income a household spends on entertainment, we are looking at one quantity measured in units of currency divided by another.


Here are the definitions of all the variables in the survey

data set:

age: age in years

sex: 1 means male, 2 means female

soc: we saw this

edu: education, terminal age of education

Reg: we saw this.


inc: income


Both edu and inc could have been numeric, but are broken down into ranges. They are thus ordered categorical.

This is extremely common; with income there are actually good reasons for doing this!


cola, restE, juice, cigs indicate use of a product


1 if you use it, 0 if you don't.

This is called a dummy variable.

1 indicates something "happened", 0 if not.

So, cigs=1 means you purchase cigarettes.

restE means "restaurants in the evening".

This is extremely common. Often in statistics we are interested in “does something happen?”.

Another example is approval ratings ( 1=approve ).

We will work with a lot of dummy variables this quarter.


A dummy variable can take on two values, 0 or 1.

We use dummy variables to indicate something,

1 if that something “happened”, 0 if it did not.

The rest of the variables in the marketing data

represent tv shows.

They are dummies: 1 if you watch, 0 if you don't.

antiq: antiques roadshow

news: bbc news

enders: east enders

friend: friends

simp: simpsons

foot: "football" (soccer)


Now we can see that there are three types of variables

in the data set.

(i) Demographics: age through income

(ii) Product category usage,

(iii) Media exposure (tv shows).

What is the point? Why collect this data?

We want to see how product usage relates

to demographics. What kind of people drink colas?

We want to see how the media relates to product usage

so that we can select the appropriate media to

advertise in. If friends viewers tend to drink colas,

that might be a good place to advertise your cola.


Important Note:

You can always take a numeric variable and

make it an ordered categorical variable by

using bins.

For example, instead of treating age as a numeric

variable it is common to break it into ranges.

0-20: a1






>70: a7

for example:


The simplest case is a dummy variable:

where x is numeric

For example, you could define someone to be "old"

if older than 40 and "young" otherwise.

d=1 then means "old" and d=0 means "young".


2. Looking at a Single Variable

The most interesting thing in statistics is understanding

how variables relate to each other.

"Friends watchers tend to drink colas".

"Smokers tend to get cancer".

But it is still very important to get of sense of what variables

are like on their own.

Note: We’ll use the term “distribution” informally to talk about what a variable looks like (what does a typical value look like, how spread out are its values, etc.) We will use the term more formally when we study probability.


2.1 Tables

To look at a categorical variable we use a table:

How to make this table

We simply count how many of each category we have.

Note: We have 1000 observations total, so the numbers in this table must add to 1000.


I like to graph the table. This table makes it easy to see how different social grades are represented.

Numbers at the bottom are categories. The height of each bar equals the number of observations in that category.


2.2 Histograms

We take a numeric variable, break it down into categories and then plot the table as on the previous slide.

Remember, the height of each bar = # of observations or “frequency” in that category.




that is,

<35 x <=40.


Time between arrivals at a bank, in minutes. (Bank data)

A histogram with a "heavy right tail" is called skewed right.

You can guess what skewed left is.


Source: Nicolas P. B. Bollen and Veronika K. Pool, “Do Hedge Fund Managers Misreport Returns? Evidence from the Pooled Distributions”; original data from Center for International Securities and Derivatives Markets, University of Massachusetts


Here’s a histogram of monthly hedge fund returns from 1994 to 2005. Notice anything interesting?


Aside: Histograms can be displayed in different ways…

The observations here are starting players in the NFL (on offense). The numbers on the vertical axis correspond to rounds of the NFL draft, while the length of each blue bar is the percentage of starting players drafted at that position (forget the red bars). The plots on the right show only quarterbacks and fullbacks. (Source)

Don’t worry, all of our histograms will be like the previous two slides.

“Aside” or “Optional” on a slide means you are not responsible for the material on that slide on an exam!


2.3 Dotplots

It can be a hassle choosing the bins for a numeric


For discrete variables and/or small data sets, we can

just put a dot on the number line for each value.

(Beer data)

nbeerm: the number of beers male MBA students claim

they can drink without getting drunk

nbeerf: same for females



: :

: :

. . : : : :

. . : . : : :.: : : : . .


. .. . : : .


0.0 4.0 8.0 12.0 16.0 20.0

We call a point

like this an


Generally the males claim they can drink more,

their numbers are centered or located at larger values.

Note: The dot plot is giving you the same kind of information as the histogram.


2.4 Time Series Plots

The survey data is what we call cross-sectional.

The households in our survey are a (hopefully representative) cross section of all British households at a particular point in time.

In cross-sectional data, order doesn’t matter. We can sort our households by age, social, etc. and none of our results change as long as we keep each row intact.

Other examples would be samples were every

row corresponded to a firm, a plant, a machine...

With a time series, each observation corresponds to

a point in time.


Daily data on the Dow Jones index: (Dow data)




For time series data, the order of observations matters.

(1-May-00 comes before 2-May-00, etc.)

The easiest way to visualize time series data is often

simply to plot the series in time order.


We could have data at various frequencies:





The kinds of patterns you will uncover can be very different depending on the frequency of the data.

A current hot topic of research at the GSB is

"high frequency data".



US beer


Do you see

a pattern?

Would we see this pattern if we looked at annual data?


Time series plot of monthly returns on a portfolio

of Canadian assets: (Country Portfolio returns)

On the


axis we



On the


axis we

have “time”.

Do you see a pattern?


Here is the


of the Canadian



(i) The histogram

does not depend

on the time order.

(ii) The appearance of the histogram depends on the number of bins. Too many bins makes the histogram appear “spiky”.


Taken from David Greenlaw, Jan Hatzius, Anil Kashyap, and Hyun Shin, US Monetary Policy Forum Report No. 2, 2008

Be careful. What pattern do you see in this series?

How about now?


From same paper as the previous slide.

Time series plots are also used to compare patterns across different variables over time, and sometimes to see the impact of past events (be very careful there, too).


3. Summarizing a Single Numeric Variable

We have looked at graphs. Suppose we are now interested in having numerical summaries of the data rather than graphical representations.

Two important features of any numeric variable are:

1) What is a typical or average value?

2) How spread out or ‘variable’ are the values?


The mean and median capture a typical value.

The variance/standard deviation capture the spread.

For example we saw that the men tend to claim

they can drink more.

How can we summarize this?


: :

: :

. . : : : :

. . : . : : :.: : : : . .


. .. . : : .


0.0 4.0 8.0 12.0 16.0 20.0


Monthly returns

on Canadian


and Japanese


They seem

to be centered

roughly at

the same place

but Japan

has more


How can we summarize this?


3.1 The Mean and Median

We will need some notation.

Suppose we have n observations on a numeric

variable which we call "x".

the last number, n is the number

of numbers,or the “number of

observations.” You may also hear it referred to as the “sample size.”

the first number

xi is the value of x associated with the ithobservation (row).


Here, x is just a name for the set of numbers, we could

just as easily use y.

In a real data set we would use a meaningful name like "age".








Sometimes the order of the observations means something.

In our return data the first observation corresponds to the

first time period.

In the survey data, the order did not matter.


The sample mean is justtheaverage of the numbers “x”:

We often use the symbol to denote the mean of the

numbers x.

We call it “x bar”.


Here is a more compact way to write the same thing…


We use a shorthand for it (it is just notation):

This is summation notation.


Graphical interpretation of the sample mean

Here are the dot plots of the beer data for women and men.

Which group claims to be able to drink more?

Character Dotplot

. . . . : : .



: :

: :

. . : : : :

. . : . : : : . : : : : .


0.0 2.5 5.0 7.5 10.0 12.5

In some sense, the men claim to drink more.

To summarize this we can compute the average value

for each group (men and women).

Note: I deleted the outlier, I do not believe him!.


“On average women claim

they can drink 4.2 beers. Men

claim they can drink 7.9 beers”

Mean of nbeerf = 4.2222

Mean of nbeerm = 7.8625

How to calculate these means

In the picture, I think of the mean as the “center” of the data.

Character Dotplot

. . . . : : .



: :

: :

. . : : : :

. . : . : : : . : : : : .


0.0 2.5 5.0 7.5 10.0 12.5




Let us compare the means of the Canadian and Japanese


Mean of canada = 0.0090654

Mean of japan = 0.0023364

This is a big difference as a practical matter!

(Average monthly return of .90% versus .23%)

It was hard to see this difference in the histograms because

the difference is small compared to the variation.


More on summation notation (take this as an aside)

Let us look at summation in more detail.

means that for each value of i, from 1 to n,

we add to the sum the value indicated,

in this case xi.

add in this value for each i


To understand how it works let us consider some


Think of each row as an

observation on both x and y.

To make things concrete, think

of each row as corresponding to

a year and let x and y be annual

returns on two different assets.

x y year

0.07 0.11 1

0.06 0.05 2

0.04 0.09 3

0.03 0.03 4

In year 1 asset “x” had return 7%.

In year 4 asset “y” had return 3%.


compute x bar.

compute y bar.

(here, we do not sum

over all observations: we sum only over the second and the third observation).


For each value of i, we can add in anything we want:

How to do these calculations using Excel

= (.02)*(.04) + (.01)*(-.02) + (-.01)*(.02)+(-.02)*(-.04)


The median

After ordering the data, the median is the

middle value of the data. If there is an even

number of data points, the median is the

average of the two middle values.


1,2,3,4,5 Median = 3

1,1,2,3,4,5 Median = (2+3)/2 =2.5


Mean versus median

Although boththe mean and the median are good

measures of the center of a distribution of measurements,

the median is less sensitive to extreme values.

The median is not affected by extreme values since

the numerical values of the measurements are not

used in its computation.


1,2,3,4,5 Mean: 3 Median: 3

1,2,3,4,100 Mean: 22 Median: 3


For the bank interarrival data:

If data is right skewed the mean will be bigger

than the median. You can think of this as the extreme

“right tail” observations pulling the mean upward.


Median or Mean?

At the GSB professors are rated by students from 1-5 in

several categories. In the past only the mean rating was


Some faculty members believe the median should

be reported instead. This was actually a major debate at

a faculty meeting a few years ago.

What difference would this make?

In fact, the GSB now reports the mean and median, along with a histogram of all the ratings!


The Mean of a Dummy Variable

Consider the "simpson" variable in the survey data set.

Does it make sense to take the mean?

The sum of the 1's and

0's will equal the number

of respondents who watch

the simpsons.

So the mean is the fraction

of respondents who watch.


So, in general, the average of a dummy,

gives the percentage of times that whatever dummy=1

signals happens.

Another example, if a poll is conducted about a particular candidate where

1=approval, 0=disapproval

then the sample mean is the candidate’s approval rating.

This may seem obvious, but we will get a lot of use out

of this idea throughout the quarter.


3.2 The Variance and Standard Deviation

The mean and the median give usinformation

about the central tendency of a set of observations, but they shed no light on the dispersion, or spread of the data.

Example: Which data set is more variable ?

5,5,5,5,5 Mean: 5

1,3,5,8,8 Mean: 5

If these were portfolio returns (in percent), means are average returns. What else might we want to measure?


The Sample Variance

. . . .


. . . .


0.030 0.045 0.060 0.075 0.090 0.105

The y numbers are more spread out than the x numbers.

We want a numerical measure of variation or spread.

The basic idea is to view variability in terms of distance

between each measurement and the mean.


. . . .


. . . .


0.030 0.045 0.060 0.075 0.090 0.105

Overall, these are smaller than these.


We cannot just look at the distance between each measurement and the mean. We need an overall measure of how big the differences are

(i.e., just one number like in the case of the mean).

Also, we cannot just sum the individual distances because the negative distances cancel out with the positive ones giving zero always (Why?).

The average squared distance would be


So, the sample variance of the x data is defined to be:

Sample variance:

We use n-1 instead of n for technical reasons that will

be discussed later (and because Excel does it this way).

Think of it as the average squared distance of

the observations from the mean.



1) What is the smallest value a variance can be?

2) What are the units of the variance?

It is helpful to have a measure of spread which

is in the original units. The sample variance is not in the

original units. We now introduce a measure of dispersion

that solves this problem: the sample standard deviation


The sample standard deviation

It is defined as the square root of the sample variance (easy).

The sample standard deviation:

The units of the standard deviation are the same

as those of the original data.


Example 1 (numerical)

Assume as before: = .04, -.02, .02, -.04

= .02, .01, .01, .02


The sample

standard deviation

for the y data

is bigger than

that for the x data.

This numerically

captures the

fact that y has

“more variation”

about its mean

than x.


Example 2 (graphical)

The standard deviations

measure the fact that there

is more spread in the Japanese


Character Dotplot



: :

:: :

.::: :.:

: : :::: ::::

::: :::: :::: :::

. : :::: :::: :::: :::.


. .

::. . : .

. ::: .:: :.: .

: ::: .::: :::: : :.

. .. .. :.:: :::: :::: :::: : :: : : . : .


-0.160 -0.080 0.000 0.080 0.160 0.240

Variable N Mean StDev

canada 107 0.00907 0.03833

japan 107 0.00234 0.07368


3.3 The Empirical Rule

We now have two numerical summaries for the data

how spread out,

how variable the data is

where the data is

The mean is pretty easy to interpret (some sort of “center” of the data).

We know that the bigger sx is, the more variable the data is, but how do we really interpret this number?

What is a big sx, what is a small one ?


The empirical rule will help us understand sx and

relate the numerical summaries back to our plots.

Empirical Rule

For “mound shaped data”:

Approximately 68% of the data is in the interval

Approximately 95% of the data is in the interval


We can see this on a histogram of the Canadian returns

The empirical

rule says that

roughly 95%

of the


are between the

dashed lines and

roughly 68% between

the dotted lines.

Looks reasonable.





Same thing

viewed from

the perspective

of the time

series plot.

n=108, so

5% outside

would be about

5 points.

There are 4 points

outside, which is

pretty close.


A little finance: comparing mutual funds

Let us use the means and standard deviations to compare mutual funds.

For 9 different assets we compute the means and standard deviations.

Later, we plot the means versus the standard deviations.

The assets are:


Variable N Mean StDev

drefus 180 0.00677 0.04724

fidel 180 0.00470 0.05659

keystne 180 0.00654 0.08424

Putnminc 180 0.00552 0.03008

scudinc 180 0.00443 0.03597

windsor 180 0.01002 0.04864

eqmrkt 180 0.01082 0.06856

valmrkt 180 0.00681 0.04800

tbill 180 0.00598 0.00252

The speculative fund (keystne) has a higher mean and

standard deviation than the income fund (Putnminc).

Later we’ll see how to look at this information graphically.


3.4 Percentiles, quartiles, and the IQR

Again, this just applies to numeric variables.

The 10% percentile is the number such that 10% of

the values are less than it and 90% are bigger.

The median is the 50% percentile.

Percentiles are also known as quantiles or quartiles.


For the age variable in the survey data:

5% of the 1000 values

are less than 25.


The first, second,

and third quartiles are the

25%, 50%, and 75% percentiles.

The interquartile range

is the difference between

the third and first quartile.

The interquartile range

is used as a measure

of spread.


first quantile = 35 years

We can interpret quantiles graphically on the histogram.

25% of the area of the colored bars is to the left of the first quantile.


The empirical rule is actually a statement about quantiles.

What does it say? For a variable with a “mound shaped” histogram…

What quantile is two standard deviations below the mean?

What quantile is one standard deviation above?



To see this yourself, draw the picture! We’ll learn later that the empirical rule is based on a very important probability model.


Source: Murphy, Kevin and Finis Welch, “Wage Differentials in the 1990s: Is the Glass Half-full or Half-empty?

Aside: We won’t use percentiles much in this class, but above is an interesting time series plot of the 90th (top line), median (middle line), and 10th percentiles of real wages in the U.S. from the late 1960s to late 1990s. This widening income gap is a major concern for economists… or is it?


4. Looking at Two Variables

While it is important to look at variables one

at a time, many interesting business problems

concern how two (or more) variables are related

to each other.


4.1 Categorical variables: the Two-way Table

Let’s look at the relationship between two categorical variables, x and y.

If x has two categories and y has two as well,

then there are four categories using both x and y.

We can then just count the number of observations in each category.

If x has r1 and y has r2, then we have r1*r2

possibilities. We can arrange these possibilities in

a two-way table.


This is the two way table relating viewership of the simpsons

with cola use.

146 of the 1000 view simpsons and consume colas.

Raw counts:

Percent of total:

Percent of column:

Percent of row:

How to make these tables


A picture of the table:



A much higher fraction of the simpsons viewers

consumes colas.


How does social grade relate to cigarette use?

Now one variable has 2 categories and the other has 6

so combined there are 12.

Row percentages:

The highest cigarette use is in the two lower

social grades.


4.2 Numeric variables: Scatter Plots

For two numeric variables we have the scatter plot.


12.0 192

12.0 160

5.0 155

5.0 120

7.0 150

13.0 175

4.0 100

12.0 165

12.0 165

12.0 150

. .

. .

. .

Each row is an observation

corresponding to a person.

Each person has two numbers

associated with him/her,

# beers and weight.

How are they related?


Is the numberof beers you can drink

related to your weight?


12.0 192

12.0 160

5.0 155

5.0 120

7.0 150

13.0 175

4.0 100

12.0 165

12.0 165

12.0 150

. .

. .

. .

You can think of a scatterplot as a ‘2D dot plot’. Each point corresponds to an observation: weight determines the position on the horizontal axis, height on the vertical.

Notice our outlier is back (circled)... and is he really an outlier?!



Are returns on a mutual fund related to market returns?

Each point


to a month.

Like the histogram, scatterplots can also be used with time series data, and the resulting plot does not depend on the time ordering.



Here’s another example of an “outlier”. This data is from a poker website that went through a major cheating scandal.

(This was also question #1 on last year’s midterm!)



A similar scandal surfaced recently. Is the evidence as compelling?


In finance we often use a different type of 2-D plot to compare asset returns. Here each point is a mutual fund. The horizontal and vertical location of each point reflects the sample standard deviation and sample mean of its returns within the same sample period.





If you’re a fund manager, where do you want to be on this plot?

















































Let us compare some countries (Country returns data)





from ‘88

to ‘96


4.3 Relating a Numeric to a Categorical variable

How do you plot a numeric variable vs a

categorical variable?

This is not so obvious.

An easy thing to do is make the numeric variable

categorical by binning it, like we did when making a histogram.


Cigarette usage and age:

Quick — what is the relationship between

age and cigarette usage?

Plots are a great way to identify patterns, but careful…

How strong is the evidence?


5. Summarizing Bivariate Relations

Can we numerically summarize the

strength of a bivariate relationship?

For categorical variables, I don't think there

is a generally accepted approach.

For two numeric variables, we introduce two summary statistics called covariance and correlation.


5.1 In Tables

There does not seem to be a standard way to

summarize the strength of the relationship in a table.

Sometimes I use the difference between a marginal

proportion and a "conditional" proportion.

In this case it would be: |.578 - .8066| =.2286

The difference between the percent of cola drinkers

and percent of simpsons viewers that are cola drinkers.


5.2 Covariance and Correlation

In the beer data (beers vs weight) and mutual fund data (windsor vs valmrkt), it looks like there is a relationship.

Even more, the relationship looks linear in that it looks like

we could draw a line through the plot to capture the pattern.

Covarianceandcorrelation summarize how strong a

linear relationship there is between two variables.

In our first example weight and # beers were two variables.

In our second example our two variables were two kinds of


In general, we think of the two variables as x and y.


The sample covariance between x and y:

The sample correlation between x and y:

So, the correlation is just the covariance divided by

the two standard deviations. What are the units?


We will get some intuition about these formulae, but first

let us see them in action. How do they summarize data

for us? Let us start with the correlation.

Correlation, the facts of life:

The closer r is to 1 the stronger the linear

relationship is with a positive slope.

When one goes up, the other tends to go up.

The closer r is to -1 the stronger the linear

relationship is with a negative slope.

When one goes up, the other tends to go down.


The correlations corresponding to the two scatter plots

we looked at are:

Correlation of valmrkt and windsor = 0.923

Correlation of nbeer and weight = 0.692

The larger correlation between valmrkt and windsor

indicates that the linear relationship is stronger.

Let us look at some more examples.


Correlation of

y1 and x1 = 0.019

Correlation of

y2 and x2 = 0.995


Correlation of

y3 and x3 = 0.586

Correlation of

y4 and x4 = -0.982


Correlation of y5 and x5 = 0.210

IMPORTANT: Correlation only measures linear relationships (here the value is small but there is a strong nonlinear relationship between y5 and x5.)


Example: The country data

Which countries go up and down together?

I have data on 23 countries.

That would be a lot of plots!


To summarize, we can compute all pairwise correlations.

(StatPro makes this table for you automatically!)

australi belgium canada finalnd france germany honkong italy

belgium 0.189

canada 0.507 0.357

finalnd 0.387 0.183 0.386

france 0.275 0.734 0.342 0.176

germany 0.226 0.691 0.302 0.304 0.709

honkong 0.334 0.301 0.558 0.355 0.359 0.339

italy 0.159 0.367 0.334 0.389 0.352 0.465 0.261

japan 0.251 0.418 0.271 0.307 0.421 0.318 0.219 0.426

usa 0.360 0.429 0.651 0.264 0.501 0.372 0.429 0.240

singapor 0.409 0.355 0.478 0.391 0.408 0.467 0.647 0.416

japan usa

usa 0.246

singapor 0.407 0.473

Why is this blank?

Make this table in StatPro

StatPro will also make a table of covariances with

variances on the diagonal.


Understanding the covariance and correlation formulae

How do these weird looking formulae for covariance and

correlation capture the relationship?

To get a feeling for this, let us go back to the simple example

and compute covariance and correlation

x y

0.07 0.11

0.06 0.05

0.04 0.09

0.03 0.03


First, let us compute the covariance

(which is a necessary ingredient to compute the correlation):

= .0004

Each of the 4 points makes a contribution to the sum.

Let us see which point does what.






Points in (I) have both x and y bigger than their means so we get a positive

contribution to the covariance.

Points in (III) have both x and y less than their means so we get a positive

contribution to the covariance.

In (II) and (IV) one of x and y is less than its mean and the other is greater

so we get a negative contribution.

The further out the point is, the bigger the contribution.


just a few

relatively small


Lots of positive contributions

just a few

relatively small


Lots of positive contributions



A positive covariance means that when a variable

is above its average the other one tends to be above as well.

They move up and down together.

A negative covariancemeans that when one is up

the other tends to be down.

They move in opposite directions.

A small covariance means that their movements are

almost (linearly) unrelated.

Now let’s compute the correlation…


We just finish the example.




The division by the standard deviations standardizes

the covariance so that the correlation is always between

+/- 1.


The sign of the correlation contains the same information

as the sign of the covariance (in fact, they have the same

sign because the standard deviations always positive).

Positive sign: positive relationship

Negative sign: negative relationship

The correlation is more informative, though, because it is

unit-less (always between –1 and 1), by construction.

Hence, it is a more easily interpretable measure of the

strength of the relationship.

Close to 1: strong positive relationship

Close to -1: strong negative relationship


6 Linearly Related Variables

We have studied data sets that display some kind of relation between variables (the mutual fund returns and the market returns, for instance).

Sometimes there is an exact linear relation between variables:

y = c0 + c1 x

In this linear relationship, c0 is called the intercept.

c1 is called the slope.

Suppose we had started with x and we already knew its sample mean and variance.

Can we figure out the sample mean and variance of the “new” variable, y?


6.1 Linear functions


Suppose we have a sample of temperatures in Celsius

and we convert them to Fahrenheit.

cel fahr

10 50

15 59

20 68

25 77

40 104

30 86

50 122

70 158

How are the cel values related

to the fahr values?

fahr = 32 + (9/5) * cel

Note that cel = 32.5, and scel = 20

We could find fahr and sfahr

using a spreadsheet.


Note: if we make a scatter plot of

fahr versus cel, what do we see ?

Correlation of cel and fahr = 1.000


In general, we like to use the symbols y and x

for the two variables

The variable y is a linear function of the variable x if:

We think of the c’s as constants

(fixed numbers) while x and y vary.



Suppose your client is a movie star. She has a deal which pays her a $10 million fee per movie +

10% of the gross ticket revenues.

How is our star’s income related to the gross?

Let I denote income.

Let G denote Gross.

Note: Don’t forget units! When we write it this way we need to make sure all our numbers are in millions of dollars.


6.2 Mean and variance of a linear function

Suppose y (i.e., each value of the variable y) is a linear function of x.

How are the mean and variance (standard deviation)

of y related to those of x?

Let us look at

our temperature


Suppose we

first multiply by

(9/5) and then

add 32.

mul = 9/5 * cel

fahr = 32 + mul

= 32 + (9/5)*cel


Variable Mean StDev

cel 32.50 20.00

mul 58.5 36.0

fahr 90.5 36.0

. . .. . . . .


. . . . . . . .


. . . . . . . .


0 30 60 90 120 150



When we multiply cel by 9/5 we affect (increase) both

the mean and the standard deviation proportionally.

If we add a constant (32 in our case) we simply

increase the mean (by the value of the constant) but leave the overall dispersion unaffected.



Mean: 1

Stdev: 1


Mean: 3

Stdev: 1


Mean: 2

Stdev: 2



So, instead of using a spreadsheet, we could have used our linear formulas.

We knew that

fahr = 32 + (9/5) * cel



c0 = 32

c1 = 9/5

Our handy linear formulas tell us:

Of course, these are the same answers we got before!!

fahr = c0 + c1 * cel

= 32 + (9/5)*32.5

= 90.5

sfahr = |c1| * scel = |9/5| * 20 = 36


Multiply by c1

Add c0

Mean increases by c0

Std. dev. is unchanged

Both mean and std. dev. change by factor of c1

Aside: Why does this work? Look back 3 slides…



Std. dev.


Aside: Why? (The hard way)

NOTE: This is way more math than we will typically need in this course.

BUT you should know these formulas are properties of our summary statistics, not just some coincidence. AND they come up again when we do probability!



Each Income number

is 10 + .1* the corresponding

Gross number.

Suppose our movie star made 10 pictures last year and the sample mean and sample variance of the gross on the films are 100 and 900, respectively.

What are the sample mean and variance of the star’s income?



With only one x, we can get the sample standard deviation of y either by using

sy2 = c12 sx2

and then taking the square root, or using

sy = |c1| sx

directly. We get the same answer either way, because sample standard deviation is always the square root of sample variance.


Why are these formulas useful?

We could always just type everything into a spreadsheet and use spreadsheet functions to get the answers.

Really, though, the reason for these formulas will become apparent when we study probability, statistical inference, and regression. You cannot understand statistics or regression without a solid understanding of linear relationships.

In other words, yes, I recognize these formulas are probably the least fun part of the course (and considering this is basic stats, that’s saying something). But you absolutely must know them.



Suppose x has mean 100 and standard deviation 10.

What are the mean, standard deviation and variance of:

(i) y = 2x?

(ii) y = 5+x?

(iii) y = 5-2x?

(c0=0, c1=2)

(c0=5, c1=1)

(c0=5, c1= -2)


6.3 Linear combinations

We may want a variable to be related to several others instead of just one. We will assume that Y is a function of X,Z,…rather than

just a function of X.

When a variable y is linearly related to several others,

we call it a linear combination.

We say, “y is a linear combination of the x’s”.

c0 is called the intercept or just “the constant”

ci is called the coefficient of xi.



Suppose in addition to the flat $10 million fee and 10 percent of ticket revenues, our movie star also gets 5 percent of all sales of the soundtrack (on CD) released with the movie.

How is the star’s income related to the film’s gross and

CD sales (in millions of dollars)?




Let I,G,C, denote

income, Gross, and cd sales





Important example: Portfolios

Suppose you have $100 to invest.

Let x1 be the return on asset 1.

If x1 = .1, and you put all your money into asset 1, then

you will have $100*(1+.1) = $110 at the end of the period.

Let x2 be the return on asset 2.

If x2 = .15, and you put all your money into asset 2, then

you will have $100*(1+.15) = $115 at the end of the period.

Suppose you put ½ of your money into asset 1 the other ½ of your money into asset 2.

What will happen?


At the end of the period you will have,

.5*(100)*(1+.1) + .5*(100)*(1+.15) = 100*[ 1+(.5*.1)+(.5*.15) ]

55 + 57.50 = $112.50

So the return is (.5*.1) + (.5*.15) = .125

Investment in asset 1

Investment in asset 2

Return on portfolio

In other words, when we put ½ of our money into asset 1 and the other ½ into asset 2, the return on the resulting portfolio is

Rp = ( ½ )*x1 + ( ½ )*x2

The return on a portfolio is a linear combination of the returns on the individual assets.


It turns out this is true in general. Suppose you have $M to invest in two assets with returns x1 and x2. Let w1 be the fraction of your wealth you choose to invest in asset 1:

Note: For this to work, we need w1 + w2 = 1

The portfolio return is:

The portfolio return is a linear combination of the individual asset returns. The coefficients are the “portfolio weights” (fraction of wealth invested in each asset).


Notice that the portfolio weights always sum up to one.

(If I invest 30% of my wealth in asset 1, then I have to

invest 70% of my wealth in asset 2).

When we’re talking about portfolios, we use “w1, w2, …” instead of “c1, c2, …” to remind us that weights have to sum to one. Our linear formulas work the same way in either case. Most of the time when we do portfolios, we don’t worry about the constant (c0=0).

Question for those with some finance experience:

Can portfolio weights be negative?


Suppose we have m assets.

The return on the ith asset is xi.

Put wi fraction of your wealth into asset i..

Your portfolio is determined by the portfolio weights wi.

Then, the return on the portfolio is:

Your portfolio return is always a linear combination of individual asset returns, with coefficients equal to the fraction of wealth invested.


For linear combinations of 2 or more variables, variance also depends on the covariance between the x’s!!

More on this later…

6.4 Mean and variance of a linear combination

First, we consider the case where we have only two x’s.

2 inputs:





For each film she does, our movie star makes $10 million plus 10% of gross ticket revenues and 5% of CD sales. Here is the data for ten movies she made last year:

Here is her income for each film.Remember,

So each number in the Income column equals 10 plus .1 times the Gross value plus .05 times the Cd value.

Note: All numbers are in millions of $.


Like before, we could type everything in and get the sample mean and variance of income using a spreadsheet.

But let’s suppose, as her agent, we already knew that:




Like before, we know that:


I = c0 + c1 G + c2 C

= 10 + .1*(100) + .05*(5)

= 20.25

See next slide…

sI2 = c12sG2 + c22sC2 + 2c1c2sCG

= (.1)2(30)2 + (.05)2(1)2 + 2(.1)(.05)(30)(1)(.8) = 9.24



Remember, we defined sample correlation as the covariance divided by the standard deviations

So, if we know the correlation and both standard deviations, we can get back sample covariance

So, if we know the sample standard deviations and either of correlation or covariance, we can figure out the other. We used this trick to calculate sCG on the previous slide.


In Excel ("moviestar2.xls") I have:

How would the answer

change if the correlation

between G and C

were zero?


Example (the country data again)

Let us use our country data and suppose that we had put

.5 into USA and .5 into Hong Kong.

What would our returns have been?

port = .5*honkong + .5*usa

honkong usa port

0.02 0.04 0.030

0.06 -0.03 0.015

0.02 0.01 0.015

-0.03 0.01 -0.010

0.08 0.05 0.065


For each month, we

get the portfolio return

as ½*hongkong + ½*usa.


w2 (= c2)

w1 (= c1)

port = .5*honkong + .5*usa

honkong usa port

0.02 0.04 0.030

0.06 -0.03 0.015

0.02 0.01 0.015

-0.03 0.01 -0.010

0.08 0.05 0.065


For each month, we

get the portfolio return

as ½*hongkong + ½*usa.

The sample means are: honkong = 0.02103

usa = 0.01346

The sample mean of our portfolio returns is:

port = w1 honkong + w2 usa

= .5*.02103 + .5*.01346 = .01724


Let us do the same exercise for the variance:

Diagonals are variances,

off-diagonals are covariances;

StatPro will make this table for

you automatically!


honkong usa port

honkong 0.00521497

usa 0.00103037 0.00110774

port 0.00312267 0.00106906 0.00209586

port = .5*honkong + .5*usa

As before, we apply the formula:

sport2 = w12shonkong2 + w22susa2 + 2 w1w2 shonkong , usa

(.5)2 (.00521) + (.5)2(.00111) + 2*(.5)*(.5)*.00103 = .0021

( Note thatsport = (.0021)1/2 .046 )


What if we had put 25% into USA and 75% into Hong Kong?


honkong usa port2

honkong 0.00521497

usa 0.00103037 0.00110774

port2 0.00416882 0.00104972 0.00338905

port2 =.75*honkong +.25*usa

To get sport22 just use the SAME formula from the previous

slide, except now with w1=.75 and w2=.25

(.75)2(.00521) + (.25)2(.00111+(2)*(.25)*(.75)*(.00103)

= .00339


How do the returns on the w1=w2=.5 portfolio compare with

those of Hong Kong and USA?

It looks

like the mean

for my portfolio

is right in

between the

means of

USA and

Hong Kong.

What about the

standard deviation?

port = .0172

sport = .046

The sample standard deviation is less than halfway between susa and shonkong … what happened?


Remember, correlations are between -1 and 1!

IF x1 and x2 are perfectly correlated (r=1), then

So in this case,

BUT in general, when c1 and c2 are positive,

We just used the formula from Slide 141:

Why is covariance important?

Often useful to rewrite the variance formula as


The basic idea here is

The smaller the correlation, the faster this happens.

This is actually one of the most important ideas in statistics – we’ll see it again!!

It is also one of the most important ideas in finance, because it leads to diversification.

When we take averages,

variance gets smaller.


Example (Optional)

y = .5x1 + .5 x2

At each point we

plot the value of y.

The variances and

covariance are:

x1 x2

x1 1.334636

x2 -1.208679 1.106238

The dashed lines are drawn at

the mean of x1 and x2.

Then, the variance of y is

0.0058105 = .5*.5*1.3346 + .5*.5*1.106 +2*.5*.5*(-1.208679)

Why is the variance of y so much smaller than those of the x’s ?


Example (Optional)

y = .5x1 + .5 x2

At each point we

plot the value of y.

The variances and

covariance are:

x1 x2

x1 1.158167

x2 1.046490 0.9609463

The dashed lines are drawn at

the mean of x1 and x2.

Then, the variance of y is

1.053 = .5*.5*1.158 + .5*.5*.961 + 2*.5*.5*1.0465

Why is the variance of y similar to those of the x’s ?


Example (Optional)

y = .5x1 + .5 x2

At each point we

plot the value of y.

The variances and

covariance are:

x1 x2

x1 1.3870537

x2 0.1976187 0.8247886

The dashed lines are drawn at

the mean of x1 and x2.

Then, the variance of y is

0.65175=.5*.5*1.387 + .5*.5*.8248 + 2*.5*.5*.1976

Why is the variance of y less than those of x1 and x2 ?


3 inputs:

The formula for the sample mean is basically the same, just one more term because there’s one more x

Note that there are now THREE “covariance terms”, one for each PAIR of x’s


Example: Portfolio with 3 inputs

port = .1*fidel+.4*eqmrkt+.5*windsor


port fidel eqmrkt windsor

port 0.00306760

fidel 0.00280224 0.00320210

eqmrkt 0.00369384 0.00319150 0.00470021

windsor 0.00261967 0.00241087 0.00298922 0.00236580

sport2 = w12sfidel2 + w22seqmarket2 + w32swindsor2 +

2 w1w2 sfidel , eqmarket + 2 w1w3 sfidel , windsor + 2 w2w3 seqmarket , windsor

.0030676 = (.1)*(.1)*.00320 + (.4)*(.4)*.00470 + (.5)*(.5)*.00236

+2*[ (.1)*(.4)*.00319 + (.1)*(.5)*.00241+(.4)*(.5)*.00299 ]


Let us try a portfolio with three stocks.

Let us go short on Canada (i.e., we borrow Canada to invest

in the other stocks)

port = -.5*canada+usa+.5*honkong




is an interesting

thing to do!


Aside:Why would we form portfolios?

Maybe the portfolio has a nice mean and variance (i.e.

nice “average return” and nice “risk”)

Because portfolio returns are linear combinations of returns on individual assets, we can apply our linear formulas to find the average return and risk of any possible portfolio as long as we know the means and variances of the individual asset returns. These formulae are fundamental tools for those who really understand finance.

And remember our “when we take averages, variance gets smaller” idea? In finance, that’s known as diversification…


Example (Optional)

Cut from a Finance Textbook:


K inputs (Optional): Suppose


I won’t ask you to do calculations by hand for more than 3 inputs,

this is just to give you an idea of what the formulas look like.


7. Linear Regression

This is data on 128 homes. (Housing data)

x=size (square feet) y = price (dollars)


Clearly, the data are correlated:

But what is the equation of the line you would draw

through the data?

Linear regression fits a line to the plot.


When I "run a regression" I get values for

the intercept and the slope.

y = (intercept) + (slope) * x




Here is the

scatter plot

with the line

drawn through it.

Looks reasonable!


It turns out the formula for the slope and the intercept are

We’ll see these later (you don’t need to know these now).

But it isn’t that hard to see what they do!

The slope formula takes covariance and “standardizes” it

so that its units are (units of y)/(units of x)

The intercept formula makes our line pass through

the point


Regression and Prediction

Suppose you had a house and you knew the size = 2000

but you do not know the price.

How could you use regression to guess or "predict"

the price?

Just plug the size into the equation of the line:

estimated price = -10091.1299 +70.2263*2000

= 130361.5


Correlation and covariance are "symmetric".

The covariance between y and x is the same

thing as the covariance between x and y.

Regression is not symmetric.

We regress y on x.

y: dependent variable

x: independent variable.

"y depends on x".

In our example y=price depends on x = size.

How do we know “what is x and what is y”?


8. Pivot Tables (Optional)

Up till now, we have tried to look at pairs of


Of course, it would be interesting to look at more

than two at a time.

The Pivot table utility in excel uses tables to do this.

But the tables can be "more than two way" and you

can put a summary for another variable in each


The simple two way tables we looked at earlier

were also created using pivot tables.


In each cell is printed the average of the cigs dummy.

This gives the percentage of smokers.

The cells are determined by a binned version of age

and sex.

In the age group 16-25, 53% of female respondents

are smokers.

This table attempts to look at 3 variables at the same time!!


The Hockey Data

We have data on every penalty called in the NHL

from 95-96 to 2001-2002. Data below is a

subsample of size 5000.

oppcall = 1 if penalty switches, that is, if A is playing B

and the last penalty was on B, then oppcall =1 if

this penalty is on A.

Each row corresponds to a penalty.

(Can't have first penalty in game).

timespan=time between penalties (mins)

laghome=1 last pen on home team

goaldiff = lead of last penalized team

inrow2=1 if last two pens on same team

laghomeT: h if laghome=1

inrowT: two if inrow2=1



the home team

is behind and

you just called

two in a row on them

if the last pen was on home

team, more likely to switch

if you just called two in

a row on same team,

more likely to switch

if the last penalized

team was ahead,

less likely to switch