Business Statistics

Business Statistics - QBM117 Scatter diagrams and measures of association

Objectives • To introduce briefly, the topic of regression and correlation. • To explore relationships between two variables using the graphical technique of scatter diagrams. • To introduce two measures of association which can be used to measure the amount of association between two variables.

Regression and correlation: measuring and predicting relationships In earlier modules we learnt to look at data, compute and interpret probabilities, draw random samples and perform statistical inference. Now we apply these concepts to explore relationships between several variables. • Regression and correlation shows us how to summarise the relationship between two factors, based on a bivariate (two variables) set of data. • Correlation is a measure of the strength of the relationship between the two variables; • Regression helps us to predict one variable from the other.

In our earlier studies we learnt to summarise univariate (single variable) data using statistical summaries such as the mean, to describe the centre and the standard deviation to describe the variability. With bivariate data we could use these same statistics to summarise each variable separately, however the payoff comes from studying them both together, to explore the relationship between them.

Exploring relationships using scatterplots Economists and business operators are often interested in relationships between two quantitative variables. For example How does advertising affect sales in my business? If I increase the price on this product, what effect will this have on demand? What effect are inflation rates having on unemployment rates, on the price of petrol, on the price of new homes etc?

Exploring relationships using scatterplots and correlations Scatterplots provide useful insights into the structure of the data such as • is the relationship between the two variables linear or non linear? • are there any outliers in the data? • what is the strength of the relationship between the two variables? etc.

Correlation is a summary measure of the strength of the relationship. It is both helpful and limited. • If the scatterplot shows either a well behaved linear relationship or no relationship at all, then the correlation provides an excellent summary of the relationship; • If however there are problems with the data such as, a non linear relationship or outliers in the data, the correlation can be misleading. Therefore correlation on its own has limited use as its interpretation depends on the type of relationship in the data.

The Scatterplot • is simply a plot of all the data. • If one variable is seen as causing, affecting, or influencing the other, then it is plotted on the x (horizontal) axis. This variable is referred to as the independent variable. The variable that is affected or influenced by the other, is plotted on the y (vertical) axis. This variable is referred to as the dependent variable. • If neither causes, affects or influences the other, it does not matter which one is plotted where.

Correlation measures the strength of the relationship between the two variables • Correlation, denoted  (rho) for a population and r for a sample, varies from –1 to +1, summarising the strength of the relationship in the data. • A correlation of 1 indicates a perfect straight-line relationship, with higher values of one variable associated with perfectly predictable higher values of the other variable. • A correlation of –1 indicates a perfect inverse straight-line relationship, with one variable decreasing as the other increases. • For correlations between –1 and 1, the size of the correlation indicates the strength of the relationship while the sign (+ or -) indicates the direction (increasing or decreasing).

A correlation of 0 generally indicates no relationship, just randomness. • Correlations must be interpreted with caution as nonlinear structures and outliers can distort the usual interpretation. • Correlation measures how close the data points are to being exactly on a tilted straight line. It has nothing to do with the steepness (slope) of the line.

Y Y Y Y Y Y X X X X X X Interpreting Correlation • r = 1 • A perfect straight line tilting up to the right • r = 0 • No overall tilt • No relationship? • r = –1 • A perfect straight line tilting down to the right

Various types of relationships A linear relationship is observed when • the scatterplot shows points bunched randomly around a straight line. • The points could be tightly bunched, falling almost exactly on a line, or more likely, they will be well scattered, forming a ‘cloud’ of points.

30 People Meters 20 10 10 20 30 Nielsen Index Example: Exploring TV Ratings • People Meters vs. Nielsen Index • Two measures of the market share of 10 TV shows • Correlation is r = 0.974 • Very strong positive association (since r is close to 1) • Linear relationship • Straight line with scatter • Increasing relationship • Tilts up and to the right

80 60 Dollars (Billions) 40 20 0 0 50 100 150 200 Deals Example: Merger Deals • Dollars vs. Deals • For mergers and acquisitions by investment bankers • 134 deals worth $63 billion by Goldman Sachs • Correlation is r = 0.790 • Strong positive association • Linear relationship • Straight line with scatter • Increasing relationship • Tilts up and to the right

8% Interest rate 7% 0% 1% 2% 3% Loan fee Example: Mortgage Rates & Fees • Interest Rate vs. Loan Fee • For mortgages • If the interest rate is lower, does the bank make it up with a higher loan fee? • Correlation is r = –0.654 • Negative association • Linear relationship • Straight line with scatter • Decreasing relationship • Tilts down and to the right

Various types of relationships No relationship is observed when • the scatterplot shows a random scatter of points with no tilt either upward or downward. • The points could look like a ‘cloud’ of points that is either circular or oval shaped. • The oval could be either up and down or left and right but it is not tilted (as you move from left to right).

Example: The Stock Market • Today’s vs. Yesterday’s Percent Change • Is there momentum? • If the market was up yesterday, is it more likely to be up today? Or is each day’s performance independent? • Correlation is r = 0.11 • A weak relationship? • No relationship? • Tilt is neither up nor down

Various types of relationships A non linear relationship is observed when • the scatterplot shows points bunched around a curve, rather than a straight line. • Correlation and regression analysis must be used with care on nonlinear data sets. • For most problems we first transform one or both of the variables, to obtain a linear relationship, then we fit a regression.

Example: Stock Options • Call Price vs. Strike Price • For stock options • “Call Price” is the price of the option contract to buy stock at the “Strike Price” • The right to buy at a lower strike price has more value • A nonlinear relationship • Not a straight line: A curved relationship • Correlation r = –0.895 • A negative relationship: Higher strike price goes with lower call price

160 150 140 Yield of process 130 120 500 600 700 800 900 Temperature Example: Maximizing Yield • Output Yield vs. Temperature • For an industrial process • With a “best” optimal temperature setting • A nonlinear relationship • Not a straight line: A curved relationship • Correlation r = –0.0155 • r suggests no relationship • But relationship is strong • It tilts neither up nor down

Outliers • A data point is an outlier if it does not fit the relationship of the rest of the data. • It can distort statistical summaries and make them very misleading. • Watch out for outliers by looking at the scatterplot and if you can justify removing an outlier (by finding that it should not have been there), then do so. • If you have to leave it, be aware of the problems it can cause and consider reporting statistical summaries (eg the correlation coefficient) both with and without it.

5,000 Cost Cost 10,000 4,000 Outlier removed: More details, r = 0.869 0 3,000 0 20 40 60 20 30 40 50 Number produced Number produced Example: Cost and Quantity • Cost vs. Number Produced • For a production facility • It usually costs more to produce more • An outlier is visible • A disaster (a fire at the factory) • High cost, but few produced r = –0.623

Reading for next lecture Read Chapter 18 Sections 18.1 - 18.3 (Chapter 11 Sections 11.1 – 11.3 abridged)

Business Statistics - QBM117