Bivariate data analysis
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

Bivariate Data Analysis PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on
  • Presentation posted in: General

Bivariate Data Analysis. Bivariate Data analysis 4. If the relationship is linear the residuals plotted against the original x - values would be scattered randomly above and below the line.

Download Presentation

Bivariate Data Analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Bivariate data analysis

Bivariate Data Analysis

Bivariate Data analysis 4


Bivariate data analysis

If the relationship is linear the residuals plotted against the original x - values would be scattered randomly above and below the line.


Bivariate data analysis

A scatter plot of residuals versus the x-values should be boring and have no interesting features, like direction or shape. It should stretch horizontally with about the same amount of scatter throughout. It should show no curves or outliers


R 0 87 indicates a strong linear relationship between x and y

r = 0.87 indicates a strong linear relationship between x and y


The scatter plot below however shows the relationship is clearly non linear

The scatter plot below however shows the relationship is clearly non-linear


Bivariate data analysis

When examining residuals to check whether a linear model is appropriate, it is usually best to plot them. The variation in the residuals is the key to assessing how well the model fits.


Bivariate data analysis

The pattern of residuals looks more like a parabola. This should indicate that the data were not really linear, but were more likely to be quadratic.


Discuss this data

Discuss this data.


Discuss this situation

Outlier?

Discuss this situation.


Discuss the plot of the residuals

Discuss the plot of the residuals


Discuss this scatter plot

Discuss this scatter plot


Linear

Linear?


Residuals

Residuals


Useful website

Useful website

  • http://stat-www.berkeley.edu/~stark/Java/Correlation.htm plots residuals, regression lines etc


Bivariate data analysis

Many of our tools for displaying and summarizing data work only when the data meet certain conditions.

We cannot use a linear model unless the relationship between two variables is linear.

Often re-expression can save the day, straightening bent relationships so that we can fit and use a simple linear model.


Displays of the residuals can often help you find subsets in the data

Displays of the residuals can often help you find subsets in the data.


Bivariate data analysis

When a scatterplot shows a CURVED form that consistently increases or decreases, we can often straighten the form of the plot be re-expressing one or both of the variables.


Bivariate data analysis

The correlation is 0.979. That sounds pretty high, but the scatter plot shows something is not quite right.


Re expressing f stop speed by squaring straightens the plot

Re-expressing f/stop speed by squaring straightens the plot.


Bivariate data analysis

This plot looks ‘straight’. The correlation is now 0.998, but the increase in correlation is not important. (The original value of 0.979 is already large.) What is important is the form of the plot is now straight, so the correlation is now an appropriate measure of association.


Goals of re expression

Goals of re-expression

  • Make the distribution (as seen in its histogram, for example) more symmetric.

  • Make the form of the scatter plot more nearly linear.

  • Make the scatter in a scatter plot spread out evenly rather than following a fan shape.


Some hints

Some hints

  • Try y2 for unimodal skewed to the left.

  • Try square root of y for counted data.

  • Try logs for measurements that can’t be negative and especially when they grow by percentage increases.

  • Try -1/y or -1/(square root of y).

  • Logs straighten exponential trends and pull in a long right trail.

  • Logs straighten power curves.


Try y versus x 2

Try y versus x2


Try y versus x 21

Try y versus x2


Try log or 1 x

Try log or 1/x


Try log or 1 x1

Try log or 1/x


Bivariate data analysis

Don’t stray too far from the powers suggested. Taking a high power may artificially inflate R2, but it won’t give a useful or meaningful model. It is better to stick with powers between 2 and -2. Even in that range you should prefer the simpler powers in the ladder to those in the cracks. A square root is easier to understand than the 0.413 power.


Bivariate data analysis

Comparing histograms and scatter graphs


Bivariate data analysis

The data in the scatter plot below shows the progression of the fastest times for the men’s marathon since the Second World War. We may want to use this data to predict the fastest time at 1 January 2010 (i.e. 64 years after 1 January 1946).

Page 53


Possible solutions

Possible solutions

  • a quadratic (y = ax2 + bx + c)

  • an exponential function (y = aebx)

  • a power function (y = axb)

  • 2 separate straight lines –

    one for say 0 – 23 years and

    one for say 23 – 60 years

  • a line for only the later years, say 23 – 60 years


Quadratic

Quadratic

  • Curve seems to fit

  • R2 = 0.9592 is very high

  • Inappropriate to quote r as it is not linear

  • time starts increasing (not sensible)

Page 54


Exponential

Exponential

  • Doesn’t fit the data points particularly well


Power function

Power Function

  • reasonable fit,

  • R2 is high

  • R2 = 0.9401


Line for only the later years 1969 2003

Line for only the later years (1969-2003)

  • Line (1969-2003) – reasonable fit,

  • R2 is high

  • Note: We only use the later years line for the prediction and ignore the earlier years


Bivariate data analysis

The data in the scatter plot below comes from a random sample of 60 models of new cars taken from all models on the market in New Zealand in May 2000. We want to use the engine size to predict the weight of a car.

  • Seems to be linear for engine sizes less than 2500cc.

  • Very weak or no linear relationship for engine sizes over 2500cc.

  • Solution: Fit a line for engine sizes less than 2500cc.

Page 55


  • Login