- 67 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' correlation and percentages' - francis-mcneil

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

correlation and percentages

- association between variables can be explored using counts
- are high counts of bone needles associated with high counts of end scrapers?
- similar questions can be asked using percent-standardized data
- are high proportions of decorated pottery associated with high proportions of copper bells?

but…

- these are different questions with different implications for formal regression
- percents will show some correlation even if underlying counts do not…
- ‘spurious’ correlation (negative)
- “closed-sum” effect

including outliers in regression analyses is usually a bad idea…

- Tukey-line / least squares discrepancies are good red-flag signals

“convex hull trimming”

> hull1 chull(x, y)

> plot(x, y)

> polygon(x[hull1], y[hull1])

> abline(lm(y[-hull1] ~ x[-hull1]))

transformation

- at least two major motivations in regression analysis:
- create/improve a linear relationship
- correct skewed distribution(s)

transformation summary

- correcting left skew:

x4 stronger

x3 strong

x2 mild

- correcting right skew:

x weak

log(x) mild

-1/x strong

-1/x2 stronger

regression/correlation

- the strength of a relationship can be assessed by seeing how knowledge of one variable improves the ability to predict the other

if you ignore x, the best predictor of y will be the mean of all y values (y-bar)

- if the y measurements are widely scattered, prediction errors will be greater than if they are close together
- we can assess the dispersion of y values around their mean by:

- “coefficient of determination” (r2)
- describes the proportion of variation that is “explained” or accounted for by the regression line…
- r2=.5

half of the variation is explained by the regression…

half of the variation in y is explained by variation in x…

residuals

- vertical deviations of points around the regression
- for case i, residual = yi-ŷi [yi-(a+bxi)]
- residuals in y should not show patterned variation either with x or y-hat
- should be normally distributed around the regression line
- residual error should not be autocorrelated (errors/residuals in y are independent…)

residuals may show patterning with respect to other variables…

- explore this with a residual scatterplot
- ŷ vs. other variables…
- are there suggestions of linear or other kinds of relationships?
- if r2 < 1, some of the remaining variation may be explainable with reference to other variables

paying close attention to outliers in a residual plot may lead to important insights

- e.g.: outlying residuals from quantities of exotic flint ~ distance from quarries
- sites with special access though transport routes, political alliances…
- residuals from regressions are often the main payoff

Basin of Mexico

Formative Basin of Mexico

- settlement survey
- 3 variables recorded from sites:
- site size (proxy for population)
- amount of arable land in standard “catchment”
- productivity index for soils

- AGLAND (km2)
- PROD (index)

How are these variables related?

Do any make sense as dependent or independent variables?

residual SIZE = SIZE – SIZE-hat

> resSize frmdat$size – (35.4 +.66 * frmdat$agland)

X1

(total variance in x0 explained by x1, that is not explained by x2…)

partial correlation coefficient:

proportion of variance in x0 explained by x1, that is not explained by x2…

multiple coefficient of determination:

variance in x0 explained by x1 and x2, both separately, and together…

SIZE = -1.8 + .42*AGLAND + 50*PROD

size = -1.8 + .42*agland + 50*prod

- various scales are involved:

size hectares

agland km2

prod productivity index

- increasing available agricultural land by 1 km2 increases site-size by about .4 hectares
- a 1-unit increase of soil productivity increases site-size by about 50 hectares
- which of these two factors is more important??

calculate “beta” coefficients to eliminate the effect differing scales…

- convert the variables to Z-scores
- mean of 0
- standard deviation of 1
- repeat multiple correlation analysis…

Bsize (size-mean(size))/sd(size)

Bagland (agland-mean(agland))/sd(agland)

Bprod (prod-mean(prod))/sd(prod) })

lmBeta lm(Bsize ~ Bagland + Bprod)

Download Presentation

Connecting to Server..