Loading in 5 sec....

correlation and percentagesPowerPoint Presentation

correlation and percentages

- 67 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' correlation and percentages' - francis-mcneil

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### transformation idea…

### multiple regression idea…

correlation and percentages

- association between variables can be explored using counts
- are high counts of bone needles associated with high counts of end scrapers?

- similar questions can be asked using percent-standardized data
- are high proportions of decorated pottery associated with high proportions of copper bells?

but…

- these are different questions with different implications for formal regression
- percents will show some correlation even if underlying counts do not…
- ‘spurious’ correlation (negative)
- “closed-sum” effect

- including outliers in regression analyses is usually a bad idea…
- Tukey-line / least squares discrepancies are good red-flag signals

“convex hull trimming” idea…

“convex hull trimming” idea…

> hull1 chull(x, y)

> plot(x, y)

> polygon(x[hull1], y[hull1])

> abline(lm(y[-hull1] ~ x[-hull1]))

transformation idea…

- at least two major motivations in regression analysis:
- create/improve a linear relationship
- correct skewed distribution(s)

LG_DENS idea… log(DENSITY)

old.par par(no.readonly = TRUE)

plot(DIST, DENSITY, log="y")

par(old.par)

> VAR1T idea… sqrt(VAR1)> plot(VAR1T, VAR2)

transformation summary idea…

- correcting left skew:
x4 stronger

x3 strong

x2 mild

- correcting right skew:
x weak

log(x) mild

-1/x strong

-1/x2 stronger

- regression/correlation idea…
- the strength of a relationship can be assessed by seeing how knowledge of one variable improves the ability to predict the other

- if you ignore idea…x, the best predictor of y will be the mean of all y values (y-bar)
- if the y measurements are widely scattered, prediction errors will be greater than if they are close together

- we can assess the dispersion of y values around their mean by:

r idea…2=

- “coefficient of determination” (r2)
- describes the proportion of variation that is “explained” or accounted for by the regression line…
- r2=.5
half of the variation is explained by the regression…

half of the variation in y is explained by variation in x…

vs. idea…

residuals idea…

- vertical deviations of points around the regression
- for case i, residual = yi-ŷi [yi-(a+bxi)]

- residuals in y should not show patterned variation either with x or y-hat
- should be normally distributed around the regression line
- residual error should not be autocorrelated (errors/residuals in y are independent…)

- residuals idea…may show patterning with respect to other variables…
- explore this with a residual scatterplot
- ŷ vs. other variables…

- are there suggestions of linear or other kinds of relationships?
- if r2 < 1, some of the remaining variation may be explainable with reference to other variables

- paying close attention to idea…outliers in a residual plot may lead to important insights
- e.g.: outlying residuals from quantities of exotic flint ~ distance from quarries
- sites with special access though transport routes, political alliances…

- residuals from regressions are often the main payoff

Middle Formative, idea…

Basin of Mexico

Formative Basin of Mexico idea…

- settlement survey
- 3 variables recorded from sites:
- site size (proxy for population)
- amount of arable land in standard “catchment”
- productivity index for soils

- SIZE (ha) idea…
- AGLAND (km2)
- PROD (index)

How are these variables related?

Do any make sense as dependent or independent variables?

SIZE ~ AGLAND idea…

residuals?? idea…

residual SIZE = SIZE – SIZE-hat idea…

> resSize frmdat$size – (35.4 +.66 * frmdat$agland)

r idea…2 = .55

X idea…0

X1

(total variance in x0 explained by x1, that is not explained by x2…)

partial correlation coefficient:

proportion of variance in x0 explained by x1, that is not explained by x2…

multiple coefficient of determination idea…:

variance in x0 explained by x1 and x2, both separately, and together…

y idea… = -1.8 + .42x1 + 50x2

SIZE = -1.8 + .42*AGLAND + 50*PROD

size = idea…-1.8 + .42*agland + 50*prod

- various scales are involved:
size hectares

agland km2

prod productivity index

- increasing available agricultural land by 1 km2 increases site-size by about .4 hectares
- a 1-unit increase of soil productivity increases site-size by about 50 hectares
- which of these two factors is more important??

- calculate “ idea…beta” coefficients to eliminate the effect differing scales…
- convert the variables to Z-scores
- mean of 0
- standard deviation of 1

- repeat multiple correlation analysis…

with(frmdat, { idea…

Bsize (size-mean(size))/sd(size)

Bagland (agland-mean(agland))/sd(agland)

Bprod (prod-mean(prod))/sd(prod) })

lmBeta lm(Bsize ~ Bagland + Bprod)

Download Presentation

Connecting to Server..