1 / 31

Transformations

Transformations. Getting normal or using the linear model. Two Reasons to Transform. Variables do not fit a normal distribution and parametric tests are desired A relationship between two variables is non-linear but transformation would allow the use of linear regression. Non-Normal Data.

molimo
Download Presentation

Transformations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Transformations Getting normal or using the linear model

  2. Two Reasons to Transform • Variables do not fit a normal distribution and parametric tests are desired • A relationship between two variables is non-linear but transformation would allow the use of linear regression

  3. Non-Normal Data • Reasons real data can fail to follow a normal distribution: • Errors in measurement are multiplicative rather than additive, e.g. ± 2% rather than ± 2mm • Constraints on the dimensions of an artifact feature are not symmetrical, e.g. point length must exceed haft length but can be as long as the material allows

  4. Non-Normal Data 2 • Measurements are products rather than sums of other measurements, e.g. area, volume • Counts follow binomial, poisson, or negative binomial distributions which are often asymmetrical unless sample sizes are large

  5. Solutions • Use non-parametric methods that do not depend on the normality of the data (increasingly easy to do) • Use data transformations that shift the distribution to one that is normal

  6. Transformation • The goal is to change the spacing of the data to compress a long tail and draw out a flat tail • The transformation must preserve the order of the original data – we only change the spacing between data points

  7. Transformation • Right skewed data with many zeros cannot be transformed effectively since nothing can stretch out observations that have the same value – e.g. artifact counts by site, grid square are often poisson distributed with many zeros

  8. An Example • Using the DartPoints data set, we saw that Length was asymmetrical • Plot the kernel density of Length with and without a log scale to see the difference • To transform Length we would use • logLength <- log(DartPoints$Length)

  9. plot(density(DartPoints$Length), main="Dart Point Length", xlab="Normal scale") plot(density(DartPoints$Length), main="Dart Point Length", xlab="Log scale", log="x")

  10. Common Transformations • Tail to the right • Natural or common (base 10) logarithm – no zero values • Square root, cube root, etc – zeros ok • Inverse, -1/x, -1/x2, etc – no zero values • Tail to the left • Exponential ex,10x (low values) • Square, cube, etc

  11. Other Transformations • arctangent (inverse tangent) to handle values between 0 and 1 used for population studies of non-metric traits

  12. Transforming to Linear • By transforming variables before using linear regression we can fit nonlinear equations • In some cases we can express the fitted equation in terms of the original untransformed variables

  13. Polynomial • Y = a + b1x + b2x2 + b3x3 + b4x4 . . . • Create polynomial values or use the function poly() within lm() • Begin with linear and then work up to quadratic, cubic, and so on until the new terms are not significant • Eg. lm(y~x+I(x^2)+I(x^3))

  14. Power Function • Log-log transformation • Use log() to transform dependent and independent variables • Compute linear regression • log(y) = a + b * log(x) • y = Axb (where A= exp(a)) • If b = 1, same as the linear model • x, y > 0

  15. Exponential function • Semi-log transformation • Use log() to transform dependent variable, y > 0 • Compute linear regression • log(y) = a + b * x • y = Aebx (where A= exp(a)) • Fits data with asymptotes

  16. Inverse Function • Reciprocal transformation – 1/x where x ≠ 0 • Used for distance models – marriage, trade, social interaction declines with distance • Fits data with asymptotes

  17. Other Functions • Logarithmic – no zeros in x • y = a + b * log(x) • Square Root – no negative values in x • y = a + b * sqrt(x)

  18. Examples • Human cranial capacity over the last 1.8 million years • Number of Identified Specimens (NISP) and Minimum Number of Individuals (MNI) at Chucalissa (Middle Misssissippian site)

  19. # BrainsCC.RData # Explore logs with scatterplot RegModel.1 <- lm(BrainCC~AgeKa, data=BrainsCC) # Rcmdr summary(RegModel.1) # Rcmdr BrainsCC$logAge <- with(BrainsCC, log(AgeKa)) # Rcmdr BrainsCC$logBrain <- with(BrainsCC, log(BrainCC)) # Rcmdr RegModel.2 <- lm(logBrain~logAge, data=BrainsCC) # Rcmdr summary(RegModel.2) # Rcmdr RegModel.3 <- lm(BrainCC~logAge, data=BrainsCC) # Rcmdr summary(RegModel.3) # Rcmdr plot(BrainCC~AgeKa, data=BrainsCC, pch="+") abline(RegModel.1, lty=1, lwd=2, col="black") x <- seq(0, 1800, 10) logx <- log(x) lines(x, exp(predict(RegModel.2, data.frame(logAge=logx))), lty=1, lwd=2, col="red") lines(x, predict(RegModel.3, data.frame(logAge=logx)), lty=1, lwd=2, col="blue") legend("topright", c("Linear", "Power", "Logarithmic"), lty=1, lwd=2, col=c("black", "red", "blue"))

  20. LinearModel.4 <- lm(BrainCC ~ AgeKa + I(AgeKa^2), data=BrainsCC) summary(LinearModel.4) LinearModel.5 <- lm(BrainCC ~ AgeKa + I(AgeKa^2) + I(AgeKa^3), data=BrainsCC) summary(LinearModel.5) LinearModel.6 <- lm(BrainCC ~ AgeKa + I(AgeKa^2) + I(AgeKa^3) + I(AgeKa^4), data=BrainsCC) summary(LinearModel.6) plot(BrainCC~AgeKa, data=BrainsCC, pch="+") abline(RegModel.1, lty=1, lwd=2, col="black") x <- seq(0, 1800, 10) lines(x, predict(LinearModel.4, data.frame(AgeKa=x)), lty=1, lwd=2, col="red") lines(x, predict(LinearModel.5, data.frame(AgeKa=x)), lty=1, lwd=2, col="blue") lines(x, predict(LinearModel.6, data.frame(AgeKa=x)), lty=1, lwd=2, col="green") legend("topright", c("Linear", "Quadratic", "Cubic", "Quartic"), lty=1, lwd=2, col=c("black", "red", "blue", "green"))

  21. load("C:/Users/DCarlson/Documents/anth642/R/Data/Chucalissa.rda") #Rcmdr plot(mni~nisp, data=Chucalissa) RegModel.1 <- lm(mni~nisp, data=Chucalissa) #Rcmdr summary(RegModel.1) #Rcmdr abline(RegModel.1) plot(mni~nisp, data=Chucalissa, log="xy") # Plot log-log transform plot(mni~nisp, data=Chucalissa, log="y") # Plot semi-log transform Chucalissa$logMNI <- log(Chucalissa$mni) # Create logged variables Chucalissa$logNISP <- log(Chucalissa$nisp) plot(logMNI~logNISP, data=Chucalissa) RegModel.2 <- lm(logMNI~logNISP, data=Chucalissa) #Rcmdr summary(RegModel.2) #Rcmdr abline(RegModel.2) plot(mni~nisp, data=Chucalissa) # plot log-log equation on original data a2 <- exp(RegModel.2$coefficients[[1]]) # Convert a to exp(a) b2 <- RegModel.2$coefficients[[2]] a1 <- RegModel.1$coefficients[[1]] b1 <- RegModel.1$coefficients[[2]] curve(a2*x^b2, 0, 3250, add=TRUE) abline(RegModel.1, lty=3) text(locator(), as.expression(substitute(MNI == a*NISP^b, list(a=round(a2, 4), b=round(b2, 4)))), pos=2) text(locator(), as.expression(substitute(MNI == a+b*NISP, list(a=round(a1, 4), b=round(b1, 4)))), pos=4)

More Related