Data Transformation Explained: Utilities, Types, and Best Practices

TRANSFORMATION

DEFINITION • A data transformation of the observations x1,x2,…,xn is a function T that replaces each xi by a new value T(xi) so that the transformed values of the batch are T(x1), T(x2),…, T(xn).

WHY WE NEED TRANSFORMATION? • Transformations of the response and predictors can improve the fit and correct violations of model assumptions: constant error variance or normality or linear relation between dependent and independent variables. We may also consider adding additional predictors that are functions of the existing predictors like quadratic or cross-product terms.

WHY WE NEED TRANSFORMATION?

WHICH TRANSFORMATION TO APPLY • Changes of origin and scale are linear transformations and they leave shape alone. • Centering  Transformation of the origin • Scaling  Transformation of both the origin and the scale • Stronger transformation such as logarithm or square root change shape. • A simple and commonly used transformation is a power transformation.

POWER TRANSFORMATION

EXAMPLE • Careful data analysis begins with inspection of the data, and techniques for examining and transforming data find direct application to the analysis of data using linear models. • The data for the four plots in Figure, given in the table below, were cleverly contrived by Anscombe (1973) so that the least-squares regression line and all other common regression ‘outputs’ are identical in the four datasets.

It is clear, however, that each graph tells a different story about the data: • In (a), the linear regression line is a reasonable descriptive summary of the tendency of Y to increase with X. • In Figure (b), the linear regression fails to capture the clearly curvilinear relationship between the two variables; we would do much better to fit a quadratic function here. • In Figure (c), there is a perfect linear relationship between Y and X for all but one outlying data point. The least-squares line is pulled strongly towards the outlier, distorting the relationship between the two variables for the rest of the data. When we encounter an outlier in real data we should look for an explanation. • In (d), the values of X are invariant (all are equal to 8), with the exception of one point (which has an X-value of 19); the least squares line would be undefined but for this point. We are usually uncomfortable having the result of a data analysis depend so centrally on a single influential observation. Only in this fourth dataset is the problem immediately apparent from inspecting the numbers.

TRANSFORMATIONS FOR NONLINEAR RELATIONS • Assume that error terms are reasonably close to a normal distribution and they have approximately constant variance, but there is a nonlinear relationship between X and Y. • In this case, apply transformation to X only. Not Y. • Because this may change the distribution of Y causing problems with normality and constant error variance. Y Y Y X X X T(X)=X2 or eX T(X)=log10X or X1/2 T(X)=1/X or e-X

WARNING • Regression coefficients will need to be interpreted with respect to the transformed scale. There is no straightforward way of back-transforming them to values that can interpreted in the original scale. You cannot directly compare regression coefficients for models where the response transformation is different. Difficulties of this type may dissuade one from transforming the response even if this requires the use of another type of model such as a generalized linear model.

TRANSFORMATIONS FOR STABILIZING VARIANCE • When a variable has very different degrees of variation in different groups, it becomes difficult to examine the data and to compare differences in level across the groups. • In this case, we need to apply the transformation to the response • Usually power transformation helps us to stabilize the variance. • If we have a heavy-tailed symmetric distribution, variance stabilizing transformation or power transformation is not helpful.

TRANSFORMATION FOR NONNORMALITY • Many statistical methods require that the numeric variables we are working with have an approximate normal distribution. • For example, t-tests, F-tests, and regression analyses all require in some sense that the numeric variables are approximately normally distributed.

TOOLS FOR ASSESSING NORMALITY • Descriptives: Skewness=0, Kurtosis=3 • Histogram, Boxplot, Density Plots • Normal Quantile Quantile Plot (QQ-Plot) • Goodness of Fit Tests Shapiro-Wilk Test Kolmogorov-Smirnov Test Anderson-Darling Test Jarque Bera Test

NORMAL QUANTILE PLOT THE IDEAL PLOT: Here is an example where the data is perfectly normal. The plot on right is a normal quantile plot with the data on the vertical axis and the expected z-scores if our data was normal on the horizontal axis. When our data is approximately normal the spacing of the two will agree resulting in a plot with observations lying on the reference line in the normal quantile plot. The points should lie within the dashed lines.

Normal Quantile Plot(leptokurtosis) The distribution of sodium levels of patients in this right heart catheterization study has heavier tailsthan a normal distribution (i.e, leptokurtosis). When the data is plotted vs. the expected z-scores the normal quantile plot there is an “S-shape”which indicates kurtosis.

Normal Quantile Plot (discrete data) Although the distribution of the gestational age data of infants in the very low birthweight study is approx. normal there is a “staircase”appearance in normal quantile plot. This is due to the discrete coding of the gestational age which was recorded to the nearest week or half week.

Normal Quantile Plots IMPORTANT NOTE: • If you plot DATA vs. NORMAL as on the previous slides then: downward bend = left skew upward bend = right skew • If you plot NORMAL vs. DATA then: downward bend = right skewupward bend = left skew

https://www.youtube.com/watch?v=-KXy4i8awOg

Tukey’s Ladder of Powers Left skewed We go upthe ladder to remove left skewnessand down the ladder to remove right skewness. Middle rung: No transformation (l = 1) Right skewed UP . . . . Here V represents our variable of interest. We are going to consider this variable raised to a power l, i.e. Vl Bigger Impact Bigger Impact DOWN . . . .

Tukey’s Ladder of Powers • To remove right skewness we typically take the square root, cube root, logarithm, or reciprocal of a the variable etc., i.e. V .5, V .333, log10(V) (think of V0) , V -1, etc. • To remove left skewness we raise the variable to a power greater than 1, such as squaring or cubing the values, i.e. V 2, V 3, etc.

Removing Right Skewness Example 1: PDP-LI levels for cancer patients In the log base 10 scale the PDP-LI values are approximately normally distributed.

sysvol .333 sysvol .5 sysvol log10(sysvol) Removing Right SkewnessExample 2: Systolic Volume for Male Heart Patients 1/sysvol

1/sysvol Removing Right SkewnessExample 2: Systolic Volume for Male Heart Patients The reciprocal of systolic volume is approximately normally distributed and the Shapiro-Wilk test provides no evidence against normality (p = .5340). CAUTION:The use of the reciprocal transformation reorders the data in the sense that the largest value becomes the smallest and the smallest becomes the largest after transformation. The units after transformation may or may not make sense, e.g. if the original units are mg/ml then after transformation they would be ml/mg.

The Lambert Way to Gaussianize Heavy-Tailed Data with the Inverse of Tukey’s h Transformation • Lambert W x F distributions are a generalized framework to analyze skewed, heavy-tailed data. It is based on an input/output system, where the output random variable (RV) Y is a non-linearly transformed version of an input RV X ~ F with similar properties as X, but slightly skewed (heavy-tailed). The transformed RV Y has a Lambert W x F distribution. • Package ‘LambertW’ written by Georg M. Goerg. • This package contains functions to model and analyze skewed, heavy-tailed data the Lambert Way: simulate random samples, estimate parameters, compute quantiles, and plot/ print results nicely. Probably the most important function is 'Gaussianize', which works similarly to 'scale', but actually makes the data Gaussian.

library(LambertW) set.seed(10) ### Set parameters #### # skew Lambert W x t distribution with (location, scale, df) = (0,1,3) and positive skew parameter gamma = 0.1 theta.st <- list(beta = c(0, 1, 3), gamma = 0.1) # double heavy-tail Lambert W x Gaussian with (mu, sigma) = (0,1) and left delta=0.2; right delta = 0.4 (-> heavier on the right) theta.hh <- list(beta = c(0, 1), delta = c(0.2, 0.4)) ### Draw random sample #### # skewed Lambert W x t yy <- rLambertW(n=1000, distname="t", theta = theta.st) # double heavy-tail Lambert W x Gaussian (= Tukey's hh) zz <- rLambertW(n=1000, distname = "normal", theta = theta.hh) ### Plot ecdf and qq-plot #### op <- par(no.readonly=TRUE) par(mfrow=c(2,2), mar=c(3,3,2,1)) plot(ecdf(yy)) qqnorm(yy); qqline(yy) plot(ecdf(zz)) qqnorm(zz); qqline(zz) par(op)

### Parameter estimation #### mod.Lst <- MLE_LambertW(yy, distname="t", type="s") mod.Lhh <- MLE_LambertW(zz, distname="normal", type="hh") layout(matrix(1:2, ncol = 2)) plot(mod.Lst) plot(mod.Lhh)

Since this heavy-tail generation is based on a bijective transformations of RVs/data, you can remove heavy-tails from data and check if they are nice now, i.e., if they are Gaussian (and test it using Normality tests). ### Test goodness of fit #### ## test if 'symmetrized' data follows a Gaussian xx <- get_input(mod.Lhh) normfit(xx) $shapiro.wilk Shapiro-Wilk normality test data: data.test W = 0.99942, p-value = 0.9934

install.packages('quantmod') library('quantmod') getSymbols("GS") y=OpCl(GS) #daily percent change open to close qqnorm(y); qqline(y) z= Gaussianize(y, type="h", return.tau.mat = TRUE) x1 <- get_input(z, c(z$tau.mat[, 1])) # same as z$input test_normality(z$input) plot(z$input) qqnorm(z); qqline(z)

Transforming Proportions • Power transformations are often not helpful for proportions, since these quantities are bounded below by 0 and above by 1. • If the data values do not approach these two boundaries, then proportions can be handled much like other sorts of data. • Percents and many sorts of rates are simply rescaled proportions. • It is common to encounter ‘disguised’ proportions, such as the number of questions correct on an exam of fixed length. • An example, drawn from the Canadian occupational prestige data, is shown in the stem-and-leaf display. The distribution is for the percentage of women among the incumbents of each of 102 occupations.

Several transformations are commonly employed for proportions; the most important is the logit transformation: • The logit transformation is the log of the ‘odds,’ P/(1 P). • The ‘trick’ of the logit transformation is to remove the upper and lower boundaries of the scale, spreading out the tails of the distribution and making the resulting quantities symmetric about 0; for example:

The logit transformations cannot be applied to proportions of exactly 0 or 1. • If we have access to the original counts, we can define adjusted proportions in place of P . • Here, F is the frequency count in the focal category (e.g., number of women) and N is the total count (total number of occupational incumbents, women plus men).

Interpreting Coefficients in Regression with Log-Transformed Variables • Log transformations are one of the most commonly used transformations, but interpreting results of an analysis with log transformed data may be challenging. • A log transformation is often useful for data which exhibit right skewness (positively skewed), and for data where the variability of residuals increases for larger values of the dependent variable. When a variable is log transformed, note that simply taking the anti-log of your parameters will not properly back transform into the original metric used.

Interpreting Coefficients in Regression with Log-Transformed Variables • To properly back transform into the original scale we need to understand some details about the log-normal distribution. In probability theory, a log-normal distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. • More specifically, if a variable Y follows a log-normal distribution, then we have that ln(Y) follows a normal distribution with a mean = μ and a variance = 2 .

Interpreting Coefficients in Regression with Log-Transformed Variables

Interpreting Coefficients in Regression with Log-Transformed Variables • Interpreting parameter estimates in a linear regression when variables have been log transformed is not always straightforward either. • The standard interpretation of a regression parameter is that a one-unit change in the predictor results in units change in the expected value of the response variable while holding all the other predictors constant. • Interpreting a log transformed variable can also be done in such a manner. However, such coefficients are routinely interpreted in terms of percent change. Below we will explore the interpretation in a simple linear regression setting when the dependent variable, or the independent variable, or both variables are log transformed.

Interpreting Coefficients in Regression with Log-Transformed Variables

Data Transformation Explained: Utilities, Types, and Best Practices

Data Transformation Explained: Utilities, Types, and Best Practices

Presentation Transcript

Transformation

Transformation

Transformation

Transformation

Transformation

Transformation

Transformation

TRANSFORMATION

Transformation

Transformation

Transformation

TRANSFORMATION

Transformation

Transformation

TRANSFORMATION

Transformation

Transformation

Transformation

Transformation

Transformation

TRANSFORMATION

Transformation