Descriptive Statistics

Descriptive Statistics Summer Program Brian Healy

What have learned so far • What is biostatistics • Role of biostatistician • How to input data into R • Simple R functions

What are we doing today? • Types of data • Summary statistics • Measures of central tendency • Tables • Graphs • How to do all of these things in R

Big picture • When we want to initially describe a data set or summarize a large data set using a graphs or tables, we have several things we can use. • Summary statistics- a single number or set of numbers that describe the entire data set • Frequency table- a table showing the number of members in each of a set of specific groups • Graphs – a picture showing characteristics of the data, usually focusing on one or more aspect of the data set • The best way to use these different methods depends on the type of data you have

Tables and graphs • The most important part of any scientific paper or presentation are the graphs and tables because these are the things people are most likely to pay attention to and remember. Also these allow a large amount of data to be summarized in a small space. • Statistical papers are somewhat different

Types of data • The first thing to notice about a variable is what kind of variable is it. • Nominal: Blond hair=1, Brown hair=2, Red hair=3 • Definition: Values fall into unordered classes • Dichotomous: Only 2 outcomes (male and female) • Ordinal: Mild=1, Moderate=2, Severe=3 • Definition: Values fall into ordered classes, but magnitude has no meaning • Discrete: Number of deaths in states in USA • Definition: Takes on specific values and the magnitiude and order are important • Often considered continuous in analyses, but conclusions can be misleading • Continuous: Height and weight • Definition: Any value is possible

Summary statistics • Definition: a single number or group of numbers that describe an entire data set • Example: Ages of class: class<-read.table(“class.dat”, header=T) age<-class[,3] • Maximum: • Minimum: • Range: • Each of these provides information about the entire group in one number

Measures of location • Measures of the location of a distribution (measure of central tendency) • Mean: • Median: the middle value • Mode: the most common value • Example: Ages of class • Mean: • Median: • Mode:

What happens if we have outliers? • Each of measure of central tendency is appropriate in certain circumstances • Outlier: an extreme observation • May be important to understand the full picture: rare toxicity • May be error in data entry or other reason and better ignored • Mean: very sensitive • Median: less sensitive, more robust

Computing summary stats in R • Question: What is the average high temperature in Boston in August? • data<-c(89, 77, 54, 80, 87, 92, 93, 83, 86) • mean(data) • median(data) • Which better describes the data? • What are explanations for the outlier? Should we include this data point?

What about the mean and median of ordinal and nominal data? • For our nominal data example, we used Blond hair=1, Brown hair=2, Red hair=3 • Data set: 1, 2, 2, 2, 3, 1, 2, 1, 2, 2 • Mean: 1.8 • Median: 2 • Do these summary statistics have any meaning in this case? • For our ordinal data example, Mild=1, Moderate=2, Severe=3 • Data set: 1, 1, 3, 1, 1, 1, 1, 2, 2, 2, 1 • Mean: 1.455 • Median: 1 • Do these have more meaning than the previous? What must we be careful of? • How could describe each of these types of data better?

Measures of spread • Beyond the location of the data, we may be interested in how varied the data is • Ex. You are planning to spend a year in London and Los Angeles. You find out that the average temperatures in each place are 65oF and 75oF. You could use this information to decide what clothes to bring. Is this all you would want to know? • The spread of the distribution, i.e. the range of possible temperatures

Measures of spread • Measures of distance from the mean: • Variance: • Standard deviation: • Note that the units on the standard deivation match the units on the mean • Interquartile range: 25 percentile and 75 percentile • Range: Minimum and maximum • Which of these are sensitive to outliers?

Computing measures of spread in R • Let’s look at the spread in the heights of the class • height<-c(63,64,66,64,64,67,68,67,63,71) • var(height) • sd(height) • IQR(height) • range(height) • What is the difference in the output for IQR and range? • To find any quantile, use quantile(height, 0.75)

Tables • Simple display for group of numbers • Very common in publications • Two main types • Display tables- Shows several characteristics of groups in one display • Frequency tables- Shows number of people in each group.

Frequency tables

Creating a table in R • A couple of different methods make tables in R • Data: • a<-c(1,1,1,1,2,2,2,2,2,3) • b<-c(2,1,2,2,2,2,2,1,1,1) • table(a) • table(a,b) • tabulate(a) • How do these work?

Practice • Using the class data, answer the following questions: • How many students in the class have a Master’s degree? • How many students went to college west of the Mississippi and have a Master’s • How many student like baseball (4 or 5)? • What is the longest time anyone was on a plane? • What is the largest family size in the class? • How many people have more than 4 people in their family?

Grouped data • Another time you use frequency tables is when you collect sensitive data that people may not be willing to give you the exact values, but will provide a range, like income. • With data such as this, how could we find the mean?

Grouped mean • Since we do not have the specific data points, we cannot calculate the exact mean • We can use the groups to estimate the mean using the grouped mean • where njis the number of people in each group and mjis the midpoint of the group

Graphs and Plots • One of the biggest advantages of R is the quality of the plots • Let’s plot the ages of the class • To make plots in R, use the following commands for the appropriate plots • histogram- hist(age) • box plot- boxplot(age)

Plot Command The basic command-line command for producing a scatter plot or line graph. col= set colors, lty= set line types, lwd= set line widths, pch= set the character type, type= pick points (type = "p"), lines ("l"), cex= set the "character expansion“, xlab= and ylab= set the labels, xlim= and ylim= set the limits of the axes, main= put a title on the plot, mtext= add a sub-title, help (par) for details

One-Dimensional Plots • barplot(height) #simple form • barplot(height, width, names, space=.2, inside=TRUE, beside=FALSE, horiz=FALSE, legend, angle, density, col, blocks=TRUE) • boxplot(..., range, width, varwidth=FALSE, notch=FALSE, names, plot=TRUE) • hist(x, nclass, breaks, plot=TRUE, angle, density, col, inside)

Two-Dimensional Plots • lines(x, y, type="l") • points(x, y, type="p")) • matplot(x, y, type="p", lty=1:5, pch=, col=1:4) • matpoints(x, y, type="p", lty=1:5, pch=, col=1:4) • matlines(x, y, type="l", lty=1:5, pch=, col=1:4) • plot(x, y, type="p", log="") • abline(coef), abline(a, b), abline(reg), abline(h=), abline(v=) • qqplot(x, y, plot=TRUE) • qqnorm(x, datax=FALSE, plot=TRUE)

Three-Dimensional Plots • contour(x, y, z, v, nint=5, add=FALSE, labex) • interp(x, y, z, xo, yo, ncp=0, extrap=FALSE) • persp(z, eye=c(-6,-8,5), ar=1)

Multiple Plots Per Page • par(mfrow=c(nrow, ncol), oma=c(0, 0, 4, 0)) • mfrow=c(m,n) : subsequent figures will be drawn row-by-row in an m by n matrix on the page. • oma=c(xbot,xlef,xtop,xrig):outer margin lines of text. • mtext(side=3, line=0, cex=2, outer=T, "This is an Overall Title For the Page") • Try this code on your own • par(mfrow=c(2,1)) • hist(age) • plot(class[,3],class[,4])

Output to a postscript file • Often we want to output an R graph to a postscript file to place it into a Latex file or other document • To do this, we use the following code • postscript(“graph1.ps”) – This opens a postscript file in the home directory • hist(age) – This plots a graph into the file • dev.off() – This closes the postscript file

Making plots of your own • Make the following plots • Histogram of height in the class with the appropriate labels • Scatterplot of height and age in the class using a different point • Make a postscript file with four plots of your choice • Write a function to make a histogram and boxplot on one graph

Using a for loop • Sometimes, we would like to do the same thing several times. One way to do this is to use a for loop • Ex. We have a data set with data on several statistics from Red Sox players. We would like to find the mean and median of each of these factors. • base<-read.table(“baseball.dat“, header=T) • The columns of this are player id, at bats, hits, home runs, walks, L/R • How could we find the mean of the first 5 columns?

basemean<-basemed<- matrix(0,1,5) for (i in 1:5){ basemean[i]<-mean(base[,i]) basemed[i]<-median(base[,i]) } basemean basemed

Apply function • A great way to do a similar action in R is to use the apply function • apply(base,2,mean) • Note that you get the same result as the for loop. • For this example there is limited benefit to the apply function, but in more complex situations it saves a lot of time Name of data set function to be applied (built-in or user defined) 1=by row 2=by column

Using conditionals • Now, we would like to find the total number of at bats and walks by left-handed batters. Remember for left-handed batters LH is 1. • We could do this using a for loop and if statements. Try this yourself.

numplayers<-nrow(base) totab<-0 totwalks<-0 for (j in 1:numplayers){ if (base[j,6]==1){totab<-totab+base[j,2] totwalks<-totwalks+base[j,5] } } • The if statement is only evaluated when the statement is true. You can also have an else if and else statement, which will be evaluated if the initial if statement is false. We will see this later in the summer. • Although this is one way to get the total number of walks and at bats, it involves a lot of code.

Subsetting a data set • Another great thing about R is that you can imbed if statements • Ex. As we know to determine the total number of walks we can use • sum(base[,5]) • If we want to find the total number of walks among left-handed players, we can sum over the correct subset of players • sum(base[(base[,6]==1),5]) • This command evaluates when (baseball[,6]==1) is true and sums over that subset only • What happens when you type base[(base[,6]==1),]

Practice • Make a histogram of the hits by batters with more than 400 at bats. • Find the minimum number of at bats by a right-handed batter

More on R functions • Yesterday, we briefly mentioned that you could write your own functions in R. This is one of the most valuable aspects of R. • Let’s look at this function. What does it do? fun<-function(x, y){ mx<-mean(x); maxx<-max(x) my<-mean(y); maxy<-max(y) if (maxx>maxy){list(group=1, mean=mx)} else {list(group=2, mean=my)} }

pp<-c(2,3,3,3,2,10) ppp<-c(8,7,6,8,6,5,6,7,8,7,7,9) fun(pp,ppp) $group [1] 1 $mean [1] 3.833333 • Now, try to write functions to do the following things. • Take a vector input and find the mean of all of the values except the minimum and maximum • Take a vector input and output a graph with a histogram and boxplot • Take a matrix input. Find the mean and median of each column. Output the mean, median and column number as a list for the column with the highest median

Possible answers • fun2<-function(x){ s<-sum(x)-min(x)-max(x) n<-length(x)-2 list(mean=s/n) } • fun3<-function(x){ par(mfrow=c(2,1)) hist(x); boxplot(x) } • fun4<-function(x){ meds<-apply(x,2,median) mns<-apply(x,2,mean) n<-c(1:ncol(x)) maxmed<-max(meds) nn<-n[(meds==maxmed)] list(column=nn, mean=mns[nn], median=meds[nn]) }

Descriptive Statistics