- 95 Views
- Uploaded on
- Presentation posted in: General

Using R in Astronomy

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Using R in Astronomy

Getting started quickly with R

Juan Pablo Torres-Papaqui

Departamento de Astronomía

Universidad de Guanajuato

DA-UG, México

División de Ciencias Naturales y Exactas,

Campus Guanajuato, Sede Valenciana

Why R?

R is a free language and environment for statistical computing and graphics. It is superior to other graphics/analysis packages commonly used in Astronomy, e.g. supermongo, pgplot, gnuplot, and features a very wide range of statistical analysis and data plotting functions.

Furthermore, R is already the most popular amongst the leading software for statistical analysis, as measured by a variety of indicators, and is rapidly growing in influence.

“R is really important to the point that it's hard to overvalue it. It allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems.” Google research scientist, quoted in the New York Times.

Key features

- It's a mature, widely used (around 2-3 million users) and well-supported free and open source software project; it's committed to a twice-yearly update schedule (it runs on GNU/linux, unix, MacOS and Windows)
- Excellent graphics capabilitie
- Vector arithmetic, plus many built-in basic & advanced statistical and numerical analysis tools
- Intelligent handling of missing data values (denoted by “NA”)
- Highly extensible, with around 3000 user-contributed packages available
- It's easy to use and has excellent online help and associated documentation

Download

The R Project for Statistical Computing

http://www.r-project.org/

Getting started

The online help pages can be accessed as follows:

help.start() # Load HTML help pages into browser

help(package) # List help page for "package"

?package # Shortform for "help(package)"

help.search("keyword") # Search help pages for "keyword"

?help # For more options

help(package=base) # List tasks in package "base"

q() # Quit R

Getting started

Some extra packages are not loaded by default, but can be loaded as follows:

library("package")

library() # List available packages to load

library(help="package") # list package contents

Getting started

Commands can be separated by a semicolon (";'') or a newline

All characters after "#'' are treated as comments (even on the same line as commands)

Variables are assigned using "<-'' (although "='' also works):

x <- 4.5

y <- 2*x^2 # Use "<<-'' for global assignment

Vectors

x <- c(1, 4, 5, 8) # Concatenate values & assign to x

y <- 2*x^2 # Evaluate expression for each element in x

> y # Typing the name of a variable evaluates it

[1] 2 32 50 128

If x is a vector, y will be a vector of the same length. The vector arithmetic in R means that an expression involving vectors of the same length (e.g. x+y, in the above example) will be evaluated piecewise for corresponding elements, without the need to loop explicitly over the values:

> x+y

[1] 3 36 55 136

> 1:10 # Same as seq(1, 10)

[1] 1 2 3 4 5 6 7 8 9 10

> a <- 1:10

a[-1] # Print whole vector *except* the first element

a[-c(1, 5)] # Print whole vector except elements 1 & 5

a[1:5] # Print elements 1 to 5

a[length(a)] # Print last element, for any length of vector

head(a, n=3) # Print the first n elements of a

tail(a, n=3) # Print the last n elements of a

Demonstrations of certain topics are available:

demo(graphics) # Run grapics demonstration

demo() # List topics for which demos are available

?demo # List help for "demo" function

Typing the name of a function lists its contents; typing a variable evaluates it, or you can use print(x).

A great feature of R functions is that they offer a lot of control to the user, but provide sensible hidden defaults. A good example is the histogram function, which works very well without any explicit control:

First, generate a set of 10,000 Gaussian distributed random numbers:

data <- rnorm(1e4) # Gaussian distributed numbers with mean=0 & sigma=1

hist(data) # Plots a histogram, with sensible bin choices by default

See ?hist for the full range of control parameters available, e.g. to specifiy 7 bins:

hist(data, breaks=7)

Data Input/Output

Assuming you have a file (“file.dat”) containing the following data, with either spaces or tabs between fields:

r x y

1 4.2 14.2

2 2.4 64.8

3 8.76 63.4

4 5.9 32.2

5 3.4 89.6

setwd("C:/Users/juanchito/Documents/Astronomy School/")

#only for windows users

Now read this file into R:

> inp <- read.table("file.dat", header=T) # Read data into data frame

> inp # Print contents of inp

> inp[0] # Print the type of object that inp is:

NULL data frame with 5 rows

or data frame with 0 columns and 5 rows

> colnames(inp) # Print the column names for inp:

[1] "r" "x" "y" # This is simply a vector that can easily be changed:

> colnames(inp) <- c("radius", "temperature", "time")

> colnames(inp)

[1] "radius" "temperature" "time"

Note that you can refer to R objects, names etc. using the first few letters of the full name, provided that is unambiguous, e.g.:

> inp$r # Print the Radius column

but note what happens if the information is ambiguous or if the column doesn't exist:

> inp$t # Could match "inp$temperature" or "inp$time"

NULL

> inp$wibble # "wibble" column doesn't exist

NULL

Alternatively, you can refer to the columns by number:

> inp[[1]] # Print data as a vector (use "[[" & "]]")

[1] 1 2 3 4 5

> inp[1] # Print data as a data.frame (only use "[" and "]")

Writing data out to a file (if no filename is specified (the default), the output is written to the console)

> write.table(inp, quote=F, row.names=F, col.names=T, sep="\t:\t")

> write.table(inp, quote=F, row.names=F, col.names=T)

radius temperature time

1 4.2 14.2

2 2.4 64.8

3 8.76 63.4

4 5.9 32.2

5 3.4 89.6

Saving data

?save

save(inp, t, file="data.RData")

load(file="data.RData") # read back in the saved data:

# - this will load the saved R objects into the current session, with

# the same properties, i.e. all the variable will have the same names

# and contents

Note that when you quit R (by typing q()), it asks if you want to save the workspace image, if you specify yes (y), it writes out two files to the current directory, called .RData and .Rhistory. The former contains the contents of the saved session (i.e. all the R objects in memory) and the latter is a list of all the command history (it's just a simple text file you can look at).

At any time you can save the history of commands using:

savehistory(file="my.Rhistory")

loadhistory(file="my.Rhistory") # load history file into current R session

Plotting data

Useful functions:

?plot # Help page for plot command

?par # Help page for graphics parameter control

?Devices # or "?device"

# (R can output in postscript, PDF, bitmap, PNG, JPEG and more formats)

dev.list() # list graphics devices????

colours() # or "colors()" List all available colours

?plotmath # Help page for writing maths expressions in R

Tip: to generate png, jpeg etc. files, I find it's best to create pdf versions in R, then convert them in gimp (having specified strong antialiasing, if asked, when loading the pdf file into gimp), otherwise the figures are "blocky" (i.e. suffer from aliasing).

To create an output file copy of a plot for printing or including in a document etc.

dev.copy2pdf(file="myplot.pdf") # } Copy contents of current graphics device

dev.copy2eps(file="myplot.eps") # } to a PDF or Postscript file

Adding more datasets to a plot:

x <- 1:10; y <- x^2

z <- 0.9*x^2

plot(x, y) # Plot original data

points(x, z, pch="+") # Add new data, with different symbols

lines(x,y) # Add a solid line for original data

lines(x, z, col="red", lty=2) # Add a red dashed line for new data

curve(1.1*x^2, add=T, lty=3, col="blue") # Plot a function as a curve

text(2, 60, "An annotation") # Write text on plot

abline(h=50) # Add a horizontal line

abline(v=3, col="orange") # Add a vertical line

Error bars

source("errorbar.R") # source code

# Create some data to plot:

x <- seq(1,10); y <- x^2

xlo <- 0.9*x; xup <- 1.08*x; ylo <- 0.85*y; yup <- 1.2*y

Plot graph, but don't show the points (type="n") & plot errors:

plot(x, y, log="xy", type="n")

errorbar(x, xlo, xup, y, ylo, yup)

?plot gives more info on plotting options (as does ?par). To use different line colours, styles, thicknesses & errorbar types for different points:

errorbar(x, xlo, xup, y, ylo, yup, col=rainbow(length(x)), lty=1:5,

lwd=1:3)

Plotting a function or equation

curve(x^2, from=1, to=100) # Plot a function over a given range

curve(x^2, from=1, to=100, log="xy") # With log-log axes

Plot chi-squared probability vs. reduced chi-squared for 10 and then 100 degrees of freedom

dof <- 10

curve(pchisq(x*dof,df=dof), from=0.1, to=3, xlab="Reduced chi-squared",

ylab="Chi-sq distribution function")

abline(v=1, lty=2, col="blue") # Plot dashed line for reduced chisq=1

dof <- 100

curve(pchisq(x*dof, df=dof), from=0.1, to=3, xlab="Reduced chi-squared",

ylab="Chi-sq distribution function")

abline(v=1, lty=2, col="blue") # Plot dashed line for reduced chisq=1

Define & plot custom function:

myfun <- function(x) x^3 + log(x)*sin(x)

#--Plot dotted (lty=3) red line, with 3x normal line width (lwd=3):

curve(myfun, from=1, to=10, lty=3, col="red", lwd=3)

#--Add a legend, inset slightly from top left of plot:

legend("topleft", inset=0.05, "My function", lty=3, lwd=3, col="red")

Plotting maths symbols:

demo(plotmath) # Shows a demonstration of many examples

?plotmath # Manual page with more information

Analysis

Basic stuff

x <- rnorm(100) # Create 100 Gaussian-distributed random numbers

hist(x) # Plot histogram of x

summary(x) # Some basic statistical info

d <- density(x) # Compute kernel density estimate

d # Print info on density analysis ("?density" for details)

plot(d) # Plot density curve for x

plot(density(x)) # Plot density curve directly

Plot histogram in terms of probability & also overlay density plot:

hist(x, probability=T)

lines(density(x), col="red") # overlay smoothed density curve

#--Standard stuff:

mean(x); median(x); max(x); min(x); sum(x); weighted.mean(x)

sd(x) # Standard deviation

var(x) # Variance

mad(x) # median absolute deviation (robust sd)

Testing the robustness of statisical estimators

?replicate # Repeatedly evaluate an expression

Calculate standard deviation of 100 Gaussian-distributed random numbers (NB default sd is 1; can specify with rnorm(100, sd=3) or whatever)

sd(rnorm(100))

Now, let's run N=5 Monte Carlo simulations of the above test:

replicate(5, sd(rnorm(100)))

and then compute the mean of the N=5 values:

mean(replicate(5, sd(rnorm(100))))

Obviously 5 is rather small for a number of MC sims (repeat the above command & see how the answer fluctuates). Let's try 10,000 instead:

mean(replicate(1e4, sd(rnorm(100))))

With a sample size of 100, the measured sd closely matches the actual value used to create the random data (sd=1). However, see what happens when the sample size decreases:

mean(replicate(1e4, sd(rnorm(10)))) # almost a 3% bias low

mean(replicate(1e4, sd(rnorm(3)))) # >10% bias low

You can see the bias more clearly with a histogram or density plot:

a <- replicate(1e4, sd(rnorm(3)))

hist(a, breaks=100, probability=T) # Specify lots of bins

lines(density(a), col="red") # Overlay kernel density estimate

abline(v=1, col="blue", lwd=3) # Mark true value with thick blue line

This bias is important to consider when esimating the velocity dispersion of a stellar or galaxy system with a small number of tracers. The bias is even worse for the robust sd estimator mad():

mean(replicate(1e4, mad(rnorm(10)))) # ~9% bias low

mean(replicate(1e4, mad(rnorm(3)))) # >30% bias low

However, in the presence of outliers, mad() is robust compared to sd() as seen with the following demonstration. First, create a function to add a 5 sigma outlier to a Gaussian random distribution:

outlier <- function(N,mean=0,sd=1) {

a <- rnorm(N, mean=mean, sd=sd)

a[1] <- mean+5*sd # Replace 1st value with 5 sigma outlier

return(a)

}

Now compare the performance of mad() vs. sd():

mean(replicate(1e4, mad(outlier(10))))

mean(replicate(1e4, sd(outlier(10))))

You can also compare the median vs. Mean:

mean(replicate(1e4, median(outlier(10))))

mean(replicate(1e4, mean(outlier(10))))

...and without the outlier:

mean(replicate(1e4, median(rnorm(10))))

mean(replicate(1e4, mean(rnorm(10))))

Kolmogorov-Smirnov test, taken from manual page (?ks.test)

x <- rnorm(50) # 50 Gaussian random numbers

y <- runif(30) # 30 uniform random numbers

# Do x and y come from the same distribution?

ks.test(x, y)

Correlation tests

Load in some data:

dfile <- "table1.txt"

A <- read.table(dfile, header=T, sep="|")

cor.test(A$z, A$Tx) # Specify X & Y data separately

cor.test(~ z + Tx, data=A) # Alternative format when using a data frame ("A")

cor.test(A$z, A$Tx, method="spearman") # } Alternative methods

cor.test(A$z, A$Tx, method="kendall") # }

plot(A$z,A$Tx)

Spline interpolation of data

x <- 1:20

y <- jitter(x^2, factor=20) # Add some noise to vector

sf <- splinefun(x, y) # Perform cubic spline interpolation

# - note that "sf" is a function:

sf(1.5) # Evaluate spline function at x=1.5

plot(x,y); curve(sf(x), add=T) # Plot data points & spline curve

Scatter plot smoothing

#--Using above data frame ("A"):

plot(Tx ~ z, data=A)

#--Return (X & Y values of locally-weighted polynomial regression

lowess(A$z, A$Tx)

#--Plot smoothed data as line on graph:

lines(lowess(A$z, A$Tx), col="red")

Regression

dfile <- "http://www.astro.ugto.mx/~papaqui/download/winter/table1.txt"

A <- read.table(dfile, header=T, sep="|")

names(A) # Print column names in data frame

m <- lm(Tx ~ z, data=A) # Fit linear model

summary(m) # Show best-fit parameters & errors

Plot data & best-fit model:

plot(Tx ~ z, A)

abline(m, col="blue") # add best-fit straight line model

Plot residuals:

plot(A$z, resid(m))

abline(h=0, col="red")

Functions

Basics

cat # Type function name without brackets to list contents

args(cat) # Return arguments of any function

body(cat) # Return main body of function

Create your own function

> fun <- function(x, a, b, c) (a*x^2) + (b*x^2) + c

> fun(3, 1, 2, 3)

[1] 30

> fun(5, 1, 2, 3)

[1] 78

A more complicated example of a function. First, create some data:

set.seed(123) # allow reproducible random numbers

a <- rnorm(1000, mean=10, sd=1) # 1000 Gaussian random numbers

b <- rnorm(100, mean=50, sd=15) # smaller population of higher numbers

x <- c(a, b) # Combine datasets

hist(x) # Shows outlier population clearly

sd(x) # Strongly biased by outliers

mad(x) # Robustly estimates sd of main sample

mean(x) # biased

median(x) # robust

Now create a simple sigma clipping function (you can paste the following straight into R):

sigma.clip <- function(vec, nclip=3, N.max=5) {

mean <- mean(vec); sigma <- sd(vec)

clip.lo <- mean - (nclip*sigma)

clip.up <- mean + (nclip*sigma)

vec <- vec[vec < clip.up & vec > clip.lo] # Remove outliers

if ( N.max > 0 ) {

N.max <- N.max - 1

# Note the use of recursion here (i.e. the function calls itself):

vec <- Recall(vec, nclip=nclip, N.max=N.max)

}

return(vec)

}

Now apply the function to the test dataset:

new <- sigma.clip(x)

hist(new) # Only the main population remains!

length(new) # } Compare numbers of

length(x) # } values before & after clipping

Factors

Factors are a vector type which treats the values as character strings which form part of a base set of values (called "levels"). The result of plotting a factor is to produce a frequency histogram of the number of entries in each level (see ?factor for more info).

Basics

a <- rpois(100, 3) # Create 100 Poisson distributed random numbers (lambda=3)

hist(a) # Plot frequency histogram of data

a[0] # List type of vector ("numeric")

b <- as.factor(a) # Change type to factor

List type of vector b ("factor"): also lists the "levels", i.e. all the different types of value:

b[0]

Now, plot b, to get a barchart showing the frequency of entries in each level:

plot(b)

table(b) # a tabular summary of the same information

A more practical example. Read in table of data (from Sanderson et al. 2006), which refers to a sample of 20 clusters of galaxies with Chandra X-ray data:

file <- "http://www.astro.ugto.mx/~papaqui/download/winter/table1.txt"

A <- read.table(file, header=T, sep="|")

By default, “read.table()” will treat non-numeric values as factors. Sometimes this is annoying, in which case use “as.is=T”; you can specify the type of all the input columns explicitly -- see “?read.table”.

A[1,] # Print 1st row of data frame (will also show column names):

plot(A$det) # Plot the numbers of each detector type

plot(A$cctype) # Plot the numbers of cool-core and non-cool core clusters

The “det” column is the detector used for the observation of each cluster: S for ACIS-S and I for ACIS-I. The “cctype” denotes the cool core type: “CC” for cool core, “NCC” for non-cool core.

Now, plot one factor against another, to get a plot showing the number of overlaps between the two categories (the cool-core clusters are almost all observed with ACIS-S):

plot(cctype ~ det, data=A, xlab="ACIS Detector", ylab="Cool-core type")

You can easily explore the properties of the different categories, by plotting a factor against a numeric variable. The result is to show a boxplot (see “?boxplot”) of the numeric variable for each catagory. For example, compare the mean temperature of cool-core & non-cool core clusters:

plot(Tx ~ cctype, data=A)

and compare the redshifts of clusters observed with ACIS-S & ACIS-I:

plot(z ~ det, data=A)

You can define your own factor categories by partitioning a numeric vector of data into discrete bins using “cut()”, as shown here:

Converting a numeric vector into factors

Split cluster redshift range into 2 bins of equal width (nearer & further clusters), and assign each z point accordingly; create new column of data frame with the bin assignments (A$z.bin):

A$z.bin <- cut(A$z, 2)

A$z.bin # Show levels & bin assignments

plot(A$z.bin) # Show bar chart

Plot mean temperatures of nearer & further clusters. Since this is essentially a flux-limited sample of galaxy clusters, the more distant clusters (higher redshift) are hotter (and hence more X-ray luminous - an example of the common selection effect known as Malmquist bias):

A$Tx.bin <- cut(A$Tx, 2)

plot(z ~ Tx.bin, data=A)

plot(Tx ~ z, data=A) # More clearly seen!

Check if redshifts of (non) cool-core clusters differ:

plot(cctype ~ z.bin, data=A) # Equally mixed

Now, let's say you wanted to colour-code the different CC types. First, create a vector of colours by copying the cctype column (you can add this as a column to the data frame, or keep it as a separate vector):

colour <- A$cctype

Now all you need to do is change the names of the levels:

levels(colour) # Show existing levels

levels(colour) <- c("blue", "red") # specify blue=CC, red=NCC

colour[0] # Show data type (& levels, since it's is a factor)

Now change the type of the vector to character (i.e. not a factor):

col <- as.character(colour)

col[0] # Confirm change of data type

col # } Print values and

colour # } note the difference

Now plot mean temperature vs. redshift, colour coded by CC type:

plot(Tx ~ z, A, col=col)

Why not add a plot legend:

legend(x="topleft", legend=c(levels(A$cctype)), text.col=levels(colour))

Now for something a little bit fancier:

plot(Tx ~ z, A, col=col, xlab="Redshift", ylab="Temperature (keV)")

legend(x="topleft", c(levels(A$cctype)), col=levels(colour),

text.col=levels(colour), inset=0.02, pch=1)