1 / 34

Programming with R and Bioconductor for microarray expression analysis and interpretation

Programming with R and Bioconductor for microarray expression analysis and interpretation. Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569. Outline. Lecture 1: Overview and background for R Lecture 2: BioC & microarray diagnostics Lecture 3: BioC analysis and interpretation

iona
Download Presentation

Programming with R and Bioconductor for microarray expression analysis and interpretation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming with R and Bioconductor for microarray expression analysis and interpretation Ramsi Haddad Rhaddad@genetics.wayne.edu 577-2569

  2. Outline • Lecture 1: Overview and background for R • Lecture 2: BioC & microarray diagnostics • Lecture 3: BioC analysis and interpretation • Lecture 4: Cluster computing on BlueGene

  3. Lecture 1 Overview and background for R

  4. What is R? • Open source software that implements the S programming language • can be used on the command line, or scripted • Available from www.r-project.org • Good documentation • Not so user friendly • “R is user-friendly, it just chooses its friends carefully”, Kyle Furge

  5. Starting R • use Putty to connect to genetics.wayne.edu • then use ssh to connect to bluegene.biosci.wayne.edu • password is mgb8680 • type R to start • q() # this is how we shut down

  6. Help! • If you have a workstation, use help.start() and a web browser will pop up • For us help(command) for manual pages • can also use ?command • Excellent documentation at web site • “Introductory statistics with R”, P. Dalgaard • great little book

  7. Basics • R is command driven • It waits for you to give it instructions “>prompt” • R operates on data structures and objects • Data structures include vectors, lists, arrays (matrices) and dataframes • No scalars per se, only vectors with 1 element

  8. Command or Assignment • Everything is either a functional command (I.e. statement)or an assignment • Assignment operator is arrow pointing to variable name (no space allowed within arrow!): “<-” • x <- 10 # I.e. x = (10) • y <- c(1,2,3) # c(), concatenate • z <- c(1:20) # help(rep, seq) • print(X) • won’t work since R is case sensitive: print(x) • [1] 10 #[1] index of first element # on line

  9. Functions • 1 + 1 # answer is a vector [1] • z * 2 • jack <- z *2 • Jack <- 1:40 • length(Jack) • all functions include () even if there is nothing: list work space: ls() • if you type ls without (), R gives you the contents of the function. • some functions behave differently on different objects, polymorphic functions. • lots of built in functions • mean(Jack); median(Jack); sd(Jack)

  10. Don’t use single character names • single letter object names cannot be: • c, q, t, C, D, F, I, T • try to use other variable names of at least 3 characters • I like to use names for quick stuff.

  11. Vectors, intro to data structures • All data structures are objects that can be assigned to variable names • All objects have a mode and a length • Vector: an ordered collection of items (all of which are of the same mode) • jack <- c(1:10) • mode(jack) # mode is numeric • typeof(jack) # integer type • length(jack) # 10 • mode can be Character, numeric, complex, logical • typeof can be integer, double, complex etc.

  12. Vector generating functions • seq(from, to, by=) # test #this out, look at help #page, ?seq • rep(x,times,length.out,each) # test out.

  13. Vectorized Arithmetic • two vectors can be multiplied, one element at a time, if they are the same length • if they are different length, shorter vector is “recycled” • if longer vector is not a multiple of the recycled vector, there is a warning. • make two vectors of different length and test this out.

  14. Logical Vectors • can be written out (rare) • a <- c(TRUE, TRUE, FALSE, TRUE) • can be generated • a <- x >10 # expect 10X FALSE, 10X TRUE • print(a); length(a) • logical operators: <, <=, ==, >, >=, !=

  15. Arrays & Matrices • arrays are multi-dimensional vectors, matrices are two-dimensional vectors • can assign dimensions to vector z to create matrix • z <- 1:20; dim(z) <- c(5,4) # z matrix has 5 rows • table.z <- matrix(z, nrow=4, ncol=5) • table.z # make sure it worked the way we wanted. • lets look at ?matrix to see how else to do that • maybe add “byrow=TRUE”

  16. Matrix Functions • dim(z) <- c(row, column) # previous slide • matricies have rownames & colnames # attributes • vectors use names • rownames(z) <- LETTERS[1:20] • t(z) # transposition so that columns become rows and row become colums • very important when generating a matrix from a matrix • rownames(z) <- c(paste(“row”, 1:20)) • what happens if you forget the terminal ) ?

  17. Matrix Functions con’t • cbind, rbind # column bind and row bind • cart <- matrix(1:100, ncol=20, byrow=T) • horse <- 10:15 • cart.horse <- rbind(cart, horse) • this wouldn’t work with cbind, why? • rownames(cart.horse) <- NULL • # remove the evidence

  18. Lists • Lists are generic vectors that can hold any data type (don’t all need to be char or logical or numeric • ramsi <- list(name=“ramsi”, age=36, married=TRUE, children=c(“jack”, “tommy”) • str(ramsi) # gives you structure of ramsi list • ramsi$children • # $ operator gets to variables within a list

  19. Data Frame • table containing numbers and text • I.e. the matrix equivalent of a list • some columns can be numbers, others text, still others logical. • sounds like microarray data • probe set ID – character • expression values –numerical • treatment – logical…..

  20. Numeric Indexing • vectors, matrices, lists • need to have some form of indexing. • use square brackets [] • z[1] # first element of z (there is no z[0]) • z[1:4] # sub-sequence of z • z[1,4] # no go, need to us z[c(1,4)] • I usually write closing bracket with opener, less confusing.

  21. Logical Indexing of vectorsI.e. Conditional Selecion • z <- 1:50 • a <- z > 13 #make a logical vector • length(a) # still 50, just some F • z[a] #apply the logical vector • length(z[a]) # this is 37 • length(z[z>13]) • negative index drops item.

  22. Logical Indexing of matricies • a must be the same length as z for this to work. • matrices have index [row, column] • table.z <- matrix(z, nrow=4, ncol=5, by.row=TRUE) • table.z[3,2] # should be a 12, verify it…. • table.z[2,] # all of row 2 • table.z[,4] # all of column 4 • indexes dropped with minus sign • Try this out

  23. Looping through matrices • R has explicit loops like other languages • not very efficient, nobody uses them • apply(x, margin, fun, …) #fun must work on vectors • lets look at its help page, make a big table with rnorm() and take the mean of the rows and columns.

  24. Functions • look at the help page for t.test • this is a typical statistical function • it requires an input and has an output • the input usually has default values • input can be defined by order of appearance (positional matching) or explicitly • always better to go explicit, expecially when scripting. • positional and named arguments can be mixed in the same function call.

  25. More on the t.test() function • t.test(x, y, alternative=“two.sided”, mu=0, paired=FALSE, var.equal =FALSE, conf.level=0.95, …) • t.test(x,y,”two.sided”, ,,TRUE,0.99) # this is not readable!

  26. t.test() results list • list with multiple values. • since it’s a list, use the $ operator to extract values • also possible to use list[[]], (not worth it) • catherine <- c(rnorm(10, 5, 2), rnorm(10, 3, 3)) #make a vector • ramsi <- t.test(catherine[1:10], catherine[11:20]) • ramsi # this shows the standard t.test() output • str(ramsi) # lets see the variables

  27. library(multtest) • look at your handout, we’ll fiddle with it. • this is a library of routines or functions that someone has written. • must first be called with library(multtest) • one of the functions is mt.sample.teststat()

  28. writing functions • my.function <- function(a,b,c) {this, that,return(whatever)} • cubism <- function(a = 3) { b <- (a*a*a) • return(b) } • also c(b) instead of return • cubism() # default is 27 • cubism(4) # better be 64!

  29. another not-so-silly function • number to times a number (variable name number) is bigger than a series of numbers (variable name series) freq <- function(series=1:20, number=11){ instances <- series[series > number] answer <- length(instances) return(answer) }

  30. Permuted t.test() • take a vector of gene expression data. • calculate a t-statistic for it using a t.test() • jumble the data and take another t.test() • rinse and repeat for all possibilities • how many times is the real t-statistic better than or equal to the permutations? • this is your real p-value • this permuted t-test is distribution free since you are using sampling to generate the actual distribution of the data instead of relying on some theoretical normal dist.

  31. making your own function • homework assignment will be to generate at permuted.t.test() for microarray data • once this works, you’ll have it for ever. • you’ll forget how to use gene spring, but you’ll always have a record of your function or your entire data analysis. • I have only seen farther because I stood on the shoulder’s of giants. • use multtest….can it be done without multtest?

  32. What is Bioconductor? • set of routines and functions to perform microarray analysis • all are used within R and can therefore be manipulated and altered. • tough to use without R • my analysis is about ½ and ½ R and Bioconductor libraries… • open group with mailing list.

  33. Lecture 2 • you all have 24 arrays of muscle tissue from normal, obese and morbidly obese men • we want to read these all in, perform some QC/QA and see which genes are differentially expressed using a permuted t test.

More Related