Data analysis of microarrays bioconductor org
Download
1 / 58

Data Analysis of Microarrays Bioconductor - PowerPoint PPT Presentation


  • 185 Views
  • Uploaded on

Data Analysis of Microarrays Bioconductor.org. “affy” package “limma” package. Open R and Load libraries. > load(affy). Affymetrix Data files. CEL files contain information about the expression levels of the individual probes.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Data Analysis of Microarrays Bioconductor' - rio


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Data analysis of microarrays bioconductor org

Data Analysis of Microarrays Bioconductor.org

“affy” package

“limma” package



Affymetrix data files
Affymetrix Data files

  • CEL files contain information about the expression levels of the individual probes.

  • CDF file contains information about which probes belong to which probe set. The probe set information in the CEL file by itself is not particularly useful as there is no indication in the file as to which probe set a probe belongs. This information is stored in the CDF library file associated with a chip type. All the arrays belonging to a given type will share this same information.


Analysis
Analysis

  • Sample Generation, experimental design, hybridization =) RAW DATA.

  • Data analysis.

    • (a) Scanning and image analysis =) PROBE INTENSITIES. [low level analysis]

    • (b) Background correction, normalization and quantifying expression measures =) NORMALIZED EXPRESSION MEASURES FOR EACH GENE (PROBE SET). [low level analysis]

    • (c) Ranking of genes, clustering/classification of genes =) INTERESTING GENE SET. [high level analysis]

  • Validation of the interesting genes: using additional biological data and performing targeted experiments.


Estrogen dataset
Estrogen dataset

  • "The investigators in this experiment were interested in the effect of estrogen on the genes in ER+ breast cancer cells over time.

  • After serum starvation of all eight samples, they exposed four samples to estrogen, and then measured mRNA transcript abundance after 10 hours for two samples and 48 hours for the other two.

  • They left the remaining four samples untreated, and measured mRNA transcript abundance at 10 hours for two samples, and 48 hours for the other two. Since there are two factors in this experiment (estrogen and time), each at two levels (present or absent,10 hours or 48 hours), this experiment is said to have a 2x2 factorial design."



Put cel files in a directory
Put .cel files in a directory

  • Estrogen.zip file should be extracted into a file named estrogen.

  • Working directory should be set to this file using the change dir command under the file menu.


Information on chips samples
Information on chips/samples

  • A text file that looks like the following should be prepared using a text editor and saved as filename.txt (e.g., estrogen.txt):


R read the sample information
R: read the sample information

You have to load the file into a phenoData object

> pd <- read.phenoData("estrogen.txt", header = TRUE, row.names = 1)

> pData(pd)


Phenodata objects
phenoData objects

  • phenoData objects are where the Bioconductor Package stores information about samples

    • treatment conditions in a cell line experiment or clinical or histopathological characteristics of tissue biopsies.

    • The header option lets the read.phenoData function know that the first line in the file contains column headings, and the row.names option indicates that the first column of the file contains the row names.


Phenodata objects1
phenoData objects

> pd <- read.phenoData("estrogen.txt", header = TRUE, row.names = 1)

> pData(pd)

estrogen time.h

low10-1.cel absent 10

low10-2.cel absent 10

high10-1.cel present 10

high10-2.cel present 10

low48-1.cel absent 48

low48-2.cel absent 48

high48-1.cel present 48

high48-2.cel present 48


Read cel files and name the object as a
Read .cel files and name the object as a

> a <- ReadAffy(filenames = rownames(pData(pd)), phenoData = pd,

+ verbose = TRUE)

Copy and paste the above statements onto the command line in R


Read cel files and name the object as a1
Read .cel files and name the object as a

> a <- ReadAffy(filenames = rownames(pData(pd)), phenoData = pd,

+ verbose = TRUE)

1 reading low10-1.cel ...instanciating an AffyBatch (intensity a 409600x8 matrix)...done.

Reading in : low10-1.cel

Reading in : low10-2.cel

Reading in : high10-1.cel

Reading in : high10-2.cel

Reading in : low48-1.cel

Reading in : low48-2.cel

Reading in : high48-1.cel

Reading in : high48-2.cel


Type a
Type a

> a

AffyBatch object

size of arrays=640x640 features (25606 kb)

cdf=HG_U95Av2 (12625 affyids)

number of samples=8

number of genes=12625

annotation=hgu95av2


Visualize chips
Visualize chips

>image(a[,1])


Visualize log2 transformed raw values using hist histogram of distribution
Visualize log2 transformed raw values using hist (histogram of distribution)

> hist(log2(intensity(a[, 4])), breaks = 100, col = "blue")


Boxplots
Boxplots of distribution)

>boxplot(a, col = "red")


Normalize data using rma
Normalize data using rma of distribution)

> eset = rma(a)

>boxplot(data.frame(exprs(eset)), col = "blue")


Ma plots
MA Plots of distribution)

  • Measurement of relative expression versus Average log intensity plot. Also known as the RI (Ratio versus Intensity).

  • MA plots can show the intensity-dependent ratio of raw microarray data. Let R represent raw intensity for one chip and G represent raw intensity for another chip. Define

  • M = log2(R/G)

  • A = log2(RG)1/2.


Ma plots after normalization
MA plots after normalization of distribution)

>mva.pairs(exprs(eset)[,c(1,3,5,7)],log.it=FALSE)


Ma plots before normalization
MA plots before normalization of distribution)

> mva.pairs(pm(a)[,c(1,3,5)])


Memory problems
Memory Problems of distribution)

If you experience memory errors, please restart your R session before trying a less memory-expensive alternative.

  • The following function justRMA() should read in all 8 arrays, normalize them, and create an object of class exprSet with as little as 128 Megabytes of RAM:

    > library(affy)

    > pd <- read.phenoData("EstrogenTargets.txt",header=TRUE,row.names=1,as.is=TRUE)

    > eset <- justRMA(filenames=pData(pd)$FileName,phenoData=pd)

    > eset

  • Or to read in only the first four arrays, you can use:

    >eset <- justRMA(filenames=pData(pd)$FileName[1:4],phenoData=pd)

    >eset


Glossary
Glossary of distribution)


Glossary1
Glossary of distribution)


Glossary2
Glossary of distribution)


Gene filtering genefilter package
Gene Filtering (Genefilter package) of distribution)

> library(genefilter)

Loading required package: survival

Loading required package: splines

> f1 <- pOverA(0.25, log2(100))

> f2 <- function(x) (IQR(x) > 0.5)

> ff <- filterfun(f1, f2)

> selected <- genefilter(eset, ff)

> sum(selected)

[1] 1664

> esetSub <- eset[selected, ]


See some data
See some data of distribution)

> esetSub[1:3, ]

Expression Set (exprSet) with

3 genes

8 samples

phenoData object with 2 variables and 8 cases

varLabels

estrogen: read from file

time.h: read from file

> exprs(esetSub[1:3, ])

low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel

1005_at 9.206734 8.993805 8.237886 8.338003 9.173192

1008_f_at 10.119335 10.986661 10.830301 10.025106 11.044671

1009_at 10.553034 10.500375 11.159770 11.048265 10.211211

low48-2.cel high48-1.cel high48-2.cel

1005_at 9.040490 7.926101 8.06968

1008_f_at 11.138321 10.705763 11.36952

1009_at 9.565864 11.363234 10.76293


Clustering algorithms in r
Clustering Algorithms in R of distribution)

  • Two main categories

    • Agglomerative (e.g., hierarchical)

    • Partitioning (e.g., kmeans)


Clustering algorithms in r1
Clustering Algorithms in R of distribution)

  • First, dist is used to compute distances. This function takes a matrix as its first argument and computes the distances between the rows of the matrix.

  • We will use the expression data matrix from estrogen for this purpose

  • The commands below are used to carry out hierarchical clustering using the Manhattan distance metric and to plot the corresponding dendrogram with nodes labeled according to file names.


Cluster data hierarchical
Cluster data (hierarchical) of distribution)

> dgTr <- dist(t(exprs(esetSub)), method = "manhattan")

> hcgTr <- hclust(dgTr, method = "average")


Cluster data hierarchical1
Cluster data (hierarchical) of distribution)

> kmgTr <- kmeans(t(exprs(esetSub)), centers = 2)

> kmgTr$cluster

low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel low48-2.cel

1 1 2 2 1 1

high48-1.cel high48-2.cel

2 2


T test on a subset of estrogen data
T-test on a subset of estrogen data of distribution)

>subEST=exprs(eset[,1:4]);

> group1=c(1,1,2,2)

> group1

[1] 1 1 2 2

> subEST

Expression Set (exprSet) with

12625 genes

4 samples

phenoData object with 2 variables and 4 cases

varLabels

estrogen: read from file

time.h: read from file

> tf1 <- ttest(group1, 0.1)

> ff2 <- filterfun(tf1)

> wh2 <- genefilter(exprs(subEST), ff2)

> sum(wh2)

[1] 1711

>


Apoai data
ApoAI data of distribution)

  • The data is from a study of lipid metabolism by Callow et al (2000). The apolipoprotein AI (ApoAI) gene is known to play a pivotal role in high density lipoprotein (HDL) metabolism. Mice which have the ApoAI gene knocked out have very low HDL cholesterol levels. The purpose of this experiment is to determine how ApoAI deficiency affects the action of other genes in the liver, with the idea that this will help determine the molecular pathways through which ApoAI operates.


Hybridizations
Hybridizations of distribution)

  • The experiment compared 8 ApoAI knockout mice with 8 wild type (normal) C57BL/6 ("black six") mice, the control mice. For each of these 16 mice, target mRNA was obtained from liver tissue and labelled using a Cy5 dye. The RNA from each mouse was hybridized to a separate microarray. Common reference RNA was labelled with Cy3 dye and used for all the arrays. The reference RNA was obtained by pooling RNA extracted from the 8 control mice.


Experimental design1
Experimental Design of distribution)


Load data
Load data of distribution)

> library(limma)

> load("ApoAI.RData")

> objects()

[1] "last.warning" "mart" "RG"

> names(RG)

[1] "R" "G" "Rb" "Gb" "printer" "genes" "targets"


What did you download
What did you download? of distribution)

> RG$targets

FileName Cy3 Cy5

c1 a1koc1.spot Pool C57BL/6

c2 a1koc2.spot Pool C57BL/6

c3 a1koc3.spot Pool C57BL/6

c4 a1koc4.spot Pool C57BL/6

c5 a1koc5.spot Pool C57BL/6

c6 a1koc6.spot Pool C57BL/6

c7 a1koc7.spot Pool C57BL/6

c8 a1koc8.spot Pool C57BL/6

k1 a1kok1.spot Pool ApoAI-/-

k2 a1kok2.spot Pool ApoAI-/-

k3 a1kok3.spot Pool ApoAI-/-

k4 a1kok4.spot Pool ApoAI-/-

k5 a1kok5.spot Pool ApoAI-/-

k6 a1kok6.spot Pool ApoAI-/-

k7 a1kok7.spot Pool ApoAI-/-

k8 a1kok8.spot Pool ApoAI-/-


What does rg contain
What does RG contain of distribution)

  • Type in RG at the command line:

    >RG

    $R

    a1koc1 a1koc2 a1koc3 a1koc4 a1koc5 a1koc6 a1koc7 a1koc8 a1kok1

    [1,] 4184.08 3300.22 2432.54 2069.81 2809.83 2029.05 1720.56 1795.58 1845.97

    [2,] 4148.48 3774.18 2579.92 2566.33 2286.43 2244.33 1987.83 1908.33 2924.68

    [3,] 2452.32 3028.47 3574.45 2348.09 4462.67 2703.20 1543.10 2123.91 1702.91

    [4,] 1577.31 1999.92 1296.81 1926.69 1010.74 1764.58 1161.71 1495.13 1778.82

    [5,] 1525.48 1967.44 1210.10 1929.64 950.80 1638.89 936.42 1540.65 1554.96

    a1kok2 a1kok3 a1kok4 a1kok5 a1kok6 a1kok7 a1kok8

    [1,] 2330.53 2630.76 2234.17 2811.72 2721.10 2361.67 2585.89

    [2,] 3027.00 2411.81 2023.39 4148.58 3384.13 2763.00 2775.44

    [3,] 2235.84 2343.61 4196.89 2079.53 4344.00 3051.24 3277.82

    [4,] 1313.93 1551.66 1359.58 1305.88 1512.14 1617.78 1584.25

    [5,] 1288.93 1378.91 1216.87 1218.35 1205.67 1412.08 1666.32

    6379 more rows ...


What does rg contain1
What does RG contain of distribution)

$G

a1koc1 a1koc2 a1koc3 a1koc4 a1koc5 a1koc6 a1koc7 a1koc8 a1kok1

[1,] 6256.08 5269.89 4080.85 3229.38 3763.22 2773.58 2800.44 1923.58 804.80

[2,] 5389.81 3608.12 2614.42 2107.56 1878.30 1926.00 1711.17 1205.20 1385.96

[3,] 2653.29 4168.16 8091.95 3134.39 3046.00 3612.40 2002.62 1651.41 915.24

[4,] 1071.26 1162.71 819.13 1295.82 428.95 900.78 838.19 689.00 763.64

[5,] 1321.75 1148.64 642.94 1289.74 500.49 684.00 571.92 736.41 685.88

a1kok2 a1kok3 a1kok4 a1kok5 a1kok6 a1kok7 a1kok8

[1,] 2558.68 1724.41 2309.08 2758.44 2632.43 1955.33 3191.78

[2,] 1943.40 1100.10 1233.33 2337.75 1630.91 1422.45 1961.50

[3,] 2404.77 3011.42 3708.85 2830.40 3030.90 2809.82 2482.39

[4,] 583.08 873.06 737.57 834.58 887.97 827.38 890.58

[5,] 536.43 656.76 593.17 679.73 756.44 572.64 1233.45

6379 more rows ...


What does rg contain2
What does RG contain of distribution)

$Rb

a1koc1 a1koc2 a1koc3 a1koc4 a1koc5 a1koc6 a1koc7 a1koc8 a1kok1 a1kok2

[1,] 1418.50 1532.00 992.00 1306.75 781.89 1165 761.88 1151.00 1098.86 941.74

[2,] 1280.05 1497.00 980.00 1328.00 773.00 1165 759.17 1151.00 994.43 934.00

[3,] 1216.00 1481.63 935.00 1348.61 773.00 1198 758.00 1129.05 949.39 935.84

[4,] 1193.69 1467.42 973.26 1341.55 760.00 1198 752.53 1077.34 949.00 911.09

[5,] 1148.12 1442.00 929.43 1376.21 760.00 1177 710.79 1075.00 920.16 898.00

a1kok3 a1kok4 a1kok5 a1kok6 a1kok7 a1kok8

[1,] 1042.00 954.00 930.00 987.57 1190.83 1073.44

[2,] 1042.00 952.22 930.00 933.09 1158.00 1074.62

[3,] 1042.00 904.63 930.30 919.70 1150.18 1077.00

[4,] 1037.75 899.89 914.79 911.14 1179.75 1077.00

[5,] 1023.15 898.00 892.76 882.02 1069.02 1064.04

6379 more rows ...


What does rg contain3
What does RG contain of distribution)

$Gb

a1koc1 a1koc2 a1koc3 a1koc4 a1koc5 a1koc6 a1koc7 a1koc8 a1kok1 a1kok2

[1,] 663.50 520.00 371.54 563.25 228.00 274.05 354.50 292.00 224.00 239.95

[2,] 643.43 520.00 355.00 589.44 231.00 268.00 324.61 286.93 206.86 239.00

[3,] 544.81 498.63 317.55 619.58 231.00 267.00 281.66 264.00 181.00 239.00

[4,] 522.80 454.55 356.46 615.76 218.37 267.00 289.30 264.00 181.00 248.18

[5,] 465.27 433.00 295.67 567.81 203.00 267.00 238.42 257.82 174.57 229.10

a1kok3 a1kok4 a1kok5 a1kok6 a1kok7 a1kok8

[1,] 282.53 254.00 321.28 363.00 285.00 309.00

[2,] 264.24 254.00 309.00 336.00 285.00 309.00

[3,] 250.06 254.00 305.00 336.00 285.00 309.86

[4,] 233.66 244.56 297.75 335.46 286.78 351.78

[5,] 231.00 231.05 298.74 334.22 274.44 358.92

6379 more rows ...


What does rg contain4
What does RG contain of distribution)

$genes

GridROW GridCOL ROW COL NAME TYPE CLID ACC

1 1 1 1 1 Cy3RT Control BLANK BLANK

2 1 1 1 2 Cy5RT Control BLANK BLANK

3 1 1 1 3 mSRB1 cDNA mSRB1 mSRB1

4 1 1 1 4 BLANK BLANK BLANK BLANK

5 1 1 1 5 BLANK BLANK BLANK BLANK

6379 more rows ...

$targets

FileName Cy3 Cy5

c1 a1koc1.spot Pool C57BL/6

c2 a1koc2.spot Pool C57BL/6

c3 a1koc3.spot Pool C57BL/6

c4 a1koc4.spot Pool C57BL/6

c5 a1koc5.spot Pool C57BL/6

11 more rows ...


Exercise
Exercise of distribution)

> dim(RG)

[1] 6384 16

> ncol(RG)

[1] 16

> colnames(RG)

[1] "a1koc1" "a1koc2" "a1koc3" "a1koc4""a1koc5" "a1koc6" "a1koc7" "a1koc8"

[9] "a1kok1" "a1kok2" "a1kok3" "a1kok4""a1kok5" "a1kok6" "a1kok7" "a1kok8"


Subsets
subsets of distribution)

> RG[1:2,]

An object of class "RGList"

$R

a1koc1 a1koc2 a1koc3 a1koc4 a1koc5 a1koc6 a1koc7 a1koc8 a1kok1

[1,] 4184.08 3300.22 2432.54 2069.81 2809.83 2029.05 1720.56 1795.58 1845.97

[2,] 4148.48 3774.18 2579.92 2566.33 2286.43 2244.33 1987.83 1908.33 2924.68

a1kok2 a1kok3 a1kok4 a1kok5 a1kok6 a1kok7 a1kok8

[1,] 2330.53 2630.76 2234.17 2811.72 2721.10 2361.67 2585.89

[2,] 3027.00 2411.81 2023.39 4148.58 3384.13 2763.00 2775.44

$G

a1koc1 a1koc2 a1koc3 a1koc4 a1koc5 a1koc6 a1koc7 a1koc8 a1kok1

[1,] 6256.08 5269.89 4080.85 3229.38 3763.22 2773.58 2800.44 1923.58 804.80

[2,] 5389.81 3608.12 2614.42 2107.56 1878.30 1926.00 1711.17 1205.20 1385.96

a1kok2 a1kok3 a1kok4 a1kok5 a1kok6 a1kok7 a1kok8

[1,] 2558.68 1724.41 2309.08 2758.44 2632.43 1955.33 3191.78

[2,] 1943.40 1100.10 1233.33 2337.75 1630.91 1422.45 1961.50


Background correction
Background correction of distribution)

  • RG.b <- backgroundCorrect(RG,method="minimum")


Loess normalization
Loess normalization of distribution)

>plotDensities(MA.p)


Quantile normalization
Quantile Normalization of distribution)

> MA.pAq <- normalizeBetweenArrays(MA.p, method="Aquantile")

> plotDensities(MA.pAq)


Print tip loess normalization
Print-tip loess normalization of distribution)

> MA <- normalizeWithinArrays(RG)


Generate factors
Generate factors of distribution)

> design <- cbind("WT-Ref"=1,"KO-WT"=rep(0:1,c(8,8)))

> design

WT-Ref KO-WT

[1,] 1 0

[2,] 1 0

[3,] 1 0

[4,] 1 0

[5,] 1 0

[6,] 1 0

[7,] 1 0

[8,] 1 0

[9,] 1 1

[10,] 1 1

[11,] 1 1

[12,] 1 1

[13,] 1 1

[14,] 1 1

[15,] 1 1

[16,] 1 1


Fit a linear model
Fit a linear model of distribution)

> fit <- lmFit(MA,design=design)

> fit

An object of class "MArrayLM"

$coefficients

WT-Ref KO-WT

[1,] -0.65953808 0.63931974

[2,] 0.22938847 0.65516034

[3,] -0.25176365 0.33421048

[4,] -0.05169672 0.04049065

[5,] -0.25006850 0.22302129

6379 more rows ...


Ebayes fit
Ebayes fit of distribution)

  • fit <- eBayes(fit)


Significance
Significance of distribution)

  • $p.value

  • WT-Ref KO-WT

  • [1,] 0.0005273336 0.009821223

  • [2,] 0.0840905898 0.001674820

  • [3,] 0.2514756923 0.280677844

  • [4,] 0.5306727045 0.727393168

  • [5,] 0.0140994442 0.103570036

  • 6379 more rows ...

  • $lods

  • WT-Ref KO-WT

  • [1,] -0.1466897 -2.581616

  • [2,] -4.8450256 -1.065213

  • [3,] -5.7239418 -5.260559

  • [4,] -6.2106950 -5.777607

  • [5,] -3.2510578 -4.528469

  • 6379 more rows ...

  • $F

  • [1] 8.8907738 26.5502551 0.7775771 0.2140163 3.7423046

  • 6379 more elements ...

  • $F.p.value

  • [1] 2.085929e-03 4.429307e-06 4.744446e-01 8.093730e-01 4.388436e-02

  • 6379 more elements ...


Top table
Top table of distribution)

> topTable(fit,coef="KO-WT",adjust="fdr")

GridROW GridCOL ROW COL NAME TYPE CLID ACC

2149 2 2 8 7 ApoAI,lipid-Img cDNA 1077520

540 1 2 7 15 EST,HighlysimilartoA cDNA 439353

5356 4 2 9 1 CATECHOLO-METHYLTRAN cDNA 1350232

4139 3 3 8 2 EST,WeaklysimilartoC cDNA 374370

1739 2 1 7 17 ApoCIII,lipid-Img cDNA 483614

2537 2 3 7 17 ESTs,Highlysimilarto cDNA 483614

1496 1 4 15 5 est cDNA 484183 genome.wustl

4941 4 1 8 6 similartoyeaststerol cDNA 737183

947 1 3 8 2 EST,WeaklysimilartoF cDNA 353292

5604 4 3 1 18 cDNA 317638

M A t P.Value B

2149 -3.1661645 12.46803 -23.980007 3.045649e-11 14.9266311

540 -3.0485504 12.28064 -12.962085 5.020967e-07 10.8133421

5356 -1.8481659 12.92660 -12.440765 6.505619e-07 10.4479375

4139 -1.0269537 12.60572 -11.756162 1.209061e-06 9.9285280

1739 -0.9325824 13.73744 -9.835355 1.563760e-05 8.1924099

2537 -1.0098117 13.63012 -9.015243 4.222433e-05 7.3048994

1496 -0.9774236 12.22891 -9.002324 4.222433e-05 7.2901336

4941 -0.9549693 13.28750 -7.441333 5.617334e-04 5.3106653

947 -0.5705900 10.54291 -4.554709 1.768067e-01 0.5629149

5604 -0.3663573 12.71267 -3.962815 5.290409e-01 -0.5532768


Ma plots1
MA plots of distribution)

> plotMA(fit, 2)


Venn diagram
Venn diagram of distribution)

> results <- decideTests(fit)

> vennDiagram(results)


Switch function
Switch function of distribution)

center = function(x, type) {

switch(type,

mean = mean(x),

median = median(x),

trimmed = mean(x, trim = .1))

}


Switch function1
Switch function of distribution)

The switch function uses the value of its first argument to determine which of its remaining arguments to evaluate and return. The first argument can be either an integer index, or a character string to be used in matching one of the following arguments.


Apply function
Apply function of distribution)

  • For example,

    > apply(x, 2, mean)

  • computes the column means of a matrix x, while

    > apply(x, 1, median)

  • computes the row medians.


ad