1 / 49

Microarray data analysis

Microarray data analysis. 25 January 2006. David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology. Inferential statistics. Inferential statistics are used to make inferences about a population from a sample.

pelham
Download Presentation

Microarray data analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microarray data analysis 25 January 2006 David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology

  2. Inferential statistics Inferential statistics are used to make inferences about a population from a sample. Hypothesis testing is a common form of inferential statistics. A null hypothesis is stated, such as: “There is no difference in signal intensity for the gene expression measurements in normal and diseased samples.” The alternative hypothesis is that there is a difference. We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level a to p < 0.05. Page 199

  3. Inferential statistics A t-test is a commonly used test statistic to assess the difference in mean values between two groups. t = = Questions Is the sample size (n) adequate? Are the data normally distributed? Is the variance of the data known? Is the variance the same in the two groups? Is it appropriate to set the significance level to p < 0.05? x1 – x2 difference between mean values s variability (noise) Page 199

  4. Inferential statistics Paradigm Parametric test Nonparametric Compare two unpaired groups Unpaired t-test Mann-Whitney test Compare two paired groups Paired t-test Wilcoxon test Compare 3 or ANOVA more groups Page 198-200

  5. ANOVA ANalysis Of VAriance ANOVA calculates the probability that several conditions all come from the same distribution

  6. Parametric vs. Nonparametric Parametric tests are applied to data sets that are sampled from a normal distribution (t-tests & ANOVAs) Nonparametric tests do not make assumptions about the population distribution – they rank the outcome variable from low to high and analyze the ranks

  7. Mann-Whitney test(a two-sample rank test) Actual measurements are not employed; the ranks of the measurements are used instead n1 and n2 are the number of observations in samples 1 and 2, and R1 is the sum of the ranks of the observations in sample 1

  8. Mann-Whitney example

  9. Mann-Whitney table

  10. Wilcoxon paired-sample test A nonparametric analogue to the paired-sample t-test, just as the Mann-Whitney test is a nonparametric procedure analogous to the unpaired-sample t-test

  11. Wilcoxon example

  12. Wilcoxon table

  13. Inferential statistics Is it appropriate to set the significance level to p < 0.05? If you hypothesize that a specific gene is up-regulated, you can set the probability value to 0.05. You might measure the expression of 10,000 genes and hope that any of them are up- or down-regulated. But you can expect to see 5% (500 genes) regulated at the p < 0.05 level by chance alone. To account for the thousands of repeated measurements you are making, some researchers apply a Bonferroni correction. The level for statistical significance is divided by the number of measurements, e.g. the criterion becomes: p < (0.05)/10,000 or p < 5 x 10-6 Page 199

  14. Significance analysis of microarrays (SAM) SAM -- an Excel plug-in -- URL: www-stat.stanford.edu/~tibs/SAM -- modified t-test -- adjustable false discovery rate Page 200

  15. Page 202

  16. up- regulated observed expected down- regulated Page 202

  17. Descriptive statistics Microarray data are highly dimensional: there are many thousands of measurements made from a small number of samples. Descriptive (exploratory) statistics help you to find meaningful patterns in the data. A first step is to arrange the data in a matrix. Next, use a distance metric to define the relatedness of the different data points. Two commonly used distance metrics are: -- Euclidean distance -- Pearson coefficient of correlation 203

  18. Euclidean Distance

  19. Pearson Correlation Coefficient

  20. Descriptive statistics: clustering Clustering algorithms offer useful visual descriptions of microarray data. Genes may be clustered, or samples, or both. We will next describe hierarchical clustering. This may be agglomerative (building up the branches of a tree, beginning with the two most closely related objects) or divisive (building the tree by finding the most dissimilar objects first). In each case, we end up with a tree having branches and nodes. Page 204

  21. Agglomerative clustering 0 1 2 3 4 a a,b b c d e Page 206

  22. Agglomerative clustering 0 1 2 3 4 a a,b b c d d,e e Page 206

  23. Agglomerative clustering 0 1 2 3 4 a a,b b c c,d,e d d,e e Page 206

  24. Agglomerative clustering 0 1 2 3 4 a a,b b a,b,c,d,e c c,d,e d d,e e …tree is constructed Page 206

  25. Divisive clustering a,b,c,d,e 4 3 2 1 0 Page 206

  26. Divisive clustering a,b,c,d,e c,d,e 4 3 2 1 0 Page 206

  27. Divisive clustering a,b,c,d,e c,d,e d,e 4 3 2 1 0 Page 206

  28. Divisive clustering a,b a,b,c,d,e c,d,e d,e 4 3 2 1 0 Page 206

  29. Divisive clustering a a,b b a,b,c,d,e c c,d,e d d,e e 4 3 2 1 0 …tree is constructed Page 206

  30. agglomerative 0 1 2 3 4 a a,b b a,b,c,d,e c c,d,e d d,e e 4 3 2 1 0 divisive Page 206

  31. 1 12 1 12 Page 207

  32. Cluster and TreeView Page 208

  33. Cluster and TreeView clustering K means SOM PCA Page 208

  34. Cluster and TreeView Page 208

  35. Cluster and TreeView Page 208

  36. Page 208

  37. Page 208

  38. Page 208

  39. Two-way clustering of genes (y-axis) and cell lines (x-axis) (Alizadeh et al., 2000) Page 209

  40. Self-Organizing Maps (SOM) To download GeneCluster: http://www.genome.wi.mit.edu/MPR/software.html

  41. SOMs are unsupervised neural net algorithms that identify coregulated genes Page 211

  42. Two pre-processing steps essential to apply SOMs 1. Variation Filtering: Data are passed through a variation filter to eliminate those genes showing no significant change in expression across the k samples. This step is needed to prevent nodes from being attracted to large sets of invariant genes. 2. Normalization: The expression level of each gene is normalized across experiments. This focuses attention on the 'shape' of expression patterns rather than absolute levels of expression.

  43. Principal components analysis (PCA) An exploratory technique used to reduce the dimensionality of the data set to 2D or 3D For a matrix of m genes x n samples, create a new covariance matrix of size n x n Thus transform some large number of variables into a smaller number of uncorrelated variables called principal components (PCs). Page 211

  44. P4 N2 Legend Lead (P) Principal component axis #2 (10%) Sodium (N) C2 P1 N3 Control (C) C3 N4 P3 P2 C4 C1 Principal component axis #1 (87%) PC#3: 1% Principal components analysis (PCA), an exploratory technique that reduces data dimensionality, distinguishes lead-exposed from control cell lines

  45. Principal components analysis (PCA): objectives • to reduce dimensionality • to determine the linear combination of variables • to choose the most useful variables (features) • to visualize multidimensional data • to identify groups of objects (e.g. genes/samples) • to identify outliers Page 211

  46. http://www.okstate.edu/artsci/botany/ordinate/PCA.htm Page 212

  47. http://www.okstate.edu/artsci/botany/ordinate/PCA.htm Page 212

  48. Page 212

  49. Use of PCA to demonstrate increased levels of gene expression from Down syndrome (trisomy 21) brain Chr 21

More Related