125 Views

Download Presentation
##### More Microarray Analysis: Unsupervised Approaches

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**More Microarray Analysis:Unsupervised Approaches**Matt Hibbs Troyanskaya Lab**Outline**• Gene Expression vs. DNA applications • A little more normalization (missing values) • Unsupervised Analysis • Basic Clustering • Statistical Enrichment • PCA/SVD • Advanced Clustering • Search-based Approaches**Expression / DNA**• Some similar concepts to analysis, but often very different goals • Expression – clustering, guilt by association, functional enrichment • DNA – signal processing, spatial relationships, motif finding • Visualized differently (Heat maps vs. karyoscope)**The missing value problem**• Microarrays can have systematic or random missing values • Some algorithms can’t deal with missing values (PCA/SVD in particular) • Instead of hoping missing values won’t bias the analysis, better to estimate them accurately**KNN Impute**• Idea: use genes with similar expression profiles to estimate missing values 2 | | 5 | 7 | 3 | 1 Gene X 2 |4.3| 5 | 7 | 3 | 1 Gene X 2 | 4 | 5 | 7 | 3 | 2 Gene A 2 | 4 | 5 | 7 | 3 | 2 Gene A 8 | 9 | 2 | 1 | 4 | 9 Gene B 8 | 9 | 2 | 1 | 4 | 9 Gene B 3 | 5 | 6 | 7 | 3 | 2 Gene C 3 | 5 | 6 | 7 | 3 | 2 Gene C**Imputation affects downstream analysis**Complete data set Data set with 30% entries missing and filled with zeros (zero values appear black) Data set with missing values estimated by KNNimpute algorithm**Unsupervised Analysis**• Supervised techniques great if you have starting information (e.g. labels) • But, we often we don’t know enough beforehand to apply these methods • Unsupervised techniques are exploratory • Let the data organize itself, then try to find biological meaning • Approaches to understand whole data • Visualization often helpful**Clustering**• Let the data organize itself • Reordering of genes (or conditions) in the dataset so that similar patterns are next to each other (or in separate groups) • Identify subsets of genes (or experiments) that are related by some measure**Quick Example**Conditions Genes**Why cluster?**• “Guilt by association” – if unknown gene X is similar in expression to known genes A and B, maybe they are involved in the same/related pathway • Visualization: datasets are too large to be able to get information out without reorganizing the data**Clustering Techniques**• Algorithm (Method) • Hierarchical • K-means • Self Organizing Maps • QT-Clustering • NNN • . • . • . • Distance Metric • Euclidean (L2) • Pearson Correlation • Spearman Correlation • Manhattan (L1) • Kendall’s t • . • . • .**Distance Metrics**• Choice of distance measure is important for most clustering techniques • Pair-wise metrics – compare vectors of numbers • e.g. genes x & y, ea. with n measurements Euclidean Distance Pearson Correlation Spearman Correlation**Distance Metrics**Euclidean Distance Pearson Correlation Spearman Correlation**Hierarchical clustering**• Imposes (pair-wise) hierarchical structure on all of the data • Often good for visualization • Basic Method (agglomerative): • Calculate all pair-wise distances • Join the closest pair • Calculate pair’s distance to all others • Repeat from 2 until all joined**HC – Interior Distances**• Three typical variants to calculate interior distances within the tree • Average linkage: mean/median over all possible pair-wise values • Single linkage: minimum pair-wise distance • Complete linkage: maximum pair-wise distance**Hierarchical clustering: problems**• Hard to define distinct clusters • Genes assigned to clusters on the basis of all experiments • Optimizing node ordering hard (finding the optimal solution is NP-hard) • Can be driven by one strong cluster – a problem for gene expression b/c data in row space is often highly correlated**HC: Real Example**• Demo in JavaTreeView & HIDRA • Spellman et al., 1998: yeast alpha-factor sync cell cycle timecourse**HC: Another Example**• Expression of tumors hierarchically clustered • Expression groups by clinical class Garber et al.**K-means Clustering**• Groups genes into a pre-defined number of independent clusters • Basic algorithm: • Define k = number of clusters • Randomly initialize each cluster with a seed (often with a random gene) • Assign each gene to the cluster with the most similar seed • Recalculate all cluster seeds as means (or medians) of genes assigned to the cluster • Repeat 3 & 4 until convergence (e.g. No genes move, means don’t change much, etc.)**K-means: problems**• Have to set k ahead of time • Ways to choose “optimal” k: minimize within-cluster variation compared to random data or held out data • Each gene only belongs to exactly 1 cluster • One cluster has no influence on the others (one dimensional clustering) • Genes assigned to clusters on the basis of all experiments**K-means: Real Example**• Demo in TIGR MeV • Spellman et al. alpha-factor cell cycle**Clustering “Tweaks”**• Fuzzy clustering – allows genes to be “partially” in different clusters • Dependent clusters – consider between-cluster distances as well as within-cluster • Bi-clustering – look for patterns across subsets of conditions • Very hard problem (NP-complete) • Practical solutions use heuristics/simplifications that may affect biological interpretation**Cluster Evaluation**• Mathematical consistency • Compare coherency of clusters to background • Look for functional consistency in clusters • Requires a gold standard, often based on GO, MIPS, etc. • Evaluate likelihood of enrichment in clusters • Hypergeometric distribution, etc. • Several tools available**Gene Ontology**• Organization of curated biological knowledge • 3 branches: biological process, molecular function, cellular component**Hypergeometric Distribution**• Probability of observing x or more genes in a cluster of n genes with a common annotation • N = total number of genes in genome • M = number of genes with annotation • n = number of genes in cluster • x = number of genes in cluster with annotation • Multiple hypothesis correction required if testing multiple functions (Bonferroni, FDR, etc.) • Additional genes in clusters with strong enrichment may be related**GO term Enrichment Tools**• SGD’s & Princeton’s GoTermFinder • http://go.princeton.edu • GOLEM (http://function.princeton.edu/GOLEM) • HIDRA Sealfon et al., 2006**More Unsupervised Methods**• Search-based approaches • Starting with a query gene/condition, find most related group • Singular Value Decomposition (SVD) & Principal Component Analysis (PCA) • Decomposition of data matrix into “patterns” “weights” and “contributions” • Real names are “principal components”“singular values” and “left/right eigenvectors” • Used to remove noise, reduce dimensionality, identify common/dominant signals**SVD (& PCA)**• SVD is the method, PCA is performing SVD on centered data • Projects data into another orthonormal basis • New basis ordered by variance explained X U Vt = Singular values “Eigen-genes” Original Data matrix “Eigen-conditions”**SVD**SVD**SVD: Real Example**• Demo in TIGR MeV • Spellman et al., 1998 cell cycle time courses • alpha-factor sync • cdc15 sync**DNA arrays / Sequence-based Analysis**• Methods so far focused on expression data • Other uses of microarrays often sequence based: CGH, ChIP-chip, SNP scanner • Data has important, inherent order • Most analysis methods developed from signal processing techniques (e.g. sound) • View data in chromosomal order (karyoscope) • Tools: JavaTreeView, IGB, Chippy**CGH Example**• Demo in JavaTreeView**Aneuploidy affects expression too**rpl20aD/ rpl20aD, Chromosome XV (data from Hughes et al. (2000))**Software Tools**• JavaTreeView – viz, karyoscope • HIDRA – viz, mult. datasets, search • Cluster (Eisen lab) – clustering • TIGR MeV – clustering, viz • IGB – Affy’s CGH browser • ChIPpy – ChIP-chip analysis**Summary**• Unsupervised Analysis • Let the data organize itself, find patterns • Clustering: Distance Metric + Algorithm • SVD/PCA – auto find dominant patterns • Impute missing values (KNN) • CGH – Karyoscope view • Questions?