1 / 29

TRI C LUSTER An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data

TRI C LUSTER An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data. Mohammed J. Zaki & Lizhuang Zhao Department of Computer Science, Rensselaer Polytechnic Institute (RPI), Troy, NY {zhaol2, zaki}@cs.rpi.edu. Microarray Data.

tybalt
Download Presentation

TRI C LUSTER An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TRICLUSTERAn Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data Mohammed J. Zaki & Lizhuang Zhao Department of Computer Science, Rensselaer Polytechnic Institute (RPI), Troy, NY {zhaol2, zaki}@cs.rpi.edu

  2. Microarray Data • Essential source of information about the Gene Expression within a cell • Typically 2D: Genes x Samples (Genes x Time) • Measure the expression level of genes in different samples • Labeled samples: Classification (cancer vs. non-cancer) • Non-labeled samples: Clustering (Bi-clusters) • Goal: Identify the “expression” patterns, providing clues to the gene regulatory networks within a cell

  3. Why Biclustering? some genes similarly expressed in some samples Bicluster full-space cluster s1s2s3s4s5 s1s2s3s4s5 g1 g2 g3 g4 g5 g1 g2 g3 g4 g5 (g2, g4, g5)×(s2, s3, s5) (g2, g4, g5)

  4. Constant Different “Homogeneity” orSimilarity Criteria Col All Row more general Shift=0.4 Scale=1.4 Scaling/Shifting Order:2 1 3 Order Preserving Note: small noise  is allowed in all expression values

  5. Why TriCluster? • Typical microarray data is 2D (gene x sample) • Temporal expression very important tool • How does gene expression evolve in time? • Find clusters over genes x samples x time • Spatial expression also of interest • How does gene expression differ in space (e.g., different regions of mouse brain)? • Find clusters over gene x samples x space • Combine temporal and spatial expression • Find clusters over gene x time x space, etc. • There is an emerging need to mine 3D data

  6. TriCluster: Our Contributions • First algorithm to mine tri-clusters in 3D microarray data • Complete and deterministic • Mine maximal clusters satisfying given homogeneity criteria • Constant: column, row, all • Scaling & Shifting • Clusters can be overlapping; optionally delete/merge clusters having large overlap • Propose a set of metrics for cluster evaluation • Use Gene Ontology (GO) to access biological significance

  7. Definitions • G is a set of genes {g0, g1, …, gn-1} • S is a set of samples {s0, s1, …, sm-1} • T is a set of time courses {t0, t1, …, tl-1} • 3D Real-valued Dataset D = {dijk}  G x S x T • dijk is the expression value of gene gi in sample sj at time tk • triCluster is a maximal submatrix of D that satisfies some homogeneity conditions • C = X x Y x Z = {cijk} • X  G, Y  S, Z  T • Given homogeneity conditions

  8. Scaling triCluster Example 2 Time 4 1 1 2 5 Genes Ratios: 1 3 4 Note: small noise  is allowed Samples

  9. TriCluster Concepts • C = X x Y x Z = {cijk} is a triCluster iff • C is maximal (no C’  C) • C has sufficient size: |X|  mg, |Y|  ms, |Z|  mt • Noise/error threshold  is satisfied for any C22 • C22 = is an arbitrary 2x2 submatrix of C • Let ri = | cia/cib| and rj = | cja/cjb| • Max(ri/rj) / Min(ri/rj) – 1   • Range threshold a is satisfied for each dim a •  = | cijk – cxyz | • If j=y, k=z, then   g (similarly define s, t)

  10. TriCluster Flexibility • Cluster definition is symmetric • Any ordering of dimensions allowed • A/C≈B/D ↔ A/B≈C/D ↔ AD≈BC • Can mine several types of clusters • Typically   0 to allow small noise/error • Approx constant cluster: g 0 and s 0 and t 0 • Approx single dim constant: g 0 or s 0 or t 0 • Approx two dim constant: (g 0 and s 0) or (g 0 and t 0) or (s 0 and t 0) • Scaling cluster: g and s and t are unconstrained • Shifting cluster: if eCis a scaling C is a shifting T =

  11. TriCluster Algorithm • Compute maximal biclusters on G x S for each time slice t  T • Construct range multigraph • Find maximal cliques • Compute triclusters from biclusters • Construct new multigraph (T x biclusters) • Find maximal cliques • Merge/Prune overlapping clusters

  12. Maximal Biclusters • Mine each GxS time-slice for maximal biclusters • For each pair of samples, get valid ratio ranges within εand gene-sets • Construct a Range Multigraph • Mine maximal cliques • Each clique/cluster can contribute to some valid tricluster

  13. Valid Ratio Ranges:Each Column Pair Range Example Original Data After row/col permutation • Take ratio s0 and s6 and construct valid ranges: • Range contains at least mg values within ε (noise threshold) • ε=0.05, mg=3,then 3.0×(1+ε)=3.15  range = [3, 3.15] • Other ranges = [3.3, 3.465], and so on • Construct gene-sets: [3, 3.15] has genes {g1, g4, g8}

  14. Range Multigraph:pair of samples • Construct valid ratios & gene-sets for s1/s4 • Ratio = 1/1, gene-set = {g2g6g0g9g7} • Ratio = 5/4, gene-set = {g4g8g1} • Construct ratios/gene-sets for other pairs Multigraph

  15. Range Multigraph: complete • Construct ratios/gene-sets for all sample pairs

  16. Maximal Clique Mining s4 s6 s2 s3 s1 s5 s0 • Perform recursive depth-first search • Maintain valid gene-sets for each node • Intersect gene-sets with each outgoing edge • {g2g6g0g9g7} {g2g6g0g9} = {g2g6g0g9} • Prune if various criteria not met (size, dim range)

  17. Mine triClusters • Let Bt be the set of maximal biclusters for time slice t • Construct new multigraph • Each time point is a vertex • Each pair of highly overlapping biclusters (gene-set, samples) forms an edge between time ti and tj • Call maximal clique mining to obtain maximal triclusters

  18. Constructing triClusters

  19. Constructing triClusters tk tj ti

  20. Constructing triClusters tk tj ti

  21. Prune and Merge A Ai A B B B Aj Merge A & B L(A+B)-A-B/ L(A+B) <  Prune B LB-A/LB <  Prune B LB-  A/LB <  • Cluster Span: • LC = {(i,j,k) | gi, sj, tk C } • LAB = LA  LB • LA-B = LA – LB • LA+B = (LA – LB)  (LB – LA)  (LA  LB)

  22. Metrics for Measuring Clustering Quality • NumClusters Number of Clusters • Span Span (X×Y×Z)=|X|×|Y|×|Z| • ElementSumSum of all cluster Spans (count multiple times) • CoverageUnion of all cluster Spans (count once) • Overlap(ElementSum - Coverage) / Coverage We want high coverage with small overlap

  23. Synthetic Data Generation • Experiments:1.4Ghz, 448MB, Linux/Vmware • Synthetic data for parameter evaluation • Input parameters: • |G|=4000, |S|=30, |T|=20 • Number of cluster to embed = 10 • Overlap % among clusters = 20% • Noise for expression values = 3% • Cluster size range = 150x6x4 (some variation) • Generate clusters with values within some range • Fill rest of cells with random noise • Do random permutations along each dimension • We vary one parameter and keep others fixed

  24. Results on Synthetic Datasets Time (sec) Time (sec) Time (sec) Number of Genes Number of Time-points Number of Samples Time (sec) Time (sec) Time (sec) Number of Clusters Variation (%) Overlap (%)

  25. Results on Yeast CellCycle Dataset • http://genome-www.stanford.edu/cellcycle • Elutriation Experiment • 7679 genes • 14 time points (0 to 390mins @ 30 min gaps) • No real samples: use raw expression values of 13 attributes as samples (Cyc3, Cyc5, ratios, etc) • GxSxT = 7679 x 13 x 14 • Note: actual 3D data will become publicly available soon (e.g. Mouse Brain Atlas: genes x space x time) • Run TriCluster: mg=50, ms= 4, mt= 5, ε = 0.03 • Found 5 clusters in 28s, overlap=0, coverage=6250 • 2D view of cluster C0 (51x4x5) shown next

  26. 2D Views of cluster C0 on yeast data t=120 s=CH2I s=CH2I t=210 s=CH2D s=CH2D t=270 Expression Values Expression Values Expression Values s=CH2IN s=CH2IN t=330 s=CH2DN s=CH2DN t=390 Genes Genes Time points Sample Curves Time Curves Gene Curves

  27. Results on Yeast Cell Cycle Dataset:Gene Ontology Significant (p-value < 0.01) Shared Gene Ontology (GO) Terms (Process, Function, Location) for Genes in Different Clusters

  28. Results on Yeast Cell Cycle Specific Cluster Different clusters show different shared terms Results could be potentially biologically significant

  29. Summary • Contributions • First algorithm to mine triclusters from 3D microarrays • Complete, deterministic • Allows small noise • Flexible: constant, single/two dim, scaling, shifting • Allows arbitrary overlap (merge/prune) • Potentially biologically significant clusters (GO)! • Future Work • Extend from 3-D to k-D datasets • Allow different pattern types along different axes (scaling along GxS, shifting along T, etc.) • Enhance clique mining step from multigraphs

More Related