1 / 45

Gene Clustering: An Enhanced Cluster Affinity Search Technique

Gene Clustering: An Enhanced Cluster Affinity Search Technique. Abdelghani Bellaachia and David Portnoy The George Washington University Department of Computer Science 801 22nd St NW Washington, DC 20052  And Y. Chen and A. G. Elkahloun NIH/NHGRI/CGB National Institute of Health

gigi
Download Presentation

Gene Clustering: An Enhanced Cluster Affinity Search Technique

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Clustering: An Enhanced Cluster Affinity Search Technique Abdelghani Bellaachia and David Portnoy The George Washington University Department of Computer Science 801 22nd St NW Washington, DC 20052  And Y. Chen and A. G. Elkahloun NIH/NHGRI/CGB National Institute of Health Bethesda, MD 20892-4470 BIOKDD02: Workshop on Data Mining in Bioinformatics (with SIGKDD02 Conference)

  2. Outline • Introduction • Gene Expression Data (GED) • Clustering GED • CAST Algorithm Amir Ben-Dor, Ron Shamir and Zohar YakhiniJCB 2000, vol 7, 559-584. • Enhanced CAST Algorithm • Experiment Data Sets • Experiment • Conclusion

  3. Introduction • Cluster analysis is a statistical technique used to group similar objects based on some feature vector. • It has been used in many domains for three decades. • It has been successfully applied to gene expression data.

  4. Introduction (Cont.) Clustering of expression data can be used to in two ways: • group genes which exhibit similar behavior • identify genes which drive certain biological processes • aid in genetic pathway discovery • group experiments which have similar genetic profiles • Aid in the classification of tumors • discover new tumor subtypes

  5. Gene Expression Data (GED) • At the moment there are mainly three array technologies being employed to generate large-scale gene expression data: • Copy or complimentary DNA arrays (cDNA): Consisting of cDNA or amplified products derived from cDNA with no more than 2,000 bases. • Oligonucleotide arrays: Consisting of oligonucleotides with between 20 to 30 bases. • Genomic arrays: Consisting of Genomic DNA with 50,000 or more bases. • Each technique provides a way of detecting gene expression levels across 5,000-14,000 gene sequences simultaneously

  6. GED (Cont.) • Each array produces a “snap shot” of the quantity of individual mRNA (messenger RNA) species in a cellular sample at a given time • Sequences of arrays (experiments) are used to create an expression matrix

  7. Clustering GED • Once an expression matrix is generated, a so-called similarity (or distance) matrix can be calculated. Forming an n-by-n matrix where n is equal to the number of feature vectors (experiments or genes). • Feature vectors are then grouped, forming clusters, where the goal is to maximize the intra-cluster homogeneity and the inter-cluster heterogeneity.

  8. Cluster Affinity Search Technique (CAST)Amir Ben-Dor, Ron Shamir and Zohar YakhiniJCB 2000, vol 7, 559-584. • Uses a graph theoretic approach. • The graph is modeled by the similarity matrix, S. • The objective of the algorithm is to find cliques within the graph which maximize the intra-cluster connectivity and minimize the inter-cluster connectivity of the vertices.

  9. CAST: Notation & Definitions • Definition 1: The affinity of a node x to a cluster C is defined as follows: a(x) = • Definition 2: The connectivity threshold, , of a cluster C is: • = T|C| where |C| is the cardinality of C. • Definition 3: A high connectivity node is a node that will be included in a cluster. Its affinity satisfies the following: a(i) where a(i) is the affinity of i. • Definition 4: A low connectivity node is a node that will be removed from a cluster. Its affinity satisfies the following: a(i)  where a(i) is the affinity of i.

  10. CAST: Notation & Definitions (Cont.) • High Connectivity: A node, i, is included in a cluster if its affinity satisfies the following: a(i)   where a(i) is the affinity of i. • Low Connectivity: A node, i, is removed from a cluster if its affinity satisfies the following: a(i) < 

  11. CAST: Cluster Generation • To form a cluster the algorithm alternates between adding and removing nodes from the current cluster until such time that changes no longer occur or a maximum of iterations has been executed. • Node Addition: • Add nodes with high connectivity to the nodes in the open cluster. • Node Removal: • Remove any nodes in the open cluster with low connectivity to the other nodes in the cluster. • Cluster Cleaning: • Make sure all nodes are in clusters with highest affinity.

  12. Addition Step while max{a(w)|w  U}  { Pick an element u  U such that a(u)=max{a(w)|w  U} Copen Copen {u} U  U \ {u} // Update affinity of all nodes For all x  U  Copen set a(x) = a(x) + S(x,u) }

  13. Removal Step while min{a(w)|w  Copen} <  { Pick an element u  Copen such that a(u)=min{a(w)|w  Copen } Copen Copen \ {u} U  U  {u} // Update affinity of all nodes For all x  U  Copen set a(x) = a(x) - S(x,u) }

  14. Cleaning step while (changes in any Ci occur) or (iterations < max iterations){ // cleaning step may not converge for each c  Ci and Ci C and Cj C{ Compute a normalized affinity of c to each cluster Cj such that aj(c)= (kCj S(c,k))/(|Cj|) } if max{ aj(c) } > ai , for all Cj C and i  j { Ci = Ci \ c Cj = Cj c } }

  15. E-CAST: Enhanced CAST • Drawback of CAST: • No knowledge to define T • Cleaning step may be very expensive • Computation of T: • A calculation for T is made by averaging all similarity values above 0.5 • Static vs. Dynamic T • Static: The calculation of T can be made just once before any clusters are created, based on all node similarity values. • Dynamic: The calculation of T can be made before each new cluster formed, based on only the nodes not yet clustered.

  16. E-CAST: An Example t = 0.848 Copem U Open New Cluster

  17. E-CAST: An Example t = 0.848 Copem U Pick an element u  U (One of the nodes of the edge with high similarity)

  18. E-CAST: An Example a = 0.85 a = 0.2 a = 0.3 a = 0.4 a = 0.25 a = 0.36 t = 0.848 Copem U For all x  U  Copenset a(x) = a(x) + S(x,u)

  19. E-CAST: An Example a = 0.85 a = 0.2 a = 0.3 a = 0.4 a = 0.25 a = 0.36 t = 0.848 Copem U Pick an element u  U with maximum affinity

  20. E-CAST: An Example a = 0.5 a = 0.5 a = 0.65 a = 0.65 a = 0.72 t = 0.848 a = 0.85 a = 0.85 Copem U For all x  U  Copenset a(x) = a(x) + S(x,u)

  21. E-CAST: An Example a = 0.5 a = 0.5 a = 0.65 a = 0.65 a = 0.72 t = 0.848 a = 0.85 a = 0.85 Copem U max{a(u)|u  U}  t|Copen| = false

  22. E-CAST: An Example a = 0.5 a = 0.5 a = 0.65 a = 0.65 a = 0.72 t = 0.848 a = 0.85 a = 0.85 Copem U min{a(v)|v  Copen} < t|Copen| = false

  23. E-CAST: An Example Copem C1 t = 0.848 U Close Cluster

  24. E-CAST: An Example C1 t = 0.8475 Copem U Open New Cluster

  25. E-CAST: An Example C1 t = 0.8475 Copem U Pick an element u  U

  26. E-CAST: An Example a = 0.2 a = 0.1 a = 0.86 a = 0.39 C1 t = 0.8475 Copem U For all x  U  Copenset a(x) = a(x) + S(x,u)

  27. E-CAST: An Example a = 0.2 a = 0.1 a = 0.86 a = 0.39 C1 t = 0.8475 Copem U Pick an element u  U with maximum affinity

  28. E-CAST: An Example a = 0.5 a = 0.47 a = 0.59 C1 t = 0.8475 a = 0.86 a = 0.86 Copem U For all x  U  Copen set a(x) = a(x) + S(x,u)

  29. E-CAST: An Example a = 0.5 a = 0.47 a = 0.59 C1 t = 0.8475 a = 0.86 a = 0.86 Copem U max{a(u)|u  U}  t|Copen| = false

  30. E-CAST: An Example a = 0.5 a = 0.47 a = 0.59 C1 t = 0.8475 a = 0.86 a = 0.86 Copem U min{a(v)|v  Copen} < t|Copen| = false

  31. E-CAST: An Example Copem C1 C2 t = 0.8475 U Close Cluster

  32. E-CAST: An Example C1 C2 t = 0.84333 Copem U Open New Cluster

  33. E-CAST: An Example C1 C2 t = 0.84333 Copem U Pick an element u  U

  34. E-CAST: An Example a = 0.84333 a = 0.84333 C1 C2 t = 0.84333 Copem U For all x  U  Copenset a(x) = a(x) + S(x,u)

  35. E-CAST: An Example a = 0.84333 a = 0.84333 C1 C2 t = 0.84333 Copem U Pick an element u  U with maximum affinity

  36. E-CAST: An Example a = 1.6867 C1 C2 t = 0.84333 a = 0.84333 a= 0.84333 Copem U For all x  U  Copenset a(x) = a(x) + S(x,u)

  37. E-CAST: An Example a = 1.6867 C1 C2 t = 0.84333 a = 0.84333 a= 0.84333 Copem U Pick an element u  U with maximum affinity

  38. E-CAST: An Example C1 C2 t = 0.84333 a = 1.6867 a = 1.6867 a= 1.6867 Copem U min{a(v)|v  Copen} < t|Copen| = false

  39. E-CAST: An Example Copem C1 C3 C2 U U = clustering complete

  40. Experiment Data Sets • Brain • Generated by cDNA array • 22 samples (experiments), 20 predicted to be placed in 6 clusters • 12,024 genes • Melanoma • Generated by cDNA array • 38 samples (experiments), 25 predicted to be placed in 3 clusters • 3,614 genes • Thanhall (Graeme Eisenhofer and A. Elkahloun) • Generated by cDNA array • 22 samples (experiments), 17 predicted to be placed in 3 clusters • 12,024 genes

  41. Experiment Metrics • Misplaced Samples: The number of samples that are not placed in their predicted clusters. • Misjoined Clusters: The number of predicted clusters that are merged by the algorithm. • Clustering Time

  42. Experiment Results

  43. Experiment Results (Cont.) • In addition E-CAST compares very well against a hierarchical clustering program, Cluster developed by Micheal Eisen et al.

  44. Experiment Results (Cont.)

  45. Conclusion • E-CAST: An enhanced CAST algorithm performs better than CAST. • Three data sets were used to evaluate both algorithms. • Dynamic assignment of T in E-CAST: may obviate the need for the cleaning step. • E-CAST shows better performance than CAST: Total execution time. • E-CAST compares very well against Micheal Eisen’s hierarchical clustering program. • Future work: Theoretical analysis of E-CAST.

More Related