Robust methodologies for partition clustering

Robust methodologies for partition clustering Paulo LisboaTerence Etchells, Ian Jarman and Simon Chambers

Overview • Partition clustering - critique • Decomposition of the covariance matrix • Landscape mapping of cluster solutions • Validation for two synthetic data sets and metabolic sub-typing

BioinformaticsNottingham Tenovous Primary Breast Carcinoma Series Consecutive series of 1,944 cases of primary operable invasive breast cancer(n=1,076 with all markers present) Patients presenting during 1986-98 Protein expression comprising 25 immunohistochemical markers related to tumour malignancyderived through high-throughput protein expression using TMA Abd El-Rehim et al, Int J Cancer, 116, 340-350, 2005.

Partition clustering – relevance to bioinformatics p53 CK 5/6 C-erbB-2 BRCA1 ER PgR

Partition clustering –open issues K-means i. Assume #K ii. Initialise #N ? iii. Sort by optimality ? iv. Select best for #K ? v. Select #K(s) ? vi. Single cluster or ensemble ? • Identify a suitable algorithm: • Model-based or model-free ? • Hierarchical, K-means, PAM ? • Return {Sa,...,Sz} solutions • Validate & interpret each solution

Separation index:Decomposition of the scatter matrix SW1 SW2 SB • Scatter matrices

Separation index:Decomposition of the scatter matrix SW1 SW2 SB • Invariant separation matrix and index

N.B. If |ST|=0 → Project onto subspace of cohort means a1 a3 a2

Theorem: is invariant to dimensionality reduction under Mahalanobis rotations ~ a1 ~ a3 ~ a2

K-means clustering

Adaptive Resonance Theory (ART) clustering

Concordance measure

Optimality principle i. N initialisations ii. Sort by J iii. Select top p% iv. Calculate pairwise CV v. Retain med(CV) vi. Plot (J, med_CV) • Reproducibility with • Best Separation - max(J) • Best Concordance – max(CV) • under repeated initialisations

Synthetic data (10 cohorts)

Synthetic data – mixing structure (Sammon Map)

Synthetic data – Visualisation in data space

Synthetic data (10 cohorts) 10 2 9 85 58 100 97 66 45 6 38 1 5 113 5 52 55 18 133 48 59 44 6 42 177 89 8 118 7 24 84 3 3 42 118 78 92 4 124 63 4 88 112 3 208 93 6 79 1 55 189 150 127 24 23 69 101 1 1 189 3 59 54 219 117 7 137 177 7 238 5 21 49 2 172 238 212 60 2 2 143 335 5 183 161 978 294 238 2 47 192 738 2 142 2 185 8 388 738 173 29 153 94 1 455 8 190 4 28 177 1 170 98 181 455 28 192 177 9 98 2 361 4 1 164 181 177 383 100 5 169 6 97 190 144 2 173 1 161 3 176 171 190 97 176 19 96 4 5 160 96 4 3 132 1 96 129 3 129 126 132 127 97 97 3 6 7 4 97 97 95 95 97 95 96

Synthetic data (10 cohorts) Max J SeCo Max Cv

BioinformaticsNottingham Tenovous Primary Breast Carcinoma Series Consecutive series of 1,944 cases of primary operable invasive breast cancer(n=1,076 with all markers present) Patients presenting during 1986-98 Protein expression comprising 25 immunohistochemical markers related to tumour malignancyderived through high-throughput protein expression using TMA Abd El-Rehim et al, Int J Cancer, 116, 340-350, 2005.

Marginal distributions

Landscape map (SeCo)

Stability index (Cv)

Landscape map (SeCo)

Cluster hierarchy (1) C5, 179 159 C7, 186 160 C2, 106 C4, 230 105 206 67 C1, 266 C5, 120 105 240 44 C3, 108 C2, 109 C4, 430 107 407 107 112 C4, 116 C3, 459 C3, 130 458 114 C6, 209 C4, 94 C1, 781 C3, 285 202 22 246 322 62 94 C1, 96 C2, 373 C5, 205 103 201 93 24 51 65 24 C2, 209 C1, 121 C2, 295 C8, 106 102 105 112 244 C1, 244 C2, 198 C6, 119 208 26 116 219 79 C6, 174 C1, 152 C3, 215 172 186 C2, 234 169 C4, 277 44 51 91 C1, 142 C5, 192 101 127 C3, 205 94 C7, 167

Cluster hierarchy (2) C1, 177 164 C3, 185 172 C2, 131 C5, 184 120 167 C5, 237 C4, 189 15 183 201 46 65 C8, 183 C4, 209 C1, 338 300 134 161 116 228 C2, 249 C3, 459 C1, 241 458 155 125 78 105 C3, 246 C3, 163 C1, 781 C2, 365 209 322 151 C6, 121 C2, 373 C4, 252 240 114 91 102 51 124 C3, 238 C1, 119 C2, 295 C7, 106 19 243 C1, 244 C2, 229 C5, 104 228 229 116 93 99 101 C5, 97 C4, 135 C6, 120 113 117 C7, 138 17 C3, 117 116 136 198 C6, 126 C2, 198 20 62 C1, 90 66 C4, 93

Solution A

Solution B

Solution A

Sub-type profiling Clusters A Clusters B Luminal New 2 Luminal N

Sub-type profiling Clusters A Clusters B Luminal A HER2

Sub-type profiling Clusters A Clusters B Basal p53 - Basal muc1 + Basal p53 + Basal muc1 -

Consistency with consensus clustering

Molecular sub-typing

Summary • Partition clustering - critique • Decomposition of the covariance matrix • Landscape mapping of cluster solutions • Validation for two synthetic data sets and metabolic sub-typing

Ferrara data (n=633)

Ferrara data (n=633) JMU Cluster 1/5 JMU Cluster 2/5 JMU Cluster 4/5 JMU Cluster 3/5 JMU Cluster 5/5

Ferrara data (n=633)

Robust methodologies for partition clustering

Robust methodologies for partition clustering

Presentation Transcript

Overview of Gene Clustering and Algorithmic Methodologies

ROCK: A ROBUST CLUSTERING ALGORITHM FOR CATEGORICAL ATTRIBUTES

Partition Functions for Independent Particles

Partition Coefficients

ROCK: A Robust Clustering Algorithm for Categorical Attributes

Collaborative Clustering for Entity Clustering

Clustering: Partition Clustering

Chapter 4 Partition (3) Double Partition

Flexible and Robust Co-Regularized Multi-Domain Graph Clustering

A robust adaptive clustering analysis method for automatic identification of clusters

Partition Function

A Technique For Text Clustering Using A Partition Based Approach Doan Nguyen

A Similarity-Based Robust Clustering Method

RIC: Parameter-Free Noise-Robust Clustering

Robust Scene Text Detection with Adaptive Clustering

Partition

Robust Information-theoretic Clustering

Robust methodologies for partition clustering

RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

Office Partition Dubai |Demountable Partition