Fuzzy Clustering of Metagenome Reads: Comparative and Composition-based Approach

Fuzzy Clustering of Metagenome Reads ShruthiPrabhakara, Raj AcharyaDepartment of Computer Science and Engineering, Pennsylvania State University INTRODUCTION • We propose a two-pass semi-supervised fuzzy clustering method for metagenome reads i.e. a hybrid of comparative and composition based methods. It encompasses : • Comparative-based methods that align metagenomic sequences to close phylogenetic neighbors in existing databases. Such methods fail to find any homologs for new families. • Composition-based methods that distinguish between clades using intrinsic features of reads such as oligomer frequency, GC composition or codon frequency. • Fuzzy clusters handle the conflicts due to over representation of conserved regions in metagenome reads, without clipping potentially useful sequences. FUZZY LEADER CLUSTERING OF METAGENOME READS Our algorithm proceeds in two passes : In the first pass, a comparative analysis of the metagenome reads against an existing database, using BLASTx, extracts reference sequences from within the dataset to form an initial set of seeded clusters. Reads with significant BLASTx hits are clustered by their taxonomy into taxon-based clusters. In the second pass, the global clade-specific characteristics (e.g. oligomer frequency) are used to cluster the remaining reads by a fuzzy possibilistic leader clustering algorithm[1]. These composition-based clustersmight either represent reads from novel species or non-protein coding genes. Output: Our algorithm groups the reads into overlapping clusters. Each cluster is defined by a core consisting of reads that definitely belong to the cluster and a fringe that has reads which may overlap with other clusters (representing homologous sequences). FASTA FILE of metagenome reads BLASTx against the NR database Filter only significant hits Extract the taxonomy of every hit CLUSTER by TAXON Cluster reads with significant hits by taxonomy CLUSTER by COMPOSITION Group remaining reads into existing clusters based on sequence composition Output: Cluster of Reads RESULTS on ACID MINE DRAINAGE DATASET & SIMULATED DATASET ACID MINE DRAINAGE SIMULATED DATASET CONCLUSION We have developed a two-pass, semi-supervised fuzzy clustering of metagenome reads that is a hybrid of comparative and composition based approaches. Our primary goal is to enrich the dataset into a small number of clusters based on taxonomy. The secondary goal is to identify polymorphic and conserved regions and capture them within the soft boundaries of the clusters. References : Hong.Yu, HuLuo. 2003.: A Novel Possibilistic Fuzzy Leader Clustering Algorithm, Proc. of the 12th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, (2009) Acid Mine Drainage Biofilm at NCBI : www.ncbi.nlm.mih.gov/ D. Dalevi, N. N. Ivanova, K. Mavromatis, S. D. Hooper, E. Szeto, P. Hugenholtz, N. C. Kyrpides, and V. M. Markowitz: Annotation of metagenome short reads using proxygenes, Bioinformatics, (2008)

Fuzzy Clustering of Metagenome Reads: Comparative and Composition-based Approach

Fuzzy Clustering of Metagenome Reads: Comparative and Composition-based Approach

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction