Efficient Clustering of Large EST Data Sets on Parallel Computers

Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003, 31(11), 2963-2974 Presented byElizabeth Cha

Problem Statement We are given an EST database from a single species, where multiple EST sequences may belong to the same gene. We want to find an efficient algorithm to cluster EST sequences, so that all EST sequences in a cluster belong to a single gene. (It’s possible to have more than one cluster for a gene.)

Efficient Algorithm Considerations • Memory efficiency to reduce the memory required to linear in the size of input • Computational efficiency without sacrificing the quality of clustering • Reduction of run-time of clustering large EST data sets by parallel processing (e.g. MPI)

EST database (dbEST) • Expressed Sequence Tag (EST) representations provide a dynamic view of genome content and expression • > 5 million human ESTs • > 3.5 million mouse ESTs Reference information: dbEST (ncbi.nlm.nih.gov/dbEST/dbEST_summary.html)

What is EST? • A unique DNA sequence derived from a cDNA library. • The length of EST is around 200 ~ 500 nucleotides long. • ESTs are generated by sequencing either one of both ends of an expressed gene. • The EST can be mapped, by a combination of genetic mapping procedures, to a unique locus in the genome and serves to identify that gene locus.

An overview of the process of protein synthesis Image adopted by http://ncbi.nlm.nih.gov/About/primer/est.html

An overview of how ESTs are generated. Image adopted from ncbi.nlm.nih.gov/About/primer/est.html

Current Problems in dbEST • Imposing size of EST database • Low sequence quality • Highly similar (but distinct) gene family members • Chimeric cDNA clones • Retained introns and alternatively spliced transcripts • Incomplete gene coverage • Other limitations

Types of alternative splicing • Skipped exons • Retained introns • Alternative donor or acceptor site Image adopted from Trends in Genetics, 2002,18(1), 53-57

How to solve the problems • Remove the redundancy by clustering ESTs representing the same native transcripts • Current software for clustering ESTs • UniGene • STACK (Sequence Tag Alignment and Consensus Knowledgebase) • HGI (Human Gene Index) • TIGR Assembler • CAP3 • Phrap

Goals of clustering ESTs • Each cluster represents a distinct gene, including all alternative transcript isoforms derived from the same gene (e.g. UniGene). • Each cluster is deemed to represent a distinct mRNA transcript (e.g. CAP3, TIGR Assembler, Phrap). • ESTs and first categorized by their RNA source and are subsequently clustered separately for each source sample (e.g. STACK).

Ideas to get evidential gene or transcript • Pairwise sequence alignment with dynamic programming algorithm • Fast identification of promising pairs with good quality overlap • Report pairs based on maximal common substrings

PaCE (Parallel Clustering of ESTs) • A software program for EST clustering on parallel computers • 2 reasons for this combination enables clustering and assembly of large-scale EST data sets • Memory requirement: grows linearly in the size of input • The input size is reduced from the complete set of ESTs to the size of the biggest cluster

EST Clustering • Given: ESTs drawn from multiple mRNAs • Partition: The ESTs into clusters such that ESTs from the same gene are put together in a distinct cluster

EST Clustering (Cont’d) Image adopted from the Summer Lecture of Dr. Aluru’s of Iowa State Univ.

EST Clustering Algorithm • Initially, treat each EST as a cluster by itself • If two ESTs from two different clusters show significant overlap, merge the clusters • Output the clusters once finished

EST Clustering (Cont’d) Merging Clusters Successful overlap results in: Image adopted from the Summer Lecture of Dr. Aluru’s of Iowa State Univ.

Determining Overlaps • Compute only lower and upper rectangles • Do banded dynamic programming Image adopted from the Summer Lecture of Dr. Aluru’s of Iowa State Univ.

Maximum Common Substring • Given: a set of strings • Find: Pairs of strings that have a maximal common substring ≥ a threshold φ Image adopted from the Summer Lecture of Dr. Aluru’s of Iowa State Univ.

Organization of PaCE • Build a distributed representation of the GST data structure in parallel • Use a single processor to handle maintaining and updating the EST clusters

Generalized Suffix Tree (GST) A GST for a set of n sequences is a suffix tree constructed using all suffixes of the n sequences.

Basic Concept of Suffix Tree A suffix tree is a data structure that exposes the internal structure of a string in a deeper way than does the fundamental preprocessing.

Definition of Suffix Tree • A suffix tree T for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m. • Each internal node has at least 2 children and each edge is labeled with a nonempty substring of S. • No 2 edges out of a node can have edge-labels beginning with the same character.

Definition of Suffix Tree (Cont’d) • Key feature: for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i.

Ukkonen’s Algorithm to Construct a Suffix Tree Construct tree I1 (It is just the single edge labeled by character S(1)) for i = 1 to m-1 do begin {phase i+1} for j = 1 to i+1 begin {extension j} Find the end of the path from the root labeled S[j..i] in the current tree. If needed, extend that path by adding character S(i+1). end; end;

Suffix Tree Construct a suffix tree of sequence gaac

Suffix Tree (Cont’d) Image adopted from article (1999) Nucleic Acids Research, 27, 2369-2376

Main idea to use GST data structure • Maximal Common Substring Image adopted from the Summer Lecture of Dr. Aluru’s of Iowa State Univ.

Parallel Clustering • A master-slave paradigm is used. • Master processor: maintains and updates the clusters • Slave processors: • Generate pairs as demanded by the master processor • Perform pairwise alignments of the pairs dispatched by the master processor • Data structure for maintaining the clusters: union-find algorithm

Software availability • PaCE is freely available for non-profit, academic use. • To request source code and executables Contact information : ananthk@cs.isastate.edu

Quality Assessment • Benchmark data set: Arabidopsis thaliana 168,200 ESTs • Small genome (114.5 Mb / 125 Mb total) has been sequenced in year 2000 • Reference information: http://www.arabidopsis.org/info/aboutarabidopsis.html

Achievements of PaCE • Reduce the worst-case memory requirement from quadratic to linear • Generate promising pairs in decreasing order of maximal common substring length and cluster the ESTs such that the number of pairwise alignments is reduced by an order of magnitude without affecting the quality of clustering • Reduce the number of duplicates generated for each promising pairs

Future Research • Extend PaCE to do assembly and build consensus sequences in parallel • Incorporate quality values available to ESTs as part of input • Ensure quality clustering and assembly

System used to implement IBM xSeries cluster • 30 dual-processor nodes • 1.26 GHz Intel Pentium III processors • connected by Myrinet • 2.25 GB memory at each node • 512 MB of RAM

Quality Assessment of PaCE and CAP3

Efficient Clustering of Large EST Data Sets on Parallel Computers

Efficient Clustering of Large EST Data Sets on Parallel Computers

Presentation Transcript

Efficient Record Linkage in Large Data Sets

SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets

SCOPE Easy and Efficient Parallel Processing of Massive Data Sets

Efficient Gaussian Process Regression for large Data Sets

Efficient Parallel kNN Joins for Large Data in MapReduce

Manipulating Large Data Sets

Pattern Recognition Chapter 8: Clustering Large Data Sets

Simple, Efficient, Portable Decomposition of Large Data Sets

Efficient Clustering of High-Dimensional Data Sets

Efficient Data Parallel Computing on GPUs

Large-Scale Molecular Dynamics Simulations of Materials on Parallel Computers

using large data sets

Generalized Linear Models on Large Data Sets

Clustering Large Data Sets in Gene expression analysis Daniel Weaver

GRID distribution supporting chaotic map clustering on large mixed microarray data sets

Very large data sets

using large data sets

using large data sets

Manipulating Large Data Sets