Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari

Parallel EST ClusteringbyKalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation

Talk Overview • Overview of talk • Motivation • Background • Techniques • Evaluation • Related work • Observations CMSC 838T – Presentation

Motivation: EST Clustering • Problem: EST Clustering • Cluster fragments of cDNA • Related to ‘fragment assembly’ problem • Detecting overlapping fragments • Overlaps can be computed: • Pairwise alignment algorithm • Dynamic programming • Alternative: • Approximate overlap detection algorithms • Dynamic programming CMSC 838T – Presentation

Motivation • Common Tools: • Takes too long • Days for 100,000 ESTs • Runs out of memory • This paper: • PaCE: • Parallel Clustering of ESTs • Efficient parallel EST Clustering • Space efficient algorithm • Reduce total work • Reduce run-time CMSC 838T – Presentation

Background: EST Clustering Tools • Three traditional software: • Originally designed for fragment assembly: • TIGR Assembler • Phrap • CAP3 • One parallel software: • UICLUSTER: assumes EST’s from 3’ end CMSC 838T – Presentation

EST Clustering Tools • Basic approach • Find pairs of similar sequences • Align similar pairs • Dynamic programing • Quality of EST clustering • Phrap: Fastest • avoids dynamic programming • Relies on approximation, lower quality • CAP: Least # of erroneous clusters CMSC 838T – Presentation

EST Clustering Tools’ Performance • With 50,000 maize ESTs • Using PC with dual Pentium 450MHZ , 512 RAM : • TIGR: ran out of memory • Phrap: 40 min • CAP: > 24 hours • With 100,000 maize ESTs • all ran out of memory • CAP would require 4 days CMSC 838T – Presentation

Goal • Space efficient algorithm • Space requirement linear in the size of the input data set • Reduce total work • Without sacrificing quality of clustering • Reduce run-time and facilitate the clustering of large data sets • Through parallel processing • Scale memory with # of processors CMSC 838T – Presentation

Approach • Expense: • Pairwise alignment (time + memory) • Promising pairs ≈ • Common string: |s|= w • Cost: if common |s|=l > w , then repeats l-w+1 times CMSC 838T – Presentation

Approach (Cont ..) • Approach: • Use trie structure • Identify promising pairs • Merge clusters with strong overlaps • Avoid storing/testing all similar pairs • Parallel EST Clustering Software: • Generalized Suffix Tree (GST) • Multiple processors: • Maintain and updates EST Clusters • Others generate batches of promising pairs, perform alignment CMSC 838T – Presentation

Approach (Cont …) CMSC 838T – Presentation

Tries • Index for each char • N leaves • Height N CMSC 838T – Presentation

Suffix Tries (Cont ..) • TRIM suffix trie CMSC 838T – Presentation

Suffix Tries (Cont ..) • Indicies • Storage O(n), constant is high though • Common string • Longest common substring CMSC 838T – Presentation

Suffix Tries (Cont ..) a b 5 b $ a a $ b b $ 4 $ 3 2 1 Given a pattern P = ab we traverse the tree according to the pattern. CMSC 838T – Presentation

Parallel Generation of GST • GST: Generalized Suffix Tree • Compacted trie • Longest common prefix found in constant time • Used for on-demand pair generation • Sequential: O(nl) • Parallel: O(nl/p) CMSC 838T – Presentation

Parallel Generation of GST (Cont …) • Previous implementations: • CRCW/CREW PRAM model • Work-optimal • Involves alphabetical ordering of characters • Unrealistic assumptions • synchronous operation of processors • infinite network bandwidth • no memory contention • Not practically efficient CMSC 838T – Presentation

Parallel Generation of GST (Cont …) • Paper’s approach: • EST’s equally distributed among processors • Each processor • Partitions suffixes of ESTs into buckets • Distribute buckets to the processors: • All suffixes in a bucket allocated to the same processor • Total # of suffixes allocated to a processor ≈ O ( ) CMSC 838T – Presentation

Parallel Generation of GST (Cont …) • Each bucket’s processor: • Compute compacted trie of all its suffixes • Cannot use sequential construction • Suffixes of a string • not in the same bucket • Each bucket: • Subtree in the GST • Nodes: • Depth first search traversal of the trie • Pointer to the right most child CMSC 838T – Presentation

On-demand Pair Generation • A pair should be generated if • Share substring of length ≥ treshhold • Maximal • Leaves in a common node • Share a substring of length = depth of node • Parallel algorithm • Each processor works with its trie if • Depth of its root in GST < threshhold CMSC 838T – Presentation

On-demand Pair Generation • To process • Sort internal nodes • Decreasing order of depth • Lists of a node • Generated after process • Removed after parent is processed • Limits space O(nl) • Run time ≈ # pairs generated + cost of sorting • Rejected pairs increase run-time by a factor of 2 • Eliminating duplicates reduce run-time CMSC 838T – Presentation

Parallel Clustering • Master-Slave paradigm: • Master processor: • Maintains and updates clusters • Using union-find data structure • Receives messages from slave processors • A batch of next promising pairs generated by slave • Results of the pairwise alignment • Determines which ones to explore • Determines if merging should occur • Slave processors: • Generate pairs on demand • Perform pairwise alignments of pairs dispatched by the master processor CMSC 838T – Presentation

Parallel Clustering (Cont…) Organization of Parallel Clustering Software • Batch of promising pairs generated + results of pairwise alignment • Batchsize or fewer # of pairs + results of pairwise alignemnt on each pair Slave P Master P Slave P slave P CMSC 838T – Presentation

Parallel Clustering (Cont..) • To start: • Slave P starts with 3× batchsize pairs • Sends the 3rd batch to Master P • Starts alignment on 1st batch • Sends results on 1st + a newly generated batch • While waiting to receive results from Master P, aligns 2nd batch • Processor always has the next batch to work between: • Submitting the results of previous batch • Receiving another set of pairs CMSC 838T – Presentation

Parallel Clustering (Cont..) • Improve and control quality • Parameters: • Match and mismatch scores • Gap penalties • Post processing: • Detection of alternating splicing • Consulting protein databases • Organism specific CMSC 838T – Presentation

Experimental environment • Used C and MPI • Tested • Quality of software: • Arabidopsis thaliana (due to availability of its genome) • Run-time behavior: • 50,000 Maize ESTs with 32-processor IBM SP • # of processors • Data size • (# of Promising pairs) vs data size • Batchsize vs (# processors) • # of Clusters • Master processor’s time CMSC 838T – Presentation

Quality Assessment • To asses quality • A data set and its correct clustering • ESTs from plant Arabidopsis thaliana • Splice program • Align ESTs to the genome • Discard ESTs that • Don’t align • Aligned in multiple spots CMSC 838T – Presentation

Quality Assessment (Cont …) • False negative: • A pair in correct clustering is not paired in the output • 5% • False positive: • A pair not in correct clustering appears in results • Negligible (< 0.04%) • Due to conservative nature of algorithm CMSC 838T – Presentation

Quality Assessment Distribution of the number singleton and non-singleton clusters for benchmark set of 168,200 Arabidopsis ESTs. CMSC 838T – Presentation

Quality Assessment (Cont..) CMSC 838T – Presentation

Run-time Assessment • Experiment with 50,000 maize ESTs: • 32-processor IBM SP-2 • 16 minutes CMSC 838T – Presentation

Run-time Assessment (Cont …) Run-time (in seconds) spent in various components of PaCE for 20,000 ESTs. p, number of processors. CMSC 838T – Presentation

Run-time Assessment (Cont ..) • Run-time as a function of batchsize • Small batchsize • Increase in communication overhead • Large batchsize • Slaves less responsive to the need of generating pairs • Slave does not use latest clustering results • Optimal batchsize • Determined by experiment • Master processor’s time • Fixed batchsize, increase in # of processors • Gradual increase in Master P’s time • With 32 processors, increase < 1% • Using 1 Master Processor in not bottleneck CMSC 838T – Presentation

Results • Space Linear in size of the input data set • Reduced total work without sacrificing quality • Reduced run-time • Parallel processors • Eliminating pairs • Faciliate clustering • Scale memory with # Processors CMSC 838T – Presentation

Observations • PaCE: Approaches EST clustering problem directly • Better than • CAP3 • Phrap • TIGR Assembler • Compare time/quality • TIGICL (TIGR Indices Clustering Tool) • Support for PVM • MegaBlast • STACK • Large data sets • Lots of Processors • Can improve clustering time? • Clustering algorithm CMSC 838T – Presentation

References • http://www.cs.berkeley.edu/~kubitron/courses/cs258-S02/lectures/eval10-logp.pdf • Apostolico, C. Iliopoulos, G. M. Landau, B. Schieber, and U. Vishkin. Parallel construction of a suffix tree with applications. Algorithmica, 3:347–365, 1988. CMSC 838T – Presentation

Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari

Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari

Presentation Transcript

Segmentation and Clustering

Pravin Kothari

Supporting Ranking and Clustering as Generalized Order-By and Group-By

Canopy Clustering and K-Means Clustering

X-Informatics Clustering to Parallel Computing

Clustering: Partition Clustering

Clustering by Compression

Parallel Clustering of English Verbs into Levin Classes

Efficient Clustering of Large EST Data Sets on Parallel Computers

-Monica Kothari

KOTHARI INTERNATIONAL SCHOOL

Clustering and Mixing of Floaters by Waves

Parallel Clustering Algorithms: Survey

On the Parallel Complexity of Hierarchical Clustering and CC -Complete Problems

Chapter 14: SEGMENTATION BY CLUSTERING

CLUSTERING AND AVAILABILITY

Kothari Mahavir Garden

Parallel Density-based Hybrid Clustering

Segmentation by clustering: normalized cut

Parallel Clustering Algorithms: Survey

Kothari International School, Noida