Loading in 2 Seconds...

Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari

Loading in 2 Seconds...

- By
**anaya** - Follow User

- 104 Views
- Uploaded on

Download Presentation
## Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Talk Overview

- Overview of talk
- Motivation
- Background
- Techniques
- Evaluation
- Related work
- Observations

CMSC 838T – Presentation

Motivation: EST Clustering

- Problem: EST Clustering
- Cluster fragments of cDNA
- Related to ‘fragment assembly’ problem
- Detecting overlapping fragments
- Overlaps can be computed:
- Pairwise alignment algorithm
- Dynamic programming
- Alternative:
- Approximate overlap detection algorithms
- Dynamic programming

CMSC 838T – Presentation

Motivation

- Common Tools:
- Takes too long
- Days for 100,000 ESTs
- Runs out of memory
- This paper:
- PaCE:
- Parallel Clustering of ESTs
- Efficient parallel EST Clustering
- Space efficient algorithm
- Reduce total work
- Reduce run-time

CMSC 838T – Presentation

Background: EST Clustering Tools

- Three traditional software:
- Originally designed for fragment assembly:
- TIGR Assembler
- Phrap
- CAP3
- One parallel software:
- UICLUSTER: assumes EST’s from 3’ end

CMSC 838T – Presentation

EST Clustering Tools

- Basic approach
- Find pairs of similar sequences
- Align similar pairs
- Dynamic programing
- Quality of EST clustering
- Phrap: Fastest
- avoids dynamic programming
- Relies on approximation, lower quality
- CAP: Least # of erroneous clusters

CMSC 838T – Presentation

EST Clustering Tools’ Performance

- With 50,000 maize ESTs
- Using PC with dual Pentium 450MHZ , 512 RAM :
- TIGR: ran out of memory
- Phrap: 40 min
- CAP: > 24 hours
- With 100,000 maize ESTs
- all ran out of memory
- CAP would require 4 days

CMSC 838T – Presentation

Goal

- Space efficient algorithm
- Space requirement linear in the size of the input data set
- Reduce total work
- Without sacrificing quality of clustering
- Reduce run-time and facilitate the clustering of large data sets
- Through parallel processing
- Scale memory with # of processors

CMSC 838T – Presentation

Approach

- Expense:
- Pairwise alignment (time + memory)
- Promising pairs ≈
- Common string: |s|= w
- Cost: if common |s|=l > w , then repeats l-w+1 times

CMSC 838T – Presentation

Approach (Cont ..)

- Approach:
- Use trie structure
- Identify promising pairs
- Merge clusters with strong overlaps
- Avoid storing/testing all similar pairs
- Parallel EST Clustering Software:
- Generalized Suffix Tree (GST)
- Multiple processors:
- Maintain and updates EST Clusters
- Others generate batches of promising pairs, perform alignment

CMSC 838T – Presentation

Approach (Cont …)

CMSC 838T – Presentation

Suffix Tries (Cont ..)

- Indicies
- Storage O(n), constant is high though
- Common string
- Longest common substring

CMSC 838T – Presentation

Suffix Tries (Cont ..)

a

b

5

b

$

a

a

$

b

b

$

4

$

3

2

1

Given a pattern P = ab we traverse the tree according to the pattern.

CMSC 838T – Presentation

Parallel Generation of GST

- GST: Generalized Suffix Tree
- Compacted trie
- Longest common prefix found in constant time
- Used for on-demand pair generation
- Sequential: O(nl)
- Parallel: O(nl/p)

CMSC 838T – Presentation

Parallel Generation of GST (Cont …)

- Previous implementations:
- CRCW/CREW PRAM model
- Work-optimal
- Involves alphabetical ordering of characters
- Unrealistic assumptions
- synchronous operation of processors
- infinite network bandwidth
- no memory contention
- Not practically efficient

CMSC 838T – Presentation

Parallel Generation of GST (Cont …)

- Paper’s approach:
- EST’s equally distributed among processors
- Each processor
- Partitions suffixes of ESTs into buckets
- Distribute buckets to the processors:
- All suffixes in a bucket allocated to the same processor
- Total # of suffixes allocated to a processor ≈ O ( )

CMSC 838T – Presentation

Parallel Generation of GST (Cont …)

- Each bucket’s processor:
- Compute compacted trie of all its suffixes
- Cannot use sequential construction
- Suffixes of a string
- not in the same bucket
- Each bucket:
- Subtree in the GST
- Nodes:
- Depth first search traversal of the trie
- Pointer to the right most child

CMSC 838T – Presentation

On-demand Pair Generation

- A pair should be generated if
- Share substring of length ≥ treshhold
- Maximal
- Leaves in a common node
- Share a substring of length = depth of node
- Parallel algorithm
- Each processor works with its trie if
- Depth of its root in GST < threshhold

CMSC 838T – Presentation

On-demand Pair Generation

- To process
- Sort internal nodes
- Decreasing order of depth
- Lists of a node
- Generated after process
- Removed after parent is processed
- Limits space O(nl)
- Run time ≈ # pairs generated + cost of sorting
- Rejected pairs increase run-time by a factor of 2
- Eliminating duplicates reduce run-time

CMSC 838T – Presentation

Parallel Clustering

- Master-Slave paradigm:
- Master processor:
- Maintains and updates clusters
- Using union-find data structure
- Receives messages from slave processors
- A batch of next promising pairs generated by slave
- Results of the pairwise alignment
- Determines which ones to explore
- Determines if merging should occur
- Slave processors:
- Generate pairs on demand
- Perform pairwise alignments of pairs dispatched by the master processor

CMSC 838T – Presentation

Parallel Clustering (Cont…)

Organization of Parallel Clustering Software

- Batch of promising pairs generated + results of pairwise alignment
- Batchsize or fewer # of pairs + results of pairwise alignemnt on each pair

Slave

P

Master

P

Slave

P

slave

P

CMSC 838T – Presentation

Parallel Clustering (Cont..)

- To start:
- Slave P starts with 3× batchsize pairs
- Sends the 3rd batch to Master P
- Starts alignment on 1st batch
- Sends results on 1st + a newly generated batch
- While waiting to receive results from Master P, aligns 2nd batch
- Processor always has the next batch to work between:
- Submitting the results of previous batch
- Receiving another set of pairs

CMSC 838T – Presentation

Parallel Clustering (Cont..)

- Improve and control quality
- Parameters:
- Match and mismatch scores
- Gap penalties
- Post processing:
- Detection of alternating splicing
- Consulting protein databases
- Organism specific

CMSC 838T – Presentation

Experimental environment

- Used C and MPI
- Tested
- Quality of software:
- Arabidopsis thaliana (due to availability of its genome)
- Run-time behavior:
- 50,000 Maize ESTs with 32-processor IBM SP
- # of processors
- Data size
- (# of Promising pairs) vs data size
- Batchsize vs (# processors)
- # of Clusters
- Master processor’s time

CMSC 838T – Presentation

Quality Assessment

- To asses quality
- A data set and its correct clustering
- ESTs from plant Arabidopsis thaliana
- Splice program
- Align ESTs to the genome
- Discard ESTs that
- Don’t align
- Aligned in multiple spots

CMSC 838T – Presentation

Quality Assessment (Cont …)

- False negative:
- A pair in correct clustering is not paired in the output
- 5%
- False positive:
- A pair not in correct clustering appears in results
- Negligible (< 0.04%)
- Due to conservative nature of algorithm

CMSC 838T – Presentation

Quality Assessment

Distribution of the number singleton and non-singleton clusters for benchmark set of 168,200 Arabidopsis ESTs.

CMSC 838T – Presentation

Quality Assessment (Cont..)

CMSC 838T – Presentation

Run-time Assessment

- Experiment with 50,000 maize ESTs:
- 32-processor IBM SP-2
- 16 minutes

CMSC 838T – Presentation

Run-time Assessment (Cont …)

Run-time (in seconds) spent in various components of PaCE for

20,000 ESTs. p, number of processors.

CMSC 838T – Presentation

Run-time Assessment (Cont ..)

- Run-time as a function of batchsize
- Small batchsize
- Increase in communication overhead
- Large batchsize
- Slaves less responsive to the need of generating pairs
- Slave does not use latest clustering results
- Optimal batchsize
- Determined by experiment
- Master processor’s time
- Fixed batchsize, increase in # of processors
- Gradual increase in Master P’s time
- With 32 processors, increase < 1%
- Using 1 Master Processor in not bottleneck

CMSC 838T – Presentation

Results

- Space Linear in size of the input data set
- Reduced total work without sacrificing quality
- Reduced run-time
- Parallel processors
- Eliminating pairs
- Faciliate clustering
- Scale memory with # Processors

CMSC 838T – Presentation

Observations

- PaCE: Approaches EST clustering problem directly
- Better than
- CAP3
- Phrap
- TIGR Assembler
- Compare time/quality
- TIGICL (TIGR Indices Clustering Tool)
- Support for PVM
- MegaBlast
- STACK
- Large data sets
- Lots of Processors
- Can improve clustering time?
- Clustering algorithm

CMSC 838T – Presentation

References

- http://www.cs.berkeley.edu/~kubitron/courses/cs258-S02/lectures/eval10-logp.pdf
- Apostolico, C. Iliopoulos, G. M. Landau, B. Schieber, and U. Vishkin. Parallel construction of a suffix tree with applications. Algorithmica, 3:347–365, 1988.

CMSC 838T – Presentation

Download Presentation

Connecting to Server..