parallel est clustering by kalyanaraman aluru and kothari n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari PowerPoint Presentation
Download Presentation
Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari

Loading in 2 Seconds...

play fullscreen
1 / 36

Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari - PowerPoint PPT Presentation


  • 104 Views
  • Uploaded on

Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari. Nargess Memarsadeghi CMSC 838 Presentation. Talk Overview. Overview of talk Motivation Background Techniques Evaluation Related work Observations. Motivation: EST Clustering. Problem: EST Clustering Cluster fragments of cDNA

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
parallel est clustering by kalyanaraman aluru and kothari

Parallel EST ClusteringbyKalyanaraman, Aluru, and Kothari

Nargess Memarsadeghi

CMSC 838 Presentation

talk overview
Talk Overview
  • Overview of talk
    • Motivation
    • Background
    • Techniques
    • Evaluation
    • Related work
    • Observations

CMSC 838T – Presentation

motivation est clustering
Motivation: EST Clustering
  • Problem: EST Clustering
    • Cluster fragments of cDNA
  • Related to ‘fragment assembly’ problem
    • Detecting overlapping fragments
  • Overlaps can be computed:
    • Pairwise alignment algorithm
    • Dynamic programming
  • Alternative:
    • Approximate overlap detection algorithms
    • Dynamic programming

CMSC 838T – Presentation

motivation
Motivation
  • Common Tools:
    • Takes too long
      • Days for 100,000 ESTs
    • Runs out of memory
  • This paper:
    • PaCE:
      • Parallel Clustering of ESTs
    • Efficient parallel EST Clustering
      • Space efficient algorithm
      • Reduce total work
      • Reduce run-time

CMSC 838T – Presentation

background est clustering tools
Background: EST Clustering Tools
  • Three traditional software:
    • Originally designed for fragment assembly:
      • TIGR Assembler
      • Phrap
      • CAP3
  • One parallel software:
    • UICLUSTER: assumes EST’s from 3’ end

CMSC 838T – Presentation

est clustering tools
EST Clustering Tools
  • Basic approach
    • Find pairs of similar sequences
    • Align similar pairs
      • Dynamic programing
  • Quality of EST clustering
      • Phrap: Fastest
        • avoids dynamic programming
        • Relies on approximation, lower quality
      • CAP: Least # of erroneous clusters

CMSC 838T – Presentation

est clustering tools performance
EST Clustering Tools’ Performance
  • With 50,000 maize ESTs
    • Using PC with dual Pentium 450MHZ , 512 RAM :
      • TIGR: ran out of memory
      • Phrap: 40 min
      • CAP: > 24 hours
  • With 100,000 maize ESTs
      • all ran out of memory
      • CAP would require 4 days

CMSC 838T – Presentation

slide8
Goal
  • Space efficient algorithm
    • Space requirement linear in the size of the input data set
  • Reduce total work
    • Without sacrificing quality of clustering
  • Reduce run-time and facilitate the clustering of large data sets
    • Through parallel processing
    • Scale memory with # of processors

CMSC 838T – Presentation

approach
Approach
  • Expense:
    • Pairwise alignment (time + memory)
    • Promising pairs ≈
      • Common string: |s|= w
      • Cost: if common |s|=l > w , then repeats l-w+1 times

CMSC 838T – Presentation

approach cont
Approach (Cont ..)
  • Approach:
    • Use trie structure
    • Identify promising pairs
      • Merge clusters with strong overlaps
      • Avoid storing/testing all similar pairs
    • Parallel EST Clustering Software:
      • Generalized Suffix Tree (GST)
      • Multiple processors:
        • Maintain and updates EST Clusters
        • Others generate batches of promising pairs, perform alignment

CMSC 838T – Presentation

approach cont1
Approach (Cont …)

CMSC 838T – Presentation

tries
Tries
  • Index for each char
  • N leaves
  • Height N

CMSC 838T – Presentation

suffix tries cont
Suffix Tries (Cont ..)
  • TRIM suffix trie

CMSC 838T – Presentation

suffix tries cont1
Suffix Tries (Cont ..)
  • Indicies
  • Storage O(n), constant is high though
  • Common string
  • Longest common substring

CMSC 838T – Presentation

suffix tries cont2
Suffix Tries (Cont ..)

a

b

5

b

$

a

a

$

b

b

$

4

$

3

2

1

Given a pattern P = ab we traverse the tree according to the pattern.

CMSC 838T – Presentation

parallel generation of gst
Parallel Generation of GST
  • GST: Generalized Suffix Tree
    • Compacted trie
    • Longest common prefix found in constant time
    • Used for on-demand pair generation
    • Sequential: O(nl)
    • Parallel: O(nl/p)

CMSC 838T – Presentation

parallel generation of gst cont
Parallel Generation of GST (Cont …)
  • Previous implementations:
      • CRCW/CREW PRAM model
      • Work-optimal
        • Involves alphabetical ordering of characters
      • Unrealistic assumptions
        • synchronous operation of processors
        • infinite network bandwidth
        • no memory contention
        • Not practically efficient

CMSC 838T – Presentation

parallel generation of gst cont1
Parallel Generation of GST (Cont …)
  • Paper’s approach:
    • EST’s equally distributed among processors
    • Each processor
      • Partitions suffixes of ESTs into buckets
    • Distribute buckets to the processors:
      • All suffixes in a bucket allocated to the same processor
      • Total # of suffixes allocated to a processor ≈ O ( )

CMSC 838T – Presentation

parallel generation of gst cont2
Parallel Generation of GST (Cont …)
  • Each bucket’s processor:
    • Compute compacted trie of all its suffixes
    • Cannot use sequential construction
      • Suffixes of a string
        • not in the same bucket
  • Each bucket:
    • Subtree in the GST
  • Nodes:
    • Depth first search traversal of the trie
    • Pointer to the right most child

CMSC 838T – Presentation

on demand pair generation
On-demand Pair Generation
  • A pair should be generated if
    • Share substring of length ≥ treshhold
    • Maximal
    • Leaves in a common node
      • Share a substring of length = depth of node
  • Parallel algorithm
    • Each processor works with its trie if
      • Depth of its root in GST < threshhold

CMSC 838T – Presentation

on demand pair generation1
On-demand Pair Generation
  • To process
    • Sort internal nodes
      • Decreasing order of depth
    • Lists of a node
      • Generated after process
      • Removed after parent is processed
      • Limits space O(nl)
      • Run time ≈ # pairs generated + cost of sorting
      • Rejected pairs increase run-time by a factor of 2
      • Eliminating duplicates reduce run-time

CMSC 838T – Presentation

parallel clustering
Parallel Clustering
  • Master-Slave paradigm:
    • Master processor:
      • Maintains and updates clusters
        • Using union-find data structure
        • Receives messages from slave processors
          • A batch of next promising pairs generated by slave
          • Results of the pairwise alignment
        • Determines which ones to explore
        • Determines if merging should occur
    • Slave processors:
      • Generate pairs on demand
      • Perform pairwise alignments of pairs dispatched by the master processor

CMSC 838T – Presentation

parallel clustering cont
Parallel Clustering (Cont…)

Organization of Parallel Clustering Software

  • Batch of promising pairs generated + results of pairwise alignment
  • Batchsize or fewer # of pairs + results of pairwise alignemnt on each pair

Slave

P

Master

P

Slave

P

slave

P

CMSC 838T – Presentation

parallel clustering cont1
Parallel Clustering (Cont..)
  • To start:
    • Slave P starts with 3× batchsize pairs
      • Sends the 3rd batch to Master P
      • Starts alignment on 1st batch
      • Sends results on 1st + a newly generated batch
      • While waiting to receive results from Master P, aligns 2nd batch
        • Processor always has the next batch to work between:
          • Submitting the results of previous batch
          • Receiving another set of pairs

CMSC 838T – Presentation

parallel clustering cont2
Parallel Clustering (Cont..)
  • Improve and control quality
      • Parameters:
        • Match and mismatch scores
        • Gap penalties
      • Post processing:
        • Detection of alternating splicing
        • Consulting protein databases
        • Organism specific

CMSC 838T – Presentation

experimental environment
Experimental environment
  • Used C and MPI
  • Tested
    • Quality of software:
      • Arabidopsis thaliana (due to availability of its genome)
    • Run-time behavior:
      • 50,000 Maize ESTs with 32-processor IBM SP
      • # of processors
      • Data size
      • (# of Promising pairs) vs data size
      • Batchsize vs (# processors)
      • # of Clusters
      • Master processor’s time

CMSC 838T – Presentation

quality assessment
Quality Assessment
  • To asses quality
    • A data set and its correct clustering
    • ESTs from plant Arabidopsis thaliana
    • Splice program
      • Align ESTs to the genome
      • Discard ESTs that
        • Don’t align
        • Aligned in multiple spots

CMSC 838T – Presentation

quality assessment cont
Quality Assessment (Cont …)
  • False negative:
    • A pair in correct clustering is not paired in the output
    • 5%
  • False positive:
    • A pair not in correct clustering appears in results
    • Negligible (< 0.04%)
    • Due to conservative nature of algorithm

CMSC 838T – Presentation

quality assessment1
Quality Assessment

Distribution of the number singleton and non-singleton clusters for benchmark set of 168,200 Arabidopsis ESTs.

CMSC 838T – Presentation

quality assessment cont1
Quality Assessment (Cont..)

CMSC 838T – Presentation

run time assessment
Run-time Assessment
  • Experiment with 50,000 maize ESTs:
    • 32-processor IBM SP-2
    • 16 minutes

CMSC 838T – Presentation

run time assessment cont
Run-time Assessment (Cont …)

Run-time (in seconds) spent in various components of PaCE for

20,000 ESTs. p, number of processors.

CMSC 838T – Presentation

run time assessment cont1
Run-time Assessment (Cont ..)
  • Run-time as a function of batchsize
    • Small batchsize
      • Increase in communication overhead
    • Large batchsize
      • Slaves less responsive to the need of generating pairs
      • Slave does not use latest clustering results
    • Optimal batchsize
      • Determined by experiment
  • Master processor’s time
    • Fixed batchsize, increase in # of processors
      • Gradual increase in Master P’s time
    • With 32 processors, increase < 1%
    • Using 1 Master Processor in not bottleneck

CMSC 838T – Presentation

results
Results
  • Space Linear in size of the input data set
  • Reduced total work without sacrificing quality
  • Reduced run-time
    • Parallel processors
    • Eliminating pairs
  • Faciliate clustering
    • Scale memory with # Processors

CMSC 838T – Presentation

observations
Observations
  • PaCE: Approaches EST clustering problem directly
    • Better than
      • CAP3
      • Phrap
      • TIGR Assembler
    • Compare time/quality
      • TIGICL (TIGR Indices Clustering Tool)
        • Support for PVM
      • MegaBlast
      • STACK
    • Large data sets
      • Lots of Processors
    • Can improve clustering time?
        • Clustering algorithm

CMSC 838T – Presentation

references
References
  • http://www.cs.berkeley.edu/~kubitron/courses/cs258-S02/lectures/eval10-logp.pdf
  • Apostolico, C. Iliopoulos, G. M. Landau, B. Schieber, and U. Vishkin. Parallel construction of a suffix tree with applications. Algorithmica, 3:347–365, 1988.

CMSC 838T – Presentation