1 / 64

De novo Parallel Assemblers Algorithm discussion

De novo Parallel Assemblers Algorithm discussion. Jintao Meng Shenzhen Institutes of Advanced Technology. Key notes. General Workflow Special in ABySS Special in YAGA. General workflow. General Workflow. 0. error removal or correction (Pre-assembler) 1. graph construction

trey
Download Presentation

De novo Parallel Assemblers Algorithm discussion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. De novo Parallel Assemblers Algorithm discussion Jintao Meng Shenzhen Institutes of Advanced Technology

  2. Key notes General Workflow Special in ABySS Special in YAGA

  3. General workflow

  4. General Workflow 0. error removal or correction (Pre-assembler) 1. graph construction 2. error removal Tips Bubbles Spurious links 3. graph simplification unambiguous node merging in ABySS Graph compact in YAGA Node concatenation in velvet, EULER_SR 4. contigs extend with Pair-end information search in ABySS Complex in YAGA 5. output contigs.

  5. Special in ABySS

  6. input data Given m sequence (error free)of total lengthn, sampled from a genome of total length g, distributed them among p processors such that there are m/p sequence or n/p bases per processor, The compute complexity of this step is O(n/p), and uses one all-to-all communication.

  7. Input data of our example Length of genome is 14, Read length is 6 Num of Reads is 6 Processor number is 3 k=3 was set in our k-mers.

  8. Distribute Reads to processors m=6 reads (error free)of total lengthn=36, sampled from genome length g=14 m/p =2 sequence was distributed to each processor

  9. Calculate the location of k-mers Assigned value (0, 1, 2, 3) to (A, C, G, T) the assignment is invariant under reverse complementation

  10. Calculate the location of k-mers Node tuples <cluster_precessor_index, value_of_k-mer, edge_vector> cut into k-mers with a sliding window 2, 110010(TAG), (0000,0000) Construct the nodes 0, 001011(AGT), (0000,0000) 0, 101101(GTC), (0000,0000) 2, 110110(TCG), (0000,0000)

  11. Processing sequences send to Processor0 0, AGT (0000,0000), 2 0, GAG, (0000,0000), 2 0, GGC, (0000,0000) 0, GGC, (0000,0000) 0, TTA, (0000,0000) 0, GTC, (0000,0000), 2 0, GAG, (0000,0000), 2 0, GGC, (0000,0000) 0, GGC, (0000,0000) 0, TTA, (0000,0000) send to Processor1 1, CCT, (0000,0000), 2 1, CCT, (0000,0000), 2 1, CTT, (0000,0000), 2 1, CCT, (0000,0000), 2 1, CTT, (0000,0000),2 2, TCG, (0000,0000), 2 2, GCT, (0000,0000), 2 send to Processor2 2, TAG, (0000,0000) 2, TAG, (0000,0000) 2, GCT, (0000,0000) 2, TCG, (0000,0000), 3 2, TAG, (0000,0000) 2, TCG, (0000,0000), 3 2, TCG, (0000,0000), 2 2, GCT, (0000,0000)

  12. Final nodes distribution Input reads 0, AGT (0000,0000), 2 2, GCT, (0000,0000), 3 1, CCT, (0000,0000), 4 2, TAG, (0000,0000), 2 0, GTC, (0000,0000), 2 1, CTT, (0000,0000), 2 2, TCG, (0000,0000), 5 0, GAG, (0000,0000), 2 0, GGC, (0000,0000), 2 0, TTA, (0000,0000) After all to all communication, All collected nodes in each processor was hold in local hash bucket.

  13. Construct edges- sending messages Each node u send messages(src, dst, direction) to each of its eight possible neighbors u->v, w->u Each node u receive messages from each of its eight possible neighbors The compute complexity of this step is O(1), and uses twice all-to-all communication.

  14. Construct edges- sending messages A: message(AGT, TAC, positive) | A | C AGT | GT_ | | G | T C: message(AGT, GTC, positive) All k-mers overlapped with the suffix of AGT u->v G: message(AGT, GTG, positive) T: message(AGT, GTT, positive) u Sending Messages 0, AGT, (0000,0000) | A | C ACT | CT_ | |G |T A: message(AGT, TAG, negative) C: message(AGT, GAG, negative) All k-mers overlapped with the suffix of negative size of AGT , w->u G: message(AGT, CTG, negative) T: message(AGT, CTT, negative) Send messages(src, dst, direction) to each of its eight possible neighbors u->v, w->u

  15. Construct edges-receiving messages A: message(TAC, AGT, negative) C: message(GTC, AGT, negative) Set positive side edge bitmap w->u’ => u-> w’ G: message(GTG, AGT, positive) T: message(GTT, AGT, positive) Positive side ACGT, 0, AGT, (0000, 1000) negative side ACGT u Receiving Messages A: message(TAG, AGT, positive) Set negative side edge bitmap (w->u => u’-> w’) C: message(GAG, AGT, positive) G: message(CTG, AGT, negative) T: message(CTT, AGT, negative) Receiving messages from each of its eight possible neighbors

  16. Final distributed de bruijn Graph Input reads 1, CCT, (0100,0100), 4 2, GCT, (0001,0100), 3 0, AGT (0100,1000), 2 1, CTT, (1000,0100), 2 2, TAG, (0001,1000), 2 0, GTC, (0010,0001), 2 2, TCG, (1000,0110), 5 0, GAG, (0010,0010), 2 0, GGC, (0001,0001), 2 0, TTA, (0010,0000)

  17. Read errors ‘dead-end’ branches Bubble removal (Spurious links)

  18. Dead-end branches (tips) definition Dead-end branch One end of the branch will terminate with no extension The length of the branch is less than k.

  19. Dead-end branches (tips) removal Identify all dead-end nodes. Trace backward until a point of ambiguity is reached. If the branch is shorter than a threshold (k), this branch will be deleted. The compute complexity of this step is O(2k), and uses 2k times point to point communication.

  20. bubble definition Repeat read errors and single nucleotide allelic differences would cause “bubbles” of length 2k-1 Bubbles are popped by removing either of those branches Complex bubbles can form when multiple bubbles intersect Bubble popping step either creates dead branches Or reduces bubble orders by one Popped bubbles are recorded in a log file to study potential allelic differences.

  21. Bubble removal Each point of divergence is found in the graph. Each path from the point of divergence is traced forward looking for the paths to join after n nodes (k<=n<=2k) If the path join, and the path with lower coverage will be removed. All the removed paths are stored in a log file. The compute complexity of this step is O(4k), and uses 4k times point to point communication.

  22. Vertex merging Ambiguous edges are removed from the graph, Merge vertices with unambiguous edges, and creating the initial contigs. The communication complexity of the worse-case is O(g), however, a good communication plan will reduce the expected communication time to O(log(g/p)).

  23. Distributed vertex merging 0, AGT (0100,1000), 2 0, AGTC (0001, 1000), 2 1, CCT, (0100,0100), 4 2, GCT, (0001,0100), 3 1, CTT, (1000,0100), 2 2, TAG, (0001,1000), 2 0, GTC, (0010,0001), 2 0, GAG, (0010,0010), 2 2, TCG, (1000,0110), 5 0, GGC, (0001,0001), 2 0, TTA, (0010,0000)

  24. Distributed vertex merging 0, AGTC (0001, 1000), 2 1, CCT, (0100,0100), 4 2, GCT, (0001,0100), 3 Message (AGT, TCG, Request) Message (CCT, GAG, Request) Message (GCT, CTT, Request) 1, CTT, (1000,0100), 2 2, TAG, (0001,1000), 2 0, GAG, (0010,0010), 2 2, TCG, (1000,0110), 5 0, GGC, (0001,0001), 2 0, TTA, (0010,0000)

  25. Distributed vertex merging 0, AGTC (0001, 1000), 2 1, CCT, (0100,0100), 4 2, GCT, (0001,0100), 3 1, CTT, (1000,0100), 2 2, TAG, (0001,1000), 2 Message (CTT, GCT Union) Message (CTT, GCT Update) 0, GAG, (0010,0010), 2 2, TCG, (1000,0110), 5 Message (GAG, CCT, Union) Message (TCG, AGT, End) Message (GAG, CCT, Update) 0, GGC, (0001,0001), 2 0, TTA, (0010,0000)

  26. Distributed vertex merging 0, AGTC (0001, 1000), 4 1, CCTC, (0010,0100), 6 1, CCT, (0100,0100), 4 2, GCT, (0001,0100), 3 2, GCTT (1000,0100), 5 1, CTT, (1000,0100), 2 2, TAG, (0001,1000), 2 0, GAG, (0010,0010), 2 2, TCG, (1000,0110), 5 0, GGC, (0001,0001), 2 0, TTA, (0010,0000)

  27. Distributed vertex merging 2, GCTT (1000,0100), 5 0, AGTC (0001, 1000), 4 1, CCTC, (0010,0100), 6 Message (GCT, TTA, Request) Message (CCT, TCG, Request) 2, TAG, (0001,1000), 2 2, TCG, (1000,0110), 5 0, GGC, (0001,0001), 2 Message (GGC, CCT, Request) 0, TTA, (0010,0000)

  28. Distributed vertex merging 2, GCTT (1000,0100), 5 0, AGTC (0001, 1000), 4 1, CCTC, (0010,0100), 6 Message (CCT, GGC, Lock) 2, TAG, (0001,1000), 2 2, TCG, (1000,0110), 5 Message (TCG, CCT, End) 0, GGC, (0001,0001), 2 0, TTA, (0010,0000) Message (TTA, GCT, Union) Message (TTA, GCT, Update)

  29. Distributed vertex merging 2, GCTT (1000,0100), 5 0, AGTC (0001, 1000), 4 1, CCTC, (0010,0100), 6 2, GCTTA (0010,0100), 6 2, TAG, (0001,1000), 2 2, TCG, (1000,0110), 5 0, GGC, (0001,0001), 2 0, TTA, (0010,0000)

  30. Distributed vertex merging 0, AGTC (0001, 1000), 4 1, CCTC, (0010,0100), 6 2, GCTTAG (0001,0100), 6 2, GCTTA (0010,0100), 6 2, TAG, (0001,1000), 2 2, TCG, (1000,0110), 5 0, GGC, (0001,0001), 2

  31. Distributed vertex merging 0, AGTC (0001, 1000), 4 1, CCTC, (0010,0100), 6 2, GCTTAG (0001,0100), 6 2, GCTTA (0010,0100), 6 2, TAG, (0001,1000), 2 2, TCG, (1000,0110), 5 0, GGC, (0001,0001), 2

  32. Distributed vertex merging 0, AGTC (0001, 1000), 4 1, CCTC, (0010,0100), 6 2, GCTTAG (0001,0100), 6 Message (GCT, AGT, Request) Message (AGT, GCT, Union) Message (AGT, GCT, Update) 2, TCG, (1000,0110), 5 0, GGC, (0001,0001), 2

  33. Distributed vertex merging 2, GCTTAG (0001,0100), 6 0, AGTC (0001, 1000), 4 1, CCTC, (0010,0100), 6 2, GCTTAGTC (0010,0100), 9 2, TCG, (1000,0110), 5 0, GGC, (0001,0001), 2

  34. Distributed vertex merging 1, CCTC, (0010,0100), 6 2, GCTTAGTC (0010,0100), 9 Message (GCT, GGC, Union) Message (GCT, TCG, update) 2, TCG, (1000,0110), 5 0, GGC, (0001,0001), 2 Message (GGC, GCT, Lock)

  35. Distributed vertex merging 1, CCTC, (0010,0100), 6 2, GCTTAGTC (0010,0100), 9 2, TCG, (1000,0110), 5 0, GGC, (0001,0001), 2 0, GGCTTAGTC, (0010,0001), 11

  36. Distributed vertex merging 1, CCTC, (0010,0100), 6 Message (CCT, GGC, Union) 2, TCG, (1000,0110), 5 0, GGCTTAGTC, (0010,0001), 11 Message (GGC, CCT, Union) Message (GGC, CCT, Update)

  37. Distributed vertex merging 1, CCTC, (0010,0100), 6 1, GACTAAGCCTC (0010,0100), 17 2, TCG, (1000,0110), 5 0, GGCTTAGTC, (0010,0001), 11

  38. The final contigs TCG GACTAAGCCTC GACTAAGCCTCGA final contigs CTAAGCCTCGACTA Reverse Reference: Reference Sequence:

  39. Contigs merging using paired-end information resolve ambiguities and merge contigs with pair-end information. Merge two contigs if it was linked with at least p pairs(default p=5) Each contigs Ci, the set of conitgs Pi is generated, which is paired with Ci, Then a graph search is performed to look for unique path from Ci to Pi.

  40. Related materials Jared T. Simpson, “ABySS: A parallel assembler for short read sequence data”, Genome Research 2009 ABySS: A parallel assembler for short read sequence data – Supplemental Material www.bcgsc.ca/platform/bioinfo/software/abyss

  41. Complexity analysis

  42. Special in YAGA

  43. Special in YAGA Bidirected de bruijn graph Parallel graph construction (ICPP’08) Parallel graph compaction (BMC Bioinfo’09) Graph reduction (BMC Bioinfo’09) Identification and removal of errors (IPDPS’10) Summarizing clusters of Paired reads(IPDPS’10)

  44. Bidirected de bruijn graph Type I: (A, B, +, +, g, c) Type II:(A, B, -, +, a, c) Type III:(A,B, +,-, g,t)

  45. Bidirected de bruijn graph The three possible overlaps between two k-molecules and the corresponding edges in the bidirected string graph. In the first case, the positive strand of A has a suffix-prefix overlap with the positive strand of B. It was recorded as (A, B, +, +, g, c) In the second case, the negative strand of A has a suffix-prefix overlap with the the positive strand of B. It was recorded as (A, B, -, +, a, c) In the last case, the positive strand of A has a suffix-prefix overlapwith the negative strand of B It was recorded as (A,B, +,-, g,t)

  46. Question: How to walk from E to A? Hits: you must maintain consistency of arrowheads when entering and leaving a node

  47. Parallel Graph Representation Distributed tuple List for edges (u,v) <u, v, du, dv, Cu, Cv> <v, u, dv, du, Cv, Cu> 2k bits to represent node IDs Sorting tuples: by node label: edges incident to a node in local memory by (smaller node label, larger node label): both tuples corresponding to an edge in local memory

  48. Parallel Graph Construction Reads are partitioned to processors based on total size Each processor generates (k+1)-molecules (edges) from its short reads Sort to eliminate duplicates and keep frequency count

  49. Parallel graph construction - details Given m sequence (error free)of total lengthn, sampled from a genome of total length g, distributed them among p processors such that there are n/p bases per processor, The bidirected string graph hold O(g) edges and nodes, so each processers will knows all edges adjacent to O(g/p) nodes, The author outline the following algorithm runs in O(n/p) and uses a constant number of all-to-all communication.

More Related