1 / 53

HPC technologies applied to the Burrows-wheeler TRANSFORM to enhance short read assembly

HPC technologies applied to the Burrows-wheeler TRANSFORM to enhance short read assembly. Ignacio Blanquer. Objectives. To justify the suitability of Burrows-Wheeler Transform for problems related with NGS , especially alignment and assembly.

oakley
Download Presentation

HPC technologies applied to the Burrows-wheeler TRANSFORM to enhance short read assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HPC technologies applied to the Burrows-wheeler TRANSFORM to enhance short read assembly Ignacio Blanquer

  2. Objectives • To justify the suitability of Burrows-Wheeler Transform for problems related with NGS, especially alignment and assembly. • To present how GPUs can provide the computing resources needed for the large scale problems in assembly and NGS. • To discuss about limitations and other approaches. • All the work presented here is part of a collaboration between the I3M and the CIPF • Thanks to Ignacio Medina, José Salavert, Joaquin Tárraga and Joaquin Dopazo. • First results published in Salavert J, BlanquerI, Tomas A, et al. "UsingGPUsfortheExactAlignment of Short-readGeneticSequencesbyMeans of theBurrows—WheelerTransform“ IEEE/ACMtransactionsoncomputationalbiology and bioinformatics / IEEE, ACM. 2012. SeqAhead Workshop on HPC4NGS

  3. I am a Computer scientist(nobodyisperfect) Member of the High Performance and Grid Computing Research Group. Working in Medical and LifeScienceapplications. Responsible of applicationcommunities in theSpanish e-Science Network, and VENUS-C cloudcomputingproject. SeqAhead Workshop on HPC4NGS

  4. Content • The problem of assembly • The overlap detection: a main bottleneck for NGS • Techniques for efficiently mapping short reads • Suffix tries and Suffix arrays. • Burrows-Wheeler Transform. • FM-Index. • Porting of FM-Index based searching tool in GPUs • Zero-error techniques. • One-error techniques. • Bottlenecks and improvements. • Extension to the problem of assembly. • Conclusions. SeqAhead Workshop on HPC4NGS

  5. Theproblem of Assembly in NGS • A puzzle with tens of thousands of million of pieces • Many of them are repeated. • Some of them are missing. • There is no exact reference, and sometimes even there is no reference at all. • Finding 1010 needles in 1010 haystacks! SeqAhead Workshop on HPC4NGS

  6. Stages in theAssembly Idury, R.M. & Waterman, M.S. “A new algorithm for DNA sequence assembly”. J. Comput. Biol. 2, 291–306 (1995). T. Chen, S.. Skiena, “Trie..Based Data Structures for Sequence Assembly”, Combinatorial Pattern Matching 1997 Experiments Preprocessing OverlapDetection Layout ConsensusSequence Analysis SeqAhead Workshop on HPC4NGS

  7. ComputationalAnalysis 1Jared T Simpson and Richard Durbin, Efficient de novo assembly of large genomes using compresseddata structures, Genome Res. December, 2011 2 Medvedev P, Georgiou K, Myers G, Brudno M: Computability of Models for Sequence Assembly. Lecture Notes in Computer Science 2007, 4645:289-301.  • For the hard stages • Overlap detection1 • In practical terms, limited in the best case by |X|·log(|X|)+|X|·avg(|Xi|). • Layout and Consensus Sequence2 • Described as a bidirectional weighted graph • Nodes are the different sequences and arrows describe overlaps. • Arrows’ weights typically define the unoverlapped fragment. • Typically NP-Hard, but there exist solutions in O(|E|2log2(|V|)) • |E| is the number of cycles and |V| is the maximum number of nodes. SeqAhead Workshop on HPC4NGS

  8. Overlapdetection SeqAhead Workshop on HPC4NGS

  9. Nomenclature • X = {X1…Xu} is the set of all the u sequences in a NGS experiment. • Each sequence Xi has a length of ni elements over an alphabet S of 4 symbols. • For simplicity, quality indicators are not considered in the algorithms. • Actually implemented in the final versions. • W denotes a sequence to be searched over X or Xi. SeqAhead Workshop on HPC4NGS

  10. The Overlap Detection Xi Xj k • Problem: • To find all pairs Xi ,Xj that fulfil • Xi -> Xjin at least k elements • Xi -> Xjif Xj[ni-k..ni] == Xi[1: k ]. • In a brute-force approach, it will require checking each Xi with respect any Xj , i != j • A NGS experiment may involve 20 Gigabases. • Unfeasible for any traditional searching process (FASTA, BLAST, SW, etc.). • Need for advanced searching structures. SeqAhead Workshop on HPC4NGS

  11. The Overlap Detection • Issues • Computational time • We should avoid complete cross comparison. • Linear or quasi-linear methods are needed. • Memory storage • Indexed searching requires 9-10 bytes per base. • This would mean around 200 GB RAM. • Need for efficient structures SeqAhead Workshop on HPC4NGS

  12. Searchingstructures • Different structures speed-up the process of searching also reducing memory requirements • Suffix arrays • Suffix tries • BWT-based Suffix tries • FM-Index. • These techniques are also valuable when searching for short seeds that are then extended using dynamic programming • E.g. Smith-Waterman. SeqAhead Workshop on HPC4NGS

  13. SuffixArrays & SuffixTREEs SeqAhead Workshop on HPC4NGS

  14. SuffixArrays 123456 X = AGGAGC 6 C 5 GC 4 AGC 3 GAGC 2 GGAGC 1 AGGAGC 3 5 1 4 6 2 123456 SA = 416352 W = “GAG” Li=1; Ls=6-> k=(6+1)/2-> 3 SA(3) = 6-> X(6:$) = “C” < W Li=4; Ls=6-> k=(4+6)/2 -> 5 SA(5) = 5-> X(5:$) = “GC” > W Li=4; Ls=4 -> k=(4+4)/2-> 4 SA(4) = 3-> X(3:$) = “GAGC” = W • A sorted list of the indexes to the different suffixes of a sequence. • Can be built in O(n·log(n)) time • Being “n” the size of the text. • Need 6n bytes and searching for a string of length “p” requires O (p·log(n)). SeqAhead Workshop on HPC4NGS

  15. SuffixTREEs and SUFFIX TRIES 123456 X = AGGAGC Suffix Trie C C A A G G 6 6 A C C GAGC G AGC C C G GAGC 3 5 5 2 4 4 1 SuffixTree G A A C G G 3 C C 2 1 A Trie (fromreTRIEval) is a specialtreeusedtocodethesuffixes of a stringorgroup of strings. EquivalenttotheSuffixArray. Bycondexingthedifferentleaves, a SuffixTreeisobtained. SeqAhead Workshop on HPC4NGS

  16. Burrows-WheelerTransform SeqAhead Workshop on HPC4NGS

  17. Burrows-WheelerTransform • Typically used in bzip compression and text searching. • It consist on a sequence of all the previous characters to the beginnings of a Suffix Array. • It can be seen as the last character of all the sorted rotations of the reference sequence. • The BWT groups all possible suffixes speeding up the searching. 0 1 2 3 4 5 6 0 1 2 3 4 5 6 6 3 0 5 2 4 1 SeqAhead Workshop on HPC4NGS

  18. Recovering the original text from BWT • In order to recover the original text, only first and last (BWT) columns are needed • Starting from the last simbol B(2) -> ‘$’. • The first symbol of the original string should be F(2) -> ‘A’. • The following symbol should be the first one in the row that ended by this ‘A’. • Since it is the second ‘A’, it should be also the second one in the BWT -> B(6). • So first one is F(6) -> ‘G’. • The recurring sequence gives the original string: • F(2), F(6), F(4), F(1), F(5), F(3), F(0) • AGGAG$ 0123456 X = AGGAGC$ 0123456 B = CG$GGAA F B 6 3 0 5 2 4 1 0 1 2 3 4 5 6 SeqAhead Workshop on HPC4NGS

  19. Transformada Burrows-Wheeler (BWT) 0,6 • Theleaves of thesuffixtree are anotatedwiththerange of suffixesthat match the input sequence. • Theadvantageisthateachnodegroupsseveralmatches and providesadditionalinformationtodealwitherrors. C G A 3,3 1,2 ^ 4,6 G G A G 5,5 1,2 4,4 1,2 6,6 A G ^ 1,1 6,6 A G 2,2 1,2 4,4 G A 4,4 2,2 G ^ 2,2 6 3 0 5 2 4 1 6,6 G 0 1 2 3 4 5 6 ^ 6,6 2,2 A 2,2 A 2,2 ^ 2,2 ^ 0 1 2 3 4 5 6 2,2 R= “^AGGAGC$” SeqAhead Workshop on HPC4NGS

  20. Transformada Burrows-Wheeler (BWT) 0,6 0,6 • AGC (Max 1F) • $ [0,6] • agA (1F) [1,2] • A,Cexclued. • aGA (1F) [4,4] • A,Cexcluded. • GGA (2F) [6,6] > X • agC (0F) [3,3] • A,Cexcluded. • aGC (0F) [5,5] • C,Gexcluded. • AGC (0F) [1,1] > V • agG (1F) [4,6] • C excluded. • aAG (2F) [1,2] > X • aGG (1F) [6,6] • C, G excluded. • AGG (1F) [2,2] V C C G A G A 3,3 1,2 3,3 1,2 ^ 4,6 G G ^ 4,6 G A G G 5,5 1,2 4,4 A G 5,5 1,2 4,4 1,2 6,6 A G ^ A 1,2 6,6 1,1 6,6 A G G ^ 1,1 2,2 6,6 1,2 4,4 G A A G 4,4 2,2 G 2,2 1,2 4,4 G ^ A 2,2 6,6 G ^ 4,4 2,2 G ^ 6,6 2,2 A 2,2 6,6 G ^ 2,2 A 6,6 2,2 6 3 0 5 2 4 1 2,2 A 0 1 2 3 4 5 6 ^ 2,2 A 2,2 ^ 2,2 2,2 ^ 2,2 ^ 0 1 2 3 4 5 6 2,2 R= “AGGAGC$” SeqAhead Workshop on HPC4NGS

  21. FM-INDEX SeqAhead Workshop on HPC4NGS

  22. FM-Index • PresentedbyFerragina and Manzini (*) • Providesanefficientwaytoconstruct and traverse a BWTsuffixtree • Construction in O(n) time once theBWTisconstructed. • Searching in linear time proportionaltothelength of the input sequence. (*) Ferragina, P. and Manzini, G. (2000).Opportunistic data structureswithapplications. In 41st IEEE SumposiumonFoundations of Computer Science, FOCS, 390-398 SeqAhead Workshop on HPC4NGS

  23. FM-INDEX • Using the BWT, two data structures are created enabling searching in linear time. • Vector C contains the cummulative number of occurences in the BWT for each one of the symbols in the alphabet, including their predecessors. • Matrix O contains the number of occurences for each symbol at each element in the BWT. 0 1 2 3 4 5 6 7 BWT = “C$GGGAA” A C G T A C G T 0 1 2 3 4 5 6 7 SeqAhead Workshop on HPC4NGS

  24. Searchingwiththe FM-INDEX • Searching along the tree • We use the formula: k = C(b) + O(b, k) + 1 l = C(b) + O(b, l + 1) Where b is the character to be processed. • String is searched reversely. • C represents the number of suffixes whose starts is alphabetically lower • E.g. the offset in the M matrix of the BWT. • O represents the offset within the block of sequences where the complete actual sequence could appear. 6 3 0 5 2 4 1 0 1 2 3 4 5 6 SeqAhead Workshop on HPC4NGS

  25. SearchingwiththeFM-INDEXExample • X = “AGGAGC”, W=“GAG” • “G”: k=0, l=6 • k=C(G)+O(G,k)+1=3+0+1=4 • l=C(G)+O(G,l+1)=3+3=6 • [4,6] are the 3 sequencesendingby “G”. • “A”: k=4, l=6 • k=C(A)+O(A,k)+1=0+0+1=1 • l=C(A)+O(A,l+1)=0+2=2 • [1,2] are the2 sequencesendingby“AG”. • “G”: k=1, l=2 • k=C(G)+O(G,k)+1=3+0+1=4 • l=C(G)+O(G,l+1)=3+1=4 • [4,4] isthesequenceendingby“GAG”. 6 3 0 5 2 4 1 0 1 2 3 4 5 6 A C G T A C G T 0 1 2 3 4 5 6 7 SeqAhead Workshop on HPC4NGS

  26. Searchingwiththe FM-INDEX(*) (*) Fast and accurate short read alignment with Burrows–Wheeler Transform, Heng Li and Richard Durbin, BIOINFORMATICS Vol. 25 no. 14 2009, doi:10.1093/bioinformatics/btp324 • Searching starts from the end of the string • At element “j”, up to 9 possible branches should be explored • Exact matching: The counter for the number of errors is not increased, new values for k and l are calculated according to the formula. Process continues with element j-1. • Mismatch: If the search tree indicates that there are additional matches in the reference that differ in the current symbol. If the counter of errors has not reached the maximum allowed, all the possible branches are explored (up to 3). New values for k and l are calculated, error counter is incremented and processing continues in the element j-1. • Deletion: In this case, the algorithm consider that current symbol may have been inserted and searches for matches in the reference skipping the current symbol. The error counter is increased (if possible), processing continues in the element j-1, but values of k and l are kept unmodified. • Insertion: In this case, the algorithm consider that a symbol is missing and checks in the tree for the possible branches including a new symbol at the present position (up to 4 branches). The error counter is increased, new values for k and l are calculated and processing continues in the same symbol. SeqAhead Workshop on HPC4NGS

  27. Dealing with errors • Early termination • It is possible to predict if a branch will lead to an unfeasible solution by computing the number of errors that have to be assumed. • It requires computing vector D for each searched sample, using the inverted reference string. • It reduces the branching explosion. ComputeD(W) z←0 j←0 fori=0 to|W|−1 do if W[j,i]X then z←z+1 j←i+1 fi D(i)←z end end SeqAhead Workshop on HPC4NGS

  28. The general case – Recursiveapproach CalculateD(W) k←1 l←|X|−1 z←0 fori=0 to|W|−1 do k←C(W[i])+O(W[i],k−1)+1 l←C(W[i])+O(W[i],l) ifk>l k←1 l←|X|−1 z←z+1 end done D(i)←z InexRecur(W,i,z,k,l) ifz<D(i) return∅ ifi<0 return{[k,l]} I←∅ I←I∪ InexRecur(W,i−1,z−1,k,l) for each b∈{A,C,G,T} do k←C(b)+O(b,k−1)+1 l←C(b)+O(b,l) ifk≤l I←I∪ InexRecur(W,i,z−1,k,l) ifb=W[i] I←I∪ InexRecur(W,i−1,z,k,l) else I←I∪ InexRecur(W,i−1,z−1,k,l) end done returnI SeqAhead Workshop on HPC4NGS

  29. Difficulties in usingGPUs • Recursive model, although supported in the last versions, is not effective. • Multiple branches will reduce the parallelism degree. • GPUs memory is reduced (insufficient for human genome). • Memory access coherence has a critical impact on the final performance. • Simplifications may be needed • Cooperation GPU-CPU is the key to success. SeqAhead Workshop on HPC4NGS

  30. ExactSearch • By removing (or limiting) the branching for multiple errors, code for processing multiple sequences simultaneously can be homogenenous. • Different sequences have different values of k and l • However, different sequences can stop at different steps • Due to different lengths or the presence of mismatches. • Pres-sorting by size could speed-up the algorithm. SeqAhead Workshop on HPC4NGS

  31. ExactSearchwithGPUs • void BWSearchGPU(W[][], nW[], k[], l[], k_ini, l_ini, C, O) • { • id_thread = blockIdx.x * blockDim.x + threadIdx.x; • if (threadIdx.x<4) CopyToSharedMemory(C); • __syncthreads(); • k2 = k_ini; l2 = l_ini; • for (i=nW[id_thread]-1; (k2<=l2) && (i>=0); i--) • BWiteration(k2, l2, k2, l2, W[id_thread][i], C, O); • k[id_thread] = k2; • l[id_thread] = l2; • } • GPU algorithm parallelization is achieved by running simultaneous searches on each CUDA thread. • FM index (C and O vectors) of the reference is copied to the GPU before searching. • The search strings (W) and the transform intervals (k, l) must be transfered between CPU and GPU. SeqAhead Workshop on HPC4NGS

  32. Extensionto 1 error G G 0,6 • Eachnode can lead up to 9 branches • Ej. AGC A aGA XGA $ agA C G G A G A Match 3,3 1,2 AGG aGG T agG C ^ 4,6 G G  A G 5,5 1,2 4,4 A agC  1,2 6,6 A G C ^ 1,1 6,6 A  G T 2,2 1,2 4,4 G A  4,4 2,2 G ^ G A aGC  2,2 6,6 G ^ C 6,6 2,2 A  T 2,2 A  2,2 ^ G AGC Match 2,2 ^ 2,2 SeqAhead Workshop on HPC4NGS

  33. Supportfor a variable number of errors • Use of exact searching for pre-filtering sequences leading to an exact matching • Short computing time • It may reduce the problem by a 39%. • For the sequences not found, define a threshold and use the matching fragments as seeds • The rest of the sequence can be done using Smith-Waterman or similar approaches. • 1-error searching slightly increases alignment time (overlapped), but increases accuracy(42%). SeqAhead Workshop on HPC4NGS

  34. Towards a usefultool, combination of CPU and GPU SeqAhead Workshop on HPC4NGS

  35. OtherOptimizations • O matrix is huge • The number of elements stored can be reduced by storing only one element of each 32 and storing the changes as bits in a 32-long word. • Enables storing the whole O for the human genome is state-of-the art GPU boards. • Performance is not compromised by the use of machine instructions, such as (_popcnt). • Partial sorting of the reference • Considering genetic variability, the ordering of the S array can take into account only the first n<|X| elements. • However, this is incompatible with the compression of the S vector. • Overlaping I/O and processing • Input and Output of the different sequences of blocks during the processing in the GPU. SeqAhead Workshop on HPC4NGS

  36. Other Optimizations • Compression of the Suffix Array • Suffix array, again is huge (one integer per element of the BWT). • Compression is feasible by storing a fraction of the elements (with a fixed stride) and iterating with the formula • S(k)=S((Y-1)(j)(k))+j • Y-1(i)=C(B[i])+O(B[i],i) • Combining searching in both strands. SeqAhead Workshop on HPC4NGS

  37. RESPONSE TIME Find and show 1 match SeqAhead Workshop on HPC4NGS

  38. Response time Show allmatches SeqAhead Workshop on HPC4NGS

  39. Response time Fordifferentblock times SeqAhead Workshop on HPC4NGS

  40. Speed-UP Fordifferent input sizes SeqAhead Workshop on HPC4NGS

  41. Distribution of processing time With disk caching Without disk caching SeqAhead Workshop on HPC4NGS

  42. Use of BWT in Assembly SeqAhead Workshop on HPC4NGS

  43. Can we apply directly the fm-index? • Construct a BWT for the Xi sequences • Computational time: sum(|Xi|·log(|Xi|)+|Xi|) • Search each Xi sequence over all the BWTs • Computational time: |X|·sum(|Xi|) • Memory Requirements (FM-index) • O: 4·8·sum(|Xi|) • S: 8·sum(|Xi|) • Instantiation: 200 Million sequences of 100 bases • Computational time> Pf. • Memory Reqs: >600GB • Unfeasible!!!! SeqAhead Workshop on HPC4NGS

  44. The SGA* algorithm (*) Efficient de novo assembly of large genomes using compresseddata structuresJared T Simpson and Richard Durbin, Genome Res. December, 2011 • An approach could be to create a single BWT that could be used to search all the hits for each sequence simultaneously . • Ideally, once the Multiple BWT is created, the computing time will be linear with the size of the sequencing. • Moreover, the Multiple BWT gives already an information about similar sequences. SeqAhead Workshop on HPC4NGS

  45. Suffixarraysformultiplesequences • A Suffix Array can be extended to cover a set of sequences • SA(i) = (j,k) • In the j-th sequence, the suffix [Sj(k).. Sj(|Sj(k)|)] occupies the i-th position in an alphabet order. • All sequences are terminated by a $j symbol, being $j alphabetically lower than any symbol of the alphabet and being $p < $q if p<q. • If two sequences are equal, the order is given by the order of the sequence. SeqAhead Workshop on HPC4NGS

  46. A pictureisworth a millionwords 1 – (1,7) 2 – (2,8) 3 – (3,7) 4 – (3,6) 5 – (2,6) 6 – (3,4) 7 – (1,4) 8 – (2,2) 9 – (1,1) 10 – (1,6) 11 – (2,4) 12 – (3,2) 13 – (2,7) 14 – (3,5) 15 – (1,3) 16 – (2,1) 17 – (1,5) 18 – (2,3) 19 – (3,1) 20 – (1,2) 21 – (2,5) 22 – (3,3) 01234567 R1 = AGGAGC$1 R2= GAGCTAG$2 R3= GCTAGA$3 8 $2 7 G$2 6 AG$2 5 TAG$2 4 CTAG$2 3 GCTAG$2 2 AGCTAG$2 1 GAGCTAG$2 7 $1 6 C$1 5 GC$1 4 AGC$1 3 GAGC$1 2 GGAGC$1 1 AGGAGC$1 7 $3 6 A$3 5 GA$3 4 AGA$3 3 TAGA$3 2 CTAGA$3 1 GCTAGA$3 R2 R1 R3 SeqAhead Workshop on HPC4NGS

  47. DefinitionfortheBWT i – SA(i) – B(i) - F(i) 1 – (1,7) – C - $1 2 – (2,8) – G - $2 3 – (3,7) – A - $3 4 – (3,6) – G - A 5 – (2,6) – T - A 6 – (3,4) – T - A 7 – (1,4) – G - A 8 – (2,2) – G - A 9 – (1,1) – $1 - A 10 – (1,6) – G - C 11 – (2,4) – G - C 12 – (3,2) – G - C 13 – (2,7) – A - G 14 – (3,5) – A - G 15 – (1,3) – G - G 16 – (2,1) – $2 - G 17 – (1,5) – A - G 18 – (2,3) – A - G 19 – (3,1) – $3 - G 20 – (1,2) – A - G 21 – (2,5) – C - T 22 – (3,3) – C - T SA(i) = (j, k) B(i) = Rj(k-1) 01234567 R1 = AGGAGC$1 R2= GAGCTAG$2 R3= GCTAGA$3 SeqAhead Workshop on HPC4NGS

  48. FM-INDEXformultiplesequences C= [391220] 12345678910111213141516171819202122 O(A)=0011111111112333455666 O(C)=1111111111111111111123 O(G)=0112223445677788888888 O(T)=0000122222222222222222 BWT = CGAGTTGG$1GGGAAG$2AA$3ACC SeqAhead Workshop on HPC4NGS

  49. Searchingwiththefm-index 01234567 W= GAGCTAG$2 i – SA(i) – B(i) - F(i) 1 – (1,7) – C - $1 2 – (2,8) – G - $2 3 – (3,7) – A - $3 4 – (3,6) – G - A 5 – (2,6) – T - A 6 – (3,4) – T - A 7 – (1,4) – G - A 8 – (2,2) – G - A 9 – (1,1) – $1 - A 10 – (1,6) – G - C 11 – (2,4) – G - C 12 – (3,2) – G - C 13 – (2,7) – A - G 14 – (3,5) – A - G 15 – (1,3) – G - G 16 – (2,1) – $2 - G 17 – (1,5) – A - G 18 – (2,3) – A - G 19 – (3,1) – $3 - G 20 – (1,2) – A - G 21 – (2,5) – C - T 22 – (3,3) – C - T (k, l) = (2, 22) k’ = C(x)+O(x,k-1)+1 l’ = C(x)+O(x,l) G -> (C(G)+O(G,1)+1, C(G)+O(G,22) -> (12+0+1, 12+8) = (13, 20) A -> (C(A)+O(A,12)+1, C(A)+O(A,20) -> (3+1+1, 3+6) = (5, 9) T -> (C(T)+O(T,4)+1, C(T)+O(T,9) -> (20+0+1, 12+8) = (21, 22) C -> (C(C)+O(C,20)+1, C(C)+O(C,22) -> (9+1+1, 9+3) = (11, 12) G -> (C(G)+O(G,10)+1, C(G)+O(G,12) -> (12+5+1, 12+7) = (18, 19) SeqAhead Workshop on HPC4NGS

  50. A pictureisworth a millionwords 1 – (1,7) 2 – (2,8) 3 – (3,7) 4 – (3,6) 5 – (2,6) 6 – (3,4) 7 – (1,4) 8 – (2,2) 9 – (1,1) 10 – (1,6) 11 – (2,4) 12 – (3,2) 13 – (2,7) 14 – (3,5) 15 – (1,3) 16 – (2,1) 17 – (1,5) 18 – (2,3) 19 – (3,1) 20 – (1,2) 21 – (2,5) 22 – (3,3) 01234567 R1 = AGGAGC$1 R2= GAGCTAG$2 R3= GCTAGA$3 01234567 W= GAGCTAG$2 8 $2 7 G$2 6 AG$2 5 TAG$2 4 CTAG$2 3 GCTAG$2 2 AGCTAG$2 1 GAGCTAG$2 7 $1 6 C$1 5 GC$1 4 AGC$1 3 GAGC$1 2 GGAGC$1 1 AGGAGC$1 7 $3 6 A$3 5 GA$3 4 AGA$3 3 TAGA$3 2 CTAGA$3 1 GCTAGA$3 R2 R1 R3 SeqAhead Workshop on HPC4NGS

More Related