1 / 19

Read Corrector

Read Corrector. D. Lavenier & N. Maillet IRISA, Rennes. Assembly from NGS data. Next Generation Sequencer. billions of bad reads. billions of good reads. Contigs. Correction. Assembly. Benchmarks. Data :

tavi
Download Presentation

Read Corrector

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Read Corrector D. Lavenier & N. Maillet IRISA, Rennes

  2. Assembly from NGS data Next Generation Sequencer billions of bad reads billions of good reads Contigs Correction Assembly D. LAVENIER & N. Maillet - IRISA

  3. Benchmarks Data : 1 000 257 40-bp reads generate from CP_000025 (1.8 Mbp) with metasim Substitutions: 216520 D. LAVENIER & N. Maillet - IRISA

  4. Read Error Correction • Principle: • Use the coverage redundancy to correct erroneous reads AGGATGACCAGGATTAGGACCAGT Probably due to an error sequencing GATGACCAGGATTAGGACCAGTTC GATGACCAGGATTAGGACCAGTTC ATGACCAGGATTAGGACCAGTTCA ACCAGGATTCGGACCAGTTCATTC ACCAGGATTAGGACCAGTTCATTC ACCAGGATTAGGACCAGTTCATTC CCAGGATTAGGACCAGTTCATTCA D. LAVENIER & N. Maillet - IRISA

  5. Correction principle • Index the reads • Perform correction directly on the index structure • Update the index as soon as a read is corrected • Stop the process when no corrections occur • Reject reads which cannot be corrected D. LAVENIER & N. Maillet - IRISA

  6. Index structure Seed size = 4 #k AGGATGACCAGGATTAGGATCAGT 44 entries A GGAT G ACCA G GATTAGGATCAGT seedX seedY ACCA, k, 5, A, G, G INDEX D. LAVENIER & N. Maillet - IRISA

  7. Index structure Seed size = 4 #k AGGATGACCAGGATTAGGATCAGT 44 entries A G GATG A CCAG G ATTAGGATCAGT seedX seedY CCAG, k, 6, G, A, G INDEX D. LAVENIER & N. Maillet - IRISA

  8. Read Error Correction k1 AGGATGACCAGGATTAGGACCAGT GGAC, k1, 15, G, A, C k2 GATGACCAGGATTAGGACAAGTTC GGAC, k2, 13, G, A, A k3 GATGACCAGGATTAGGACCAGTTC GGAC, k3, 13, G, A, C k4 ATGACCAGGATTAGGACCAGTTCA GGAC, k4, 12, G, A, C k5 ACCAGGATTCGGACCAGTTCATTC GGAC, k5, 09, G, C, C k6 ACCAGGATTAGGACCAGTTCATTC GGAC, k6, 09, G, A, C k7 ACCAGGATTAGGACCAGTTCATTC GGAC, k7, 09, G, A, C k8 CCAGGATTAGGACCAGTTCATTCA GGAC, k8, 08, G, A, C INDEX D. LAVENIER & N. Maillet - IRISA

  9. Voting algorithm k1 AGGATGACCAGGATTAGGACCAGT k2 GATGACCAGGATTAGGACAAGTTC GATGACCAGGATTAGGACCAGTTC k3 GATGACCAGGATTAGGACCAGTTC k4 ATGACCAGGATTAGGACCAGTTCA k5 ACCAGGATTCGGACCAGTTCATTC ACCAGGATTAGGACCAGTTCATTC k6 ACCAGGATTAGGACCAGTTCATTC k7 ACCAGGATTAGGACCAGTTCATTC k8 CCAGGATTAGGACCAGTTCATTCA Majority of A in the column D. LAVENIER & N. Maillet - IRISA

  10. General algorithm 3 steps: • Read indexing • base on double seed • Read correction • iterate until no more correction is possible • Read rejection • remove reads which cannot be corrected D. LAVENIER & N. Maillet - IRISA

  11. Step 2 : Correction do nb_err = 0 for each entry of INDEX LR = list of bad reads nb_err += len(LR) for each elements of LR un-index read from INDEX correct read index read into INDEX until nb_err != 0 D. LAVENIER & N. Maillet - IRISA

  12. Correcting 2 errors T C read with 2 errors: TTGGACCTGTGAGACTTGAGCACAGATGGACCCA iteration 1 will correct C TTGGACCTGTGA G ACTT G AGCA C AGATGGACCCA iteration 2 will correct A TTGGACCTGTGAGACTTGAG C ATAGA TGGA C CCA D. LAVENIER & N. Maillet - IRISA

  13. Step 3 : Read rejection • Principle: • Each double seed of the index is counted • A read is rejected if only one of its double seed is rare D. LAVENIER & N. Maillet - IRISA

  14. Extrapolation Data : 1 000 257 40-bp reads generate from CP_000025 (1.8 Mbp) with metasim Substitutions: 216520 D. LAVENIER & N. Maillet - IRISA

  15. Extrapolation Data : 1 000 257 40-bp reads generate from CP_000025 (1.8 Mbp) with metasim Substitutions: 216520 D. LAVENIER & N. Maillet - IRISA

  16. Step 2 : Parallelization do nb_err = 0 for i = 0 to size(INDEX) LR = list of bad reads in INDEX[i] nb_err += len(LR) for k = 0 to len(LR) un-index LR[k] correct LR[k] index LR[k] until nb_err != 0 do nb_err = 0 for i = 0 to size(INDEX) step S for j = 0 to S LR[j] = list of bad reads in INDEX[i+j] nb_err += len(LR[j]) for j = 0 to S for k = 0 to len(LR[j]) un-index LR[j][k] correct LR[j][k] index LR[j][k] until nb_err != 0 D. LAVENIER & N. Maillet - IRISA

  17. Step 2 : Parallelization do nb_err = 0 for i = 0 to size(INDEX) step S for j = 0 to S LR[j] = list of bad reads in INDEX[i+j] nb_err += len(LR[j]) for j = 0 to S for k = 0 to len(LR[j]) un-index LR[j][k] correct LR[j][k] index LR[j][k] until nb_err != 0 do nb_err = 0 for i = 0 to size(INDEX) step S forall j = 0 to S LR[j] = list of bad reads in INDEX[i+j] nb_err += len(LR[j]) for j = 0 to S for k = 0 to len(LR[j]) un-index LR[j][k] correct LR[j][k] index LR[j][k] until nb_err != 0 D. LAVENIER & N. Maillet - IRISA

  18. Read Correctiona very time consuming process • Data • Salmonella entericasubsp. entericaserovarTyphimuriumstr. (5 Mbp) • 12 726 271 reads – 80 bp D. LAVENIER & N. Maillet - IRISA

  19. Challenges • Correction of billions of reads • Development of scaling algorithms • Parallelism need to be extended • Multicore is not enough • are GPU good candidates ? • Not sure, but need to be tested on new GPU architectures • Find parallel data structures • To decrease the memory footprint per processor • To break the computation into hundreds of tasks D. LAVENIER & N. Maillet - IRISA

More Related