120 likes | 240 Views
This overview discusses two innovative algorithms, SeqMap and GNUMAP, designed for mapping oligonucleotides from next-generation sequencing to the genome. SeqMap effectively handles numerous substitutions and insertions/deletions by employing a memory-intensive approach based on hashing and spaced-seed alignment. In contrast, GNUMAP offers a probabilistic method that considers base quality and addresses challenges posed by repeat regions, maximizing data utility and performance. Both algorithms aim to improve mapping accuracy while revealing their respective strengths and limitations in computational resources.
E N D
SeqMap: mapping massive amount of oligonucleotides to the genomeHui Jiang et al. Bioinformatics (2008) 24: 2395-2396The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencingNathan Clement et al. Bioinformatics (2010) 26: 38-45 Presented by: Xia Li
SeqMap • Motivation • Hashing genome usually needs large memory (e.g. SOAP needs 14GB memory when mapping to the human genome) • Allow more substitutions and insertion/deletion
SeqMap Short Read • Pigeonhole principle • Spaced seed alignment • ELAND, SOAP, RMAP • Hash reads • Insertion/deletion: 2/4 combinations with 1/2 shifted one nucleotide to its left or right Split into 4 parts All combinations of 2/4 parts Short read look up table (indexed by 2 parts) Reference Genome Image credit: J. Ruan
Experiment & Result • Deal with more substitutions and insertion/deletion Randomly generate a DNA sequence of a length of 1Mb, add 100Kb random substitutions, N’s and insertion/deletions
GNUMAP • Motivation • Base uncertainty • Such as nearly equal or low probabilities to A, C, G or T • Filter low quality reads [RMAP] -> discard up to half of the reads (Harismendyet al., 2009) • Repeated regions in the genome • Discard them -> loss of up to half of the data (Harismendyet al., 2009) • Record one -> unequal mapping to some of the repeat regions • Record all -> each location having 3 times the correct score
GNUMAP • Flow-chart
Alignment Score Read from sequencer GGGTACAACCATTAC Read is added to both repeat regions proportionally to their match quality weighted by its # of occurrences in the genome AACCAT GGGTAC AACCAT ACTGAACCATACGGGTACTGAACCATGAA Slide credit: N. Clement
Comments • SeqMap • Pos: dealing with more substations/insertion/deletion • Cons: memory consuming, not fast • GNUMAP • Pos: consider base quality and repeated regions -> generate more useful information and achieves best performance (~15% increase) • Cos: memory consuming, slow, more noise