Short Read Mapper

Short Read Mapper Brian S. Lam CS124

Outline • Biological Motivation • Computer Science Problem • Trivial Solution • Hash Index Solution • Future Direction

Biological Motivation • Goal: read the DNA sequence of an individual • 2 types of methods • Full Genome Sequencing (FGS): reads the entire DNA sequence at once • Shotgun sequencing: divides DNA into many short reads, and then a computer program reassembles them

Biological Motivation • Shotgun sequencing sounds more complicated, why use it? • Faster • Cheaper • However, there are downsides: • We have to reassemble the short reads • We must have a reference genomewhich is similar to the one we’re sequencing

Biological Motivation • Q: How do we reassemble the short reads? • They are randomly ordered • They will not exactly match the reference genome • Basically like doing a puzzle, but sometimes the pieces don’t fit

Biological Motivation • Q: How do we reassemble the short reads? • A: Re-sequencing • Assume that the difference between the reference genome and our reads is very small • Find the “best fit” position for each short read • Complications: • Mutations (i.e. SNPs) • Read errors • Insertions, deletions, repeated regions

Computer Science Problem • We can ignore the biology, and this becomes substring mapping problem • Allow a certain number of mismatches to account for SNPs • Ignore other complications such as read errors, insertions, deletions, repeated regions, etc. • This is for simplicity

Computer Science Problem • Problem Layout … T C A G A A G A Short read length L • Allow up to D mismatches per short read

Computer Science Problem • Assumptions • There are at most D mutations in any substring of length L • Any 2 substrings of length L in our sequence differ by at least 2Dpositions • What this means: • All short reads will map to exactly ONE position

Trivial Solution Algorithm For each short read, slide across reference genome until we find a position with < D mismatches • Easy to explain, easy to code

T C A G A A G A A T A A Trivial Solution Example: Let L = 4, D = 2 Reference: Short Read: 3 mismatches

T C A G A A G A A T A A Trivial Solution Example: Let L = 4, D = 2 Reference: Short Read: 1 mismatch SNP 1 < D, so this is the correct position, and the second base in the short read is a SNP

Trivial Solution However, simplicity has its cost… ? This is way too slow!

Hash Index Solution • Idea: If we allow D mismatches, and we break the short read into D+1 pieces, then there is at least one piece that will match perfectly

Hash Index Solution Algorithm • Store the index of each substring of length L/(D+1) in a hash index • Break short reads into pieces of length L/(D+1), and look up possible matching indices in hash index • Use trivial algorithm to check whether the short read actually matches this position • Harder to explain, harder to code

Hash Index Solution • Hashing Function • We want every substring to map to a unique key • There are four bases: A, C, G, andT • If we interpret the string as a base-4 number, we get a unique mapping • Let A = 0, C = 1, G = 2, T = 3

Hash Index Solution Hashing Function Example TGCA → 32104 → 3 x 43 + 2 x 42 + 1 x 41 + 0 x 40 228

Hash Index Solution This is our key into the hash index Hashing Function Example TGCA → 32104 → 3 x 43 + 2 x 42 + 1 x 41 + 0 x 40 228

Hash Index Solution Step 1) Populating the Hash Index • Calculate the key length based on the short read length (L) and number of allowed mismatches (D) • Add index of every substring of key length in the reference genome to the hash index

Hash Index Solution Hash Index Example • Assume key length is 4, and reference genome starts with TGCA • From the example, key(TGCA) = 228 0 1 … 228 229 …

0 Hash Index Solution Hash Index Example • Assume key length is 4, and reference genome starts with TGCA • From the example, key(TGCA) = 228 0 1 … index next 228 229 …

Hash Index Solution Step 2) Break short reads into pieces and look up possible matching indices in hash index

Hash Index Solution Step 2) Break short reads into pieces and look up possible matching indices in hash index Example Short Read: TCGAAACTGAGT TCGA AACT GAGT

Hash Index Solution Step 2) Break short reads into pieces and look up possible matching indices in hash index Example Short Read: TCGAAACTGAGT TCGA AACT GAGT Look these key values up in the hash index

Hash Index Solution Step 3) Use the trivial algorithm to check these possible matching positions against the short read

Hash Index Solution Much better performance!

Future Direction • Efficiency • Although the hash index algorithm is faster, it uses a lot of memory • Robustness • I ignored insertions, deletions, and repeated regions • These are all real complications that must be dealt with to get accurate results

Questions?

Short Read Mapper

Short Read Mapper

Presentation Transcript

Vision Mapper

Curriculum Mapper Training

How to Read a Short Story

Introduction to Short Read Sequencing Analysis

Community Resource Mapper

MQ Network Mapper

Wrangling Short Read Data with SHRiMP

Short read alignment

Configuration Mapper Sonja Vrcic

Pipeline Current Mapper PCM+

Short read mapping (Alignment)

Health Ontology Mapper

Data File Mapper Plus

Short read mapping (Alignment)

HL7 Mapper Plus: Abstracting

SHRiMP: The SHort Read Mapping Package

Lecture 4. Short Read Alignment

MAPPER project

Short read alignment

Short Read Sequencing Analysis Workshop

Google Visualization Mapper

Thematic Mapper