Short read mapper
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Short Read Mapper PowerPoint PPT Presentation


  • 48 Views
  • Uploaded on
  • Presentation posted in: General

Short Read Mapper. Brian S. Lam CS124. Outline. Biological Motivation Computer Science Problem Trivial Solution Hash Index Solution Future Direction. Outline. Biological Motivation Computer Science Problem Trivial Solution Hash Index Solution Future Direction. Biological Motivation.

Download Presentation

Short Read Mapper

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Short read mapper

Short Read Mapper

Brian S. Lam

CS124


Outline

Outline

  • Biological Motivation

  • Computer Science Problem

  • Trivial Solution

  • Hash Index Solution

  • Future Direction


Outline1

Outline

  • Biological Motivation

  • Computer Science Problem

  • Trivial Solution

  • Hash Index Solution

  • Future Direction


Biological motivation

Biological Motivation

  • Goal: read the DNA sequence of an individual

  • 2 types of methods

    • Full Genome Sequencing (FGS): reads the entire DNA sequence at once

    • Shotgun sequencing: divides DNA into many short reads, and then a computer program reassembles them


Biological motivation1

Biological Motivation

  • Shotgun sequencing sounds more complicated, why use it?

    • Faster

    • Cheaper

  • However, there are downsides:

    • We have to reassemble the short reads

    • We must have a reference genomewhich is similar to the one we’re sequencing


Biological motivation2

Biological Motivation

  • Q: How do we reassemble the short reads?

    • They are randomly ordered

    • They will not exactly match the reference genome

  • Basically like doing a puzzle, but sometimes the pieces don’t fit


Biological motivation3

Biological Motivation

  • Q: How do we reassemble the short reads?

  • A: Re-sequencing

    • Assume that the difference between the reference genome and our reads is very small

    • Find the “best fit” position for each short read

  • Complications:

    • Mutations (i.e. SNPs)

    • Read errors

    • Insertions, deletions, repeated regions


Outline2

Outline

  • Biological Motivation

  • Computer Science Problem

  • Trivial Solution

  • Hash Index Solution

  • Future Direction


Computer science problem

Computer Science Problem

  • We can ignore the biology, and this becomes substring mapping problem

  • Allow a certain number of mismatches to account for SNPs

    • Ignore other complications such as read errors, insertions, deletions, repeated regions, etc.

    • This is for simplicity


Computer science problem1

Computer Science Problem

  • Problem Layout

T

C

A

G

A

A

G

A

Short read length L

  • Allow up to D mismatches per short read


Computer science problem2

Computer Science Problem

  • Assumptions

    • There are at most D mutations in any substring of length L

    • Any 2 substrings of length L in our sequence differ by at least 2Dpositions

  • What this means:

    • All short reads will map to exactly ONE position


Outline3

Outline

  • Biological Motivation

  • Computer Science Problem

  • Trivial Solution

  • Hash Index Solution

  • Future Direction


Trivial solution

Trivial Solution

Algorithm

For each short read, slide across reference genome until we find a position with < D mismatches

  • Easy to explain, easy to code


Trivial solution1

T

C

A

G

A

A

G

A

A

T

A

A

Trivial Solution

Example: Let L = 4, D = 2

Reference:

Short Read:

3 mismatches


Trivial solution2

T

C

A

G

A

A

G

A

A

T

A

A

Trivial Solution

Example: Let L = 4, D = 2

Reference:

Short Read:

3 mismatches


Trivial solution3

T

C

A

G

A

A

G

A

A

T

A

A

Trivial Solution

Example: Let L = 4, D = 2

Reference:

Short Read:

1 mismatch

SNP

1 < D, so this is the correct position, and the second base in the short read is a SNP


Trivial solution4

Trivial Solution

However, simplicity has its cost…

?

This is way too slow!


Outline4

Outline

  • Biological Motivation

  • Computer Science Problem

  • Trivial Solution

  • Hash Index Solution

  • Future Direction


Hash index solution

Hash Index Solution

  • Idea: If we allow D mismatches, and we break the short read into D+1 pieces, then there is at least one piece that will match perfectly


Hash index solution1

Hash Index Solution

Algorithm

  • Store the index of each substring of length L/(D+1) in a hash index

  • Break short reads into pieces of length L/(D+1), and look up possible matching indices in hash index

  • Use trivial algorithm to check whether the short read actually matches this position

  • Harder to explain, harder to code


Hash index solution2

Hash Index Solution

  • Hashing Function

    • We want every substring to map to a unique key

    • There are four bases: A, C, G, andT

    • If we interpret the string as a base-4 number, we get a unique mapping

    • Let A = 0, C = 1, G = 2, T = 3


Hash index solution3

Hash Index Solution

Hashing Function Example

TGCA → 32104 →

3 x 43

+ 2 x 42

+ 1 x 41

+ 0 x 40

228


Hash index solution4

Hash Index Solution

This is our key into the hash index

Hashing Function Example

TGCA → 32104 →

3 x 43

+ 2 x 42

+ 1 x 41

+ 0 x 40

228


Hash index solution5

Hash Index Solution

Step 1) Populating the Hash Index

  • Calculate the key length based on the short read length (L) and number of allowed mismatches (D)

  • Add index of every substring of key length in the reference genome to the hash index


Hash index solution6

Hash Index Solution

Hash Index Example

  • Assume key length is 4, and reference genome starts with TGCA

  • From the example, key(TGCA) = 228

0

1

228

229


Hash index solution7

0

Hash Index Solution

Hash Index Example

  • Assume key length is 4, and reference genome starts with TGCA

  • From the example, key(TGCA) = 228

0

1

index

next

228

229


Hash index solution8

Hash Index Solution

Step 2) Break short reads into pieces and look up possible matching indices in hash index


Hash index solution9

Hash Index Solution

Step 2) Break short reads into pieces and look up possible matching indices in hash index

Example

Short Read: TCGAAACTGAGT

TCGA

AACT

GAGT


Hash index solution10

Hash Index Solution

Step 2) Break short reads into pieces and look up possible matching indices in hash index

Example

Short Read: TCGAAACTGAGT

TCGA

AACT

GAGT

Look these key values up in the hash index


Hash index solution11

Hash Index Solution

Step 3) Use the trivial algorithm to check these possible matching positions against the short read


Hash index solution12

Hash Index Solution

Much better performance!


Outline5

Outline

  • Biological Motivation

  • Computer Science Problem

  • Trivial Solution

  • Hash Index Solution

  • Future Direction


Future direction

Future Direction

  • Efficiency

    • Although the hash index algorithm is faster, it uses a lot of memory

  • Robustness

    • I ignored insertions, deletions, and repeated regions

    • These are all real complications that must be dealt with to get accurate results


Questions

Questions?


  • Login