information theory of dna sequencing n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Information Theory of DNA Sequencing PowerPoint Presentation
Download Presentation
Information Theory of DNA Sequencing

Loading in 2 Seconds...

play fullscreen
1 / 13

Information Theory of DNA Sequencing - PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on

Information Theory of DNA Sequencing. David Tse Dept. of EECS U.C. Berkeley ITA 2012 Feb. 10 Research supported by NSF Center for Science of Information. Abolfazl Motahari. Guy Bresler. TexPoint fonts used in EMF: A A A A A A A A A A A A A A A A. DNA sequencing.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Theory of DNA Sequencing' - casey


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information theory of dna sequencing

Information Theory of DNA Sequencing

David Tse

Dept. of EECS

U.C. Berkeley

ITA 2012

Feb. 10

Research supported by NSF Center for Science of Information.

AbolfazlMotahari

Guy Bresler

TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA

dna sequencing
DNA sequencing

DNA: the blueprint of life

Problem: to obtain the sequence of nucleotides.

…ACGTGACTGAGGACCGTG

CGACTGAGACTGACTGGGT

CTAGCTAGACTACGTTTTA

TATATATATACGTCGTCGT

ACTGATGACTAGATTACAG

ACTGATTTAGATACCTGAC

TGATTTTAAAAAAATATT…

impetus human genome project
Impetus: Human Genome Project

1990: Start

2001: Draft

3 billion basepairs

2003: Finished

$3 billion

sequencing gets cheaper and faster
Sequencing Gets Cheaper and Faster

Cost of one human genome

  • HGP: $ 3 billion
  • 2004: $30,000,000
  • 2008: $100,000
  • 2010: $10,000
  • 2011: $4,000
  • 2012-13: $1,000
  • ???: $300

Time to sequence one genome: years/months  hours/days

Massive parallelization.

but many genomes to sequence
But many genomes to sequence

100 million species

(e.g. phylogeny)

7 billion individuals

(SNP, personal genomics)

1013 cells in a human

(e.g. somatic mutations

such as HIV, cancer)

whole genome shotgun sequencing
Whole Genome Shotgun Sequencing

Reads are assembled to reconstruct the original DNA sequence.

computation versus information view
Computation versus Information View
  • Many proposed assembly algorithms for many sequencing technologies.
  • But what is the minimum number of reads required for reliable reconstruction?
  • How much intrinsic information does each read provide about the DNA sequence?
  • This depends on the sequencing technology but not on the assembly algorithm.
communication and sequencing an analogy
Communication and Sequencing: An Analogy

Communication:

source

sequence

Sequencing:

Question: what is the max. sequencing rate such that reliable reconstruction is possible?

the read channel
The read channel
  • Capacity depends on
    • read length: L
    • DNA length: G
  • Normalized read length:
  • Eg. L = 100, G = 3 £109 :

AGGTCC

AGCTTATAGGTCCGCATTACC

read

channel

result sequencing capacity
Result: Sequencing Capacity

no coverage

(Lander-Waterman 88)

duplication

(Arratia et al 96)

H2(p) is (Renyi) entropy rate

of the DNA sequence:

The higher the entropy,

the easier the problem!

L

L

L

L

greedy

algorithm

complexity is in the eyes of the beholder
Complexity is in the eyes of the beholder

Low entropy

High entropy

harder to communicate

easier to communicate

easier jigsaw puzzle

harder jigsaw puzzle

conclusion
Conclusion
  • DNA sequencing is an important problem.
  • Many new technologies and new applications.
  • An analogy between sequencing and communication is drawn.
  • A notion of sequencing capacity is formulated.
  • A principled design framework?