Sequencing and Assembly

1 / 34

# Sequencing and Assembly - PowerPoint PPT Presentation

Sequencing and Assembly. Cont’d. Steps to Assemble a Genome. Some Terminology read a 500-900 long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Sequencing and Assembly' - fausto

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Sequencing and Assembly

Cont’d

Steps to Assemble a Genome

Some Terminology

read a 500-900 long word that comes

out of sequencer

mate pair a pair of reads from two ends

of the same insert fragment

contig a contiguous sequence formed

with no gaps

supercontig an ordered and oriented set

(scaffold) of contigs, usually by mate

pairs

consensus sequence derived from the

in a contig

2. Merge some “good” pairs of reads into longer contigs

3. Link contigs to form supercontigs

4. Derive consensus sequence

..ACGATTACAATAGGTT..

• Overlap graph:
• Edges: overlaps (ri, rj, shift, orientation, score)

from two regions of

the genome (blue

and red) that contain

the same repeat

Note:

of course, we don’t

know the “color” of

these nodes

repeat region

We want to merge reads up to potential repeat boundaries

Unique Contig

Overcollapsed Contig

• Remove transitively inferable overlaps
• If read r overlaps to the right reads r1, r2, and r1 overlaps r2, then (r, r2) can be inferred by (r, r1) and (r1, r2)

r

r1

r2

r3

repeat boundary???

sequencing error

• Ignore “hanging” reads, when detecting repeat boundaries

b

a

b

a

Repeats, errors, and contig lengths
• Repeats shorter than read length are easily resolved
• Read that spans across a repeat disambiguates order of flanking regions
• Repeats with more base pair diffs than sequencing error rate are OK
• We throw overlaps between two reads in different copies of the repeat
• To make the genome appear less repetitive, try to:
• Decrease sequencing error rate

Role of error correction:

Discards up to 98% of single-letter sequencing errors

decreases error rate

 decreases effective repeat content

 increases contig length

Normal density

Too dense

 Overcollapsed

Overcollapsed?

Find all links between unique contigs

Connect contigs incrementally, if  2 forward-reverse links

supercontig

(aka scaffold)

• Fill gaps in supercontigs with paths of repeat contigs
• Complex algorithmic step
• Exponential number of paths
4. Derive Consensus Sequence

TAGATTACACAGATTACTGA TTGATGGCGTAA CTA

Derive multiple alignment from pairwise read alignments

TAGATTACACAGATTACTGACTTGATGGCGTAAACTA

TAG TTACACAGATTATTGACTTCATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Derive each consensus base by weighted voting

(Alternative: take maximum-quality letter)

Some Assemblers
• PHRAP
• Early assembler, widely used, good model of read errors
• Overlap O(n2)  layout (no mate pairs)  consensus
• Celera
• First assembler to handle large genomes (fly, human, mouse)
• Overlap  layout  consensus
• Arachne
• Public assembler (mouse, several fungi)
• Overlap  layout  consensus
• Phusion
• Overlap  clustering  PHRAP  assemblage  consensus
• Euler
• Indexing  Euler graph  layout by picking paths  consensus
Quality of assemblies—mouse

Terminology:N50 contig length

If we sort contigs from largest to smallest, and start

Covering the genome in that order, N50 is the length

Of the contig that just covers the 50th percentile.

7.7X

sequence

coverage

Quality of assemblies—dog

7.5X

sequence

coverage

Quality of assemblies—chimp

3.6X

sequence

Coverage

Assisted

Assembly

History of WGA

1997

• 1982: -virus, 48,502 bp
• 1995: h-influenzae, 1 Mbp
• 2000: fly, 100 Mbp
• 2001 – present
• human (3Gbp), mouse (2.5Gbp), rat*, chicken, dog, chimpanzee, several fungal genomes

Let’s sequence the human genome with the shotgun strategy

That is impossible, and a bad idea anyway

Phil Green

Gene Myers

\$985 deCODEme

(November 2007)

\$399 Personal Genome Service

(November 2007)

\$2,500 Health Compass service

(April 2008)

Genetic Information Nondiscrimination Act

(May 2008)

\$350,000 Whole-genome sequencing

(November 2007)

Applications

Whole-genome sequencing

Comparative genomics

Genome resequencing

Structural variation analysis

Polymorphism discovery

Metagenomics

Environmental sequencing

Gene expression profiling

Genotyping

Population genetics

Migration studies

Ancestry inference

Relationship inference

Genetic screening

Drug targeting

Forensics

New sequencing applications

Sequencing applications

Increase in sequencing data output

Demand for more sequencing

Sequencing technology improvement

Sequencing technology

Sanger sequencing

\$10.00

\$1.00

Cost per finished bp:

\$0.10

\$0.01

1975

1980

1990

2000

2008

Fred Sanger

15 – 200 bp

500 – 1,000 bp

Throughput:

2 ∙ 106 bp/day

Sequencing technology

Sanger sequencing

3 ∙ 109 bp

1x coverage

10x coverage

× 3 ∙ 109 bp

= 40 years

2 ∙ 106 bp/day

= \$30 million

10x coverage × 3 ∙ 109 bp × \$0.001/bp

Pyrosequencing on a chip
• Mostafa Ronaghi, Stanford Genome Technologies Center
• 454 Life Sciences

Sequencing technology

Next-generation sequencing

250 bp

Throughput:

300 Mb/day

Cost:

~ 10,000 bp/\$

De novo:

yes

Genome Sequencer / FLX

Sequencing technology

Next-generation sequencing

Genome Analyzer

SOLiD Analyzer

~ 35 bp

Throughput:

300 – 500 Mb/day

Cost:

~ 100,000 bp/\$

De novo:

yes

Sequencing technology

Next-generation sequencing

“SNP chips”

Infinium Assay

GeneChip Array

genotypes

1bp

Throughput:

1 – 2 Mb/day

Cost:

5,000 bp/\$

De novo:

no

Nanopore Sequencing

http://www.mcb.harvard.edu/branton/index.htm

Sequencing technology

Next-generation sequencing