Sequencing and assembly
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Sequencing and Assembly PowerPoint PPT Presentation


  • 45 Views
  • Uploaded on
  • Presentation posted in: General

Sequencing and Assembly. Cont’d. Steps to Assemble a Genome. Some Terminology read a 500-900 long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads

Download Presentation

Sequencing and Assembly

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sequencing and assembly

Sequencing and Assembly

Cont’d


Steps to assemble a genome

Steps to Assemble a Genome

Some Terminology

read a 500-900 long word that comes

out of sequencer

mate pair a pair of reads from two ends

of the same insert fragment

contig a contiguous sequence formed

by several overlapping reads

with no gaps

supercontig an ordered and oriented set

(scaffold) of contigs, usually by mate

pairs

consensus sequence derived from the

sequene multiple alignment of reads

in a contig

1. Find overlapping reads

2. Merge some “good” pairs of reads into longer contigs

3. Link contigs to form supercontigs

4. Derive consensus sequence

..ACGATTACAATAGGTT..


2 merge reads into contigs

2. Merge Reads into Contigs

  • Overlap graph:

    • Nodes: reads r1…..rn

    • Edges: overlaps (ri, rj, shift, orientation, score)

Reads that come

from two regions of

the genome (blue

and red) that contain

the same repeat

Note:

of course, we don’t

know the “color” of

these nodes


2 merge reads into contigs1

repeat region

2. Merge Reads into Contigs

We want to merge reads up to potential repeat boundaries

Unique Contig

Overcollapsed Contig


2 merge reads into contigs2

2. Merge Reads into Contigs

  • Remove transitively inferable overlaps

    • If read r overlaps to the right reads r1, r2, and r1 overlaps r2, then (r, r2) can be inferred by (r, r1) and (r1, r2)

r

r1

r2

r3


2 merge reads into contigs3

2. Merge Reads into Contigs


2 merge reads into contigs4

2. Merge Reads into Contigs

repeat boundary???

sequencing error

  • Ignore “hanging” reads, when detecting repeat boundaries

b

a

b

a


Overlap graph after forming contigs

Overlap graph after forming contigs

Unitigs:

Gene Myers, 95


Repeats errors and contig lengths

Repeats, errors, and contig lengths

  • Repeats shorter than read length are easily resolved

    • Read that spans across a repeat disambiguates order of flanking regions

  • Repeats with more base pair diffs than sequencing error rate are OK

    • We throw overlaps between two reads in different copies of the repeat

  • To make the genome appear less repetitive, try to:

    • Increase read length

    • Decrease sequencing error rate

      Role of error correction:

      Discards up to 98% of single-letter sequencing errors

      decreases error rate

       decreases effective repeat content

       increases contig length


Sequencing and assembly

3. Link Contigs into Supercontigs

Normal density

Too dense

 Overcollapsed

Inconsistent links

Overcollapsed?


Sequencing and assembly

3. Link Contigs into Supercontigs

Find all links between unique contigs

Connect contigs incrementally, if  2 forward-reverse links

supercontig

(aka scaffold)


Sequencing and assembly

3. Link Contigs into Supercontigs

  • Fill gaps in supercontigs with paths of repeat contigs

  • Complex algorithmic step

    • Exponential number of paths

    • Forward-reverse links


4 derive consensus sequence

4. Derive Consensus Sequence

TAGATTACACAGATTACTGA TTGATGGCGTAA CTA

Derive multiple alignment from pairwise read alignments

TAGATTACACAGATTACTGACTTGATGGCGTAAACTA

TAG TTACACAGATTATTGACTTCATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Derive each consensus base by weighted voting

(Alternative: take maximum-quality letter)


Some assemblers

Some Assemblers

  • PHRAP

    • Early assembler, widely used, good model of read errors

    • Overlap O(n2)  layout (no mate pairs)  consensus

  • Celera

    • First assembler to handle large genomes (fly, human, mouse)

    • Overlap  layout  consensus

  • Arachne

    • Public assembler (mouse, several fungi)

    • Overlap  layout  consensus

  • Phusion

    • Overlap  clustering  PHRAP  assemblage  consensus

  • Euler

    • Indexing  Euler graph  layout by picking paths  consensus


  • Quality of assemblies mouse

    Quality of assemblies—mouse

    Terminology:N50 contig length

    If we sort contigs from largest to smallest, and start

    Covering the genome in that order, N50 is the length

    Of the contig that just covers the 50th percentile.

    7.7X

    sequence

    coverage


    Quality of assemblies dog

    Quality of assemblies—dog

    7.5X

    sequence

    coverage


    Quality of assemblies chimp

    Quality of assemblies—chimp

    3.6X

    sequence

    Coverage

    Assisted

    Assembly


    History of wga

    History of WGA

    1997

    • 1982: -virus, 48,502 bp

    • 1995: h-influenzae, 1 Mbp

    • 2000: fly, 100 Mbp

    • 2001 – present

      • human (3Gbp), mouse (2.5Gbp), rat*, chicken, dog, chimpanzee, several fungal genomes

    Let’s sequence the human genome with the shotgun strategy

    That is impossible, and a bad idea anyway

    Phil Green

    Gene Myers


    Sequencing and assembly

    $985 deCODEme

    (November 2007)

    $399 Personal Genome Service

    (November 2007)

    $2,500 Health Compass service

    (April 2008)

    Genetic Information Nondiscrimination Act

    (May 2008)

    $350,000 Whole-genome sequencing

    (November 2007)


    Sequencing and assembly

    Applications

    Whole-genome sequencing

    Comparative genomics

    Genome resequencing

    Structural variation analysis

    Polymorphism discovery

    Metagenomics

    Environmental sequencing

    Gene expression profiling

    Genotyping

    Population genetics

    Migration studies

    Ancestry inference

    Relationship inference

    Genetic screening

    Drug targeting

    Forensics


    Sequencing and assembly

    New sequencing applications

    Sequencing applications

    Increase in sequencing data output

    Demand for more sequencing

    Sequencing technology improvement


    Sequencing and assembly

    Sequencing technology

    Sanger sequencing

    $10.00

    $1.00

    Cost per finished bp:

    $0.10

    $0.01

    1975

    1980

    1990

    2000

    2008

    Fred Sanger

    Read length:

    15 – 200 bp

    500 – 1,000 bp

    Throughput:

    “grad-student years”

    2 ∙ 106 bp/day


    Sequencing and assembly

    Sequencing technology

    Sanger sequencing

    3 ∙ 109 bp

    1x coverage

    10x coverage

    × 3 ∙ 109 bp

    = 40 years

    2 ∙ 106 bp/day

    = $30 million

    10x coverage × 3 ∙ 109 bp × $0.001/bp


    Pyrosequencing on a chip

    Pyrosequencing on a chip

    • Mostafa Ronaghi, Stanford Genome Technologies Center

    • 454 Life Sciences


    Sequencing and assembly

    Sequencing technology

    Next-generation sequencing

    “short reads”

    Read length:

    250 bp

    Throughput:

    300 Mb/day

    Cost:

    ~ 10,000 bp/$

    De novo:

    yes

    Genome Sequencer / FLX


    Single molecule array for genotyping solexa

    Single Molecule Array for Genotyping—Solexa


    Polony sequencing

    Polony Sequencing


    Sequencing and assembly

    Sequencing technology

    Next-generation sequencing

    Genome Analyzer

    SOLiD Analyzer

    “microreads”

    Read length:

    ~ 35 bp

    Throughput:

    300 – 500 Mb/day

    Cost:

    ~ 100,000 bp/$

    De novo:

    yes


    Molecular inversion probes

    Molecular Inversion Probes


    Illumina genotype arrays

    Illumina Genotype Arrays


    Sequencing and assembly

    Sequencing technology

    Next-generation sequencing

    “SNP chips”

    Infinium Assay

    GeneChip Array

    genotypes

    Read length:

    1bp

    Throughput:

    1 – 2 Mb/day

    Cost:

    5,000 bp/$

    De novo:

    no


    Nanopore sequencing

    Nanopore Sequencing

    http://www.mcb.harvard.edu/branton/index.htm


    Sequencing and assembly

    Sequencing technology

    Next-generation sequencing


    Sequencing and assembly

    Sequencing technology

    ?


  • Login