Fragment assembly
Download
1 / 48

Fragment Assembly - PowerPoint PPT Presentation


  • 133 Views
  • Uploaded on

Fragment Assembly. Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm. Steps to Assemble a Genome. Some Terminology read a 500-900 long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Fragment Assembly' - borka


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Fragment assembly
Fragment Assembly

Given N reads…

Where N ~ 30 million…

We need to use a linear-time algorithm


Steps to assemble a genome
Steps to Assemble a Genome

Some Terminology

read a 500-900 long word that comes

out of sequencer

mate pair a pair of reads from two ends

of the same insert fragment

contig a contiguous sequence formed

by several overlapping reads

with no gaps

supercontig an ordered and oriented set

(scaffold) of contigs, usually by mate

pairs

consensus sequence derived from the

sequene multiple alignment of reads

in a contig

1. Find overlapping reads

2. Merge some “good” pairs of reads into longer contigs

3. Link contigs to form supercontigs

4. Derive consensus sequence

..ACGATTACAATAGGTT..


1 find overlapping reads

T GA

TACA

| ||

||

TAGA

TAGT

1. Find Overlapping Reads

  • Find pairs of reads sharing a k-mer, k ~ 24

  • Extend to full alignment – throw away if not >98% similar

TAGATTACACAGATTAC

|||||||||||||||||

TAGATTACACAGATTAC

  • Caveat: repeats

    • A k-mer that occurs N times, causes O(N2) read/read comparisons

    • ALU k-mers could cause up to 1,000,0002 comparisons

  • Solution:

    • Discard all k-mers that occur “too often”

      • Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available


1 find overlapping reads1
1. Find Overlapping Reads

  • Correcterrors using multiple alignment

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTATTGA

TAG-TTACACAGATTATTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAG-TTACACAGATTACTGA

TAG-TTACACAGATTATTGA

insert A

correlated errors—

probably caused by repeats

 disentangle overlaps

replace T with C

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

TAGATTACACAGATTACTGA

In practice, error correction removes

up to 98% of the errors

TAG-TTACACAGATTATTGA

TAG-TTACACAGATTATTGA


2 merge reads into contigs
2. Merge Reads into Contigs

  • Overlap graph:

    • Nodes: reads r1…..rn

    • Edges: overlaps (ri, rj, shift, orientation, score)

Reads that come

from two regions of

the genome (blue

and red) that contain

the same repeat

Note:

of course, we don’t

know the “color” of

these nodes



Overlap graph after forming contigs
Overlap graph after forming contigs

Unitigs:

Gene Myers, 95


Repeats errors and contig lengths
Repeats, errors, and contig lengths

  • Repeats shorter than read length are easily resolved

    • Read that spans across a repeat disambiguates order of flanking regions

  • Repeats with more base pair diffs than sequencing error rate are OK

    • We throw overlaps between two reads in different copies of the repeat

  • To make the genome appear less repetitive, try to:

    • Increase read length

    • Decrease sequencing error rate

      Role of error correction:

      Discards up to 98% of single-letter sequencing errors

      decreases error rate

       decreases effective repeat content

       increases contig length


3. Link Contigs into Supercontigs

Normal density

Too dense

 Overcollapsed

Inconsistent links

Overcollapsed?


3. Link Contigs into Supercontigs

Find all links between unique contigs

Connect contigs incrementally, if  2 links

supercontig

(aka scaffold)


3. Link Contigs into Supercontigs

Fill gaps in supercontigs with paths of repeat contigs


4 derive consensus sequence
4. Derive Consensus Sequence

TAGATTACACAGATTACTGA TTGATGGCGTAA CTA

Derive multiple alignment from pairwise read alignments

TAGATTACACAGATTACTGACTTGATGGCGTAAACTA

TAG TTACACAGATTATTGACTTCATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

TAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Derive each consensus base by weighted voting

(Alternative: take maximum-quality letter)


Some assemblers
Some Assemblers

  • PHRAP

    • Early assembler, widely used, good model of read errors

    • Overlap O(n2)  layout (no mate pairs)  consensus

  • Celera

    • First assembler to handle large genomes (fly, human, mouse)

    • Overlap  layout  consensus

  • Arachne

    • Public assembler (mouse, several fungi)

    • Overlap  layout  consensus

  • Phusion

    • Overlap  clustering  PHRAP  assemblage  consensus

  • Euler

    • Indexing  Euler graph  layout by picking paths  consensus


  • Quality of assemblies
    Quality of assemblies

    Celera’s assemblies of human and mouse



    Quality of assemblies mouse1
    Quality of assemblies—mouse

    Terminology:N50 contig length

    If we sort contigs from largest to smallest, and start

    Covering the genome in that order, N50 is the length

    Of the contig that just covers the 50th percentile.



    History of wga
    History of WGA

    1997

    • 1982: -virus, 48,502 bp

    • 1995: h-influenzae, 1 Mbp

    • 2000: fly, 100 Mbp

    • 2001 – present

      • human (3Gbp), mouse (2.5Gbp), rat*, chicken, dog, chimpanzee, several fungal genomes

    Let’s sequence the human genome with the shotgun strategy

    That is impossible, and a bad idea anyway

    Phil Green

    Gene Myers


    Genomes sequenced
    Genomes Sequenced

    • http://www.genome.gov/10002154



    Evolution at the dna level
    Evolution at the DNA level

    Deletion

    Mutation

    …ACGGTGCAGTTACCA…

    SEQUENCE EDITS

    …AC----CAGTCCACCA…

    REARRANGEMENTS

    Inversion

    Translocation

    Duplication


    Protein phylogenies
    Protein Phylogenies

    • Proteins evolve by both duplication and species divergence


    Orthology and paralogy
    Orthology and Paralogy

    Yeast

    Orthologs:Derived by speciation

    Paralogs:

    Everything else

    HA1 Human

    HA2 Human

    WA Worm

    HB Human

    WB Worm



    Definition
    Definition

    • Given N sequences x1, x2,…, xN:

      • Insert gaps (-) in each sequence xi, such that

        • All sequences have the same length L

        • Score of the global map is maximum

    • A faint similarity between two sequences becomes significant if present in many

    • Multiple alignments reveal elements that are conserved among a class of organisms and therefore important in their common biology

    • The patterns of conservation can help us tell function of the element


    Scoring function sum of pairs
    Scoring Function: Sum Of Pairs

    Definition:Induced pairwise alignment

    A pairwise alignment induced by the multiple alignment

    Example:

    x: AC-GCGG-C

    y: AC-GC-GAG

    z: GCCGC-GAG

    Induces:

    x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG

    y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG


    Sum of pairs cont d
    Sum Of Pairs (cont’d)

    • Heuristic way to incorporate evolution tree:

    Human

    Mouse

    Duck

    Chicken

    • Weighted SOP:

      • S(m) = k<l wkl s(mk, ml)


    A profile representation
    A Profile Representation

    • Given a multiple alignment M = m1…mn

      • Replace each column mi with profile entry pi

        • Frequency of each letter in 

        • # gaps

        • Optional: # gap openings, extensions, closings

      • Can think of this as a “likelihood” of each letter in each position

    - A G G C T A T C A C C T G

    T A G – C T A C C A - - - G

    C A G – C T A C C A - - - G

    C A G – C T A T C A C – G G

    C A G – C T A T C G C – G G

    A 1 1 .8

    C .6 1 .4 1 .6 .2

    G 1 .2 .2 .4 1

    T .2 1 .6 .2

    - .2 .8 .4 .8 .4



    Multidimensional dp
    Multidimensional DP

    Generalization of Needleman-Wunsh:

    S(m) = i S(mi)

    (sum of column scores)

    F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN)

    F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))


    Multidimensional dp1
    Multidimensional DP

    • Example: in 3D (three sequences):

    • 7 neighbors/cell

    F(i,j,k) = max{ F(i – 1, j – 1, k – 1) + S(xi, xj, xk),

    F(i – 1, j – 1, k ) + S(xi, xj, - ),

    F(i – 1, j , k – 1) + S(xi, -, xk),

    F(i – 1, j , k ) + S(xi, -, - ),

    F(i , j – 1, k – 1) + S( -, xj, xk),

    F(i , j – 1, k ) + S( -, xj, - ),

    F(i , j , k – 1) + S( -, -, xk) }


    Multidimensional dp2
    Multidimensional DP

    Running Time:

    • Size of matrix: LN;

      Where L = length of each sequence

      N = number of sequences

    • Neighbors/cell: 2N – 1

      Therefore………………………… O(2N LN)


    Multidimensional dp3
    Multidimensional DP

    • How do gap states generalize?

    • VERY badly!

      • Require 2N – 1 states, one per combination of gapped/ungapped sequences

      • Running time: O(2N 2N  LN) = O(4N LN)

    Running Time:

    • Size of matrix: LN;

      Where L = length of each sequence

      N = number of sequences

    • Neighbors/cell: 2N – 1

      Therefore………………………… O(2N LN)

    Y

    YZ

    XY

    XYZ

    Z

    X

    XZ


    Progressive alignment
    Progressive Alignment

    x

    • When evolutionary tree is known:

      • Align closest first, in the order of the tree

      • In each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presult

        Weighted version:

      • Tree edges have weights, proportional to the divergence in that edge

      • New profile is a weighted average of two old profiles

    pxy

    y

    z

    pxyzw

    pzw

    w


    Progressive alignment1
    Progressive Alignment

    x

    • When evolutionary tree is known:

      • Align closest first, in the order of the tree

      • In each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presult

        Weighted version:

      • Tree edges have weights, proportional to the divergence in that edge

      • New profile is a weighted average of two old profiles

    y

    Example

    Profile: (A, C, G, T, -)

    px = (0.8, 0.2, 0, 0, 0)

    py = (0.6, 0, 0, 0, 0.4)

    s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A)

    + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -)

    Result:pxy= (0.7, 0.1, 0, 0, 0.2)

    s(px, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -)

    Result:px-= (0.4, 0.1, 0, 0, 0.5)

    z

    w


    Progressive alignment2
    Progressive Alignment

    x

    • When evolutionary tree is unknown:

      • Perform all pairwise alignments

      • Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment

      • Construct a tree (UPGMA / Neighbor Joining / Other methods)

      • Align on the tree

    y

    ?

    z

    w


    Heuristics to improve alignments
    Heuristics to improve alignments

    • Iterative refinement schemes

    • A*-based search

    • Consistency

    • Simulated Annealing


    Iterative refinement
    Iterative Refinement

    One problem of progressive alignment:

    • Initial alignments are “frozen” even when new evidence comes

      Example:

      x: GAAGTT

      y: GAC-TT

      z: GAACTG

      w: GTACTG

    Frozen!

    Now clear correct y = GA-CTT


    Iterative refinement1

    allow y to vary

    x,z fixed projection

    Iterative Refinement

    Algorithm (Barton-Stenberg):

    • For j = 1 to N,

      Remove xj, and realign to x1…xj-1xj+1…xN

    • Repeat 4 until convergence

    z

    x

    y


    Iterative refinement2
    Iterative Refinement

    Example: align (x,y), (z,w), (xy, zw):

    x: GAAGTTA

    y: GAC-TTA

    z: GAACTGA

    w: GTACTGA

    After realigning y:

    x: GAAGTTA

    y: G-ACTTA + 3 matches

    z: GAACTGA

    w: GTACTGA


    Iterative refinement3
    Iterative Refinement

    Example not handled well:

    x: GAAGTTA

    y1: GAC-TTA

    y2: GAC-TTA

    y3: GAC-TTA

    z: GAACTGA

    w: GTACTGA

    • Realigning any single yi changes nothing


    Consistency
    Consistency

    zk

    z

    xi

    x

    y

    yj

    yj’


    Consistency1
    Consistency

    zk

    z

    Basic method for applying consistency

    • Compute all pairs of alignments xy, xz, yz, …

    • When aligning x, y during progressive alignment,

      • For each (xi, yj), let s(xi, yj) = function_of(xi, yj, axz, ayz)

      • Align x and y with DP using the modified s(.,.) function

    xi

    x

    y

    yj

    yj’


    Real world protein aligners
    Real-world protein aligners

    • MUSCLE

      • High throughput

      • One of the best in accuracy

    • ProbCons

      • High accuracy

      • Reasonable speed


    Muscle at a glance
    MUSCLE at a glance

    • Fast measurement of all pairwise distances between sequences

      • DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N2 L logL) time

    • Build tree TDRAFT based on those distances, with UPGMA

    • Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT

      • Only perform alignment steps for the parts of the tree that have changed

    • Measure new Kimura-based distances D(x, y) based on MDRAFT

    • Build tree T based on D

    • Progressive alignment over T, to build M

    • Iterative refinement; for many rounds, do:

      • Tree Partitioning: Split M on one branch and realign the two resulting profiles

      • If new alignment M’ has better sum-of-pairs score than previous one, accept


    Probcons at a glance
    PROBCONS at a glance

    • Computation of all posterior matrices Mxy : Mxy(i, j) = Prob(xi ~ yj), using a HMM

    • Re-estimation of posterior matrices M’xy with probabilistic consistency

      • M’xy(i, j) = 1/N sequence zk Mxz(i, k)  Myz (j, k); M’xy = Avgz(MxzMzy)

    • Compute for every pair x, y, the maximum expected accuracy alignment

      • Axy: alignment that maximizes aligned (i, j) in AM’xy(i, j)

      • Define E(x, y) = aligned (i, j) in AxyM’xy(i, j)

    • Build tree T with hierarchical clustering using similarity measure E(x, y)

    • Progressive alignment on T to maximize E(.,.)

    • Iterative refinement; for many rounds, do:

      • Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each sequence and realign the two resulting profiles


    Some resources
    Some Resources

    Genome Resources

    Annotation and alignment genome browser at UCSC

    http://genome.ucsc.edu/cgi-bin/hgGateway

    Specialized VISTA alignment browser at LBNL

    http://pipeline.lbl.gov/cgi-bin/gateway2

    ABC—Nice Stanford tool for browsing alignments

    http://encode.stanford.edu/~asimenos/ABC/

    Protein Multiple Aligners

    http://www.ebi.ac.uk/clustalw/

    CLUSTALW – most widely used

    http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py

    MUSCLE – most scalable

    http://probcons.stanford.edu/

    PROBCONS – most accurate


    ad