Alignment problem
Sponsored Links
This presentation is the property of its rightful owner.
1 / 38

Alignment Problem PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Alignment Problem. (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal (heuristic) alignment algorithms are also very important: e.g. BLAST. Key Issues. Types of alignments (local vs. global)

Download Presentation

Alignment Problem

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Alignment Problem

  • (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one.

  • Sub-optimal (heuristic) alignment algorithms are also very important: e.g. BLAST

Key Issues

  • Types of alignments (local vs. global)

  • The scoring system

  • The alignment algorithm

  • Measuring alignment significance

Types of Alignment

  • Global—sequences aligned from end-to-end.

  • Local—alignments may start in the middle of either sequence

  • Ungapped—no insertions or deletions are allowed

  • Other types: overlap alignments, repeated match alignments

Local vs. Global Pairwise Alignments

  • A global alignment includes all elements of the sequences and includes gaps.

    • A global alignment may or may not include "end gap" penalties.

    • Global alignments are better indicators of homology and take longer to compute.

  • A local alignment includes only subsequences, and sometimes is computed without gaps.

    • Local alignments can find shared domains in divergent proteins and are fast to compute

How do you compare alignments?

  • Scoring scheme

    • What events do we score?

      • Matches

      • Mismatches

      • Gaps

    • What scores will you give these events?

    • What assumptions are you making?

  • Score your alignment

Scoring Matrices

  • How do you determine scores?

  • What is out there already for your use?

  • DNA versus Amino Acids?



Multiple Sequence Alignment

Global versus Local Alignments

  • Progressive alignment

    • Estimate guide tree

    • Do pairwise alignment on subtrees



  • Consistency-based Algorithms

    • T-Coffee - consistency-based objective function to minimize potential errors

      • Generates pair-wise global (Clustal)

      • Local (Lalign)

      • Then combine, reweight, progressive alignment

Iterative Algorithms

  • Estimate draft progressive alignment (uncorrected distances)

  • Improved progressive (reestimate guide tree using Kimura 2-parameter)

  • Refinement - divide into 2 subtrees, estimate two profiles, then re-align 2 profiles

  • Continue refinement until convergence


  • Clustal

  • T-Coffee

  • MUSCLE (limited models)

  • MAFFT (wide variety of models)


  • Speed


  • Accuracy


  • Lots more work to do here!

Why Genome Sequencing?

Modern Sequencing Methods

  • Sanger (1982) introduced a sequencing method amenable to automation.

  • Whole-genome sequencing: Clone-By-Clone vs. Shotgun Assembly

  • Drosophila melongaster sequenced (Myers et al. 2000)

  • Homo sapien sequenced (Venter et al. 2001)

Sanger (1982) introduced chain-termination sequencing.

Main idea: Obtain fragments of all possible lengths, ending in A, C, T, G.

Using gel electrophoresis, we can separate fragments of differing lengths, and then assemble them.

Automated Sequencing

Perkin-Elmer 3700:

Can sequence ~500bp with 98.5% accuracy

Reads and Contigs

Sequencing machines are limited to about ~500-750bp, so we must break up DNA into short and long fragments, with reads on either end.

Reads are then assembled into contigs, then scaffolds.

Clone-by-Clone vs. Shotgun

  • Traditionally, long fragments are mapped, and then assembled by finding a minimum tiling path. Then, shotgun assembly is used to sequence long fragments.

  • Shotgun assembly is cheaper, but requires more computational resources.

  • Drosophila was successfully sequenced using shotgun assembly.

In a Perfect World


  • Good coverage does not guarantee that we can “see” repeats.

  • Read coverage is generally not “truly” random, due to complications in fragmentation and cloning.

  • Any automated approach requires extensive post-processing.


The Fruit Fly

  • Drosophila melongaster was sequenced in 2000 using whole genome shotgun assembly.

  • Genome size is ~120Mbp for euchromatic (coding) portion, with roughly 13,600 genes.

  • The genome is still being refined.

NIH used a Clone-By-Clone strategy; Celera used shotgun assembly.

Celera used 300 sequencing machines in parallel to obtain 175,000 reads per day.

Efforts were combined, resulting in 8x coverage of the human genome; consensus sequence is 2.91 billion base pairs.


  • The basic question is: given a set of fragments from a long string, can we reconstruct the string?

  • What is the shortest common superstring of the given fragments?


  • Construct a (directed) overlap graph, where nodes represent reads and edges represent overlap. Paths are contigs in this graph.

  • Problem: Find the consensus sequence by finding a path that visits all nodes in layout graph.

  • Note: This is an idealization, since we must handle errors!

Approximation Algorithms

  • The shortest common superstring problem is NP-complete.

  • Greedily choosing edges is a 4-approximation, conjectured to be a 2-approximation.

  • Another idea: TSP has a 2-approximation if the edge weights are metric (Waterman et al. 1976 gives such metrics).

Handling Repeats

  • We can estimate how much coverage a given set of overlapping reads should yield, based on coverage.

  • Repeats will “seem” to have unusually good coverage.

  • Celera’s algorithms are proprietary, but there is no explicit way to handle repeats in the overlap-layout-consensus paradigm.

The Big Picture


Suppose we had a way to probe fragments of length k that were present in our sequence, from a hybridization assay.

Commercial products: Affymetrix GeneChip, Agilent, Amersham, etc.


  • Then instead of reads, we have regularly sized fragments, k-mers.

  • Construct a multigraph G with (k-1)-mers as nodes, with edges representing k-mers. G is a de Bruijn graph.

  • Idea: An Eulerian path in G corresponds to the assembled sequence, and we don’t lose repeats (Pevzner 1989).

Bridges of Königsberg

Theorem (Euler 1736): A graph has a path visiting every edge exactly once if and only if it is connected and has 2 or fewer vertices of odd degree.

Pros and Cons

  • An Eulerian path in a graph can be found in linear time, if one exists.

  • Errors in the hybridization experiments may prevent us from finding a solution.

  • Can we just use reads as “virtual” hybridization data?

Graph Preprocessing

  • Read errors mean up to k missing/erroneous edges. But we cannot correct this until we are done assembling!

  • Greedily mutate reads to minimize size of set of k-mers.

  • We also need to deal with repeats, which requires contracting certain paths to single edges…

Sizes of genomes and numbers of genes

Sequencing parameters

  • Difficulty and cost of large-scale sequencing projects depend on the following parameters:

    • Accuracy

      • How many errors are tolerated

    • Coverage

      • How many times the same region is sequenced

  • The two parameters are related

    • More coverage usually means higher accuracy

    • Accuracy is also dependent on the finishing effort

Sequence accuracy

  • Highly accurate sequences are needed for the following:

    • Diagnostics

      • e.g., Forensics, identifying disease alleles in a patient

    • Protein coding prediction

      • One insertion or deletion changes the reading frame

  • Lower accuracy sufficient for homology searches

    • Differences in sequence are tolerated by search programs

Sequence accuracy and sequencing cost

  • Level of accuracy determines cost of project

    • Increasing accuracy from one error in 100 to one error in 10,000 increases costs three to fivefold

  • Need to determine appropriate level of accuracy for each project

    • If reference sequence already exists, then a lower level of accuracy should suffice

      • Can find genes in genome, but not their position

  • Sequencing coverage

    • Coverage is the number of times the same region is sequenced

      • Ideally, one wants an equal number of sequences in each direction

    • To obtain accuracy of one error in 10,000 bases, one needs the following:

      • 10x coverage

        • Stringent finishing

      • Complete sequence

        • Base-perfect sequencing

    NCBI Genome Summary

    • NCBI

  • Login