1 / 38

# Alignment Problem - PowerPoint PPT Presentation

Alignment Problem. (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal (heuristic) alignment algorithms are also very important: e.g. BLAST. Key Issues. Types of alignments (local vs. global)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Alignment Problem' - charis

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

• (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one.

• Sub-optimal (heuristic) alignment algorithms are also very important: e.g. BLAST

• Types of alignments (local vs. global)

• The scoring system

• The alignment algorithm

• Measuring alignment significance

• Global—sequences aligned from end-to-end.

• Local—alignments may start in the middle of either sequence

• Ungapped—no insertions or deletions are allowed

• Other types: overlap alignments, repeated match alignments

Local vs. Global Pairwise Alignments

• A global alignment includes all elements of the sequences and includes gaps.

• A global alignment may or may not include "end gap" penalties.

• Global alignments are better indicators of homology and take longer to compute.

• A local alignment includes only subsequences, and sometimes is computed without gaps.

• Local alignments can find shared domains in divergent proteins and are fast to compute

• Scoring scheme

• What events do we score?

• Matches

• Mismatches

• Gaps

• What scores will you give these events?

• What assumptions are you making?

• How do you determine scores?

• DNA versus Amino Acids?

• TTACGGAGCTTC

• CTGAGATCC

Global versus Local Alignments

• Progressive alignment

• Estimate guide tree

• Do pairwise alignment on subtrees

ClustalX

• Consistency-based Algorithms

• T-Coffee - consistency-based objective function to minimize potential errors

• Generates pair-wise global (Clustal)

• Local (Lalign)

• Then combine, reweight, progressive alignment

• Estimate draft progressive alignment (uncorrected distances)

• Improved progressive (reestimate guide tree using Kimura 2-parameter)

• Refinement - divide into 2 subtrees, estimate two profiles, then re-align 2 profiles

• Continue refinement until convergence

• Clustal

• T-Coffee

• MUSCLE (limited models)

• MAFFT (wide variety of models)

• Speed

• Muscle>MAFFT>CLUSTALW>T-COFFEE

• Accuracy

• MAFFT>Muscle>T-COFFEE>CLUSTALW

• Lots more work to do here!

• Sanger (1982) introduced a sequencing method amenable to automation.

• Whole-genome sequencing: Clone-By-Clone vs. Shotgun Assembly

• Drosophila melongaster sequenced (Myers et al. 2000)

• Homo sapien sequenced (Venter et al. 2001)

Sanger (1982) introduced chain-termination sequencing.

Main idea: Obtain fragments of all possible lengths, ending in A, C, T, G.

Using gel electrophoresis, we can separate fragments of differing lengths, and then assemble them.

Perkin-Elmer 3700:

Can sequence ~500bp with 98.5% accuracy

Sequencing machines are limited to about ~500-750bp, so we must break up DNA into short and long fragments, with reads on either end.

Reads are then assembled into contigs, then scaffolds.

• Traditionally, long fragments are mapped, and then assembled by finding a minimum tiling path. Then, shotgun assembly is used to sequence long fragments.

• Shotgun assembly is cheaper, but requires more computational resources.

• Drosophila was successfully sequenced using shotgun assembly.

• Good coverage does not guarantee that we can “see” repeats.

• Read coverage is generally not “truly” random, due to complications in fragmentation and cloning.

• Any automated approach requires extensive post-processing.

• Phrapwww.phrap.org

• Drosophila melongaster was sequenced in 2000 using whole genome shotgun assembly.

• Genome size is ~120Mbp for euchromatic (coding) portion, with roughly 13,600 genes.

• The genome is still being refined.

Celera used 300 sequencing machines in parallel to obtain 175,000 reads per day.

Efforts were combined, resulting in 8x coverage of the human genome; consensus sequence is 2.91 billion base pairs.

Abstraction assembly.

• The basic question is: given a set of fragments from a long string, can we reconstruct the string?

• What is the shortest common superstring of the given fragments?

Overlap-Layout-Consensus assembly.

• Construct a (directed) overlap graph, where nodes represent reads and edges represent overlap. Paths are contigs in this graph.

• Problem: Find the consensus sequence by finding a path that visits all nodes in layout graph.

• Note: This is an idealization, since we must handle errors!

Approximation Algorithms assembly.

• The shortest common superstring problem is NP-complete.

• Greedily choosing edges is a 4-approximation, conjectured to be a 2-approximation.

• Another idea: TSP has a 2-approximation if the edge weights are metric (Waterman et al. 1976 gives such metrics).

Handling Repeats assembly.

• We can estimate how much coverage a given set of overlapping reads should yield, based on coverage.

• Repeats will “seem” to have unusually good coverage.

• Celera’s algorithms are proprietary, but there is no explicit way to handle repeats in the overlap-layout-consensus paradigm.

The Big Picture assembly.

Hybridization assembly.

Suppose we had a way to probe fragments of length k that were present in our sequence, from a hybridization assay.

Commercial products: Affymetrix GeneChip, Agilent, Amersham, etc.

Sequencing-By-Hybridization assembly.

• Construct a multigraph G with (k-1)-mers as nodes, with edges representing k-mers. G is a de Bruijn graph.

• Idea: An Eulerian path in G corresponds to the assembled sequence, and we don’t lose repeats (Pevzner 1989).

Bridges of Königsberg assembly.

Theorem (Euler 1736): A graph has a path visiting every edge exactly once if and only if it is connected and has 2 or fewer vertices of odd degree.

Pros and Cons assembly.

• An Eulerian path in a graph can be found in linear time, if one exists.

• Errors in the hybridization experiments may prevent us from finding a solution.

• Can we just use reads as “virtual” hybridization data?

Graph Preprocessing assembly.

• Read errors mean up to k missing/erroneous edges. But we cannot correct this until we are done assembling!

• Greedily mutate reads to minimize size of set of k-mers.

• We also need to deal with repeats, which requires contracting certain paths to single edges…

Sequencing parameters assembly.

• Difficulty and cost of large-scale sequencing projects depend on the following parameters:

• Accuracy

• How many errors are tolerated

• Coverage

• How many times the same region is sequenced

• The two parameters are related

• More coverage usually means higher accuracy

• Accuracy is also dependent on the finishing effort

Sequence accuracy assembly.

• Highly accurate sequences are needed for the following:

• Diagnostics

• e.g., Forensics, identifying disease alleles in a patient

• Protein coding prediction

• One insertion or deletion changes the reading frame

• Lower accuracy sufficient for homology searches

• Differences in sequence are tolerated by search programs

• Level of accuracy determines cost of project

• Increasing accuracy from one error in 100 to one error in 10,000 increases costs three to fivefold

• Need to determine appropriate level of accuracy for each project

• If reference sequence already exists, then a lower level of accuracy should suffice

• Can find genes in genome, but not their position

• Sequencing coverage assembly.

• Coverage is the number of times the same region is sequenced

• Ideally, one wants an equal number of sequences in each direction

• To obtain accuracy of one error in 10,000 bases, one needs the following:

• 10x coverage

• Stringent finishing

• Complete sequence

• Base-perfect sequencing

NCBI Genome Summary assembly.

• NCBI