# Algorithms for Alignment of Genomic Sequences - PowerPoint PPT Presentation

Algorithms for Alignment of Genomic Sequences. Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004. Conservation Implies Function. Gene. Exon. CNS: Other Conserved. Edit Distance Model (1).

Algorithms for Alignment of Genomic Sequences

### Algorithms for Alignment of Genomic Sequences

Michael Brudno

Department of Computer Science

Stanford University

PGA Workshop 07/16/2004

Gene

Exon

CNS:

Other

Conserved

Weighted sum of insertions, deletions & mutations to transform one string into another

AGGCACA--CA AGGCACACA

| |||| || or| || ||

A--CACATTCA ACACATTCA

Given: x, y

Define: F(i,j) = Score of best alignment of

x1…xi to y1…yj

Recurrence: F(i,j) = max ( F(i-1,j) – GAP_PENALTY,

F(i,j-1) – GAP_PENALTY,

F(i-1,j-1) + SCORE(xi, yj))

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

• F(i,j) = Score of best alignment ending at i,j

• Time O( n2 ) for two seqs, O( nk ) for k seqs

F(i-1,j-1)

F(i,j-1)

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

F(i,j)

F(i-1,j)

• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)

- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

### Local Alignment

• F(i,j) = max (F(i,j), 0)

• Return all paths with a position i,j where F(i,j) > C

• Time O( n2 ) for two seqs, O( nk ) for k seqs

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

BLAST

FASTA

• Find short matching words (seeds)

• Chain them

• Rescore chain

seq1

seed

• Find seeds at current location in seq1

seq2

location

in seq1

seq1

distance

cutoff

seed

• Find seeds at current location in seq1

seq2

location

in seq1

seq1

gap

cutoff

distance

cutoff

seed

• Find seeds at current location in seq1

seq2

location

in seq1

seq1

gap

cutoff

distance

cutoff

seed

• Find seeds at current location in seq1

• Find the previous seeds that fall into the search box

seq2

Search

box

location

in seq1

seq1

gap

cutoff

distance

cutoff

seed

• Find seeds at current location in seq1

• Find the previous seeds that fall into the search box

• Do a range query: seeds are indexed by their diagonal

seq2

Search

box

location

in seq1

Range of search

seq1

gap

cutoff

distance

cutoff

seed

• Find seeds at current location in seq1

• Find the previous seeds that fall into the search box

• Do a range query: seeds are indexed by their diagonal.

• Pick a previous seed that maximizes the score of chain

seq2

Search

box

location

in seq1

Range of search

seq1

gap

cutoff

distance

cutoff

seed

• Find seeds at current location in seq1

• Find the previous seeds that fall into the search box

• Do a range query: seeds are indexed by their diagonal.

• Pick a previous seed that maximizes the score of chain

seq2

Search

box

location

in seq1

Range of search

Time O(n log n), where n is number of seeds.

• Initial score = # matching bp - gaps

• Rapid rescoring: extend all seeds to find optimal location for gaps

• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)

- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

x

y

### Global Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

• Find Local Alignments

• Chain Local Alignments

• Restricted DP

• Find Local Alignments

• Chain Local Alignments

• Restricted DP

• Find Local Alignments

• Chain Local Alignments

• Restricted DP

Human

Baboon

Given N sequences, phylogenetic tree

Align pairwise, in order of the tree (LAGAN)

Mouse

Rat

To anchor the (X/Y), and (Z) alignments:

X

Z

Y

Z

X/Y

Z

• Human sequence length: 1.8 Mb

• Total genomic sequence: 13 Mb

Chicken

Zebrafish

Cow

Pig

Chimp

Human

Dog

Rat

Fugufish

Cat

Baboon

Mouse

% Exons

Aligned

TIME (sec)

MAX MEMORY (Mb)

LAGAN

Mammals

99.7%

550

90

Chicken & Fishes

96%

862

90

MLAGAN

Mammals

99.8%

4547

670

Chicken & Fishes

98%

Automatic computational system for comparative analysis of pairs of genomes

http://pipeline.lbl.gov

Alignments (all pair combinations):

Human Genome (Golden Path Assembly)

Mouse assemblies: Arachne, Phusion (2001) MGSC v3 (2002)

Rat assemblies: January 2003, February 2003

----------------------------------------------------------

D. Melanogaster vs D. Pseudoobscura February 2003

Tandem Local/Global Approach

• Finding a likely mapping for a contig (BLAT)

Progressive Alignment Scheme

Human, Mouse and Rat genomes

Pairwise M/R mapping

no

yes

Aligned M&R fragments

Unaligned M&R sequences

Mapping aligned fragments by union of M&R local BLAT hits on the human genome

Map to Human Genome

yes

no

no

yes

M/H and R/H pairwise alignment

Unassigned M&R DNA fragments

H/M/R MLAGAN alignment

M/R pairwise alignment

Computational Time

23 dual 2.2GHz Intel Xeon node PC cluster.

Pair-wise rat/mouse – 4 hours Pair-wise rat/human and mouse/human – 2 hours Multiple human/mouse/rat – 9 hours

Total wall time: ~ 15 hours

Evolution Over a Chromosome

Overview

• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)

- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Evolution at the DNA level

Deletion

Mutation

…ACGGTGCAGTTACCA…

SEQUENCE EDITS

…AC----CAGTCCACCA…

REARRANGEMENTS

Inversion

Translocation

Duplication

Local & Global Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

Global

Local

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

Glocal Alignment Problem

Find least cost transformation of one sequence into another using new operations

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

• Sequence edits

• Inversions

• Translocations

• Duplications

• Combinations of above

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

Shuffle-LAGAN

A glocal aligner for long DNA sequences

S-LAGAN: Find Local Alignments

• Find Local Alignments

• Build Rough Homology Map

• Globally Align Consistent Parts

S-LAGAN: Build Homology Map

• Find Local Alignments

• Build Rough Homology Map

• Globally Align Consistent Parts

Building the Homology Map

b

a

d

c

Chain (using Eppstein

Galil); each alignment

gets a score which is

MAX over 4 possible

chains.

Penalties are affine

(event and distance

components)

• Penalties:

• regular

• translocation

c) inversion

d) inverted translocation

S-LAGAN: Build Homology Map

• Find Local Alignments

• Build Rough Homology Map

• Globally Align Consistent Parts

S-LAGAN: Global Alignment

• Find Local Alignments

• Build Rough Homology Map

• Globally Align Consistent Parts

S-LAGAN Results (CFTR)

Local

Glocal

S-LAGAN Results (CFTR)

Hum/Mus

Hum/Rat

S-LAGAN results (HOX)

• 12 paralogous genes

• Conserved order in mammals

S-LAGAN results (HOX)

• 12 paralogous genes

• Conserved order in mammals

S-LAGAN Results (Chr 20)

• Human Chr 20 v. homologous Mouse Chr 2.

• 270 Segments of conserved synteny

• 70 Inversions

S-LAGAN Results (Whole Genome)

• Used Berkeley Genome Pipeline

• % Human genome aligned with mouse sequence

• Evaluation criteria from Waterston, et al (Nature 2002)

Rearrangements in Human v. Mouse

• Preliminary conclusions:

• Rearrangements come in all sizes

• Duplications worse conserved than other rearranged regions

• Simple inversions tend to be most common and most conserved

What is next? (Shuffle)

• Better algorithm and scoring

• Whole genome synteny mapping

• Multiple Glocal Alignment(!?)

Overview

• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)

- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Biological Story

• Math1 (Mouse Atonal Homologue 1, also ATOH) is a gene that is responsible for nervous system development

Detailed Alignment

hum_a : CAATAGAGGGTCTGGCAGAGGCTC---------------------CTGGC @ 57336/400001

mus_a : CAATAGAGGGGCTGGCAGAGGCTC---------------------CTGGC @ 78565/400001

rat_a : CAATAGAGGGGCTGGCAGAGACTC---------------------CTGGC @ 112663/369938

fug_a : TGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCGTGGGC @ 36013/68174

hum_a : CGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 57386/400001

mus_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 78615/400001

rat_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 112713/369938

fug_a : CGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCGAGCGC @ 36063/68174

Can we align human & fly???

CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAG

CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG

CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG

GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCG

Melan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGC

Pseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA

Putting it all together

CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAG

CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG

CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG

GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCG

Melan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGC

Pseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA

Overview

• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)

- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Acknowledgments

Berkeley:

Inna Dubchak

Alexander Poliakov

Göttingen:

Burkhard Morgenstern

Rat Genome Sequencing Consortium

Stanford:

Serafim Batzoglou

Arend Sidow

Matt Scott

Gregory Cooper

Chuong (Tom) Do

Sanket Malde

Kerrin Small

Mukund Sundararajan

http://lagan.stanford.edu/