Algorithms for alignment of genomic sequences
Download
1 / 59

Algorithms for Alignment of Genomic Sequences - PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on

Algorithms for Alignment of Genomic Sequences. Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004. Conservation Implies Function. Gene. Exon. CNS: Other Conserved. Edit Distance Model (1).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Algorithms for Alignment of Genomic Sequences' - garan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Algorithms for alignment of genomic sequences

Algorithms for Alignment of Genomic Sequences

Michael Brudno

Department of Computer Science

Stanford University

PGA Workshop 07/16/2004


Conservation implies function
Conservation Implies Function

Gene

Exon

CNS:

Other

Conserved


Edit distance model 1
Edit Distance Model (1)

Weighted sum of insertions, deletions & mutations to transform one string into another

AGGCACA--CA AGGCACACA

| |||| || or| || ||

A--CACATTCA ACACATTCA


Edit distance model 2
Edit Distance Model (2)

Given: x, y

Define: F(i,j) = Score of best alignment of

x1…xi to y1…yj

Recurrence: F(i,j) = max ( F(i-1,j) – GAP_PENALTY,

F(i,j-1) – GAP_PENALTY,

F(i-1,j-1) + SCORE(xi, yj))


Edit distance model 3
Edit Distance Model (3)

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

  • F(i,j) = Score of best alignment ending at i,j

  • Time O( n2 ) for two seqs, O( nk ) for k seqs

F(i-1,j-1)

F(i,j-1)

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

F(i,j)

F(i-1,j)


Overview
Overview

  • Local Alignment (CHAOS)

  • Multiple Global Alignment (LAGAN)

    - Whole Genome Alignment

  • Glocal Alignment (Shuffle-LAGAN)

  • Biological Story


Local alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

Local Alignment

  • F(i,j) = max (F(i,j), 0)

    • Return all paths with a position i,j where F(i,j) > C

  • Time O( n2 ) for two seqs, O( nk ) for k seqs


Heuristic local alignment
Heuristic Local Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

BLAST

FASTA


CHAOS: CHAins Of Seeds

  • Find short matching words (seeds)

  • Chain them

  • Rescore chain


Chaos chaining the seeds
CHAOS: Chaining the Seeds

seq1

seed

  • Find seeds at current location in seq1

seq2

location

in seq1


Chaos chaining the seeds1
CHAOS: Chaining the Seeds

seq1

distance

cutoff

seed

  • Find seeds at current location in seq1

seq2

location

in seq1


Chaos chaining the seeds2
CHAOS: Chaining the Seeds

seq1

gap

cutoff

distance

cutoff

seed

  • Find seeds at current location in seq1

seq2

location

in seq1


Chaos chaining the seeds3
CHAOS: Chaining the Seeds

seq1

gap

cutoff

distance

cutoff

seed

  • Find seeds at current location in seq1

  • Find the previous seeds that fall into the search box

seq2

Search

box

location

in seq1


Chaos chaining the seeds4
CHAOS: Chaining the Seeds

seq1

gap

cutoff

distance

cutoff

seed

  • Find seeds at current location in seq1

  • Find the previous seeds that fall into the search box

  • Do a range query: seeds are indexed by their diagonal

seq2

Search

box

location

in seq1

Range of search


Chaos chaining the seeds5
CHAOS: Chaining the Seeds

seq1

gap

cutoff

distance

cutoff

seed

  • Find seeds at current location in seq1

  • Find the previous seeds that fall into the search box

  • Do a range query: seeds are indexed by their diagonal.

  • Pick a previous seed that maximizes the score of chain

seq2

Search

box

location

in seq1

Range of search


Chaos chaining the seeds6
CHAOS: Chaining the Seeds

seq1

gap

cutoff

distance

cutoff

seed

  • Find seeds at current location in seq1

  • Find the previous seeds that fall into the search box

  • Do a range query: seeds are indexed by their diagonal.

  • Pick a previous seed that maximizes the score of chain

seq2

Search

box

location

in seq1

Range of search

Time O(n log n), where n is number of seeds.


Chaos scoring
CHAOS Scoring

  • Initial score = # matching bp - gaps

  • Rapid rescoring: extend all seeds to find optimal location for gaps


Overview1
Overview

  • Local Alignment (CHAOS)

  • Multiple Global Alignment (LAGAN)

    - Whole Genome Alignment

  • Glocal Alignment (Shuffle-LAGAN)

  • Biological Story


Global alignment

z

x

y

Global Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC


LAGAN: 1. FIND Local Alignments

  • Find Local Alignments

  • Chain Local Alignments

  • Restricted DP


LAGAN: 2. CHAIN Local Alignments

  • Find Local Alignments

  • Chain Local Alignments

  • Restricted DP


Lagan 3 restricted dp
LAGAN: 3. Restricted DP

  • Find Local Alignments

  • Chain Local Alignments

  • Restricted DP


Mlagan 1 progressive alignment
MLAGAN: 1. Progressive Alignment

Human

Baboon

Given N sequences, phylogenetic tree

Align pairwise, in order of the tree (LAGAN)

Mouse

Rat


Mlagan 2 multi anchoring
MLAGAN: 2. Multi-anchoring

To anchor the (X/Y), and (Z) alignments:

X

Z

Y

Z

X/Y

Z


Cystic fibrosis cftr 12 species
Cystic Fibrosis (CFTR), 12 species

  • Human sequence length: 1.8 Mb

  • Total genomic sequence: 13 Mb

Chicken

Zebrafish

Cow

Pig

Chimp

Human

Dog

Rat

Fugufish

Cat

Baboon

Mouse


Cftr cont d
CFTR (cont’d )

% Exons

Aligned

TIME (sec)

MAX MEMORY (Mb)

LAGAN

Mammals

99.7%

550

90

Chicken & Fishes

96%

862

90

MLAGAN

Mammals

99.8%

4547

670

Chicken & Fishes

98%


Automatic computational system for comparative analysis of pairs of genomes

http://pipeline.lbl.gov

Alignments (all pair combinations):

Human Genome (Golden Path Assembly)

Mouse assemblies: Arachne, Phusion (2001) MGSC v3 (2002)

Rat assemblies: January 2003, February 2003

----------------------------------------------------------

D. Melanogaster vs D. Pseudoobscura February 2003


Tandem Local/Global Approach pairs of genomes

  • Finding a likely mapping for a contig (BLAT)


Progressive alignment scheme
Progressive Alignment Scheme pairs of genomes

Human, Mouse and Rat genomes

Pairwise M/R mapping

no

yes

Aligned M&R fragments

Unaligned M&R sequences

Mapping aligned fragments by union of M&R local BLAT hits on the human genome

Map to Human Genome

yes

no

no

yes

M/H and R/H pairwise alignment

Unassigned M&R DNA fragments

H/M/R MLAGAN alignment

M/R pairwise alignment


Computational time
Computational Time pairs of genomes

23 dual 2.2GHz Intel Xeon node PC cluster.

Pair-wise rat/mouse – 4 hours Pair-wise rat/human and mouse/human – 2 hours Multiple human/mouse/rat – 9 hours

Total wall time: ~ 15 hours



Evolution over a chromosome
Evolution Over a Chromosome pairs of genomes


Overview2
Overview pairs of genomes

  • Local Alignment (CHAOS)

  • Multiple Global Alignment (LAGAN)

    - Whole Genome Alignment

  • Glocal Alignment (Shuffle-LAGAN)

  • Biological Story


Evolution at the dna level
Evolution at the DNA level pairs of genomes

Deletion

Mutation

…ACGGTGCAGTTACCA…

SEQUENCE EDITS

…AC----CAGTCCACCA…

REARRANGEMENTS

Inversion

Translocation

Duplication


Local global alignment
Local & Global Alignment pairs of genomes

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

Global

Local

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC


Glocal alignment problem
Glocal Alignment Problem pairs of genomes

Find least cost transformation of one sequence into another using new operations

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

  • Sequence edits

  • Inversions

  • Translocations

  • Duplications

  • Combinations of above

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC


Shuffle lagan
Shuffle-LAGAN pairs of genomes

A glocal aligner for long DNA sequences


S-LAGAN: Find Local Alignments pairs of genomes

  • Find Local Alignments

  • Build Rough Homology Map

  • Globally Align Consistent Parts


S-LAGAN: Build Homology Map pairs of genomes

  • Find Local Alignments

  • Build Rough Homology Map

  • Globally Align Consistent Parts


Building the homology map
Building the Homology Map pairs of genomes

b

a

d

c

Chain (using Eppstein

Galil); each alignment

gets a score which is

MAX over 4 possible

chains.

Penalties are affine

(event and distance

components)

  • Penalties:

  • regular

  • translocation

c) inversion

d) inverted translocation


S-LAGAN: Build Homology Map pairs of genomes

  • Find Local Alignments

  • Build Rough Homology Map

  • Globally Align Consistent Parts


S-LAGAN: Global Alignment pairs of genomes

  • Find Local Alignments

  • Build Rough Homology Map

  • Globally Align Consistent Parts


S lagan results cftr
S-LAGAN Results (CFTR) pairs of genomes

Local

Glocal


S lagan results cftr1
S-LAGAN Results (CFTR) pairs of genomes

Hum/Mus

Hum/Rat



S lagan results hox
S-LAGAN results (HOX) pairs of genomes

  • 12 paralogous genes

  • Conserved order in mammals


S lagan results hox1
S-LAGAN results (HOX) pairs of genomes

  • 12 paralogous genes

  • Conserved order in mammals


S lagan results chr 20
S-LAGAN Results (Chr 20) pairs of genomes

  • Human Chr 20 v. homologous Mouse Chr 2.

  • 270 Segments of conserved synteny

  • 70 Inversions


S lagan results whole genome
S-LAGAN Results (Whole Genome) pairs of genomes

  • Used Berkeley Genome Pipeline

  • % Human genome aligned with mouse sequence

  • Evaluation criteria from Waterston, et al (Nature 2002)


Rearrangements in human v mouse
Rearrangements in Human v. Mouse pairs of genomes

  • Preliminary conclusions:

  • Rearrangements come in all sizes

  • Duplications worse conserved than other rearranged regions

  • Simple inversions tend to be most common and most conserved


What is next shuffle
What is next? (Shuffle) pairs of genomes

  • Better algorithm and scoring

  • Whole genome synteny mapping

  • Multiple Glocal Alignment(!?)


Overview3
Overview pairs of genomes

  • Local Alignment (CHAOS)

  • Multiple Global Alignment (LAGAN)

    - Whole Genome Alignment

  • Glocal Alignment (Shuffle-LAGAN)

  • Biological Story


Biological story
Biological Story pairs of genomes

  • Math1 (Mouse Atonal Homologue 1, also ATOH) is a gene that is responsible for nervous system development



Detailed alignment
Detailed Alignment pairs of genomes

hum_a : CAATAGAGGGTCTGGCAGAGGCTC---------------------CTGGC @ 57336/400001

mus_a : CAATAGAGGGGCTGGCAGAGGCTC---------------------CTGGC @ 78565/400001

rat_a : CAATAGAGGGGCTGGCAGAGACTC---------------------CTGGC @ 112663/369938

fug_a : TGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCGTGGGC @ 36013/68174

hum_a : CGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 57386/400001

mus_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 78615/400001

rat_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 112713/369938

fug_a : CGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCGAGCGC @ 36063/68174


Can we align human fly
Can we align human & fly??? pairs of genomes

CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAG

CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG

CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG

GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCG

Melan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGC

Pseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA


Putting it all together
Putting it all together pairs of genomes

CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAG

CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG

CCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG

GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCG

Melan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGC

Pseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA


Overview4
Overview pairs of genomes

  • Local Alignment (CHAOS)

  • Multiple Global Alignment (LAGAN)

    - Whole Genome Alignment

  • Glocal Alignment (Shuffle-LAGAN)

  • Biological Story


Acknowledgments

Acknowledgments pairs of genomes

Berkeley:

Inna Dubchak

Alexander Poliakov

Göttingen:

Burkhard Morgenstern

Rat Genome Sequencing Consortium

Stanford:

Serafim Batzoglou

Arend Sidow

Matt Scott

Gregory Cooper

Chuong (Tom) Do

Sanket Malde

Kerrin Small

Mukund Sundararajan

http://lagan.stanford.edu/


ad