using traveling salesman problem algorithms to determine multiple sequence alignment orders l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders PowerPoint Presentation
Download Presentation
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders

Loading in 2 Seconds...

play fullscreen
1 / 45

Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders - PowerPoint PPT Presentation


  • 145 Views
  • Uploaded on

Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders. Weiwei Zhong. Topics. Background Algorithm Design Test Results. Background. Definitions. What is a Sequence Alignment? . Given 2 or more sequences a scoring scheme. match score

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders' - kynthia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
using traveling salesman problem algorithms to determine multiple sequence alignment orders

Using Traveling Salesman Problem Algorithms to DetermineMultiple Sequence Alignment Orders

Weiwei Zhong

topics
Topics
  • Background
  • Algorithm Design
  • Test Results
background
Background

Definitions

what is a sequence alignment
What is a Sequence Alignment?
  • Given
    • 2 or more sequences
    • a scoring scheme
  • match score
  • mismatch score
  • gap penalty

Insert gaps in each sequence, so that

  • all sequences have the same length
  • maximum pairing score
slide5

Scoring Matrix

Simplified Scoring

  • match = 2
  • mismatch = -1
  • gap penalty = -2

In Practice

Scoring matrix

slide6

Global vs. Local Alignments

Global: entire lengths of sequences

F G K – G K G

F G K F G K G

Local: regions of sequences

- - - F G K G K G

F G K F G K G - -

pairwise alignment vs multiple sequence alignment msa
Pairwise Alignment vs. Multiple Sequence Alignment (MSA)

MSA: more than 2 sequences

Pairwise: 2 sequences

F G K  G K G

F G K F G K G

- G K Q G K G

- - K F G K G

F G K  G K G

F G K F G K G

background8
Background

Basic Dynamic Programming

slide9

Dynamic Programming Algorithm for Pairwise Alignments

  • Two sequences
  • GAATTC
  • GGATC

1. Initialization

G A A T T C

G

G

A

T

C

  • Scoring scheme
  • match = 2
  • mismatch = -1
  • gap penalty = -2
slide10

2

0

-1

-1

-1

-1

2

1

-1

-2

-2

-2

0

4

3

1

-1

-3

-1

2

3

5

3

1

-1

0

1

3

4

5

cj

2. Table fill

Mi-1,j-1 + S(ci, cj)

Mi,j-1 + g

Mi-1,j + g

Mij = max

ci

G A A T T C

G

G

A

T

C

  • Scoring scheme
  • match = 2
  • mismatch = -1
  • gap g = -2
slide11

3. Trace back

G A A T T C

0

0

0

0

0

0

0

G

G

A

T

C

0

0

-1

-1

-1

-1

2

0

2

1

-1

-2

-2

-2

0

0

4

3

1

-1

-3

0

-1

2

3

5

3

1

0

-1

0

1

3

4

5

G A A T T C

| | | |

G G A – T C

slide12

Multidimensional Dynamic Programming for MSA

  • n strings of length L each, running time is O(Ln).
  • Impractical: 5-7 proteins of 200-300 residues each.
topics13
Topics
  • Background
  • Algorithm Design
  • Test Results
algorithm design
Algorithm Design

An MSA Heuristic

slide15

cj

T

A

ci

S

*

Feng-Doolittle Progressive Alignment

  • 1. Align 2 of the sequences Si, Sj
  • 2. Align a 3rd sequenceSkto the alignment Si, Sj
  • 3. Repeat 2 until all sequences are aligned

S(ci, cj) = (S(T, S) + S(A, S)) / 2

Running Time

O( n L2 )

slide16

Features of Feng-Doolittle Algorithm

  • Once a gap, always a gap
  • Early mistakes cannot be corrected

Alignment order is important

x: G A A G T T

y: G A – C T T

z: G A A C T G

x: G A A G T T

y: G A C – T T

z: G A A C T G

algorithm design17
Algorithm Design

TspMsa: First Version

slide18

Traveling Salesman Problem (TSP)

  • Given
  • n nodes
  • distances for each pair of nodes
  • Find a roundtrip, so that
  • visit each node exactly once
  • minimal total length

NP-complete

Well studied

slide19

TspMsa: Algorithm Design

0 1 2 3 4

0

calculate pairwise distances

0

1

2

3

4

1

2

3

4

determine a TSP tour

0

1

2

3

4

0

Alignment

order

Feng-Doolittle alignment

2

4

3

1

slide20

Starting Point and Direction of TSP Tour

498

429

337

814

508

624

375

542

8

632

932

970

84

1

251

14

378

79

914

284

1049

15

9

0.703

0.747

data set

kinase_ref3

0.770

0.703

0.737

0.67

0

9

1

10

4

0.749

8

0.702

0.665

0.74

0.653

2

7

0.636

0.722

0.736

0.636

0.702

0.681

6

0.603

3

0.736

0.654

0.689

0.743

5

18

0.677

0.668

0.64

0.731

0.669

19

17

0.733

0.712

0.739

0.686

0.656

0.706

14

20

0.696

0.712

0.685

0.719

0.772

0.711

15

21

16

22

0.7

11

13

0.692

12

0.698

0.765

0.688

0.746

0.685

algorithm design21
Algorithm Design

TspMsa: Modified Design

slide22

1

0

67

1

2

24

24

15

3

4

38

67

1, 0

1, 0

67

2

24

15

2, 4

3

3

4

38

38

3, 1, 0

67

2, 4

38

TspMsa: Modified Algorithm Design

calculate pairwise distances

determine a TSP tour

align closest nodes

no

one node left

?

3

1

yes

3, 1, 0, 2, 4

0

end

2

4

slide23

Modified Algorithm is Better

Alignment order for Kinase_ref3

6

8

10

9

0

1

4

2

3

18

17

15

16

11

12

13

22

21

20

19

5

7

14

Original TspMsa : 0.603 (worst) - 0.772 (best)

Modified TspMsa : 0.836

topics24
Topics
  • Background
  • Algorithm Design
  • Test Results
test results
Test Results

What to Compare With?

slide26

best quality

Fast

Existing MSA Programs

Iterative

Progressive

clustalw

saga

multal

prrp

multalign

pileup

poa

hmmt

less computation time

better quality

slide27

2

3

1

4

9

5

8

6

7

repeat until one node left at the center

i

i

x

2

3

1

j

j

4

9

ri=(Σdik)/(n-2)

dix=(dij + ri - rj) /2

djx=dij – dix

dxm=(dim + djm - dij)/2

5

8

7

6

9

4

3

2

1

8

7

6

5

CLUSTALW

1. Calculate pairwise distances

2. Derive a guide tree by the Neighbor Joining method

choose 2 closest nodes, derive an internal node

slide28

CLUSTALW

3. Progressively align all sequences following the guide tree

  • Weighted sequences

Without weights

Score = [S(t,v) + S(l,v)] / 2

1p e e k s a v t a l

2g e e k a a v l a l

3e g e w q l v l h v

With weights

Score = [S(t,v)*w1*w3 + S(l,v)*w2*w3] / 2

  • 2 gap penalty values: opening, extension
  • Dynamically changes the gap penalty and the scoring matrix
slide29

T

N

K

E

POA

1. Convert sequences to partial order graphs

E T N K

E T - - P K M I V R

E T T H – K M L V R

P

I

M

V

R

T

K

E

T

H

L

slide30

POA

2. Align 2 sequences

3. Align one sequence to the current group

P

T

T

H

K

E

E

T

N

K

4. Repeat 3 until all sequences are aligned

test results31
Test Results

Quality Evaluation

slide32

BAliBASE Benchmark

  • Reference 1: equidistance sequences with various levels of similarity.
      • < 25% sequence identity
      • 20-40% sequence identity
      • > 35% sequence identity
  • Reference 2: closely related sequences with a highly divergent “orphan” sequence.
  • Reference 3: subgroups with <25% identity between groups.
  • Reference 4: sequences with N/C-terminal extensions.
  • Reference 5: sequences with internal insertions.
slide33

Reference 1 Sequences with < 25% Identity

short

medium

long

All Test Scores

Average Score

slide34

Reference 1 Sequences with 20-40% Identity

short

medium

long

All Test Scores

Average Score

slide35

Reference 1 Sequences with >35% Identity

short

medium

long

All Test Scores

Average Score

slide36

Reference 2

short

medium

long

All Test Scores

Average Score

slide37

Reference 3

short

medium

long

All Test Scores

Average Score

slide38

Reference 4 and Reference 5

Reference 4

Reference 5

All Test Scores

Average Score

slide39

Alignment Quality Comparison

TspMsa and POA: TspMsa better

TspMsa and CLUSTALW: comparable

Reference 1:

<25% identity: Similar *

20-40% identity: Similar *

> 35% identity: Similar

Reference 2: Similar *

Reference 3: TspMsa better

Reference 4: CLUSTALW better

Reference 5: Similar

* CLUSTALW slightly better for short sequences.

test results40
Test Results

Execution Time Evaluation

slide41

Fast Mode TspMsa

Most time consuming step:

Pairwise distance calculations

  • Slow mode:
  • full dynamic programming (accurate)
  • Fast mode:
  • a fast approximate method (heuristic)
slide43

Execution Time Evaluation

CLUSTALW and TspMsa in fast mode

slide44

Conclusions

  • Slow mode
  • close to CLUSTALW (slow mode)
  • better than POA
  • Fast mode(not as good as slow mode)
  • comparable to CLUSTALW (fast mode)
  • better than POA
  • Fast mode
  • faster than CLUSTALW (fast mode)
  • comparable to POA

QUALITY

SPEED

acknowledgement
Acknowledgement

Dr. Robert Robinson

Dr. Russell Malmberg

Dr. Eileen Kraemer

Computer Science Department