Chapter 2 data searches and pairwise alignments
This presentation is the property of its rightful owner.
Sponsored Links
1 / 47

Chapter 2 Data Searches and Pairwise Alignments PowerPoint PPT Presentation


  • 63 Views
  • Uploaded on
  • Presentation posted in: General

Chapter 2 Data Searches and Pairwise Alignments. 暨南大學資訊工程學系 黃光璿 2004/03/08. Introduction. What is the difference between acctga and agcta?. a c c t g a a g c t g a a g c t - a. Nomenclature. 2.1 Dot Plots. 2.2 Simple Alignments. No gap. mutation (substitution): common insertion

Download Presentation

Chapter 2 Data Searches and Pairwise Alignments

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Chapter 2 data searches and pairwise alignments

Chapter 2Data Searches and Pairwise Alignments

暨南大學資訊工程學系

黃光璿

2004/03/08


Introduction

Introduction

  • What is the difference between acctga and agcta?

a c c t g a

a g c t g a

a g c t - a


Nomenclature

Nomenclature


2 1 dot plots

2.1 Dot Plots


2 2 simple alignments

2.2 Simple Alignments

  • No gap


Chapter 2 data searches and pairwise alignments

  • mutation (substitution): common

  • insertion

  • deletion

  • scoring scheme

    • match score

    • mismatch score

}

gap, indel (rare)


2 3 gaps

2.3 Gaps


2 3 1 gap penalty

2.3.1 Gap Penalty

  • uniform gap

  • affine gap

    • origination penalty

    • length penalty


2 4 scoring matrices

2.4 Scoring Matrices


Chapter 2 data searches and pairwise alignments

  • Modeling 之問題

    • 大自然是否真的依此規則運作?


Modeling

Modeling


Chapter 2 data searches and pairwise alignments

Define the odds ratio as


2 4 1 pam matrices

2.4.1 PAM Matrices

  • Dayhoff, Schwartz, Orcutt (1978)

  • Point Accepted Mutation

    • Based on observed substitution rates

      • (Box. 2.1)

    • Input

      • A set of observed substitution rates

    • Output

      • PAM-1 matrix (log-odds matrix)


Multiple alignment

Multiple Alignment

(1) Group the sequences with high similarity (> 85% identity).


Phylogenetic tree

Phylogenetic Tree

(2) For each group, build the corresponding phylogenetic tree.


Mutation frequency

Mutation Frequency

A->G, I->L, A->G, A->L, C->S, G->A

(3)

FG,A=3


Relative mutability

Relative Mutability

  • (4)


Mutation probability

Mutation Probability

  • (5)


Odds ratio

Odds Ratio

  • (6)


Log odds ratio

Log-Odds Ratio

  • (7)


Chapter 2 data searches and pairwise alignments

  • Which PAM matrix is the most appropriate?

    • the length of the sequences

    • How closely the sequences are believed to be related.

  •  PAM 120 for database search

  •  PAM 200 for comparing two specific proteins


2 4 2 blosum matrices

2.4.2 BLOSUM Matrices

  • Henikoff & Henikoff (1992)

  • PAM-k: k愈大, 愈不相似

  • BLOSUM-k: k愈大愈相似

  •  BLOSUM62: for ungapped matching

  •  BLOSUM50: for gapped matching


2 5 dynamic programming

2.5 Dynamic Programming

  • The Needleman and Wunsch Algorithm (Global Alignment)


Alignment graph

Alignment Graph


Chapter 2 data searches and pairwise alignments

A C - - T C G

A C A G T A G


Complexity

Complexity


2 6 global and local alignments

2.6 Global and Local Alignments

  • Semi-global alignment

  • Local alignment


2 6 1 semi global alignments

2.6.1 Semi-global Alignments

  • A A C A C G T G T C T

  • - - - A C G T - - - -


2 6 2 local alignment

2.6.2 Local Alignment

  • The Smith-Waterman Alignment


2 7 database searches

2.7 Database Searches

  • BLAST and its relatives

  • FASTA and related algorithms


2 7 1 blast and its relatives

2.7.1 BLAST and Its Relatives


Blastp

BLASTP

  • Using PAM or BLOSUM matrices


2 7 2 fasta and related algorithms

2.7.2 FASTA and Related Algorithms

改進 dot plot & band search

  • Preprocess the target sequence.

    • Identify the position for each word.

      (for amino acid & word length=1, a 20-entry array)

  • Scan the query sequence.

    • Compute the shifts of query to align each word with the target.

  • Find the mode (眾數) of the shifts.

  • Join the possible shifts into one new target sequence. Perform the full local alignment algorithm.


Chapter 2 data searches and pairwise alignments

Target: FAMLGFIKYLPGCM

Query:TGFIKYLPGACT


2 7 3 alignment scores and statistical significance of database searches

2.7.3 Alignment Scores and Statistical Significance of Database Searches

  • related model v.s. random model

    • S-score: the alignment score

    • E-score: expected number of sequences with score >= S by random chance

    • P-score: probability that one or more sequences with score >= S would be found randomly

  •  Low E & P are better.


Chapter 2 data searches and pairwise alignments

  • length correction

  • Scores


Pam 120 ln 2 2 nats

PAM 120 (ln 2)/2 nats

A R N D C Q E G H I L K M F P S T W Y V B Z X *

A 3 -3 -1 0 -3 -1 0 1 -3 -1 -3 -2 -2 -4 1 1 1 -7 -4 0 0 -1 -1 -8

R -3 6 -1 -3 -4 1 -3 -4 1 -2 -4 2 -1 -5 -1 -1 -2 1 -5 -3 -2 -1 -2 -8

N -1 -1 4 2 -5 0 1 0 2 -2 -4 1 -3 -4 -2 1 0 -4 -2 -3 3 0 -1 -8

D 0 -3 2 5 -7 1 3 0 0 -3 -5 -1 -4 -7 -3 0 -1 -8 -5 -3 4 3 -2 -8

C -3 -4 -5 -7 9 -7 -7 -4 -4 -3 -7 -7 -6 -6 -4 0 -3 -8 -1 -3 -6 -7 -4 -8

Q -1 1 0 1 -7 6 2 -3 3 -3 -2 0 -1 -6 0 -2 -2 -6 -5 -3 0 4 -1 -8

E 0 -3 1 3 -7 2 5 -1 -1 -3 -4 -1 -3 -7 -2 -1 -2 -8 -5 -3 3 4 -1 -8

G 1 -4 0 0 -4 -3 -1 5 -4 -4 -5 -3 -4 -5 -2 1 -1 -8 -6 -2 0 -2 -2 -8

H -3 1 2 0 -4 3 -1 -4 7 -4 -3 -2 -4 -3 -1 -2 -3 -3 -1 -3 1 1 -2 -8

I -1 -2 -2 -3 -3 -3 -3 -4 -4 6 1 -3 1 0 -3 -2 0 -6 -2 3 -3 -3 -1 -8

L -3 -4 -4 -5 -7 -2 -4 -5 -3 1 5 -4 3 0 -3 -4 -3 -3 -2 1 -4 -3 -2 -8

K -2 2 1 -1 -7 0 -1 -3 -2 -3 -4 5 0 -7 -2 -1 -1 -5 -5 -4 0 -1 -2 -8

M -2 -1 -3 -4 -6 -1 -3 -4 -4 1 3 0 8 -1 -3 -2 -1 -6 -4 1 -4 -2 -2 -8

F -4 -5 -4 -7 -6 -6 -7 -5 -3 0 0 -7 -1 8 -5 -3 -4 -1 4 -3 -5 -6 -3 -8

P 1 -1 -2 -3 -4 0 -2 -2 -1 -3 -3 -2 -3 -5 6 1 -1 -7 -6 -2 -2 -1 -2 -8

S 1 -1 1 0 0 -2 -1 1 -2 -2 -4 -1 -2 -3 1 3 2 -2 -3 -2 0 -1 -1 -8

T 1 -2 0 -1 -3 -2 -2 -1 -3 0 -3 -1 -1 -4 -1 2 4 -6 -3 0 0 -2 -1 -8

W -7 1 -4 -8 -8 -6 -8 -8 -3 -6 -3 -5 -6 -1 -7 -2 -6 12 -2 -8 -6 -7 -5 -8

Y -4 -5 -2 -5 -1 -5 -5 -6 -1 -2 -2 -5 -4 4 -6 -3 -3 -2 8 -3 -3 -5 -3 -8

V 0 -3 -3 -3 -3 -3 -3 -2 -3 3 1 -4 1 -3 -2 -2 0 -8 -3 5 -3 -3 -1 -8

B 0 -2 3 4 -6 0 3 0 1 -3 -4 0 -4 -5 -2 0 0 -6 -3 -3 4 2 -1 -8

Z -1 -1 0 3 -7 4 4 -2 1 -3 -3 -1 -2 -6 -1 -1 -2 -7 -5 -3 2 4 -1 -8

X -1 -2 -1 -2 -4 -1 -1 -2 -2 -1 -2 -2 -2 -3 -2 -1 -1 -5 -3 -1 -1 -1 -2 -8

* -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8


Applications

Applications

  • Reconstructing long sequences of DNA from overlapping sequence fragments

  • Determining physical and genetic maps from probe data under various experiment protocols

  • Database searching

  • Comparing two or more sequences for similarities


Chapter 2 data searches and pairwise alignments

  • Protein structure prediction (building profiles)

  • Comparing the same gene sequenced by two different labs


2 8 multiple sequence alignemnts

2.8 Multiple Sequence Alignemnts

  • CLUSTAL

    • R. G. Higgins & P. M. Sharp, 1988

  • CLUSTALW

    • Sequences are weighted according to how divergent they are from the most closely related pair of sequences.

    • Gaps are weighted for different sequences.


Summary

Summary

  • notion of similarity

  • the scoring system used to rank alignments

  • the algorithms used to find optimal scoring alignment

  • the statistical method used to evaluate the significance of an alignment score


Chapter 2 data searches and pairwise alignments

參考資料及圖片出處

  • Fundamental Concepts of BioinformaticsDan E. Krane and Michael L. Raymer, Benjamin/Cummings, 2003.

  • BLAST, by I. Korf, M. Yandell, J. Bedell, O‘Reilly & Associates, 2003. (天瓏代理)

  • Biological Sequence Analysis – Probabilistic Models of Proteins and Nucleic AcidsR. Durbin, S. Eddy, A. Krogh, and G. Mitchison,Cambridge University Press, 1998.

  • Biochemistry, by J. M. Berg, J. L. Tymoczko, and L. Stryer, Fith Edition, 2001.


  • Login