1 / 47

# Chapter 2 Data Searches and Pairwise Alignments - PowerPoint PPT Presentation

Chapter 2 Data Searches and Pairwise Alignments. 暨南大學資訊工程學系 黃光璿 2004/03/08. Introduction. What is the difference between acctga and agcta?. a c c t g a a g c t g a a g c t - a. Nomenclature. 2.1 Dot Plots. 2.2 Simple Alignments. No gap. mutation (substitution): common insertion

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Chapter 2 Data Searches and Pairwise Alignments' - tymon

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Chapter 2Data Searches and Pairwise Alignments

2004/03/08

• What is the difference between acctga and agcta?

a c c t g a

a g c t g a

a g c t - a

• No gap

}

gap, indel (rare)

• uniform gap

• affine gap

• origination penalty

• length penalty

• Modeling 之問題

• 大自然是否真的依此規則運作？

• Dayhoff, Schwartz, Orcutt (1978)

• Point Accepted Mutation

• Based on observed substitution rates

• (Box. 2.1)

• Input

• A set of observed substitution rates

• Output

• PAM-1 matrix (log-odds matrix)

(1) Group the sequences with high similarity (> 85% identity).

(2) For each group, build the corresponding phylogenetic tree.

A->G, I->L, A->G, A->L, C->S, G->A

(3)

FG,A=3

• (6)

• Henikoff & Henikoff (1992)

• PAM-k: k愈大, 愈不相似

• BLOSUM-k: k愈大愈相似

•  BLOSUM62: for ungapped matching

•  BLOSUM50: for gapped matching

• The Needleman and Wunsch Algorithm (Global Alignment)

A C A G T A G

• Semi-global alignment

• Local alignment

• A A C A C G T G T C T

• - - - A C G T - - - -

• The Smith-Waterman Alignment

• BLAST and its relatives

• FASTA and related algorithms

• Using PAM or BLOSUM matrices

• Preprocess the target sequence.

• Identify the position for each word.

(for amino acid & word length=1, a 20-entry array)

• Scan the query sequence.

• Compute the shifts of query to align each word with the target.

• Find the mode (眾數) of the shifts.

• Join the possible shifts into one new target sequence. Perform the full local alignment algorithm.

Query:TGFIKYLPGACT

2.7.3 Alignment Scores and Statistical Significance of Database Searches

• related model v.s. random model

• S-score: the alignment score

• E-score: expected number of sequences with score >= S by random chance

• P-score: probability that one or more sequences with score >= S would be found randomly

•  Low E & P are better.

PAM 120 ( Database Searchesln 2)/2 nats

A R N D C Q E G H I L K M F P S T W Y V B Z X *

A 3 -3 -1 0 -3 -1 0 1 -3 -1 -3 -2 -2 -4 1 1 1 -7 -4 0 0 -1 -1 -8

R -3 6 -1 -3 -4 1 -3 -4 1 -2 -4 2 -1 -5 -1 -1 -2 1 -5 -3 -2 -1 -2 -8

N -1 -1 4 2 -5 0 1 0 2 -2 -4 1 -3 -4 -2 1 0 -4 -2 -3 3 0 -1 -8

D 0 -3 2 5 -7 1 3 0 0 -3 -5 -1 -4 -7 -3 0 -1 -8 -5 -3 4 3 -2 -8

C -3 -4 -5 -7 9 -7 -7 -4 -4 -3 -7 -7 -6 -6 -4 0 -3 -8 -1 -3 -6 -7 -4 -8

Q -1 1 0 1 -7 6 2 -3 3 -3 -2 0 -1 -6 0 -2 -2 -6 -5 -3 0 4 -1 -8

E 0 -3 1 3 -7 2 5 -1 -1 -3 -4 -1 -3 -7 -2 -1 -2 -8 -5 -3 3 4 -1 -8

G 1 -4 0 0 -4 -3 -1 5 -4 -4 -5 -3 -4 -5 -2 1 -1 -8 -6 -2 0 -2 -2 -8

H -3 1 2 0 -4 3 -1 -4 7 -4 -3 -2 -4 -3 -1 -2 -3 -3 -1 -3 1 1 -2 -8

I -1 -2 -2 -3 -3 -3 -3 -4 -4 6 1 -3 1 0 -3 -2 0 -6 -2 3 -3 -3 -1 -8

L -3 -4 -4 -5 -7 -2 -4 -5 -3 1 5 -4 3 0 -3 -4 -3 -3 -2 1 -4 -3 -2 -8

K -2 2 1 -1 -7 0 -1 -3 -2 -3 -4 5 0 -7 -2 -1 -1 -5 -5 -4 0 -1 -2 -8

M -2 -1 -3 -4 -6 -1 -3 -4 -4 1 3 0 8 -1 -3 -2 -1 -6 -4 1 -4 -2 -2 -8

F -4 -5 -4 -7 -6 -6 -7 -5 -3 0 0 -7 -1 8 -5 -3 -4 -1 4 -3 -5 -6 -3 -8

P 1 -1 -2 -3 -4 0 -2 -2 -1 -3 -3 -2 -3 -5 6 1 -1 -7 -6 -2 -2 -1 -2 -8

S 1 -1 1 0 0 -2 -1 1 -2 -2 -4 -1 -2 -3 1 3 2 -2 -3 -2 0 -1 -1 -8

T 1 -2 0 -1 -3 -2 -2 -1 -3 0 -3 -1 -1 -4 -1 2 4 -6 -3 0 0 -2 -1 -8

W -7 1 -4 -8 -8 -6 -8 -8 -3 -6 -3 -5 -6 -1 -7 -2 -6 12 -2 -8 -6 -7 -5 -8

Y -4 -5 -2 -5 -1 -5 -5 -6 -1 -2 -2 -5 -4 4 -6 -3 -3 -2 8 -3 -3 -5 -3 -8

V 0 -3 -3 -3 -3 -3 -3 -2 -3 3 1 -4 1 -3 -2 -2 0 -8 -3 5 -3 -3 -1 -8

B 0 -2 3 4 -6 0 3 0 1 -3 -4 0 -4 -5 -2 0 0 -6 -3 -3 4 2 -1 -8

Z -1 -1 0 3 -7 4 4 -2 1 -3 -3 -1 -2 -6 -1 -1 -2 -7 -5 -3 2 4 -1 -8

X -1 -2 -1 -2 -4 -1 -1 -2 -2 -1 -2 -2 -2 -3 -2 -1 -1 -5 -3 -1 -1 -1 -2 -8

* -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8

Applications Database Searches

• Reconstructing long sequences of DNA from overlapping sequence fragments

• Determining physical and genetic maps from probe data under various experiment protocols

• Database searching

• Comparing two or more sequences for similarities

2.8 Multiple Sequence Alignemnts Database Searches

• CLUSTAL

• R. G. Higgins & P. M. Sharp, 1988

• CLUSTALW

• Sequences are weighted according to how divergent they are from the most closely related pair of sequences.

• Gaps are weighted for different sequences.

Summary Database Searches

• notion of similarity

• the scoring system used to rank alignments

• the algorithms used to find optimal scoring alignment

• the statistical method used to evaluate the significance of an alignment score

• Fundamental Concepts of BioinformaticsDan E. Krane and Michael L. Raymer, Benjamin/Cummings, 2003.

• BLAST, by I. Korf, M. Yandell, J. Bedell, O‘Reilly & Associates, 2003. （天瓏代理）

• Biological Sequence Analysis – Probabilistic Models of Proteins and Nucleic AcidsR. Durbin, S. Eddy, A. Krogh, and G. Mitchison,Cambridge University Press, 1998.

• Biochemistry, by J. M. Berg, J. L. Tymoczko, and L. Stryer, Fith Edition, 2001.