bnfo 602 lecture 2 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
BNFO 602 Lecture 2 PowerPoint Presentation
Download Presentation
BNFO 602 Lecture 2

Loading in 2 Seconds...

play fullscreen
1 / 29

BNFO 602 Lecture 2 - PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on

BNFO 602 Lecture 2. Usman Roshan. -3 mil yrs. AAGACTT. AAGACTT. AAGACTT. AAGACTT. AAGACTT. -2 mil yrs. AAGGCTT. AAG G CTT. AAGGCTT. AAGGCTT. T_GACTT. T_GACTT. T _ GACTT. T_GACTT. -1 mil yrs. _GGGCTT. _ G GGCTT. _GGGCTT. TAGACCTT. T AG A C CTT. TAGACCTT. A _ C ACTT. A_CACTT.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'BNFO 602 Lecture 2' - albany


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
bnfo 602 lecture 2

BNFO 602Lecture 2

Usman Roshan

dna sequence evolution

-3 mil yrs

AAGACTT

AAGACTT

AAGACTT

AAGACTT

AAGACTT

-2 mil yrs

AAGGCTT

AAGGCTT

AAGGCTT

AAGGCTT

T_GACTT

T_GACTT

T_GACTT

T_GACTT

-1 mil yrs

_GGGCTT

_GGGCTT

_GGGCTT

TAGACCTT

TAGACCTT

TAGACCTT

A_CACTT

A_CACTT

A_CACTT

today

_G_GCTT

(Mouse)

TAGGCCTT

(Human)

TAGCCCTTA

(Monkey)

A_CACTTC

(Lion)

A_C_CTT

(Cat)

GGCTT

(Mouse)

TAGGCCTT

(Human)

TAGCCCTTA

(Monkey)

ACACTTC

(Lion)

ACCTT

(Cat)

DNA Sequence Evolution
sequence alignments
Sequence alignments

They tell us about

  • Function or activity of a new gene/protein
  • Structure or shape of a new protein
  • Location or preferred location of a protein
  • Stability of a gene or protein
  • Origin of a gene or protein
  • Origin or phylogeny of an organelle
  • Origin or phylogeny of an organism
  • And more…
pairwise alignment
Pairwise alignment
  • X: ACA, Y: GACAT
  • Match=8, mismatch=2, gap-5

ACA-- -ACA- --ACA ACA----

GACAT GACAT GACAT G--ACAT

8+2+2-5-5 -5+8+8+8-5 -5-5+2+2+2 2-5-5-5-5-5-5

Score = 2 14 -4 -28

optimal alignment
Optimal alignment
  • An alignment can be specified by the traceback matrix.
  • How do we determine the traceback for the highest scoring alignment?
  • Needleman-Wunsch algorithm for global alignment
    • First proposed in 1970
    • Widely used in genomics/bioinformatics
    • Dynamic programming algorithm
needleman wunsch
Needleman-Wunsch
  • Input:
    • X = x1x2…xn, Y=y1y2…ym
    • (X is seq2 and Y is seq1)
  • Define V to be a two dimensional matrix with len(X)+1 rows and len(Y)+1 columns
  • Let V[i][j] be the score of the optimal alignment of X1…i and Y1…j.
  • Let m be the match cost, mm be mismatch, and g be the gap cost.
dynamic programming
Dynamic programming

Initialization:

for i = 1 to len(seq2) { V[i][0] = i*g; }

For i = 1 to len(seq1) { V[0][i] = i*g; }

Recurrence:

for i = 1 to len(seq2){

for j = 1 to len(seq1){

V[i-1][j-1] + m(or mm)

V[i][j] = max { V[i-1][j] + g

V[i][j-1] + g

if(maximum is V[i-1][j-1] + m(or mm)) then T[i][j] = ‘D’

else if (maximum is V[i-1][j] + g) then T[i][j] = ‘U’

else then T[i][j] = ‘L’

}

}

example
Example

V

Input:

seq2: ACA

seq1: GACAT

m = 5

mm = -4

gap = -20

seq2 is lined along the rows

and seq2 is along the columns

G A C A T

A

C

A

T

affine gap penalties
Affine gap penalties
  • Affine gap model allows for long insertions in distant proteins by charging a lower penalty for extension gaps. We define g as the gap open penalty (first gap) and e as the gap extension penalty (additional gaps)
  • Alignment:
    • ACACCCT ACACCCC
    • ACCT TAC CTT
    • Score = 0 Score = 0.9
  • Trivial cost matrix: match=+1, mismatch=0, gapopen=-2, gapextension=-0.1
affine penalty recurrence
Affine penalty recurrence

M(i,j) denotes alignments of x1..i and y1..j ending with

a match/mismatch. E(i,j) denotes alignments of x1..i

and y1..j such that yj is paired with a gap. F(i,j) defined

similarly. Recursion takes O(mn) time where m and n

are lengths of x and y respectively.

structural alignments
Structural alignments
  • Recall that proteins have 3-D structure.
structural alignment example 1
Structural alignment - example 1

Alignment of thioredoxins from

human and fly taken from the

Wikipedia website. This protein

is found in nearly all organisms

and is essential for mammals.

PDB ids are 3TRX and 1XWC.

structural alignment example 2

Unaligned proteins.

2bbm and 1top are

proteins from fly and

chicken respectively.

Computer generated

aligned proteins

Structural alignment - example 2

Taken from http://bioinfo3d.cs.tau.ac.il/Align/FlexProt/flexprot.html

structural alignments1
Structural alignments
  • We can produce high quality manual alignments by hand if the structure is available.
  • These alignments can then serve as a benchmark to train gap parameters so that the alignment program produces correct alignments.
benchmark alignments
Benchmark alignments
  • Protein alignment benchmarks
    • BAliBASE, SABMARK, PREFAB, HOMSTRAD are frequently used in studies for protein alignment.
    • Proteins benchmarks are generally large and have been in the research community for sometime now.
    • BAliBASE 3.0
biologically realistic scoring matrices
Biologically realistic scoring matrices
  • PAM and BLOSUM are most popular
  • PAM was developed by Margaret Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins
  • BLOSUM is more recent and computed from blocks of sequences with sufficient similarity
slide18
PAM
  • We need to compute the probability transition matrix M which defines the probability of amino acid i converting to j
  • Examine a set of closely related sequences which are easy to align---for PAM 1572 mutations between 71 families
  • Compute probabilities of change and background probabilities by simple counting
expected accuracy alignment
Expected accuracy alignment
  • The dynamic programming formulation allows us to find the optimal alignment defined by a scoring matrix and gap penalties. This may not necessarily be the most “accurate” or biologically informative.
  • We now look at a different formulation of alignment that allows us to compute the most accurate one instead of the optimal one.
posterior probability of x i aligned to y j
Posterior probability of xi aligned to yj
  • Let A be the set of all alignments of sequences x and y, and define P(a|x,y) to be the probability that alignment a (of x and y) is the true alignment a*.
  • We define the posterior probability of the ith residue of x (xi) aligning to the jth residue of y (yj) in the true alignment (a*) of x and y as

Do et. al., Genome Research, 2005

expected accuracy of alignment
Expected accuracy of alignment
  • We can define the expected accuracy of an alignment a as
  • The maximum expected accuracy alignment can be obtained by the same dynamic programming algorithm

Do et. al., Genome Research, 2005

example for expected accuracy
Example for expected accuracy
  • True alignment
  • AC_CG
  • ACCCA
  • Expected accuracy=(1+1+0+1+1)/4=1
  • Estimated alignment
  • ACC_G
  • ACCCA
  • Expected accuracy=(1+1+0.1+0+1)/4 ~ 0.75
estimating posterior probabilities
Estimating posterior probabilities
  • If correct posterior probabilities can be computed then we can compute the correct alignment. Now it remains to estimate these probabilities from the data
  • PROBCONS (Do et. al., Genome Research 2006): estimate probabilities from pairwise HMMs using forward and backward recursions (as defined in Durbin et. al. 1998)
  • Probalign (Roshan and Livesay, Bioinformatics 2006): estimate probabilities using partition function dynamic programming matrices
local alignment
Local alignment
  • Global alignment recurrence:
  • Local alignment recurrence
local alignment traceback
Local alignment traceback
  • Let T(i,j) be the traceback matrices and m and n be length of input sequences.
  • Global alignment traceback:
    • Begin from T(m,n) and stop at T(0,0).
  • Local alignment traceback:
    • Find i*,j* such that T(i*,j*) is the maximum over all T(i,j).
    • Begin traceback from T(i*,j*) and stop when

T(i,j) <= 0.

blast
BLAST
  • Local pairwise alignment heuristic
  • Faster than standard pairwise alignment programs such as SSEARCH, but less sensitive.
  • Online server: http://www.ncbi.nlm.nih.gov/blast
blast1
BLAST
  • Given a query q and a target sequence, find substrings of length k (k-mers) of score at least t --- also called hits. k is normally 3 to 5 for amino acids and 12 for nucleotides.
  • Extend each hit to a locally maximal segment. Terminate the extension when the reduction in score exceeds a pre-defined threshold
  • Report maximal segments above score S.
finding k mers quickly
Finding k-mers quickly
  • Preprocess the database of sequences:
    • For each sequence in the database store all k-mers in hash-table.
    • This takes linear time
  • Query sequence:
    • For each k-mer in the query sequence look up the hash table of the target to see if it exists
    • Also takes linear time