1 / 10

Calculating substitution matrices

Calculating substitution matrices.

Download Presentation

Calculating substitution matrices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Calculating substitution matrices http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/nodeD.html#wm5Two models one random (R) and one match (M) for sequence alignmentThe random model assumes that letter a occurs independently with some frequency qa, the probability of the two sequences is just the product of the probabilities of each amino acid:P(x,y|R) =PiqxiPjqyj

  2. Odds ratio • The match model aligns residues with a joint probability pab • P (x,y|M) = Pipxiyi • The ratio of match to random is known as odds ratio: • P(x,y|M)/P(x,y|R) = Pi (pxiyi/qxiqyi)

  3. Log odds ratio • s(a, b) = log (pab/qaqb) • S = Si s(xi, yi) • This last equation is the sum of individual scores for each aligned pair of residues. The first equation refers to scores in a matrix, for instance, proteins exhibit a 20 X 20 matrix known as a score or substitution matrix. (BLOSUM, PAM)

  4. Significance of scores using alignment algorithms • Calculate a raw Score • Sum of scores for each letter to letter and letter to null position • Calculate a bit score • Normalizes for scoring system used • Calculate an E-value • Calculated from bit score to account for probability the hit arose by chance

  5. Raw score • Calculated from substitution matrices (PAM, BLOSUM), and gap costs • There are substitution matrices for nucleotides also: • States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity of nucleic acid database searches using application-specific scoring matrices." Methods 3:66-70.

  6. Bit score • S’ = (lS – lnK)/ ln 2 • lambda and K are parameters dependent upon the scoring system (substitution matrix and gap costs) employed • Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268. • http://www.ncbi.nlm.nih.gov/BLAST/matrix_info.html#lambda • Gap costs – the standard cost associated with a gap of length g

  7. Gap costs • Can be linear – like we did in our matrix g(g) = -gd • Can be an “affine” score – most prevalent now g(g) = -d – (g-1)e Where d is called the gap-open penalty and e is called the gap-extension penalty. The gap extension penalty e is usually less than the d, allowing long insertions and deletions to be penalized less

  8. E - value • E = N/2S’ • This is an approximation for the number (E) of distinct HSP’s with normalized score at least S’ expected to occur by chance when two random protein sequences of sufficient lengths m and n are compared • N = mn (search space size)

  9. Database searching • If a protein is compared to whole database, n is the database length in residues • The equation can be converted to: • S’ = log2(N/E) • If a protein of length 250 might be compared to a protein database of 5 x 106 residues, to achieve a marginally significant E-value of 0.05 a normalized score of 38 bits is necessary

  10. Significance of E - value • E value is between 1 and 0 • The lower the E value the more significant the match • Note that the E value is dependent on the length of query sequence – An E value of .05 is more significant for a query of 100 amino acids, than 200 amino acids

More Related