Iosif Vaisman

Introduction to Bioinformatics Iosif Vaisman Email: ivaisman@gmu.edu

NIH working definition of bioinformatics and computational biology (July 2000) The NIH Biomedical Information Science and Technology Initiative Consortium agreed on the following definitions of bioinformatics and computational biology recognizing that no definition could completely eliminate overlap with other activities or preclude variations in interpretation by different individuals and organizations. Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.

Liebman MN, Molecular modeling of protein structure and function: a bioinformatic approach. J Comput Aided Mol Des 1988, 1(4):323-41 Bioinformatics bibliography(papers with the word “bioinformatics” in title or abstract)

Dynamics of Database Growth

Comparative Sequence Sizes • Yeast chromosome 3 350,000 • Escherichia coli (bacterium) genome 4,600,000 • Largest yeast chromosome now mapped 5,800,000 • Entire yeast genome 15,000,000 • Smallest human chromosome (Y) 50,000,000 • Largest human chromosome (1) 250,000,000 • Entire human genome 3,000,000,000

The String Alignment Problem string - a sequence of characters from some alphabet given: two strings acbcdb and cadbd one of possible alignments: a c - - b c d b - c a d b - d - score: 3 . (2) + 5 . (-1) = 1 scoring function: exact match +2 mismatch -1 insertion -1

The String Alignment Problem given: two strings CTCATG and TACTTG C T C A T G | | | T A C T T G score: 3 . (2) + 3 . (-1) = 3 C T C A - T - G | | | | . T - A C T T G score: 4 . (2) + 4 . (-1) = 4

Entropy and Redundancy of Language CUR F W D DIS AND P A SED IEND ROUGHT EATH EASE AIN BLES FR B BR AND AG

Entropy and Redundancy of Language ** CUR**** F*****W******* D***** DIS*****AND P*** || |||| ||||| ||||||| ||||| ||||| ||| **BLES****FR*****B*******BR*****AND ***** AG*** The sequences are 65% identical A CURSED FIEND WROUGHT DEATH DISEASE AND PAIN || |||| ||||| ||||||| ||||| ||||| ||| A BLESSED FRIEND BROUGHT BREATH AND EASE AGAIN

PAM100 PAM100 PAM100 PAM100 PAM200 PAM150 Substitution Matrices • Dayhoff (or MDM, or PAM) - Derived from global alignments of closely related sequencesPAM100 - number referes to evolutionary distance (Percentage of Acceptable point Mutations per 108 years) 300 million years 200 million years 100 million years

Substitution Matrices • BLOSUM (BLOcks SUbstitution Matrix) -Derived from local, ungapped alignments of distantly related sequencesBLOSUM62 - number refers to the minimum percent identity Reference: Henikoff & Henikoff Proteins17:49, 1993

Selecting a Matrix Low PAM: short segments, high similarity High PAM: long segments, low similarity • Compared sequences are related:200 PAM or 250 PAM • Database scanning:120 PAM • Local alignment search: 40 PAM, 120 PAM, 250 PAM • Detection of related sequences using BLAST: BLOSUM 62 THERE IS NO “ONE SIZE FITS ALL” MATRIX !

Matrix Example A B C D E F G H I K .. 1.5 0.2 0.3 0.3 0.3 -0.5 0.7 -0.1 0.0 0.0 .. A 1.1 -0.4 1.1 0.7 -0.7 0.6 0.4 -0.2 0.4 .. B 1.5 -0.5 -0.6 -0.1 0.2 -0.1 0.2 -0.6 .. C 1.5 1.0 -1.0 0.7 0.4 -0.2 0.3 .. D 1.5 -0.7 0.5 0.4 -0.2 0.3 .. E 1.5 -0.6 -0.1 0.7 -0.7 .. F 1.5 -0.2 -0.3 -0.1 .. G 1.5 -0.3 0.1 .. H 1.5 -0.2 .. I 1.5 .. K

Dayhoff’s Acceptable Point Mutations Ala A Arg R 30 Asn N 109 17 Asp D 154 0 532 Cys C 33 10 0 0 Gln Q 93 120 50 76 0 Glu E 266 0 94 831 0 422 Gly G 579 10 156 162 10 30 112 His H 21 103 226 43 10 243 23 10 Ile I 66 30 36 13 17 8 35 0 3 Leu L 95 17 37 0 0 75 15 17 40 253 Lys K 57 477 322 85 0 147 104 60 23 43 39 Met M 29 17 0 0 0 20 7 7 0 57 207 90 Phe F 20 7 7 0 0 0 0 17 20 90 167 0 17 Pro P 345 67 27 10 10 93 40 49 50 7 43 43 4 7 Ser S 772 137 432 98 117 47 86 450 26 20 32 168 20 40 269 Thr T 590 20 169 57 10 37 31 50 14 129 52 200 28 10 73 696 Trp W 0 27 3 0 0 0 0 0 3 0 13 0 0 10 0 17 0 Tyr Y 20 3 36 0 30 0 10 0 40 13 23 10 0 260 0 22 23 6 Val V 365 20 13 17 33 27 37 97 30 661 303 17 77 10 50 43 186 0 17 A R N D C Q E G H I L K M F P S T W Y Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr

Search and alignment entropy • Information content per position: pam10 - 3.43 bits pam120 - 0.98 bits pam160 - 0.70 bits pam250 - 0.38 bits blosum62 - 0.70 bits • Information requirements: for search - 30 bits for alignment - 16 bit

Search and alignment entropy Recommended matrices for different query length Query length Substitution matrix Gap costs <35 PAM-30 ( 9,1) 35-50 PAM-70 (10,1) 50-85 BLOSUM-80 (10,1) >85 BLOSUM-62 (11,1)

Sequence B Sequence A FASTA Algorithm 1 First run (identities)

2 Sequence B Rescoring using PAM matrix high score low score Sequence A FASTA Algorithm The score of the highest scoring initial region is saved as the init1 score.

Sequence B Sequence A FASTA Algorithm 3 Joining threshold - eliminates disjointed segments Non-overlapping regions are joined. The score equals sum of the scores of the regions minus a gap penalty. The score of the highest scoring region, at the end of this step, is saved as the initn score.

Sequence B Sequence A FASTA Algorithm 4 Alignment optimization using dynamic programming The score for this alignment is the opt score.

FASTA Algorithm FastA uses a simple linear regression against the natural log of the search set sequence length to calculate a normalized z-score for the sequence pair. Using the distribution of the z-score, the program can estimate the number of sequences that would be expected to produce, purely by chance, a z-score greater than or equal to the z-score obtained in the search. This is reported as the E() score.

FASTA Results • When init1=init0=opt: 100 % homology over the matched stretch. • When initn > init1: more than 1 matching region in the database with poorly matching separating regions. • When opt > initn: the matching regions are greatly improved by adding gaps in one or both of the sequences.

BLAST - Basic Local Alignment Search Tool • Blast programs use a heuristic search algorithm. The programs use the statistical methods of Karlin and Altschul (1990,1993). • Blast programs were designed for fast database searching, with minimal sacrifice of sensitivity to distant related sequences.

BLAST Algorithm 1 Query sequence of length L Maximium of L-w+1 words (typically w = 3 for proteins) For each word from the query sequence find the list of words with high score using a substitution matrix (PAM or BLOSUM) Word list

BLAST Algorithm 2 Database sequences Word list Exact matches of words from the word list to the database sequences

BLAST Algorithm 3 Maximal Segment Pairs (MSPs) For each exact word match, alignment is extended in both directions to find high score segments

Gapped BLAST • The Gapped Blast algorithm allows gaps to be introduces into the alignments. That means that similar regions are not broken into several segments. • This method reflects biological relationships much better.

BLAST family of programs • blastp - amino acid query sequence against a protein sequence database • blastn - nucleotide query sequence against a nucleotide sequence database • blastx - nucleotide query sequence translated in all reading frames against a protein database • tblastn - protein query sequence against a nucleotide sequence database dynamically translated in all reading frames • tblastx - six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

Database Searches • Run Blast first, then depending on your results run a finer tool (Fasta, Smith-Waterman, etc.) • Where possible use translated sequence. • E() < 0.05 is statistically significant, usually biologically interesting. Check also 0.05 < E() <10 because you might find interesting hits. • Pay attention to abnormal composition of the query sequence, it usually causes biased scoring. • Split large query sequence ( if >1000 for DNA, >200 for protein). • If the query has repeated segments, remove them and repeat the search.

Documenting the Search • Algorithm(s) • Substitution matrix • Gap penalty (FASTA) • Name of database • Version of database • Computer used

MULTIPLE SEQUENCE ALIGNMENT

Computational complexity Alignment of protein sequences with 200 amino acid residues:

Multiple alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG-- Column cost: the sum of costs for all possible pairs

Multiple alignment A correct multiple alignment corresponds to an evolutionary history: no correct way to determine practical way - to find an alignment with the maximum score

Multiple sequence alignment Given k (k > 2) sequences, s1,…, sk, each sequence consisting of characters from an alphabet A multiple alignment is a a rectangular array, consisting of characters from the alphabet A’ (A + "-"), that satisfies the following 3 conditions: 1. There are exactly k rows. 2. Ignoring the gap character, row number i is exactly the sequence si. 3. Each column contains at least one character different from "-".

Consensus Plurality - minimum number of votes for a consensus Threshold - scoring matrix value below which a symbol may not vote for a coalition. Sensitivity - minimum score to select consensus Profiles - blocks of prealigned sequences

Multiple alignment algorithm 1. Pairwise alignments (progressive pairwise alignments) 2. Distance matrix calculation 3. Guide tree creation (hierarchical clustering) 4. New sequence addition

Sreal(ij) - Srand(ij) D(ij)= -ln x 100 Siden(ij) - Srand(ij) Scoring system (distances) Sreal(ij) - observed similarity score for two aligned sequences i and j Siden(ij) - average of the two scores for each sequence aligned with itself Srand(ij) - average score determined from 100 global randomizations of the two sequences The distances D(ij) are used to generate the distance matrix from which the approximate guide tree is generated.

Multiple alignment

(1,1,1) C (1,0) (1,1) B B (0,0) (0,1) A (0,0,0) A Multiple alignment Segment - line joining two vertices Each unit m-dimensional cube in the lattice contains 2m -1 segments

Multiple alignment Alignment Path for 3 Sequences (0,0,0), (1,0,0), (2,1,0), (3,2,0), (3,3,1), (4,3,2)

Multiple alignment V S N - S - S N A - - - - A S Pairwise Projections of the Alignment

Alignment statistics Rablpb Humcetp Rabcetp Bovbpi Humlbpa Ratlbp Maccetp Humbpi 1 2 3 4 5 6 7 8 478 67% 65% 19% 19% 18% 42% 43% 1 0 82% 80% 39% 39% 36% 64% 65% 0 1% 0% 5% 5% 12% 2% 2% 327 483 58% 16% 16% 16% 39% 41% 2 400 0 75% 38% 38% 35% 62% 63% 5 0 0% 5% 5% 12% 1% 1% 318 284 482 18% 18% 17% 40% 43% 3 390 367 0 38% 38% 35% 64% 64% 4 1 0 5% 5% 12% 1% 1% 96 84 95 494 95% 74% 20% 21% 4 198 192 194 0 98% 84% 40% 41% 30 29 28 0 0% 7% 6% 5%

Alignment score Rablpb Humcetp Rabcetp Bovbpi Humlbpa Ratlbp Maccetp Humbpi 1 2 3 4 5 6 7 8 1 4077 2 5358 4129 3 5323 5650 4096 4 8103 8229 8112 4210 5 8109 8243 8118 4332 4219 6 8535 8672 8575 5511 5519 4261 7 6474 6531 6500 8103 8119 8572 4103 8 6392 6434 6378 8033 8035 8520 5508 4083 1 2 3 4 5 6 7 8

Alignment visualization Identity Summary view

Alignment visualization Physico-chemical properties Differences mode

Alignment visualization (tree)

Sequence Logos: a quantitative graphical display for binding sites and proteins Reference: Schneider, T.D. Meth. Enzym 274:445, 1996

Sequence Logos

Iosif Vaisman

Iosif Vaisman

Presentation Transcript

Iosif Legrand California Institute of Technology

Iosif Legrand Caltech - CERN

Nikos Iosif International Business Development, MANTIS

Todd J.Taylor, Iosif I.Vaisman todd.taylor@nist, ivaisman@gmu

Iosif Vissarionovich Dzhugashvili (Aka Joseph Stalin)

Angelica Iosif, Project assistant Mare Nostrum NGO

Iosif Legrand California Institute of Technology

Iosif Legrand California Institute of Technology

MonALISA Team Iosif Legrand , Harvey Newman, Ramiro Voicu ,

Iosif Legrand California Institute of Technology

Iosif Legrand California Institute of Technology

Iosif Legrand California Institute of Technology

Iosif Legrand California Institute of Technology

Iosif Legrand California Institute of Technology

Iosif Legrand California Institute of Technology

MonALISA Team Iosif Legrand , Harvey Newman, Ramiro Voicu ,