1 / 50

Prologue

Blastology and Open Source: Needs and Deeds Iddo Friedberg, Ph.D. The Burnham Institute February, 2003. Prologue. BLAST – Basic Local Alignment Search Tool: fast sequence similarity searching, query vs. database (1990) Gapped BLAST – now we can use gaps in the alignment (1996)

uriel
Download Presentation

Prologue

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Blastology and Open Source: Needs and DeedsIddo Friedberg, Ph.D.The Burnham InstituteFebruary, 2003

  2. Prologue • BLAST – Basic Local Alignment Search Tool: fast sequence similarity searching, query vs. database (1990) • Gapped BLAST – now we can use gaps in the alignment (1996) • PSI-BLAST Position Specific Iterated BLAST Iterated BLAST search increase sensitivity. (1997) 7800 citations over 6 years

  3. Blastology & Open Source: Needs & Deeds • How PSI-BLAST works • Post PSI-BLAST processing possibilities • PeCoP: conserved positions in profiles • Information content • Different measures and their purposes • Implementation • Bio* tools • NCBI tools • What do we still need?

  4. Blastology & Open Source: Needs & Deeds • How PSI-BLAST works (basically…) • Post PSI-BLAST processing • PeCoP: conserved positions in profiles • content • Different measures and their purposes • Implementation • Bio* tools • NCBI tools • What do we still need?

  5. A 029001100003200 C 000070000000000 . . Y 002000080202000 MGLLTREIF--ILQQ using profile MGLLTREIF--ILQQ FGLLRT-I-T-YMTN -RLTRD-I---LGLY FGLLRT-I---FMTS New sequences in the multiple alignment A 027005101003200 C 000070000000000 . . Y 202000060202000 A 029001100003200 C 000070000000000 . . Y 002000080202000 Construct a new profile PSI BLAST 101 Take a sequence Search for similar sequences in a full sequence database FGLGRT-I-T-YMTN -GLVRT-I---LGLE FGLLRT-I---YMTQ Sequences are multiply aligned • After several iterations of this procedure we have: • Sequence information, inc. links to annotation • Several sets of multiple alignments. • Profiles, derived by us or by PSI-BLAST • Thresholding information (alignment statistics) Construct a profile, andrepresent conservation in each position numerically Profile holds more information than a single sequence: use the profile to retrieve additional sequences

  6. Blastology & Open Source: Needs & Deeds • How PSI-BLAST works • Post PSI-BLAST processing • PeCoP: conserved positions in profiles • Information content • Different measures and their purposes • Implementation • Bio* tools • NCBI tools • What do we still need?

  7. Post-BLAST Information FlowWetlab Typicale PSI-BLAST Sequence Alignments Statistics Annotations Locating homologs Function Prediction (if function unknown)

  8. Enter Bioinformatics, Stage Left… • Process many queries • More sophisticated post-processing, e.g. • Structure prediction • Phylogenetics • Function prediction: using annotation / structural data / phylogenetic data • “Unusual” searching: • Need to change parameter default values

  9. Post-BLAST Information FlowBioinformatics PSI-BLAST Annotations Sequence Alignments Statistics Profiles Locating homologs Function Prediction (if function unknown) Homology Modeling Fold prediction Tree building

  10. PDB-BLAST: Sensitive Fold Recognition(Li & Godzik) PSI-BLAST Large sequence Database (nr85) PSI-BLAST Structure Database (PDB) Fold recognition Statistics Sequence Alignments Profiles

  11. PSI-PRED 2ndary Structure Prediction (David Jones) • PSI-BLAST • Filtered database: • No Xmembrane • No coiled-coils Profiles Windows of Length 15 1st Neural Network 2nd Neural Network 3-state Prediction

  12. PSI-BLAST is used for: • Distant homology detection • Fold assignment • profile-profile comparison • Domain identification • Evolutionary Analysis (e.g. tree building) • Sequence Annotation / function assignment • Profile export to other programs • Sequence clustering • Structural genomics target selection PSI BLAST’s ability to do all of the above has been evaluated. So have competing programs, which used PSI-BLAST as a standard for comparison

  13. Blastology & Open Source: Needs & Deeds • How PSI-BLAST works • Post PSI-BLAST processing • PeCoP: conserved positions in profiles • Information content • Different measures and their purposes • Implementation • Bio* tools • NCBI tools • What do we still need?

  14. Why Profiles? • More informative than sequences • More accurate than regexps (“motifs”) • PSI-BLAST’s consecutive profiles enable us to obtain an “evolutionary vista” • PeCoP: illustrating the use of iterated profiles to detect Persistently Conserved Positions

  15. PeCoP: locating important residues(Friedberg & Margalit) PSI-BLAST Large sequence Database (nr) Sequence Alignments Statistics Profiles Locate important residues Find Conserved Positions

  16. What is a Conserved Position? • A conserved position has a high frequency of any single amino-acid type in the MSA column. • Conservation is usually measured by determining the information content or the relative entropy of a position

  17. Blastology & Open Source: Needs & Deeds • How PSI-BLAST works • Post PSI-BLAST processing • PeCoP: getting profiles from PSI-BLAST • Information content • Different measures and their purposes • Implementation • Bio* tools • NCBI tools • What do we still need?

  18. Information Content I: Uncertainty Uncertainty: the number of “yes / no” questions to verify a state: • Coin toss: 1 question. (“Is it heads?”) • Nucleotide in a DNA sequence: 2 questions (“Is it a purine?”) -> (“Is it an adenine?”) • Uncertainty is measured in bits • Maximum uncertainty: log2(number of possible states) Coin toss: log22 = 1 bit DNA: log24 = 2 bits Proteins: log220 = 4.32 bits

  19. Information Content II: MeasuringPositional Conservation Information content is the reduction in uncertainty • Uncertainty ``before’’: log220 = 4.32 bits • Uncertainty ``after’’ (i.e. when we know the MSA position makeup): • Uncertainty difference is therefore: • Fully conserved position: IC = 4.32 – 20*0 = 4.32 • Not conserved at all: = 0 “The more conserved a position, the higher its information content”

  20. Information Content II: MeasuringPositional Conservation . . .D. . . . . . D . . . . . . D . . . . . . E. . . . . . G . . . PD = 3/5 = 0.6 PE = 1/5 = 0.2 PG = 1/5 = 0.2 Uncertainty “After”: Information content: Information content is the reduction in uncertainty • Uncertainty ``before’’: log220 = 4.32 bits • Uncertainty ``after’’ (i.e. when we know the MSA position makeup): • Uncertainty difference is therefore: • Fully conserved position: IC = 4.32 – 20*0 = 4.32 • Not conserved at all: = 0 “The more conserved a position, the higher its information content”

  21. Information Content II: MeasuringPositional Conservation Information content is the reduction in uncertainty • Uncertainty ``before’’: log220 = 4.32 bits • Uncertainty ``after’’ (i.e. when we know the MSA position makeup): • Uncertainty difference is therefore: • Fully conserved position: IC = 4.32 – 20*0 = 4.32 • Not conserved at all: = 0 “The more conserved a position, the higher its information content”

  22. Blastology & Open Source: Needs & Deeds • How PSI-BLAST works (basically…) • Post PSI-BLAST processing • PeCoP: • Information content • Different measures and their purposes • Implementation • Bio* tools • NCBI tools • What do we still need?

  23. Division by Prior Frequencies: “Conserved” vs. “Distinct” • A conserved position has a high frequency of any given amino-acid type in the MSA column. • “High Frequency” meaning: • 1) a high frequency in the column? or • 2) a higher-than-expected frequency in the column? • Higher-than-expected: based on the frequencies of residue types in the “sequence universe”. (SwissProt). Question: ``How conserved is a position?’’ Do not divide by priors. Use Question: ``How distinct is a position?’’ Divide by priors. Use Surprise! When dividing by priors: relative entropy

  24. 20 Amino Acids… or Less? • A conserved position has a high frequency of any given amino-acid type in the MSA column. • “Amino acid type” meaning: • 1) There are 20 amino acid types • 2) There are less, because they can be grouped into similar physico-chemical types

  25. Representative letter Physico-chemical property Included residue types F Hydrophobic A, V, L, I, M, C R Aromatic F, W, Y, H O Polar S, T, N, Q T Positive R, K N Negative E, D P Proline P G Glycine G 20 Amino Acids… or Less?

  26. IC: Remember This • Information content == reduction in uncertainty. Used for measuring positional conservation • “The more conserved a position, the higher its information content” • We can divide (or not) by expected prior frequencies • We can group (or not) the 20 amino acids into a smaller alphabet

  27. PSI-BLAST Nucleation Center Detection Possible Schemes for Calculating Positional Conservation 20-letter Alphabet Reduced Alphabet Priors No Priors

  28. Blastology & Open Source: Needs & Deeds • How PSI-BLAST works (basically…) • Post PSI-BLAST processing • PeCoP: • Information content • Different measures and their purposes • Implementation • Bio* tools • NCBI tools • What do we still need?

  29. PeCoP: locating important residues(Friedberg & Margalit) PSI-BLAST Large sequence Database (nr) Sequence Alignments Statistics Profiles Locate important residues Find Conserved Positions

  30. Find Conserved Positions: Set a Threshold • Threshold is determined by normalizing the IC distribution over a sequence tomean == 0, SD == 1 • Then set a threshold

  31. Find Conserved Positions: Conservation over Profiles • Determine conservation in a profile according to one of the four schemes discussed • But PSI-BLAST gives us several profiles (nIterations -1) • Therefore, a position is conserved if it retains conservation through successive iterations. • But retention does not have to be 100%

  32. Retention Schemes • Majority vote: if a position is conserved in x out of n iterations, it is considered conserved. • Persistent conservation: conservation in the first & last iteration

  33. Persistent Conservation • Positions conserved in close family members may be conserved due to evolutionary non-divergence, and not solely due to a structural / functional role. Hence, a supply of false positives. • Positions conserved in distant family members may be marked as such due to an observed drift from the original sequence. False positives again, but for a different reason. The intersection of the above two findings minimizes both types of errors

  34. PeCoP • Determine conservation according to the following parameters: • Either one of the four IC schemes AND • Set a threshold AND • Choose a retention scheme PeCop Submission PeCoP Results

  35. Getting PSI-BLAST Profiles According to Different Conservation Schemes In ncbitools: ncbi/tools/posit.c lines1826 – 2689 #ifdef POSIT_DEBUG // the code here is concerned with matrix output, // and normally commented out //play around with it… #endif Can NCBI provide this output by use of a command-line argument?

  36. Why Not Parse PSI-BLAST Alignments? Speed • Slow, esp. When using a scripting language • Not all alignments appear on output (default 250) • Sequence weighting, profile construction, all already provided for. • NCBI keep changing format: programmer has to keep changing the parser.

  37. Why Parse PSI-BLAST Alignments? Gain more information: • Assign sequence weight and filtering parameters according to specific needs • Use annotation: inline or linked. • Realign sequences, and construct own profile • PSI-BLAST source code keeps changing • As of v. 2.1.2: XML and (2.2.1) tabulated (no alignment) output

  38. Post-blast Information FlowBioinformatics PSI-BLAST Annotations Sequence Alignments Statistics Profiles Locating homologues Function Prediction (if function unknown) Homology Modeling Fold prediction Tree building

  39. Post Blast Processing Many modules, but: • Most are application-specific. • Some are web-resources only. • Bad licenses, machine-specific, not written for distribution purposes, etc. Result: need to rewrite the same stuff over (and over.. and over..).

  40. Blastology & Open Source: Needs & Deeds • How PSI-BLAST works (basically…) • Post PSI-BLAST processing • PeCoP: • Information content • Different measures and their purposes • Implementation • Bio* tools • NCBI tools • What do we still need?

  41. Bio*.org Projects • Collaborative projects aimed at providing programming tools for bioinformatics under an open-source license • Bio{Perl | Java | Python} : procedural • Bio{CORBA | MOBY}: interface, web access standardization The Open Bioinformatics Foundation

  42. Bio*.org and Post PSI-BLAST Processing

  43. NCBI and Post Blast Processing • Language: C/C++ • ASN.1 was around long before XML • seqalign.asn • Now (v. 2.1.1) there is also XML output format, DTDs are there. • Web APIs, for WWW-based PSI-BLAST runs • Public domain, no license

  44. What is Needed? • Annotation handling. PB output has rudimentary annotation only. The rest is served by links. Transfer into MySQL? • Translate parsed output into multiple sequence alignment objects, and then into PSSMs • Direct PB residue frequency output • CORBA: do we need a format-aware object? • Anything else you can think of………

  45. Summary • PSI-BLAST profiles have become the method-of-choice for “doing things” when a high detection sensitivity is required BUT… • Profiles can and should be interpreted carefully • Results should be interpreted carefully • Do NOT write your own PSI-BLAST parser. Please write something we need!

  46. Further Reading • http://www.ncbi.nlm.nih.gov • http://open-bio.org Books: Durbin R. et al. Biological Sequence Analysis. Cambridge University Press (Chapter 9) Papers: • http://www.ncbi.nlm.nih.gov/BLAST/blast_references.html Blastology: • W. Li , F. Pio, K. Pawlowski and A. Godzik: Saturated Blast: detecting distant homology using automated multiple intermediate sequence Blast search Bioinformatics (2000) 16:1105-1110 • W. Li, L. Jaroszewski and A. Godzik: Clustering of highly homologous sequences from large sequence protein databases Bioinformatics, (2001) 17:282-283. • W. Li, L. Jaroszewski and A. Godzik: Tolerating some redundancy significantly speeds up clustering of large protein databases Bioinformatics (2002) 18:77-82 • W.Li and A.Godzik: Discovering new genes with advanced homology detection Trends in Biotech, (2002) 20:315-6. • I. Friedberg, T. Kaplan, and H. Margalit: Evaluation of PSI-BLAST Alignment Accuracy in Comparison to Structural Alignments. (2000) Protein Science,Nov;9(11):2278-84 • I. Friedberg and H. Margalit: Persistently Conserved Positions inStructurally-Similar, Sequence Dissimilar Proteins: Roles in PreservingProtein Fold and Function (2002) Protein Science 11(2):350-360 • I. Friedberg and H. Margalit: PeCoP: automatic determination of persistently conserved positions in protein families. Bioinformatics 18(9): 1276-77(2002) Conserved positions: • Mirny LA, Shakhnovich EI: Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. J Mol Biol. 1999 Aug 6;291(1):177-96. • Reddy BV, Li WW, Shindyalov IN, Bourne PE. Conserved key amino acid positions (CKAAPs) derived from the analysis of common substructures in proteins. Proteins. 2001 Feb 1;42(2):148-63. • Landgraf R, Xenarios I, Eisenberg D.Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J Mol Biol. 2001 Apr 13;307(5):1487-502.

  47. Thanks to.. • Hanah Margalit • Adam Godzik • Bio{java | perl | python}.org folks • Jeff Bizzaro http://bioinformatics.org/pecop

  48. The End

  49. Check the Following when Running PSI-BLAST for PBP: • Number of sequences printed (if making own profile from printed sequences). • E-value inclusion threshold for next iteration (rec: 0.001). • Low complexity masking? • Substitution matrix used?

  50. PSI-BLAST 101 (contd.) Exports: • Multiple sequence alignments • Annotation links • Statistical data

More Related