1 / 47

Bioinformatics

Bioinformatics. Step 1: BLAST. *with thanks to KT Scott. Updating the Curriculum. In contrast to biological research, biology education has changed relatively little in the past two decades Involving students in research is the best way to inspire students to the goal of a career in science

Download Presentation

Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Step 1: BLAST *with thanks to KT Scott

  2. Updating the Curriculum • In contrast to biological research, biology education has changed relatively little in the past two decades • Involving students in research is the best way to inspire students to the goal of a career in science • 2020: All students

  3. Teaching with Bioinformatics • Algorithms and databases are the experimental tools • “Mathematical” articulations of biological principles

  4. Teaching BLAST • BLAST as entry into multiple databases • Considering BLAST as an “experiment” • Using the tool as an opportunity to illustrate concepts in biology

  5. Starts with a sequence in FASTA format >637785566 YP_391116 microcompartments protein MSTEYGIALGMIETRGLVPAIEAADAMTKAAEVRLVSREFVGGGYVTVLV RGETGAVNAAVRAGADACERVGDGLVAAHIIARPHKEVEPVLALGNSSPD RS > Denotes “”description line” (not read as sequence data0

  6. BLAST Results Page

  7. Example of a BLAST Alignment

  8. What is an alignment? • Comparison of n.a. or a.a. sequences by lining them up • Pairwise (2 sequences) • Multiple (>2) • A good alignment juxtaposes • Active sites • Evolutionarily and functionally related portions of the sequences

  9. Global vs local alignment • Global • Whole sequence of both proteins is used • E.g. FASTA – for overall similarity • Local • Finds areas of greatest similarity between two proteins—it may not align the full length • E.g. BLAST

  10. Homologous or not…. • Identity and similarity

  11. A Substitution Matrix (partial) Identity and Similarity… Used to quantify the similarity between amino acids or nucleotides in an alignment

  12. The Scoring Matrix • Dictates the scores for each amino acid identity or substitution in an alignment of query and subject sequences • Different matrices assign different penalties and scores for amino acid substitutions (PAM and BLOSUM)-more in a moment….

  13. Biochemistry Digression • Biological basis for scores? • How do proteins evolve? • How do we know? • Correlated changes • (Structure sometimes reveals common ancestry that is no longer apparent in the primary structure)

  14. BLAST as an Experiment

  15. Expect Word size Matrix Gap costs Filter Mask BLAST as an Experiment

  16. The BLAST Algorithm • Segments query sequence into “words” and scores potential word matches • Scans this list for alignments that meet a threshold score T • uses a scoring matrix to calculate this • Uses this list of ‘synonyms’ to scan the database • Extends the alignments to see if they meet a cutoff score S • uses a scoring matrix to calculate this • Reports the alignments that exceed S

  17. Phase 1: Segmenting the query sequence; scoring potential word matches--compile • BLAST: • segments the query sequence into pieces (“words”) • Default word length: 3 amino acids or 11 nucleic acids • Creates a list of scores for comparing query words to target words • Uses scoring matrix to calculate scores for words that might be found in the database • Saves the scores (and words) exceeding a given threshold T From Pevsner, via KT Scott

  18. Phase 2: Scanning the database • BLAST • scans the database for matches to the word list with acceptable T values • Requires two matches (“hits”) within the target sequence • Sets aside sequences with matches above T for further analysis

  19. Phase 3: Extending the hits • BLAST • Searches 5’ and 3’ of the word hit on both the query and target sequence • Adds up the score for sequence identity or similarity until value exceeds S • Alignment is dropped from subsequent analyses if value never exceeds S

  20. The BLAST Algorithm Step 1 and the Importance of Scoring Matrices • Segments query sequence into “words” and scores potential word matches • Scans this list for alignments that meet a threshold score T • uses a scoring matrix to calculate this • Uses this list of ‘synonyms’ to scan the database • Extends the alignments to see if they meet a cutoff score S • uses a scoring matrix to calculate this • Reports the alignments that exceed S

  21. A Scoring Matrix (partial)

  22. Building a Scoring Matrix • Scores in the matrix are based primarily on the frequency with which a given residue in the query sequence aligns with another residue in a homologous sequence in the database. • Because these frequencies generally cannot be known a priori, they must be based on empirical evidence. • Choice of which related sequences to use as empirical data for determination of frequencies differentiates each scoring matrix and its benefits.

  23. PAM and BLOSUM Matrices • Both are empirically based: • Rely on similarity scores derived by aligning amino acid sequences from proteins known to be homologous • PAM (1978): • Similarity scores were based on closely related proteins and extrapolated out for more distantly related ones (globals) • BLOSUM (1992): • Similarity scores were based on distantly related proteins. (Locals)

  24. Selecting Scoring Matrices • Choose a matrix appropriate to the suspected degree of sequence identity between the query and its target sequences • PAM: empirically derived for close relatives • BLOSUM: empirically derived for distant relatives Figure: Pevsner 2003, Bioinformatics and Functional Genomics

  25. Step 2 Raw Scores from each potential alignment

  26. Raw Scores (S) • Calculated by counting the number of identities, mismatches, gaps and “-” characters in the alignment • S = aI + bX – cO – dG where I is the number of identities in the alignment X is the number of mismatched letters O is the number of gaps G is the total number of “-” characters

  27. S = aI + bX – cO - dG There’s nothing inexplicable here…. I is the number of identities in the alignment and a the reward for each identity X is the number of mismatched letters and b is the reward for each mismatch O is the number of gaps and c is the penalty for opening the gap G is the total number of “-” characters and d is the penalty for each “-” in the gap

  28. S = aI + bX – cO - dG There’s nothing inexplicable here…. I is the number of identities in the alignment and a the reward for each identity X is the number of mismatched letters and b is the reward for each mismatch O is the number of gaps and c is the penalty for opening the gap. c=11 G is the total number of “-” characters and d is the penalty for each “-” in the gap. d=1

  29. Going from Raw Scores to Bit Scores • Raw scores (S) are obtained from the BLOSUM or PAM matrices and gap penalties • Bit scores (S’) are calculated by correcting unitless raw scores (S) for the statistical parameters (l and K) of the specific matrices and search spaces (normalizing parameters). • S’ = [lS-ln(K)]/ln(2) • Larger raw scores result in larger bit scores • Allows user to compare scores obtained by using different matrices and search spaces

  30. E-value • Number of distinct alignments with scores greater than or equal to a given value expected to occur in a search against a database of known size • For small e-values, it is the probability of this alignment (score) occurring by chance • E.g. E = 1x10-4 is expected to occur by chance 1 in 10,000 times

  31. E-value • Calculated from: • Bit scores (S’) – a measure of similarity • Length of the query sequence • m = effective length of the query sequence = length of query sequence – average length of alignments • (Controls for fewer alignments occurring at the ends of the query sequence) • Size of the database • n = effective length of the database sequence (total number of bases) • E = mn x 2-S’ (m and n define the search space; larger the S’, smaller the E) • The value of E decreases exponentially with increasing S

  32. (Changing) Search Summary

  33. Expect You have control …

  34. Expect • Alignments will be reported with E-values less than or equal to the expect values threshold • Setting a larger E threshold will result in more reported hits • Setting a smaller E threshold will result in fewer reported hits

  35. Filter Mask

  36. Filters and Masking • Low complexity • Replaces the following with N (nucleotides) or X (amino acids) • Dinucleotide repeats • Amino acid repeats • Leader sequences • Stretches of hydrophobic residues • Mask lowercase • Replaces lowercase letters in sequence with N or X • Lowercase letters typically indicate base or amino acid not known with certainty

  37. BLAST Results Page

  38. Mindless BLAST • Believing that E tells the whole story. • Ignoring length of match. • BLAST is a local alignment tool • Disregarding biological function

  39. Evaluating a High E-value:Distant Homolog or Garbage? • Take another look at the target (subject) sequence(s) that have high E-values • Similar length? • Recurring motifs? • Similar biological functions? • Use target sequences as query sequences for another BLAST search • Does the original query sequence come up in report?

  40. Different Forms of BLAST • blastp • blastn • blastx • tblastn • Tblastx • BLAST2 • Biology Workbench (http://workbench/sdsc.edu/) is a good source of tools

  41. Annotations are Hypotheses

  42. Bioinformatic Microbiology

  43. Solution to a relatively low CO2 concentration To deal with this, bacteria have evolved carbon concentrating mechanisms HCO3- Transporter Carboxysome CA HCO3- RuBP CO2 Rubisco HCO3- HCO3- PGA This is thought to increase CO2 concentration 1000-fold.

  44. Goal: Examine counterpart in Thiomicrospira crunogena

  45. Mini Assignment • NCBI exploration • BLAST IMG • BLAST PDB • GOAL: 10 sequences of CsoS3 for use during the rest of the workshop

  46. NCBI BLAST • Navigate to NCBI BLAST at http://blast.ncbi.nlm.nih.gov. Click on “protein blast.” • Run defaults • Run Swissprot • Change database back to Non-redundant protein sequences, Matrix to PAM30, and Max target sequences to 20000 and run again. Scroll to top of “Alignments,” click “Select All,” and click “Get Selected Sequences.” Note number of sequences retrieved. • Change Matrix to BLOSUM45 and run again. Scroll to top of “Alignments,” click “Select All,” and click “Get Selected Sequences.” Note number of sequences retrieved. • Check “Low complexity regions” under “Filter” and run again. Scroll to top of “Alignments,” click “Select All,” and click “Get Selected Sequences.” Note number of sequences retrieved.

More Related