Download
a field guide part 2 n.
Skip this Video
Loading SlideShow in 5 Seconds..
A Field Guide part 2 PowerPoint Presentation
Download Presentation
A Field Guide part 2

A Field Guide part 2

116 Views Download Presentation
Download Presentation

A Field Guide part 2

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. A Field Guidepart 2 University of Colorado Health Sciences Center August 30, 2005

  2. Part 2 Entrez: text searching • a GenBank record • preview/index BLAST: sequence searching • pre-computed searches • algorithms • what’s new? VAST: structure searching • Example: mapping oligos to a genome

  3. Header Feature Table Sequence GenBank Records The Flatfile Format

  4. LOCUS NM_019570 4279 bp mRNA linear INV 28-OCT-2004 DEFINITION Mus musculus REV1-like(S. cerevisiae)(Rev1l),mRNA ACCESSION NM_019570 VERSION NM_019570.3 GI:50811869 KEYWORDS . = Title A Typical GenBank Record

  5. GenBank Record: Feature Table

  6. GenBank Record: Feature Table, con’t. GenPept identifier

  7. skip GenBank Record: sequence

  8. [accn] [orgn] [mdat] [prop] Indexing for Nucleotide UID 59958365 FieldIndexed Terms [primary accession] NM_001012399 [title] Bos taurus hemochromatosis (hfe), mRNA. [organism] Bos taurus [sequence length] 1168 [modification date] 2005/02/19 [properties] biomol mrna gbdiv mam srcdb refseq

  9. Global Entrez Search: HFE HFE

  10. [Title] Entrez Nucleotide: HFE 137 records Not HFE

  11. 42 records Curated HFE splice variants (11 total) Smarter Query hfe[title] AND human[orgn]

  12. hfe[title]ANDhuman[orgn] (con’t) Primary data

  13. Preview/Index

  14. Preview/Index

  15. srcdb Preview/Index: Properties, srcdb Properties

  16. Preview/Index: Properties, srcdb …AND srcdb refseq[Properties]

  17. Preview/Index: Properties, srcdb …AND srcdb ddbj/embl/genbank[Properties]

  18. Primate division gbdiv pri[prop] EST division gbdiv est[prop] Database Queries #1hfe 137 #2 hfe[title]AND human[orgn] 42 #3 #2 AND srcdb refseq[prop] 11 #4 #2 AND srcdb ddbj/embl/genbank[prop] 31 #5 #4 AND gbdiv pri[prop] 29 #4 #4 AND gbdiv est[prop] 2

  19. Genomic DNA biomol genomic[prop] cDNA biomol mrna[prop] Molecule Queries #1hfe 116 #2 hfe[title]AND human[orgn] 42 #3 #2 AND biomol mrna[prop] 29 #4 #2 AND biomol genomic[prop] 13

  20. Entrez Nucleotide Reviewed RefSeqs with transcript variants: srcdb refseq reviewed[prop]ANDtranscript[title] AND variant[title] Entrez Gene Topoisomerase genes from Archaea: topoisomerase[gene name]ANDarchaea[organism] Genes on human chromosome 2 with OMIM links 2[chromosome] ANDhuman[organism]AND“gene omim”[filter] Membrane proteins linked to cancer: “integral to plasma membrane”[gene ontology]ANDcancer[dis] More Queries… Fields are database-specific

  21. Other Entrez Databases UniGene:rat clusters that have at least one mRNA rat[organism] NOT0[mrna count] SNP:uniquely mapped microsatellites on human chr2 microsat[SNP Class] AND 1[Map Weight] AND 2[Chromosome]) AND human[orgn] UniSTS:markers on the Genethon map of human chromosome 12 Genethon[Map Name] ANDhuman[organism] AND12[chromosome] Structure:structures of bacterial kinases with resolutions below 2 Å bacteria[organism]ANDkinaseAND000.00:002.00[resolution]

  22. Basic Local Alignment Search Tool

  23. BLAST Web Searches, 2005 200,000

  24. Nucleotide or protein:Related Sequences • BLAST link:BLink Precomputed BLAST Services • Transcript clusters:UniGene • Protein homologs: HomoloGene

  25. Link to Related Sequences

  26. Related Sequences Most similar Least similar

  27. BLink (BLAST Link)

  28. Best hits 3D structures CDD-Search BLink Output

  29. Seq 1 Seq 1 Seq 2 Seq 2 Global alignment Local alignment Global vs Local Alignment

  30. Global Seq1: 1 W--HEREISWALTERNOW 16 W HERE Seq2: 1 HEWASHEREBUTNOWISHERE 21 Local Seq1: 1 W--HERE 5 Seq1: 1 W--HERE 5 W HERE W HERE Seq2: 3 WASHERE 9 Seq2: 15 WISHERE 21 Global vs Local Alignment Seq1: WHEREISWALTERNOW (16aa) Seq2: HEWASHEREBUTNOWISHERE (21aa)

  31. The Flavors of BLAST • Standard BLAST • nucleotide, protein and translations (blastn, blastp, blastx, tblastn, tblastx) • traditional “contiguous” word hit • Megablast • optimized for large batch searches • can use discontiguous words • PSI-BLAST • constructs PSSMs automatically; uses as query • very sensitive protein search • RPS BLAST • searches a database of PSSMs • tool for conserved domain searches “contiguous” discontiguous

  32. Why Is BLAST So Popular? • Fast - heuristic approach based on Smith Waterman • Localalignments • Statisticalsignificance -Expect value • Versatile - blastn, blastp, blastx, tblastn, tblastx, rps-blast, psi-blast -www, standalone, and network clients

  33. How BLAST Works • Make lookup table of “words” for query • Scan database for hits • Ungapped extensions of hits (initial HSPs) • Gapped extensions (no traceback) • Gapped extensions (traceback; alignment details)

  34. GTACTGGACATGGACCCTACAGGAA Query: 11-mer Nucleotide Words GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC GGACATGGACC GACATGGACCC ACATGGACCCT Make a lookup table of words . . .

  35. GTQITVEDLFYNIATRRKALKN Query: Word size = 3 (default) Word size can only be 2 or 3 Neighborhood Words LTV, MTV, ISV, LSV, etc. Protein Words GTQ TQI QIT ITV TVE VED EDL DLF ... Make a lookup table of words [ -f 11 = blastp default ]

  36. Minimum Requirements for a Hit ATCGCCATGCTTAATTGGGCTT CATGCTTAATT one exact match • Nucleotide BLAST requires one exact match • Protein BLAST requires two neighboring matches within 40 aa GTQITVEDLFYNI SEI YYN neighborhood words [ -A 40 = blastp default ] two matches

  37. example query words Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV… HFL 18 HFV 15 HFS 14 HWL 13 NFL 13 DFL 12 HWV 10 etc … YLS 15 YLT 12 YVS 12 YIT 10 etc … Neighborhood words Neighborhood score threshold T (-f) =11 Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47 YLSHFL Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333 +E YA YL K F+L +SP+ +DVNVHP+K V+++ I Gapped extension with trace back Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I + + Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337 Final HSP BLASTP Summary High-scoring pair (HSP)

  38. Scoring Systems - Nucleotides Identity matrix A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 –3 +1 -3 T –3 –3 –3 +1 [ -r 1 -q -3 ] CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA

  39. Scoring Systems - Proteins • Position Independent Matrices • PAM Matrices (Percent Accepted Mutation) • Derived from observation; small dataset of alignments • Implicit model of evolution • All calculated from PAM1 • PAM250 widely used • BLOSUM Matrices (BLOck SUbstitution Matrices) • Derived from observation; large dataset of highly conserved blocks • Each matrix derived separately from blocks with a defined percent identity cutoff • BLOSUM62 - default matrix for BLAST • Position Specific Score Matrices (PSSMs) • PSI- and RPS-BLAST

  40. F Negative for less likely substitutions Y Positive for more likely substitutions D D F BLOSUM62 A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

  41. PSSM scores 1 5 7 4 4 Position-Specific Score Matrix Serine/Threonine protein kinases catalytic loop DAF-1

  42. Position-Specific Score Matrix A R N D C Q E G H I L K M F P S T W Y V 435 K -1 0 0 -1 -2 3 0 3 0 -2 -2 1 -1 -1 -1 -1 -1 -1 -1 -2 436 E 0 1 0 2 -1 0 2 -1 0 -1 -1 0 0 0 -1 0 0 -1 -1 -1 437 S 0 0 -1 0 1 1 0 1 1 0 -1 0 0 0 2 0 -1 -1 0 -1 438 N -1 0 -1 -1 1 0 -1 3 3 -1 -1 1 -1 0 0 -1 -1 1 1 -1 439 K -2 1 1 -1 -2 0 -1 -2 -2 -1 -2 5 1 -2 -2 -1 -1 -2 -2 -1 440 P -2 -2 -2 -2 -3 -2 -2 -2 -2 -1 -2 -1 0 -3 7 -1 -2 -3 -1 -1 441 A 3 -2 1 -2 0 -1 0 1 -2 -2 -2 0 -1 -2 3 1 0 -3 -3 0 442 M -3 -4 -4 -4 -3 -4 -4 -5 -4 7 0 -4 1 0 -4 -4 -2 -4 -1 2 443 A 4 -4 -4 -4 0 -4 -4 -3 -4 4 -1 -4 -2 -3 -4 -1 -2 -4 -3 4 444 H -4 -2 -1 -3 -5 -2 -2 -4 10 -6 -5 -3 -4 -3 -2 -3 -4 -5 0 -5 445 R -4 8 -3 -4 0 -1 -2 -3 -2 -5 -4 0 -3 -2 -4 -3 -3 0 -4 -5 446 D -4 -4 -1 8 -6 -2 0 -3 -3 -5 -6 -3 -5 -6 -4 -2 -3 -7 -5 -5 447 I -4 -5 -6 -6 -3 -4 -5 -6 -5 3 5 -5 1 1 -5 -5 -3 -4 -3 1 448 K 0 0 1 -3 -5 -1 -1 -3 -3 -5 -5 7 -4 -5 -3 -1 -2 -5 -4 -4 449 S 0 -3 -2 -3 0 -2 -2 -3 -3 -4 -4 -2 -4 -5 2 6 2 -5 -4 -4 450 K 0 3 0 1 -5 0 0 -4 -1 -4 -3 4 -3 -2 2 1 -1 -5 -4 -4 451 N -4 -3 8 -1 -5 -2 -2 -3 -1 -6 -6 -2 -4 -5 -4 -1 -2 -6 -4 -5 452 I -3 -5 -5 -6 0 -5 -5 -6 -5 6 2 -5 2 -2 -5 -4 -3 -5 -3 3 453 M -4 -4 -6 -6 -3 -4 -5 -6 -5 0 6 -5 1 0 -5 -4 -3 -4 -3 0 454 V -3 -3 -5 -6 -3 -4 -5 -6 -5 3 3 -4 2 -2 -5 -4 -3 -5 -3 5 455 K -2 1 1 4 -5 0 -1 -2 1 -4 -2 4 -3 -2 -3 0 -1 -5 -2 -3 456 N 1 1 3 0 -4 -1 1 0 -3 -4 -4 3 -2 -5 -2 2 -2 -5 -4 -4 457 D -3 -2 5 5 -1 -1 1 -1 0 -5 -4 0 -2 -5 -1 0 -2 -6 -4 -5 458 L -3 -1 0 -3 0 -3 -2 3 -4 -2 3 0 1 1 -2 -2 -3 5 -1 -3 catalytic loop

  43. E = Kmne-S or E = mn2-S’ K = scale for search space  = scale for scoring system S’ = bitscore = (S - lnK)/ln2 (applies to ungapped alignments) Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Expect Value E = number of database hits you expect to find by chance, ≥ S your score Alignments expected number of random hits Score (S) More info:www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

  44. Gapped Alignments • Gapping provides more biologically realistic alignments • Gapped BLAST parameters are simulated for each scoring matrix • Affine gap costs = -(a+bk) • a = gap open penalty b = gap extend penalty • A gap of length 1 receives the score -(a+b)

  45. An Alignment BLAST Cannot Make 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC Reason: no contiguous exact match of 7 bp.

  46. Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3 BLAST 2 Sequences (blastx) output: An Alignment BLAST Can Make Solution: compare protein sequences; BLASTX

  47. Other BLAST Algorithms • Megablast • Discontiguous Megablast • PSI-BLAST • PHI-BLAST

  48. Megablast: NCBI’s Genome Annotator • Long alignments of similar DNA sequences • Greedy algorithm • Concatenation of query sequences • Faster than blastn; less sensitive

  49. Too fast for you? MegaBLAST & Word Size Trade-off: sensitivity vs speed

  50. blastp 3 2 WORD SIZE default minimum blastn 11 7 megablast 28 8 MegaBLAST & Word Size Trade-off: sensitivity vs speed