1 / 98

A Field Guide part 2

A Field Guide part 2. University of Colorado Health Sciences Center. August 30, 2005. Part 2. Entrez: text searching a GenBank record preview/index. BLAST: sequence searching pre-computed searches algorithms what’s new?. VAST: structure searching.

huy
Download Presentation

A Field Guide part 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Field Guidepart 2 University of Colorado Health Sciences Center August 30, 2005

  2. Part 2 Entrez: text searching • a GenBank record • preview/index BLAST: sequence searching • pre-computed searches • algorithms • what’s new? VAST: structure searching • Example: mapping oligos to a genome

  3. Header Feature Table Sequence GenBank Records The Flatfile Format

  4. LOCUS NM_019570 4279 bp mRNA linear INV 28-OCT-2004 DEFINITION Mus musculus REV1-like(S. cerevisiae)(Rev1l),mRNA ACCESSION NM_019570 VERSION NM_019570.3 GI:50811869 KEYWORDS . = Title A Typical GenBank Record

  5. GenBank Record: Feature Table

  6. GenBank Record: Feature Table, con’t. GenPept identifier

  7. skip GenBank Record: sequence

  8. [accn] [orgn] [mdat] [prop] Indexing for Nucleotide UID 59958365 FieldIndexed Terms [primary accession] NM_001012399 [title] Bos taurus hemochromatosis (hfe), mRNA. [organism] Bos taurus [sequence length] 1168 [modification date] 2005/02/19 [properties] biomol mrna gbdiv mam srcdb refseq

  9. Global Entrez Search: HFE HFE

  10. [Title] Entrez Nucleotide: HFE 137 records Not HFE

  11. 42 records Curated HFE splice variants (11 total) Smarter Query hfe[title] AND human[orgn]

  12. hfe[title]ANDhuman[orgn] (con’t) Primary data

  13. Preview/Index

  14. Preview/Index

  15. srcdb Preview/Index: Properties, srcdb Properties

  16. Preview/Index: Properties, srcdb …AND srcdb refseq[Properties]

  17. Preview/Index: Properties, srcdb …AND srcdb ddbj/embl/genbank[Properties]

  18. Primate division gbdiv pri[prop] EST division gbdiv est[prop] Database Queries #1hfe 137 #2 hfe[title]AND human[orgn] 42 #3 #2 AND srcdb refseq[prop] 11 #4 #2 AND srcdb ddbj/embl/genbank[prop] 31 #5 #4 AND gbdiv pri[prop] 29 #4 #4 AND gbdiv est[prop] 2

  19. Genomic DNA biomol genomic[prop] cDNA biomol mrna[prop] Molecule Queries #1hfe 116 #2 hfe[title]AND human[orgn] 42 #3 #2 AND biomol mrna[prop] 29 #4 #2 AND biomol genomic[prop] 13

  20. Entrez Nucleotide Reviewed RefSeqs with transcript variants: srcdb refseq reviewed[prop]ANDtranscript[title] AND variant[title] Entrez Gene Topoisomerase genes from Archaea: topoisomerase[gene name]ANDarchaea[organism] Genes on human chromosome 2 with OMIM links 2[chromosome] ANDhuman[organism]AND“gene omim”[filter] Membrane proteins linked to cancer: “integral to plasma membrane”[gene ontology]ANDcancer[dis] More Queries… Fields are database-specific

  21. Other Entrez Databases UniGene:rat clusters that have at least one mRNA rat[organism] NOT0[mrna count] SNP:uniquely mapped microsatellites on human chr2 microsat[SNP Class] AND 1[Map Weight] AND 2[Chromosome]) AND human[orgn] UniSTS:markers on the Genethon map of human chromosome 12 Genethon[Map Name] ANDhuman[organism] AND12[chromosome] Structure:structures of bacterial kinases with resolutions below 2 Å bacteria[organism]ANDkinaseAND000.00:002.00[resolution]

  22. Basic Local Alignment Search Tool

  23. BLAST Web Searches, 2005 200,000

  24. Nucleotide or protein:Related Sequences • BLAST link:BLink Precomputed BLAST Services • Transcript clusters:UniGene • Protein homologs: HomoloGene

  25. Link to Related Sequences

  26. Related Sequences Most similar Least similar

  27. BLink (BLAST Link)

  28. Best hits 3D structures CDD-Search BLink Output

  29. Seq 1 Seq 1 Seq 2 Seq 2 Global alignment Local alignment Global vs Local Alignment

  30. Global Seq1: 1 W--HEREISWALTERNOW 16 W HERE Seq2: 1 HEWASHEREBUTNOWISHERE 21 Local Seq1: 1 W--HERE 5 Seq1: 1 W--HERE 5 W HERE W HERE Seq2: 3 WASHERE 9 Seq2: 15 WISHERE 21 Global vs Local Alignment Seq1: WHEREISWALTERNOW (16aa) Seq2: HEWASHEREBUTNOWISHERE (21aa)

  31. The Flavors of BLAST • Standard BLAST • nucleotide, protein and translations (blastn, blastp, blastx, tblastn, tblastx) • traditional “contiguous” word hit • Megablast • optimized for large batch searches • can use discontiguous words • PSI-BLAST • constructs PSSMs automatically; uses as query • very sensitive protein search • RPS BLAST • searches a database of PSSMs • tool for conserved domain searches “contiguous” discontiguous

  32. Why Is BLAST So Popular? • Fast - heuristic approach based on Smith Waterman • Localalignments • Statisticalsignificance -Expect value • Versatile - blastn, blastp, blastx, tblastn, tblastx, rps-blast, psi-blast -www, standalone, and network clients

  33. How BLAST Works • Make lookup table of “words” for query • Scan database for hits • Ungapped extensions of hits (initial HSPs) • Gapped extensions (no traceback) • Gapped extensions (traceback; alignment details)

  34. GTACTGGACATGGACCCTACAGGAA Query: 11-mer Nucleotide Words GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC GGACATGGACC GACATGGACCC ACATGGACCCT Make a lookup table of words . . .

  35. GTQITVEDLFYNIATRRKALKN Query: Word size = 3 (default) Word size can only be 2 or 3 Neighborhood Words LTV, MTV, ISV, LSV, etc. Protein Words GTQ TQI QIT ITV TVE VED EDL DLF ... Make a lookup table of words [ -f 11 = blastp default ]

  36. Minimum Requirements for a Hit ATCGCCATGCTTAATTGGGCTT CATGCTTAATT one exact match • Nucleotide BLAST requires one exact match • Protein BLAST requires two neighboring matches within 40 aa GTQITVEDLFYNI SEI YYN neighborhood words [ -A 40 = blastp default ] two matches

  37. example query words Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV… HFL 18 HFV 15 HFS 14 HWL 13 NFL 13 DFL 12 HWV 10 etc … YLS 15 YLT 12 YVS 12 YIT 10 etc … Neighborhood words Neighborhood score threshold T (-f) =11 Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47 YLSHFL Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333 +E YA YL K F+L +SP+ +DVNVHP+K V+++ I Gapped extension with trace back Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I + + Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337 Final HSP BLASTP Summary High-scoring pair (HSP)

  38. Scoring Systems - Nucleotides Identity matrix A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 –3 +1 -3 T –3 –3 –3 +1 [ -r 1 -q -3 ] CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA

  39. Scoring Systems - Proteins • Position Independent Matrices • PAM Matrices (Percent Accepted Mutation) • Derived from observation; small dataset of alignments • Implicit model of evolution • All calculated from PAM1 • PAM250 widely used • BLOSUM Matrices (BLOck SUbstitution Matrices) • Derived from observation; large dataset of highly conserved blocks • Each matrix derived separately from blocks with a defined percent identity cutoff • BLOSUM62 - default matrix for BLAST • Position Specific Score Matrices (PSSMs) • PSI- and RPS-BLAST

  40. F Negative for less likely substitutions Y Positive for more likely substitutions D D F BLOSUM62 A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

  41. PSSM scores 1 5 7 4 4 Position-Specific Score Matrix Serine/Threonine protein kinases catalytic loop DAF-1

  42. Position-Specific Score Matrix A R N D C Q E G H I L K M F P S T W Y V 435 K -1 0 0 -1 -2 3 0 3 0 -2 -2 1 -1 -1 -1 -1 -1 -1 -1 -2 436 E 0 1 0 2 -1 0 2 -1 0 -1 -1 0 0 0 -1 0 0 -1 -1 -1 437 S 0 0 -1 0 1 1 0 1 1 0 -1 0 0 0 2 0 -1 -1 0 -1 438 N -1 0 -1 -1 1 0 -1 3 3 -1 -1 1 -1 0 0 -1 -1 1 1 -1 439 K -2 1 1 -1 -2 0 -1 -2 -2 -1 -2 5 1 -2 -2 -1 -1 -2 -2 -1 440 P -2 -2 -2 -2 -3 -2 -2 -2 -2 -1 -2 -1 0 -3 7 -1 -2 -3 -1 -1 441 A 3 -2 1 -2 0 -1 0 1 -2 -2 -2 0 -1 -2 3 1 0 -3 -3 0 442 M -3 -4 -4 -4 -3 -4 -4 -5 -4 7 0 -4 1 0 -4 -4 -2 -4 -1 2 443 A 4 -4 -4 -4 0 -4 -4 -3 -4 4 -1 -4 -2 -3 -4 -1 -2 -4 -3 4 444 H -4 -2 -1 -3 -5 -2 -2 -4 10 -6 -5 -3 -4 -3 -2 -3 -4 -5 0 -5 445 R -4 8 -3 -4 0 -1 -2 -3 -2 -5 -4 0 -3 -2 -4 -3 -3 0 -4 -5 446 D -4 -4 -1 8 -6 -2 0 -3 -3 -5 -6 -3 -5 -6 -4 -2 -3 -7 -5 -5 447 I -4 -5 -6 -6 -3 -4 -5 -6 -5 3 5 -5 1 1 -5 -5 -3 -4 -3 1 448 K 0 0 1 -3 -5 -1 -1 -3 -3 -5 -5 7 -4 -5 -3 -1 -2 -5 -4 -4 449 S 0 -3 -2 -3 0 -2 -2 -3 -3 -4 -4 -2 -4 -5 2 6 2 -5 -4 -4 450 K 0 3 0 1 -5 0 0 -4 -1 -4 -3 4 -3 -2 2 1 -1 -5 -4 -4 451 N -4 -3 8 -1 -5 -2 -2 -3 -1 -6 -6 -2 -4 -5 -4 -1 -2 -6 -4 -5 452 I -3 -5 -5 -6 0 -5 -5 -6 -5 6 2 -5 2 -2 -5 -4 -3 -5 -3 3 453 M -4 -4 -6 -6 -3 -4 -5 -6 -5 0 6 -5 1 0 -5 -4 -3 -4 -3 0 454 V -3 -3 -5 -6 -3 -4 -5 -6 -5 3 3 -4 2 -2 -5 -4 -3 -5 -3 5 455 K -2 1 1 4 -5 0 -1 -2 1 -4 -2 4 -3 -2 -3 0 -1 -5 -2 -3 456 N 1 1 3 0 -4 -1 1 0 -3 -4 -4 3 -2 -5 -2 2 -2 -5 -4 -4 457 D -3 -2 5 5 -1 -1 1 -1 0 -5 -4 0 -2 -5 -1 0 -2 -6 -4 -5 458 L -3 -1 0 -3 0 -3 -2 3 -4 -2 3 0 1 1 -2 -2 -3 5 -1 -3 catalytic loop

  43. E = Kmne-S or E = mn2-S’ K = scale for search space  = scale for scoring system S’ = bitscore = (S - lnK)/ln2 (applies to ungapped alignments) Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Expect Value E = number of database hits you expect to find by chance, ≥ S your score Alignments expected number of random hits Score (S) More info:www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

  44. Gapped Alignments • Gapping provides more biologically realistic alignments • Gapped BLAST parameters are simulated for each scoring matrix • Affine gap costs = -(a+bk) • a = gap open penalty b = gap extend penalty • A gap of length 1 receives the score -(a+b)

  45. An Alignment BLAST Cannot Make 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC Reason: no contiguous exact match of 7 bp.

  46. Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3 BLAST 2 Sequences (blastx) output: An Alignment BLAST Can Make Solution: compare protein sequences; BLASTX

  47. Other BLAST Algorithms • Megablast • Discontiguous Megablast • PSI-BLAST • PHI-BLAST

  48. Megablast: NCBI’s Genome Annotator • Long alignments of similar DNA sequences • Greedy algorithm • Concatenation of query sequences • Faster than blastn; less sensitive

  49. Too fast for you? MegaBLAST & Word Size Trade-off: sensitivity vs speed

  50. blastp 3 2 WORD SIZE default minimum blastn 11 7 megablast 28 8 MegaBLAST & Word Size Trade-off: sensitivity vs speed

More Related