1 / 22

BINF350, Tutorial 4 Karen Marshall

Blast in practice. BINF350, Tutorial 4 Karen Marshall. Aim. Examine how blast parameters (e.g. scoring scheme, word length) affect the alignment outcome To optimise blast parameters for alignments with different levels of sequence homology. v. Practical: Part 1.

howie
Download Presentation

BINF350, Tutorial 4 Karen Marshall

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Blast in practice BINF350, Tutorial 4 Karen Marshall

  2. Aim • Examine how blast parameters (e.g. scoring scheme, word length) affect the alignment outcome • To optimise blast parameters for alignments with different levels of sequence homology

  3. v Practical: Part 1 • Start with an ~200 bp original DNA sequence • Simulation mutation events over time and collect sequences • Blast original sequence against mutated sequences • Repeat blasts using different parameters Mutated sequences Blast Original sequence

  4. Simulation of mutated sequences • Point accepted mutation (PAM) model of molecular evolution • 1 PAM = 1 mutation per 100 bases on average • 1 PAM  99.0% sequence homology • 10 PAM  90.6% sequence homology • 50 PAM  63.5% sequence homology • Concept of forward and backwards mutation

  5. for each ‘successive PAM’ for each ‘nucleotide’ if (rand > 0.01) do not mutate else if (rand <=0.01) mutate by random selection from the non-identical bases

  6. Step 1 2 3 BLAST - Heuristic • Suffix Tree • Lookup table • Words/seeds • Location • Threshold T • Larger seq file

  7. BLAST http://www.ncbi.nih.gov/BLAST/blast_whatsnew.shtml • BLAST 2.2.8 release notes • Correction to tblastx alignment computation • ia32-linux now requires glibc 2.2.5 • Source code can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/20040204/ncbi.tar.gz . • Binaries can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.8/ . • BLAST 2.2.7 release notes • Standalone BLAST is now available for amd64-linux. • formatdb now restricts volume sizes to 1G on 32-bit platforms for performance reasons. • The -A option has been removed from formatdb, that is, all databases will be created with ASN.1 deflines. • tblastn query concatenation now works correctly on 64-bit platforms. • The wwwblast source code has been merged into the C toolkit tree and is no longer distributed with the binaries. • Source code can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/20040202/ncbi.tar.gz . • Binaries can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.7/ .

  8. BLAST on your own machine • Allows you to BLAST multiple sequences • most web versions are single sequence only • Steps • Sequence files in FASTA format Can have multiple sequences in each file but no duplicates • Format larger sequence file into a database Formatdb –i dbfile.txt –p F –o T • Perform BLAST using appropriate switches BLASTALL –p BLASTN –d dbfile.txt –i comp.txt –o out.txt

  9. BLAST 2.2.8 • Arguments see appendix of handout • –W for seed word length (default = 11) • -r reward for a match (default = 1) • -q penalty for a mismatch (default = 3) • -G cost to open a gap • -E cost to extend a gap • -F filter query sequence • -e to set threshold expectation (threshold for HSP before gaps are included) • -m to specify different output options

  10. Example of BLAST output: -m3 Score E Sequences producing significant alignments: (bits) Value 1_10 170 3e-046 0_0 170 3e-046 4_10 115 2e-029 2_10 107 4e-027 5_10 96 2e-023 3_10 96 2e-023 4_20 68 3e-015 2_20 68 3e-015 5_20 56 1e-011 QUERY 1 agattcactggtgtggcaagttgtctctcagactgtacatgcattaaaattttgcttggc 60 1_10 1 ............................................................ 60 0_0 1 ............................................................ 60 4_10 3 ....t.....c......ag..................a.................... 60 2_10 1 ............a..c....a...........a................g.......... 60 5_10 2 ........c......a.........g............................c.... 60 3_10 1 .................g........t.....................c.....a..... 60 4_20 3 ....t.....c......ag....a.....g.......a.................... 60 2_20 1 ............a..c...ta...........aa......c..a.....g..... 55 5_20 4 ......c..c...a....g....g..............a......c......c.... 60

  11. Substitution scores • Optimal substitution scores were derived for different PAM distances / sequence homologies (States et al., 1991) • Of importance is the match to mismatch score ratio

  12. Substitution scores • ‘Better’ substitution matrices exist, but not yet implemented in most BLAST software

  13. Practical: Part 2 • Apply concepts from Part 1 to ‘real sequences’ • BLAST mRNA sequence for human and cattle INFG to an ~1/2 Mb sequence of human DNA • Use optimal blast parameters for expected homology Human DNA Blast Human INFG mRNA Cattle INFG mRNA

  14. Expected levels of sequence homology • Varies for sequences being considered and genomic region Human to mouse comparison, from …

  15. Efficiency of BLAST • Human to cattle coding sequence ~85% homology (~PAM 15)

  16. INFG mRNA sequences • Extracted from NCBI website using batch entrez INFG_refseq.txt >gi|10835170|ref|NM_000619.1| Homo sapiens interferon, gamma (IFNG), mRNA TGAAGATCAGCTATTAGAAGAGAAAGATCAGTTAAGTCCTTTGGACCTGATCAGCTTGATACAAGAACTA CTGATTTCAACTTCTTTGGCTTAATTCTCTCGGAAACGATGAAATATACAAGTTATATCTTGGCTTTTCA GCTCTGCATCGTTTTGGGTTCTCTTGGCTGTTACTGCCAGGACCCATATGTAAAAGAAGCAGAAAACCTT AAGAAATATTTTAATGCAGGTCATTCAGATGTAGCGGATAATGGAACTCTTTTCTTAGGCATTTTGAAGA ATTGGAAAGAGGAGAGTGACAGAAAAATAATGCAGAGCCAAATTGTCTCCTTTTACTTCAAACTTTTTAA AAACTTTAAAGATGACCAGAGCATCCAAAAGAGTGTGGAGACCATCAAGGAAGACATGAATGTCAAGTTT TTCAATAGCAACAAAAAGAAACGAGATGACTTCGAAAAGCTGACTAATTATTCGGTAACTGACTTGAATG TCCAACGCAAAGCAATACATGAACTCATCCAAGTGATGGCTGAACTGTCGCCAGCAGCTAAAACAGGGAA GCGAAAAAGGAGTCAGATGCTGTTTCAAGGTCGAAGAGCATCCCAGTAATGGTTGTCCTGCCTGCAATAT TTGAATTTTAAATCTAAATCTATTTATTAATATTTAACATTATTTATATGGGGAATATATTTTTAGACTC ATCAATCAAATAAGTATTTATAATAGCAACTTTTGTGTAATGAAAATGAATATCTATTAATATATGTATT ATTTATAATTCCTATATCCTGTGACTGTCTCACTTAATCCTTTGTTTTCTGACTAATTAGGCAAGGCTAT GTGATTACAAGGCTTTATCTCAGGGGCCAACTAGGCAGCCAACCTAAGCAAGATCCCATGGGTTGTGTGT TTATTTCACTTGATGATACAATGAACACTTATAAGTGAAGTGATACTATCCAGTTACTGCCGGTTTGAAA ATATGCCTGCAATCTGAGCCAGTGCTTTAATGGCATGTCAGACAGAACTTGAATGTGTCAGGTGACCCTG ATGAAAACATAGCATCTCAGGAGATTTCATGCCTGGTGCTTCCAAATATTGTTGACAACTGTGACTGTAC CCAAATGGAAAGTAACTCATTTGTTAAAATTATCAATATCTAATATATATGAATAAAGTGTAAGTTCACA ACT >gi|31982948|ref|NM_174086.1| Bos taurus interferon, gamma or immune type [interferon gamma type 2] (IFNG), mRNA ATTAGAAAAGAAAGATCAGCTACCTCCTTGGGACCTGATCATAACACAGGAGCTACCGATTTCAACTACT CCGGCCTAACTCTCTCCTAAACAATGAAATATACAAGCTATTTCTTAGCTTTACTGCTCTGTGGGCTTTT GGGTTTTTCTGGTTCTTATGGCCAGGGCCAATTTTTTAGAGAAATAGAAAACTTAAAGGAGTATTTTAAT GCAAGTAGCCCAGATGTAGCTAAGGGTGGGCCTCTCTTCTCAGAAATTTTGAAGAATTGGAAAGATGAAA

  17. Human Chr12 sub-sequence • Extracted from USCS ‘Golden Path’ website • chr12:66,589,493-67,085,092 ~ ½ Mb • does contain INFG gene • Repeats masked to lower case >hg16_dna range=chr12:66589493-67085092 5'pad=0 3'pad=0 revComp=FALSE strand=? repeatMasking=lower CATTCATTACTTTTATAAGGTTTCTCTCTGGTATGCATCTGACTTACATC ATGGGAAAGCTAGTTTCATGACTCCTTTGGAATAGTTGTGGTCCTGAATA TGGAAAATCAATTAATGAATAGCTTAAAGCACAATAGTCAACAAATAGAT GTGAAAATTCTTTGTGAACTTTAAAGTCTTACTTAAACGTGAGATATTAT ATACAGTGTTTTATGTtagactgtgagcttgttaaagaaagaactatgcc ttctttttctttctaccagttccagtgcctcgtacaacatagaaaccata agtgtttttgaaagagcaaatGAATATTGGAAGGAGTAAGGTGATAGCTA AAGCTAAAACAATGTTTAGGGAGAACAACTGAAACAAAAGCAGCATTTGT GTCTTAAACTCATGGCCTCTGAAACAGCCTTGATAGATAGTAGAGAGGGT CAGATAGAGAGAGCCTGACTCAGAGATTGGGAAGCCCTATATGGTTGGAA GAGAAAGTAAGAGGAGACCCAAAGTATTAGACCACAGAAAGAAGTTCTAA TAGTCAGTGTCAAGAGATTCAGCAGGAGGTTGTGTATCAGGATTTGGGTT TGGGAGTGGTATGGAGCTTACCTATCTCTAAAACGAGCAGGAGGGCAAAA ATGAATCCCAGTCCCAAAGAATTCACTAATGGCCAGCAAACCAACACAGG AACCCCAGCACAGACACACAAGATAGGAAACCAGTTGTTGAAACTACAAT GTAACGGGGCTGATTTAATAAAAACCTGTTACATGAGTTATAGGtttttt ttttttttttttttttttAATGTATGTGCCCCACCTTAGGAAAGCCAGAA ATAATGGCAACGAAGAAATATTCATTCACAGTGAGAAAGCCATTAGAACG TTGGCTGGAACCTAGGGGCATATCGAGGGCCCACGTGGGAAGGACAATGA CAACTTGTTTAGTCCTCACTGGTTTCCCAGTCTGTGGATCTTATTTGAAT hs_chr12_subseq.txt

  18. Human INFG gene

  19. Human INFG gene • From USCS ‘Golden Path website’ genome browser

  20. INFG against ~1/2 Mb region of Chr 12

  21. Assessment • Submit • for either Part 1 or Part 2 the BLAST output, concatenated into one file and annotated • a short summary / discussion of the concepts covered in this practical (< 500 words)

  22. References • Strongly recommend BLAST tutorial on NCBI site • http://www.ncbi.nlm.nih.gov/BLAST/tutorial/ Altschul-1.html • Further “Bioinformatics for quantitative geneticists course notes” J. McEwan • http://www-personal.une.edu.au/~jvanderw/ aabc_materials2004.htm#ModuleC

More Related