1 / 19

Lecture 4: Practical use of sequence alignment methods and introduction of projects

CZ5225 Methods in Computational Biology. Lecture 4: Practical use of sequence alignment methods and introduction of projects. Sequence Alignment Methods. Pairwise alignment  best-matching Global alignment Local alignment Multiple alignment Software FASTA Clustal

Download Presentation

Lecture 4: Practical use of sequence alignment methods and introduction of projects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CZ5225 Methods in Computational Biology Lecture 4: Practical use of sequence alignment methods and introduction of projects

  2. Sequence Alignment Methods • Pairwise alignment  best-matching • Global alignment • Local alignment • Multiple alignment • Software • FASTA • Clustal • BLAST (Basic Local Alignment Search Tool) • PSI-BLAST (detecting remote-homologues) • HMM-based methods (detecting remote-homologues)

  3. Pairwise Alignment Algorithms • Needleman-Wunsch • Global alignment only. • Smith-Waterman • Local or global alignment. Substitution matrix and the gap-scoring scheme Blosum, pam,etc Affine Gap, Extension Gap,etc It is fairly demanding of time and memory resources FASTA,BLAST…

  4. Multiple Sequence Alignment • FASTA : Superseded by BLAST • BLAST : emphasizes the balance between the speed and sensitivity • PSI-BLAST: profile alignments, remote homology identify • HMM: profile alignments, remote homology identify • Clustal: Profile alignments

  5. BLAST Programs • There are five different blast programs, which can be distinguished by the type of the query sequence (DNA or protein) and the type of the subject database: • BLASTP compares an amino acid query sequence against a protein sequence database; • BLASTN compares a nucleotide query sequence against a nucleotide sequence database; • BLASTX compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database; • TBLASTN compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands). • TBLASTX compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

  6. Practical Use of BLAST • The Information Database Curation (data collection) • The sequence data transformation. • formatDB, indexing the Sequence Database for BLAST • Do BLAST against the designed database. • Identify the homologous from the blast results. • Scoring the blast hits according to their e-value and their drug susceptibility.

  7. Preparation: Get the BLAST package • Why do we need a local version? • Where to get the software package? • http://www.ncbi.nlm.nih.gov/blast/ • Tree Structure after unpacking:

  8. 2.The sequence data transformation. • Any sequence format to FASTA format greater than symbol The description line >Example1 envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLE >Example2 envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFN

  9. 3.formatDB, indexing the Sequence Database for BLAST • formatdb -i ecoli.nt -p F -o T • -i Input file(s) for formatting [File In] Optional • -p Type of file(default = T) • T - protein • F - nucleotide [T/F] Optional • -s Create indexes limited only to accessions - sparse [T/F] Optional • default = F • -V Verbose: check for non-unique string ids in the database [T/F] Optional • default = F • -o Parse options(default = F) • T - True: Parse SeqId and create indexes. • F - False: Do not parse SeqId. Do not create indexes.[T/F] Optional • -F Gifile (file containing list of gi's) [File In] Optional • … … • formatdb.exe -i ourOwnDatabase -p T -o T

  10. 4.Do BLAST against the designed database. • blastall arguments: • -p Program Name [String] • -d Database [String] • default = nr • -i Query File [File In] • default = stdin • -e Expectation value (E) [Real] • default = 10.0 • -v Number of database sequences to show one-line descriptions default = 500 • -b Number of database sequence to show alignments • default = 250

  11. 4.Do BLAST against the designed database. • EXAMPLE: • blastall -p blastp -d db/swissprot -i Q9Y5N1.txt -o Q9Y5N1.out • blastall -p blastp -d db/swissprot -e 1 -i Q9Y5N1.txt -o Q9Y5N1.out Q9Y5N1.txt >newSP|Q9Y5N1|HH3R_HUMAN Histamine H3 receptor (HH3R) (G protein-coupled receptor 97) MERAPPDGPLNASGALAGEAAAAGGARGFSAAWTAVLAALMALLIVATVLGNALVMLAFV ADSSLRTQNNFFLLNLAISDFLVGAFCIPLYVPYVLTGRWTFGRGLCKLWLVVDYLLCTS SAFNIVLISYDRFLSVTRAVSYRAQQGDTRRAVRKMLLVWVLAFLLYGPAILSWEYLSGG SSIPEGHCYAEFFYNWYFLITASTLEFFTPFLSVTFFNLSIYLNIQRRTRLRLDGAREAA GPEPPPEAQPSPPPPPGCWGCWQKGHGEAMPLHRYGVGEAAVGAEAGEATLGGGGGGGSV ASPTSSSGSSSRGTERPRSLKRGSKPSASSASLEKRMKMVSQSFTQRFRLSRDRKVAKSL AVIVSIFGLCWAPYTLLMIIRAACHGHCVPDYWYETSFWLLWANSAVNPVLYPLCHHSFR RAFTKLLCPQKLKIQPHSSLEHCWK

  12. 4.Do BLAST against the designed database. Q9Y5N1.out Query= newSP|Q9Y5N1|HH3R_HUMAN Histamine H3 receptor (HH3R) (G protein-coupled receptor 97) (445 letters) Database: swissprot 172,892 sequences; 63,586,428 total letters Score E Sequences producing significant alignments: (bits) Value sp|Q9Y5N1|HRH3_HUMAN Histamine H3 receptor (HH3R) (G-protein cou... 668 0.0 ………………….. sp|P18871|ADA2A_PIG Alpha-2A adrenergic receptor (Alpha-2A adren... 105 2e-022 sp|Q9N2B2|HRH1_PANTR Histamine H1 receptor 105 2e-022 ………………………………….. Database: swissprot Posted date: Jul 8, 2005 9:35 PM Number of letters in database: 63,586,428 Number of sequences in database: 172,892 ………………………………… Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Number of Hits to DB: 40,317,827 Number of Sequences: 172892 …………………………………………….

  13. Parsing and interpreting the results • Biojava • Bioperl • Bioruby • Biopython • Or • Your own codes-Why?

  14. Work Flow of Manipulate Batched BLAST Queries – Shell Programming • Prepare and put the job into the queue • Handle individual request • Analyze/output the result after each job request • Remaining – collect and finalize report Basic/Bash/C/C++/C#/Java/Python/Perl/R/Ruby/TCL

  15. Another way to BLAST like a robot • BLAST URL API ( from NCBI) http://www.ncbi.nlm.nih.gov/blast/Blast.cgi

  16. A Sample URL • http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Put • &PROGRAM=blastn&DATABASE=nr&FILTER=L&QUERY=AF123456 CMD Put : submit a query PROGRAM blastn : run BLASTn DATABASE nr : search against nr L : turn low complexity filtering on FILTER QUERY AF123456 : accession, GI, or FASTA An interim update to BLAST URLAPI, still being reviewed, is at: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/node_0.html Quote From: NCBI-programming with BLAST

  17. Intel Pentium Linux 2-way farm Intel Pentium Linux 2-way farm Intel Pentium Linux 2-way farm NCBI BLAST Server End Users Formatter Database loading if needed Blast.cgi Database server Alignment Search Request RID Result RID blastalign obj Merger demon mssql splitd Split queryinto chucks for distributed computing on multiple available CPUs Finished chunks are merged to generate final blastalign object Replicate Backup mssql Quote From: NCBI-Programming with BLAST

  18. Posting a URL NCBI $response = $ua->request($req) User Agent HTTP Request HTTP Response $ua = LWP::UserAgent->new $req = new HTTP::Request POST Quote From: NCBI-programming with BLAST

  19. Introduction of projects • Drug Resistant Mutation Data Collection and Database development • The scoring matrix development by sequence variations and their drug susceptibility data • Prediction of drug resistant mutations

More Related