1 / 44

BioPerl

BioPerl . Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu. Introduction. Objective of BioPerl: Develop reusable, extensible core Perl modules for use as a standard for manipulating molecular biological data. Background: Started in 1995

kirene
Download Presentation

BioPerl

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BioPerl Based on a presentation by Manish Anand/Jonathan Nowacki/ Ravi Bhatt/Arvind Gopu

  2. Introduction • Objective of BioPerl: • Develop reusable, extensible core Perl modules for use as a standard for manipulating molecular biological data. • Background: • Started in 1995 • One of the oldest open source Bioinformatics Toolkit Project

  3. So what is BioPerl? • Higher level of abstraction • Re-usable collection of Perl modules that facilitate bioinformatics application development: • Accessing databases with different formats • Sequence manipulation • Execution and Parsing of the results of molecular biology programs • Catch? BioPerl does not include programs like Blast, ClustalW, etc • Uses system calls to execute external programs

  4. So what is BioPerl? (continued…) • 551 modules (incl. 82 interface modules) • 37 module groups • 79,582 lines of code (223,310 lines total) • 144 lines of code per module • For More info: BioPerl Module Listing

  5. Major Areas covered in Bioperl • Sequences, features, annotations, • Pairwise alignment reports • Multiple sequence alignments • Bibliographic data • Graphical rendering of sequence tracks • Database for features and sequences

  6. Additional things • Gene prediction parsers • Trees, parsing phylogenetic and molecular evolution software output • Population genetic data and summary statistics • Taxonomy • Protein Structure

  7. Downloading modules • Modules can be obtained from: • www.CPAN.org (Perl Modules) • www.BioPerl.org (BioPerl Modules) • Downloading modules from CPAN • Interactive mode • perl -MCPAN -e shell • Batch mode • use CPAN; • clean, install, make, recompile, test

  8. Directory Structure • BioPerl directory structure organization: • Bio/ BioPerl modules • models/ UML for BioPerl classes • t/ Perl built-in tests • t/data/ Data files used for the tests • scripts/ Reusable scripts that use BioPerl • scripts/contributed/ Contributed scripts not necessarily integrated into BioPerl. • doc/ "How To" files and the FAQ as XML

  9. Parsing Sequences • Bio::SeqIO • multiple drivers: genbank, embl, fasta,... • Sequence objects • Bio::PrimarySeq • Bio::Seq • Bio::Seq::RichSeq

  10. Sequence Object Creation Sequence Creation : $sequence = Bio::Seq->new( -seq => ‘AATGCAA’ -display_id => ‘my_sequence’); Flat File Format Support : Raw, FASTA, GCG, GenBank, EMBL, PIR Via ReadSeq: IG, NBRF, DnaStrider, Fitch, Phylip, MSF, PAUP

  11. Sequence object • Common (Bio::PrimarySeq) methods • seq() - get the sequence as a string • length() - get the sequence length • subseq($s,$e) - get a subseqeunce • translate(...) - translate to protein [DNA] • revcom() - reverse complement [DNA] • display_id() - identifier string • description() - description string

  12. Sequence Types Different Sequence Objects: • Seq – Some annotations • RichSeq – Additional annotations • PrimarySeq – Bare minimum annotation ( id, accession number, alphabet) • LocatableSeq – Start, stop and gap information also • LargeSeq – Very long sequences • LiveSeq – Newly sequenced genomes

  13. Using a sequence use Bio::PrimarySeq; my $str = “ATGAATGATGAA”; my $seq = Bio::PrimarySeq->new(-seq => $str, -display_id=>”example”); print “id is “, $seq->display_id,”\n”; print $seq->seq, “\n”; my $revcom = $seq->revcom; print $revcom->seq, “\n”; print “frame1=”,$seq->translate->seq,“\n”; id is example ATGAATGATGAA TTCATCATTCAT trans frame1=MNDE

  14. Accessing remote databases $gb = new Bio::DB::GenBank(); $seq1 = $gb->get_Seq_by_id('MUSIGHBA1'); $seq2 = $gb->get_Seq_by_acc('AF303112'); $seqio = $gb-> get_Stream_by_id(["J00522","AF303112","2981014"]);

  15. Sequence – Accession numbers # Get a sequence from RefSeq by accession number use Bio::DB::RefSeq; $gb = new Bio::DB::RefSeq; $seq = $gb->get_Seq_by_acc(“NM_007304”); print $seq->seq();

  16. Reading and Writing Sequences • Bio::SeqIO • fasta, genbank, embl, swissprot,... • Takes care of writing out associated features and annotations • Two functions • next_seq (reading sequences) • write_seq (writing sequences)

  17. Writing a Sequence use Bio::SeqIO; # Let’s convert swissprot to fasta format my $in = Bio::SeqIO->new( -format => ‘swiss’, -file => ‘file.sp’); my $out = Bio::SeqIO->new( -format => ‘fasta’, -file => ‘>file.fa’);` while( my $seq = $in->next_seq ) { $out->write_seq($seq); }

  18. Manipulating sequence data with Seq methods • Allows the easy manipulation of bioinformatics data • Specific parts of various annotated formats can be selected and rearranged. • Unwanted information can be voided out of reports • Important information can be highlighted, processed, stored in arrays for graphs/charts/etc with relative ease • Information can be added and subtracted in a flash

  19. The Code #!/usr/local/bin/perl use Bio::Seq; use Bio::SeqIO; my $seqin = Bio::SeqIO->new('-file' => "genes.fasta" , '-format' =>'Fasta'); my $seqobj = $seqin->next_seq(); my $seq = $seqobj->seq(),"\n"; #plain sequence print ">",$seqobj->display_id()," Description: ",$seqobj->desc(), " Alphabet: ",$seqobj->alphabet(),"\n"; $seq =~ s/(.{60})/$1\n/g; # convert to 60 char lines print $seq,"\n";

  20. Before

  21. After

  22. Obtaining basic sequence statistics- molecular weights, residue & codon frequencies (SeqStats, SeqWord) • Molecular Weight • Monomer Counter • Codon Counter • DNA weights • RNA weights • Amino Weights • More

  23. The Code #!/usr/local/bin/perl use Bio::PrimarySeq; use Bio::Tools::SeqStats; my $seqobj = new Bio::PrimarySeq(-seq => 'ATCGTAGCTAGCTGA', -display_id => 'example1'); $seq_stats = Bio::Tools::SeqStats->new(-seq=>$seqobj); $hash_ref = $seq_stats->count_monomers(); foreach $base (sort keys %$hash_ref) { print "Number of bases of type ", $base, "= ",%$hash_ref->{$base},"\n"; }

  24. The Results

  25. More Code use SeqStats; $seq_stats = Bio::Tools::SeqStats->new($seqobj); $weight = $seq_stats->get_mol_wt(); -returns the molecular weight $monomer_ref = $seq_stats->count_monomers(); -counts the number of monomers $codon_ref = $seq_stats->count_codons(); # for nucleic acid sequence -counts the number of codons

  26. Monomer

  27. Why the Large and The Small MW? Note that since sequences may contain ambiguous monomers (eg "M"meaning "A" or "C" in a nucleic acid sequence), the method get_mol_wtreturns a two-element array containing the greatest lower bound andleast upper bound of the molecule. (For a sequence with no ambiguous monomers, the two elements of the returned array will be equal.)

  28. Identifying restriction enzyme sites (Restriction Enzyme) • Bioperl's standard RestrictionEnzyme object comes with data for more than 150 different restriction enzymes. • To select all available enzymes with cutting patterns that are six bases long: • $re = new Bio::Tools::RestrictionEnzyme('-name'=>'EcoRI'); @sixcutters = $re->available_list(6); • sites for that enzyme on a given nucleic acid sequence can be obtained using • $re1 = new Bio::Tools::RestrictionEnzyme(-name=>'EcoRI'); # $seqobj is the Seq object for the dna to be cut @fragments = $re1->cut_seq($seqobj);

  29. Manipulating sequence alignments • Bioperl offers several perl objects to facilitate sequence alignment: • pSW (Smith-Waterman) • Clustalw.pm • TCoffee.pm • bl2seq option of StandAloneBlast.

  30. Manipulating Alignments • Some of the manipulations possible with SimpleAlign include: • slice(): Obtaining an alignment ``slice'', that is, a subalignment inclusive of specified start and end columns. • column_from_residue_number(): Finding column in an alignment where a specified residue of a specified sequence is located. • consensus_string(): Making a consensus string. This method includes an optional threshold parameter, so that positions in the alignment with lower percent-identity than the threshold are marked by ``?'''s in the consensus • percentage_identity(): A fast method for calculating the average percentage identity of the alignment • consensus_iupac(): Making a consensus using IUPAC ambiguity codes from DNA and RNA.

  31. The Code • use Bio::SimpleAlign; • $aln = Bio::SimpleAlign->new('t/data/testaln.fasta'); • $threshold_percent = 60; • $consensus_with_threshold = $aln->consensus_string($threshold_percent); • $iupac_consensus = $aln->consensus_iupac(); • # dna/rna alignments only • $percent_ident = $aln->percentage_identity; • $seqname = 'AKH_HAEIN'; • $pos = $aln->column_from_residue_number($seqname, 14);

  32. Searching for Sequence Similarity • BLAST with BioPerl • Parsing Blast and FASTA Reports • Search and SearchIO • BPLite, BPpsilite, BPbl2seq • Parsing HMM Reports • Standalone BioPerl BLAST

  33. Remote Execution of BLAST • BioPerl has built in capability of running BLAST jobs remotely using RemoteBlast.pm • Runs these jobs at NCBI automatically • NCBI has dynamic configurations (server side) to “always” be up and ready • Automatically updated for new BioPerl Releases • Convenient for independent researchers who do not have access to huge computing resources • Quick submission of Blast jobs without tying up local resources (especially if working from standalone workstation) • Legal Restrictions!!!

  34. Example of Remote Blast $remote_blast = Bio::Tools::Run::RemoteBlast->new( '-prog' => 'blastp','-data' => 'ecoli','-expect' => '1e-10' ); $r = $remote_blast->submit_blast("t/data/ecolitst.fa"); while (@rids = $remote_blast->each_rid ) { foreach $rid ( @rids ) { $rc = $remote_blast->retrieve_blast($rid); } }

  35. Sample Script to Read and Parse BLAST Report # Get the report $searchio = new Bio::SearchIO (-format => 'blast', -file => $blast_report); $result = $searchio->next_result; # Get info about the entire report $algorithm_type = $result->algorithm; # get info about the first hit $hit = $result->next_hit; $hit_name = $hit->name ; # get info about the first hsp of the first hit $hsp = $hit->next_hsp; $hsp_start = $hsp->query->start;

  36. Running BLAST Locally • StandAloneBlast • Bio::Tools::Run::StandAloneBlast • Factory Objects • @params = ('program' => 'blastn', 'database' => 'ecoli.nt'); $factory = Bio::Tools::Run::StandAloneBlast->new(@params); • Advantages: • Private • Use Customized Local Resources • Avoid Network Problems

  37. Examples • # Setting parameters similar to RemoteBlast$input = Bio::Seq->new(-id =>"test query", -seq =>"ACTAAGTGGGGG"); • $blast_report = $factory->blastall($input); • # Blast Report Object that directly accesses parserwhile (my $sbjct = $blast_report->next_hit){ while (my $hsp = $sbjct->next_hsp){ print $hsp->score . " " . $hsp->subject- >seqname . "\n"; } }

  38. Format Conversion – Sequences Example • Use: Bio::SeqIO • Core Code: $in = Bio::SeqIO->new('-file' => "COG0001", '-format' => 'Fasta'); $out = Bio::SeqIO->new('-file' => ">COG0001.gen", '-format' => 'genbank'); while ( my $seq = $in->next_seq() ) { $out->write_seq($seq); }

  39. Format Conversion – Alignments • Alignment formats supported: • INPUT: fasta, selex (HMMER), bl2seq, clustalw (.aln), msf (GCG), psi (PSI-BLAST), mase (Seaview), stockholm, prodom, water, phylip (interleaved), nexus, mega, meme • OUTPUT: fasta, clustalw, mase, selex, msf/gcg, and phylip (interleaved). • Next_aln( ) and write_aln( ) methods of the ‘Bio::AlignIO’ object are used

  40. ClustalW and Profile Align • ClustalW using BioPerl • Clustalw program should be installed and environment variable ‘CLUSTALDIR’ set • Setting Parameters – Build a factory • Some parameters: 'ktuple', 'matrix', 'outfile', 'quiet‘ • Need reference to sequence array object (See example) • Align( ) and Profile_align( ) methods used

  41. ClustalW – Example • Use Bio::SeqIO, Bio::Tools::Run::Alignment::Clustalw • Core code (Simple Align): @params = ('ktuple' => 2, 'matrix' => 'BLOSUM', 'outfile' => 'clustalw_out', 'quiet' => 1); $factory = Bio::Tools::Run::Alignment:: Clustalw->new(@params); $seq_array_ref = \@seq_array; $aln= $factory->align($seq_array_ref);

  42. Smith Waterman Search • Smith Waterman pairwise alignment • Standard method for producing an optimal local alignment of two sequences • Auxilliary Bioperl-ext library required • SW algorithm implemented in C and incorporated into bioperl • Align_and_show() & Pairwise_alignment() in Bio::Tools::pSW module are methods used

  43. Smith Waterman Search – Example • Use Bio::Tools::pSW, Bio::SeqIO, Bio::AlignIO • Core code: $factory = new Bio::Tools::pSW( '-matrix' => 'BLOSUM62', '-gap' => 12, '-ext' => 2); $aln = $factory->pairwise_alignment($seq_array[0], $seq_array[1]); my $alnout = new Bio::AlignIO(-format => 'msf', -fh => \*STDOUT); $alnout->write_aln($aln);

  44. Smith Waterman Search • AlignIO object in previous slide – could also be used to print into a file • Use double loop to do all pairwise comparisons • More Info: Bio::Tools::pSW mapage

More Related