1 / 65

BioPerl – An Overview

BioPerl – An Overview. Gloria Rendon April 2006 University of Illinois at Urbana-Champaign. BioPerl is a… Biology toolkit of modules for Bioinformatics, Genetics, Life Sciences Framework to do Computational Biology Object-oriented flavor of Perl plus an extensive Bioinformatics library

dareh
Download Presentation

BioPerl – An Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BioPerl – An Overview Gloria Rendon April 2006 University of Illinois at Urbana-Champaign

  2. BioPerl is a… Biology toolkit of modules for Bioinformatics, Genetics, Life Sciences Framework to do Computational Biology Object-oriented flavor of Perl plus an extensive Bioinformatics library Collection of Perl modules that facilitate the development of perl scripts for bioinformatics applications

  3. BioPerl is NOT a A set of ready to use programs, like many commercial packages and free web-based interfaces Suitable language for all aspects of Computational Biology; not suitable for high-precision, fast, intensive numeric data analysis ][ex: simulations, modeling, probabilities, etc. A strongly type language; which means min. time is spent on tasks such as error-checking and consistency of the data A visually-oriented language, poor GUI capabilities for code development

  4. BioPerl, the open source group of volunteers dedicated to the development of this language, is 10 years old BioPerl, the “stable” core language, is four years old, release date 2002; contained modules for sequence manipulation, accessing of databases using a range of data formats and execution and parsing of the results of various molecular biology programs including Blast, clustalw, TCoffee, genscan, ESTscan and HMMER. Latest version is 1.4, release date 2003; contains core and extensions, additional libraries, repositories, compiled programs Future release is 1.5, release date ? Contains GUI capabilities, persistence capabilities, client-server CORBA-compliant capabilities, and process pipelining capabilities. A very brief History

  5. Core BioPerl requires Perl [versions 5.0 and above] to be already installed Mimimal and complete versions of BioPerl exists and are found online and packaged as “bundles”; they both subsume the Core package Custome installation: you can pick and choose which modules to install; among the most commonly downloaded ones are: For accessing remote databases, you will also need: File-Temp-0.09 and IO-String-1.01 For accessing Ace database, you will also need: AcePerl-1.68 For remote Blast searches: libwww-perl-5.48 Digest-MD5-2.12 HTML-Parser-3.13 libnet-1.073 MIME-Base64-2.11 URI-1.09 IO-String-1.216 For xml parsing: libxml-perl-2.30 XML-Twig-2.02 Soap-Lite-0.52 XML-DOM-1.37 expat-1.95.1 Even though developers strive to produce independent modules; their interdependencies are sometimes unavoidable. So, make sure you have installed all the necessary modules on the host system. For more current and additional information on external modules required by bioperl, check http://bioperl.org/Core/external.shtml Software Requirements

  6. Bioperl also uses several C programs for sequence alignment and local blast searching. To use these features of bioperl you will need an ANSI C or Gnu C compiler as well as the actual program available from sources such as: for Smith-Waterman alignments: bioperl-ext-0.6 from http://bioperl.org/Core/external.shtml for clustalw alignments: ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW/ for tcoffee alignments: http://igs-server.cnrsmrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.html for local blast searching: ftp://ftp.ncbi.nlm.nih.gov/blast/server/current_release/ for EMBOSS applications: http://www.hgmp.mrc.ac.uk/Software/EMBOSS/download.html Software Requirements…

  7. Locate the package(s) on the network at: http://search.cpan.org/ Download Decompress and remove the file archive Create a makefile Run “make”, “make test”, and “make install” for every module The CPAN module can also be used to install all of the modules in a single step as a “bundle” of modules, Bundle::BioPerl, eg $>perl -MCPAN -e shell cpan>install Bundle::BioPerl <installation details....> cpan>install B/BI/BIRNEY/bioperl-1.0.tar.gz <installation details....> cpan>quit The process described above is for the UNIX OS. The minimal package should also work under NT, Windows, Mac OS X; however, it has not been widely tested. Installation

  8. Much of bioperl is focused on sequence manipulation. Accessing sequence data from local and remote databases Transforming formats of database/ file records Manipulating individual sequences Searching for "similar" sequences Creating and manipulating sequence alignments Searching for genes and other structures on genomic DNA Developing machine readable sequence annotations BioPerl, the Core toolkit

  9. Bundle is a collection of modules that could be related Each module is composed of one or more classes A class is the blueprint of an object An object contains data part and methods part. Data is visible to the object only. Encapsulation. The methods act on the data and could be private or public. Objects interact with each other through method invocation only. Inheritance is one of the many relationships that can exist between classes, i.e. the ISA relationship Other common relationships used in Bioperl: xxxxx and xxxxIO where the latter class is the IO wrapper of the former class xxxxx xxxxx and xxxxxI where the latter class is the Interface of the former class xxxxx A one-minute crash course on Object-Oriented Languages

  10. BioPerl Class Diagram Source: http://bioperl.org/wiki/Class_Diagram

  11. Bio::DB Bio::Seq Bio::Index Bio::Align Bio::Search Bio::Graphics Bio::Biblio Bio::Structure Bio::Variation Bio::LiveSeq A closer look at the class diagram… All class diagrams shown here follow UML conventions

  12. Few class diagrams have been published Even fewer dataflow diagrams exist But, every single class in Bioperl has a POD Last resource: look at the code itself. [Remember, Bioperl is open source] Documentation concerns

  13. Source: actual code Download: link to repos. Name: class name Synopsis: usage Description: textual Appendix: list of methods Author Contributors Feedback The Class POD

  14. Sequence classes in BioPerl

  15. Seq is the central sequence object in bioperl. Most common sequence manipulations can be performed with Seq. RichSeq objects store additional annotations beyond those used by standard Seq objects SeqWithQuality objects are used to manipulate sequences with quality data, like those produced by phred PrimarySeq object is basically a “stripped down” version of Seq. LocatableSeq object might be more appropriately called an “AlignedSeq” object. It is a Seq object which is part of a multiple sequence alignment. It has “start” and “end” positions indicating from where in a larger sequence it may have been extracted. LargeSeq object is a special type of Seq object used for handling very long ( eg > 100 MB) sequences. LiveSeq addresses the problem of features whose location on a sequence changes over time. This can happen, for example, when sequence feature objects are used to store gene locations on newly sequenced genomes - locations which can change as higher quality sequencing data becomes available. SeqI objects are Seq “interface objects” (see section II.4 and Bio). They are used to ensure bioperl’s compatibility with other software packages. Sequence in BioPerl

  16. 1. de novo:  use Bio::Seq; my $seq1 = Bio::Seq->new ( -seq => 'ATGAGTAGTAGTAAAGGTA', -id => 'my seq', -desc => 'this is a new Seq'); 2. from a file:  use Bio::SeqIO; # through file IO functions my $seqin = Bio::SeqIO->new ( -file => 'seq.fasta', -format => 'fasta'); my $seq3 = $seqin->next_seq(); # through file handles my $inseq = Bio::SeqIO->newFh ( -file => ‘<seqs.sp', -format => ‘swiss'); my $outseq = Bio::SeqIO->newFh ( -file => ‘>seqs.fasta', -format => 'fasta'); print $outseq $_ while <$inseq>; Sequence Creation, Retrieval and Access

  17. 3. from a remote database # these three lines each returns a Seq object $gb = new Bio::DB::GenBank(); $seq1 = $gb->get_Seq_by_id(’MUSIGHBA1’); $seq2 = $gb->get_Seq_by_acc(’AF303112’)) # this line returns a SeqIO object $seqio = $gb->get_Stream_by_batch( [ qw(J00522 AF303112 2981014)])); Bioperl supports sequence data retrieval from the Genbank, Genpept, RefSeq, Swissprot, and EMBL databases. Sequence Creation, Retrieval and Access

  18. 4. from a local database Before accessing sequences from local sequence datafiles, they have to be made Bioperl-readable by indexing them with Bio::Index or Bio::DB::Fasta. The following sequence data formats are supported by Bio::Index: Genbank, Swissprot, Pfam, EMBL and Fasta. Once the set of sequences have been indexed using Bio::Index, individual sequences can be accessed using syntax very similar to that described above for accessing remote databases. use Bio::Index::Fasta; # using fasta file format $Index_File_Name = shift; $inx = Bio::Index::Fasta->new( -filename => $Index_File_Name, -write_flag => 1); $inx->make_index(@ARGV); foreach $id (@ARGV) { $seq = $inx->fetch($id); # Returns Bio::Seq object } Sequence Creation, Retrieval and Access

  19. 5. format conversion with SeqIO SeqIO can read a stream of sequences - located in a single or in multiple files - in a number of formats: Fasta, EMBL, GenBank, Swissprot, PIR, GCG, SCF, phd/phred, Ace, or raw (plain sequence). Once the sequence data has been read in with SeqIO, it is available to bioperl in the form of Seq objects. Moreover, the Seq objects can then be written to another file (again using SeqIO) in any of the supported data formats making data converters simple to implement, for example: use Bio::SeqIO; $in = Bio::SeqIO->new( ’-file’ => "inputfilename", ’-format’ => ’Fasta’); $out = Bio::SeqIO->new(’-file’ => ">outputfilename", ’-format’ => ’EMBL’); while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); } Sequence Creation, Retrieval and Access

  20. Yet another view of the Seq class

  21. Bioperl Features [ex: XML tags  Bioperl features]

  22. Sequences and Annotations

  23. use Bio::SeqFeature::Generic; use Bio::SeqIO; $in = Bio::SeqIO->newFh(-file => $ARGV[0]); $out = Bio::SeqIO->newFh(); $seq = <$in>; $feat = new Bio::SeqFeature::Generic ( -start => 10, -end => 100, -strand => -1, -primary => 'repeat', -source => 'repeatmasker', -score => 1000, -tag => { new => 1, author => 'someone', sillytag => 'this is silly!' } ); $seq->add_SeqFeature($feat); print $out $seq; Annotations, de novo

  24. Sequences and Locations

  25. my $fuzzylocation = new Bio::Location::Fuzzy( -start => '<30', -end => 90, -loc_type => '.‘ ); A Location object is like an index within a range. It is designed to be associated with a SeqFeature object to indicate where on a larger structure (eg a chromosome or contig) the feature can be found. It was implemented as a separate object, rather than as a simple index on a range, because - Some objects have multiple locations or sub-locations (eg a gene’s exons may have multiple start and stop locations) - In unfinished genomes, the precise locations of features is not known with certainty.

  26. - The following methods return string values $seqobj->desc() # a description of the sequence $seqobj->display_id(); # the human read-able id of the sequence $seqobj->seq(); # string of sequence $seqobj->subseq(5,10); # part of the sequence as a string $seqobj->accession_number(); # when there, the accession number $seqobj->alphabet(); # one of ’dna’,’rna’,’protein’ $seqobj->primary_id(); # a unique id for this sequence The following methods return an array of Bio::SeqFeature objects $seqobj->top_SeqFeatures # The ’top level’ sequence features $seqobj->all_SeqFeatures # All sequence features - The following methods returns new sequence objects, but do not transfer features across: $seqobj->trunc(5,10) # truncation from 5 to 10 as new object $seqobj->revcom # reverse complements sequence $seqobj->translate # translation of a DNA sequence from start/end Seq other commonly used methods

  27. SeqStats object provides methods for obtaining the molecular weight of the sequence as well the number of occurrences of each of the component residues (bases for a nucleic acid or amino acids for a protein.) For nucleic acids, SeqStats also returns counts of the number of codons used. For example: use SeqStats; $seq_stats = Bio::Tools::SeqStats->new($seqobj); $weight = $seq_stats->get_mol_wt(); $monomer_ref = $seq_stats->count_monomers(); $codon_ref = $seq_stats->count_codons(); # for DNA sequence The SeqWords object is similar to SeqStats and provides methods for calculating frequencies of “words” (eg tetramers or hexamers) Basic sequence statistics

  28. More on Format conversion with AlignIO

  29. AlignIO is the bioperl object for data conversion of alignment files. AlignIO is patterned on the SeqIO object and shares most of SeqIO’s features. AlignIO currently supports INPUT in the following formats: fasta, mase, stockholm, prodom, selex, bl2seq, clustalw, msf/gcg, water (from EMBOSS, see III.3.6), needle (from EMBOSS, see III.3.6) AlignIO supports OUTPUT in these formats: fasta, mase, selex, clustalw, msf/gcg. One significant difference between AlignIO and SeqIO is that AlignIO handles IO for only a single alignment at a time (SeqIO.pm handles IO for multiple sequences in a single stream.) Syntax for AlignIO is almost identical to that of SeqIO: use Bio::AlignIO; $in = Bio::AlignIO->new(’-file’ => "inputfilename" , ’-format’ => ’fasta’); $out = Bio::AlignIO->new(’-file’ => ">outputfilename", ’-format’ => ’pfam’); while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); } The only difference is that here, the returned object reference, $aln, is to a SimpleAlign object rather than a Seq object.

  30. SimpleAlign

  31. The SimpleAlign class contains methods to select sequences or columns, but it can not filter alignments by functions (i.e. properties) In order to filter columns by properties, you have to extract the columns by yourself, filter them and reconstruct the new sequences. The following example filters gap columns.

  32. use strict; use Bio::AlignIO; my $in = new Bio::AlignIO ( -file => $ARGV[0], -format => 'clustalw' ); my $out = newFh Bio::AlignIO ( -fh => \*STDOUT, -format => 'clustalw' ); my $aln = $in->next_aln(); # create a list containing all columns foreach my $seq ( $aln->each_alphabetically() ) { my $colnr = 0; foreach my $chr ( split("", $seq->seq()) ) { $aln_cols[$colnr] .= $chr; $colnr++; } } # then do the work: we want to eliminate all the columns containing gaps # 1/ we create a list containing all the columns without any gap my $gapchar = $aln->gap_char(); my @no_gap_cols = (); foreach my $col ( @aln_cols ) { next if $col =~ /\Q$gapchar\E/; push @no_gap_cols, $col; } # now we replace the old gapped list with the new ungapped one my @seq_strs = (); foreach my $col ( @no_gap_cols ) { my $colnr = 0; foreach my $chr ( split"", $col ) { $seq_strs[$colnr] .= $chr; $colnr++; }} foreach my $seq ( $aln->each_alphabetically() ) { $seq->seq(shift seq_strs);} print $out $aln;

  33. Bioperl offers a number of modules to facilitate running Blast, both locally and remotely, as well as to parse the often voluminous reports produced by Blast. Note, Bioperl itself does not have an internal library for running Blast; instead, it calls the necessary program and then manipulates its results internally Search and Analysis of Similar Sequences

  34. The module Bio::Tools::Run::StandAloneBlast offers the ability to wrap local calls to blast from within perl. All of the currently available options of NCBI Blast (eg PSIBLAST, PHIBLAST, bl2seq) are available from within the bioperl StandAloneBlast interface. Of course, to use StandAloneBlast, one needs to have installed locally ncbiblast as well as one or more blast-readable databases. Basic usage of the StandAloneBlast.pm module is simple. Initially, a local blast “factory object” is created, then the supported blast executables can be issued. # local BLAST use Bio::SeqIO; use Bio::Tools::Run::StandAloneBlast; # step one, creating the factory @params = (’program’ => ’blastn’, ’database’ => ’ecoli.nt’); $factory = Bio::Tools::Run::StandAloneBlast->new(@params); #step two, the input seq are entered $input = Bio::Seq->new(’-id’=>"test query", ’-seq’=>"ACTAAGTGGGGG"); $blast_report = $factory->blastall($input); #step three, accessing parts of the blast report my $result = $blast_report->next_result; while( my $hit = $result->next_hit()) { print "\thit name: ", $hit->name(), " significance: ", $hit->significance(), "\n"; } StandAloneBlast

  35. Bioperl supports remote execution of blasts at NCBI by means of the RemoteBlast object. A skeleton script to run a remote blast might look as follows: #remote BLAST #step 1: query submission open SEQS, “>ecoliblastseqs.txt”; $remote_blast = Bio::Tools::Run::RemoteBlast->new( ’-prog’ => ’blastp’, ’-data’ => ’ecoli’, ’-expect’ => ’1e-10’ ); $r = $remote_blast->submit_blast("t/data/ecolitst.fa"); #step2: results retrieval and storage while (@rids = $remote_blast->each_rid ) { foreach $rid ( @rids ) { $rc = $remote_blast->retrieve_blast($rid); push(<SEQS>, $rc); } } close SEQS; RemoteBlast

  36. Bioperl supports a wider range of parsing capabilities than for running the search engines that produce them. Bioperl objects to parse and/or search BLAST, PSIBLAST and FASTA reports; they include: Search.pm, SearchIO.pm, BPlite.pm and Blast.pm (for parsing Blast reports). Future release will incorporate support for HMMer and GenScan among others. Parsing Similarity Search Reports

  37. use Bio::SearchIO; my $blast_report = new Bio::SearchIO ('-format' => 'blast', '-file' => $ARGV[0]); my $result = $blast_report->next_result; while( my $hit = $result->next_hit()) { print "\thit name: ", $hit->name(), "\n"; while( my $hsp = $hit->next_hsp()) { print "E: ", $hsp->evalue(), "frac_identical: ", $hsp->frac_identical(), "\n"; } } Parsing a Blast Report

  38. Bioperl has a family of parsers that work in a slightly different way than the previous one: The report belongs to a different class, the Bio::Tools:BPlite class, which has a different set of methods for get the information. A factory has to be created first and then to it Bioperl applies the parameters of the search This family of parsers include: BPLite, BPpsilite, BPbl2seq Other Parsers

  39. use Bio::SeqIO; use Bio::Tools::Run::StandAloneBlast; my $Seq_in = Bio::SeqIO->new (-file => $ARGV[0], -format => 'fasta'); my $query = $Seq_in->next_seq(); my $factory = Bio::Tools::Run::StandAloneBlast->new( 'program' => 'blastp', 'database' => 'swissprot' ); my $blast_report = $factory->blastall($query); while (my $subject = $blast_report->nextSbjct()) { print $subject->name(), "\n"; while (my $hsp = $subject->nextHSP()) { print join("\t", $hsp->P, $hsp->percent, $hsp->score), "\n"; } }

  40. recap creating databases accessing databases with OBDA relational databases, SQL and others closing BioPerl OverviewPart 2

  41. Bioperl offers a number of modules to facilitate running Blast, both locally and remotely, as well as to parse the often voluminous reports produced by Blast. Note, Bioperl itself does not have an internal library for running Blast; instead, it calls the necessary program and then manipulates its results internally Creating own databases: 1. By Storing results of Searches as flat files

  42. Bioperl supports remote execution of blasts at NCBI by means of the RemoteBlast object. A skeleton script to run a remote blast might look as follows: #remote BLAST #step 1: query submission open SEQS, “>ecoliblastseqs.txt”; $remote_blast = Bio::Tools::Run::RemoteBlast->new( ’-prog’ => ’blastp’, ’-data’ => ’ecoli’, ’-expect’ => ’1e-10’ ); $r = $remote_blast->submit_blast("t/data/ecolitst.fa"); #step2: results retrieval and storage while (@rids = $remote_blast->each_rid ) { foreach $rid ( @rids ) { $rc = $remote_blast->retrieve_blast($rid); push(<SEQS>, $rc); } } close SEQS; RemoteBlast

  43. The module Bio::Tools::Run::StandAloneBlast offers the ability to wrap local calls to blast from within perl. All of the currently available options of NCBI Blast (eg PSIBLAST, PHIBLAST, bl2seq) are available from within the bioperl StandAloneBlast interface. Of course, to use StandAloneBlast, one needs to have installed locally ncbiblast as well as one or more blast-readable databases. Basic usage of the StandAloneBlast.pm module is simple. Initially, a local blast “factory object” is created, then the supported blast executables can be issued. # local BLAST use Bio::SeqIO; use Bio::Tools::Run::StandAloneBlast; # step one, creating the factory @params = (’program’ => ’blastn’, ’database’ => ’ecoli.nt’); $factory = Bio::Tools::Run::StandAloneBlast->new(@params); #step two, the input seq are entered $input = Bio::Seq->new(’-id’=>"test query", ’-seq’=>"ACTAAGTGGGGG"); $blast_report = $factory->blastall($input); #step three, accessing parts of the blast report my $result = $blast_report->next_result; while( my $hit = $result->next_hit()) { print "\thit name: ", $hit->name(), " significance: ", $hit->significance(), "\n"; } StandAloneBlast

  44. Basically follow each site’s instructions on downloading and setting up the database locally. Creating databases:2. Mirroring databases on your local system

More Related