Perl (4) Hongkang Mei, Ph.D. March 9, 2002

Perl (4) Hongkang Mei, Ph.D. March 9, 2002

Bioperl introduction • Using bioperl • DBI introduction • DBI architecture • Handles, data source • Connection & disconnection • Utilities & functions • Database interactions

bioperl • a collection of Perl modules that facilitates the development of perl scripts for bioinformatics application • http://www.bioperl.org/ • It includes: • sequence manipulation • accessing databases of various formats • execution of molecular biology programs (Blast, clustalw,HMMer) • parsing of results • Still being developed

Goals of learning Bioperl • Knowing: • what bio-informatics tasks can be handled with bioperl • (2)where to find the methods to accomplish these tasks within the bioperl package • (3) where to go for additional information.

Bioperl requirements • Software requirements: • Minimal installation • perl v5.005 • interactive debugger (recommended) • Complete Installation • Additionally: • perl modules from CPAN; • bioperl perl extensions; • bioperl xs-extension; • several standard compiled bioinformatics programs.

Installation Locate the package on the network Download Decompress (with gunzip or a similiar utility) Create a ``makefile'' (with ``perl Makefile.PL'' for perl modules or a supplied ``install'' or ``configure'' program for non-perl program Run ``make'', ``make test'' and ``make install'' This procedure must be repeated for every CPAN module, bioperl-extension and external module to be installed. A helper module CPAN.pm is available from CPAN which automates the process for installing the perl modules. For the external programs there is an extra step: Set the relevant environmental variable (CLUSTALDIR, TCOFFEEDIR or BLASTDIR) to the directory holding the executable in your startup file - eg in .bashrc.

sequence objects • Seq central obj for DNA, protein, RNA, for SeqIO • stores multiple annotations and associated • features • PrimarySeq “stripped down” version of Seq • LocatableSeq has ‘start’ ‘end’ positions for alignment • AlignIO, pSW • LiveSeq handles location changes of features • LargeSeq version of Seq handling long sequences • SeqI Seq interface object used with other packages

alignment objects (SimpleAlign, UnivAln) • store an array of sequences as an alignment • two system developed independently • Interface objects and implementation objects • Interface definition of what methods one can call on an object, without any knowledge of how it is implemented: Bio::MyObjectI • implementation the actual implementation

Using bioperl • Accessing sequence data from local and remote databases • Transforming formats of database/ file records • Manipulating individual sequences • Searching for "similar" sequences • Creating and manipulating sequence alignments • Searching for genes and other structures on genomic DNA • Developing machine readable sequence annotations

accessing sequences data from local and remote databases • enter seq directly: • $seq = Bio::Seq->new('-seq'=>'actgtggcgtcaact', • '-desc'=>'Sample Bio::Seq object', • '-display_id' => 'something', • '-accession_number' => 'accnum', • '-moltype' => 'dna' ); • access remote databases: • $gb = new Bio::DB::GenBank(); • $seq1 = $gb->get_Seq_by_id('MUSIGHBA1'); • $seq2 = $gb->get_Seq_by_acc('AF303112')) • $seqio = $gb->get_Stream_by_batch([ qw(J00522 AF303112 2981014)]));

manipulating sequences • Seq methods • $seqobj->display_id(); # the human read-able id of the sequence • $seqobj->seq(); # string of sequence • $seqobj->subseq(5,10); # part of the sequence as a string • $seqobj->accession_number(); # when there, the accession number • $seqobj->moltype(); # one of 'dna','rna','protein' • $seqobj->primary_id(); # a unique id for this sequence irregardless • # of its display_id or accession number • $seqobj->top_SeqFeatures # The 'top level' sequence features • $seqobj->all_SeqFeatures # All sequence features, including sub • # seq features • $seqobj->trunc(5,10) # truncation from 5 to 10 as new object • $seqobj->revcom # reverse complements sequence • $seqobj->translate # translation of the sequence

Perl DBI for relational databases • DBI: database interface • Defines Perl’s database programming interface • for Unix system, resembles Win32::ODBC for ActiveWare to work on Windows system • sybperl • oraperl

Content • DBI architecture • handles to interact with databases • connecting to and disconnecting from databases • DBI's utility methods and functions

DBI Architecture • __Two main groups of software: the DBI itself, and the drivers. • __The DBI defines the actual DBI interface • routes method calls to the appropriate drivers provides various support services to them • __Specific drivers are implemented for each different type of database and actually perform the operations on the databases

Data flow • Perl script • DBI(specifications) • specific DBD (driver) • database

databases supported • Oracle • Informix • mSQL • MySQL • Ingres • Sybase • DB2 • Empress • SearchServer • PostgreSQL

Handles • driver handles (1:1 with DBD) • database handles (m:m with db) • statement handles

driver handles $drh • data_sources(), to enumerate what can be connected to • each driver can have many datasource names stored • and can be retrieved by DBI-> data_sources() • connect(), to actually make a connection • DBI->connect()

database handles $dbh • first step towards actually doing work with the database • encapsulate a single connection to a particular database • $dbh = DBI->connect( $data_source, ... ); • are children of their corresponding driver handle • can make multipleconnections to multiple databases • completely encapsulated objects • transactions from one database handle cannot • “cross-over'' or “leak'' into another.

Statement Handles $sth • for database interaction and manipulation • encapsulate individual SQL statements to be • executed within the database. • children of their corresponding database handle • no limit of number of sth can be made

Data Source Names • Tell the DBI where to find the database to connect to • dbi:Oracle:archaeo • difficult to standardize • hostname, database name, TCP/IP, db alias • each DBD contains a set of DSN

Retrieve all DSN in all installed drivers • list drivers: DBI->available_drivers(); • list datasource names: DBI->data_sources(); • use DBI; • my @drivers = DBI->available_drivers(); • die "No drivers found!\n" unless @drivers; # should never happen • foreach my $driver ( @drivers ) { • print "Driver: $driver\n"; • my @dataSources = DBI->data_sources( $driver ); • foreach my $dataSource ( @dataSources ) { • print "\tData Source is $dataSource\n"; • } • print "\n"; • } • The output from this script on my machine looks like: • Driver: ADO • Driver: CSV • Data source is DBI:CSV:f_dir=megaliths • Data source is DBI:CSV:f_dir=pictish_stones • …...

DSN syntax: dbi:mSQL:hostname:database:port_number dbi:Oracle:connection_descriptor dbi:Oracle:archaeo dbi:CSV:f_dir=/datafiles CSV: comma separated value files

Connection • requirements (parameters): • data source name • driver to use • database to connect • (possibly) where is the db • username • multi-user accounts, security from DBI • empty if not required • password

Perl script for connection • #!/usr/bin/perl -w • # ch04/connect/ex1: Connects to an Oracle database. • use DBI; # Load the DBI module • ### Perform the connection using the Oracle driver • my $dbh = DBI->connect( "dbi:Oracle:archaeo", "username", "password" ) • or die "Can't connect to Oracle database: $DBI::errstr\n"; • exit;

Multi-connections • & connect simultaneously to two different databases • #!/usr/bin/perl -w • # ch04/connect/ex3: Connects to two Oracle databases simultaneously. • use DBI; # Load the DBI module • ### Perform the connection using the Oracle driver • my $dbh1 = DBI->connect( "dbi:Oracle:archaeo", "username", "password" ) • or die "Can't connect to 1st Oracle database: $DBI::errstr\n"; • my $dbh2 = DBI->connect( "dbi:Oracle:archaeo", "username", "password" ) • or die "Can't make 2nd database connect: $DBI::errstr\n"; • ### Perform the connection using the mSQL driver • my $dbh3 = DBI->connect( "dbi:mSQL:seconddb", "username", "password" , { • PrintError => 0 • } ) • or die "Can't connect to mSQL database: $DBI::errstr\n"; • exit;

Disconnection • explicit disconnection is recommended when finishes • my $dbh = DBI->connect( "dbi:Oracle:archaeo", "username", "password" , { • PrintError => 0 • } ) • or die "Can't connect to Oracle database: $DBI::errstr\n"; • #now ready to disconnect • $dbh->disconnect() • or warn "Disconnection failed: $DBI::errstr\n"; • returns true or false • false usually means that the connection is lost

Disconnection (cont.) • upon exit of perl program, DESTROY will clean up by calling • the handle’s disconnect() method, bu with a warning: • Database handle destroyed without explicit disconnect. • Have you done commit or rollback if any change made to db? • DESTROY method has to call rollback() before disconnect() • if AutoCommit is not enabled

Utilities & functions • quote escaping method • DBI execution tracing • functions to tidy up your data • neat, numeric testing

quote escaping method quote() • use DBI; • ### The string to quote • my $string = "Don't view in monochrome (it looks 'fuzzy')!"; • ### Connect to the database • my $dbh = DBI->connect( "dbi:Oracle:archaeo", "username", "password" , { • RaiseError => 1 • } ); • ### Escape the string quotes ... • my $quotedString = $dbh->quote( $string ); • ### Use quoted string as a string literal in a SQL statement • my $sth = $dbh->prepare( " • SELECT * FROM media WHERE description = $quotedString • " ); • $sth->execute(); • exit;

quote() (cont.) • For example, if you quoted the Perl string of Do it! via an Oracle database handle, you would be returned the value of 'Do it!'. However, the quote() method also takes care of cases such as Don't do it! which needs to be translated to 'Don''t do it!' for most databases. The simplistic addition of surrounding quotes would have produced 'Don't do it!' which is not a valid SQL string literal.

Tracing DBI Execution • generate runtime tracing information of what it's doing • track down strange problems • valid tracing levels: • 0 Disables tracing. • 1 Traces DBI method execution showing returned values and errors. • 2 As for 1, but also includes method entry with parameters. • 3 As for 2, but also includes more internal driver trace information. • 4 Levels 4, and above can include more detail than is helpful. • trace at various handle levels: • DBI->trace() • $dbh->trace() • $sth->trace()

use DBI; ### Remove any old trace files unlink 'dbitrace.log' if -e 'dbitrace.log’; ### Connect to a database my $dbh = DBI->connect( "dbi:Oracle:archaeo", "username", "password" ); ### Set the tracing level to 1 and prepare() DBI->trace( 1 ); doPrepare(); ### Set trace output to a file at level 2 and prepare() DBI->trace( 2, 'dbitrace.log' ); doPrepare(); ### Set the trace output back to STDERR at level 2 and prepare() DBI->trace( 2, undef ); doPrepare(); exit; ### prepare a statement (invalid to demonstrate tracing) sub doPrepare { print "Preparing and executing statement\n"; my $sth = $dbh->prepare( " SELECT * FROM megalith " ); $sth->execute(); return; } exit;

-> prepare for DBD::Oracle::db (DBI::db=HASH(0xcd45c)~0xcd4a4 ' SELECT * FROM megalith ') thr0 <- prepare= DBI::st=HASH(0xcd648) at trace1 line 30. -> execute for DBD::Oracle::st (DBI::st=HASH(0xcd648)~0x16afec) thr0 dbd_st_execute SELECT (out0, lob0)... !! ERROR: 942 'ORA-00942: table or view does not exist (DBD ERROR: OCIStmtExecute)' <- execute= undef at trace1 line 33. DBD::Oracle::st execute failed: ORA-00942: table or view does not exist (DBD ERROR: OCIStmtExecute) at trace1 line 33. -> method entered <- method returning

Neat and Tidy Formatting • #!/usr/bin/perl -w • use DBI; • ### Declare some strings to neatify • my $str1 = "Alligator's an extremely neat() and tidy person"; • my $str2 = "Oh no\nhe's not!"; • ### Neatify this first string to a maxlen of 40 • print "String: " . DBI::neat( $str1, 40 ) . "\n"; • ### Neatify the second string to a default maxlen of 400 • print "String: " . DBI::neat( $str2 ) . "\n"; • ### Neatify a number • print "Number: " . DBI::neat( 42 * 9 ) . "\n"; • ### Neatify an undef • print "Undef: " . DBI::neat( undef ) . "\n"; • exit; • which generates the output of: • String: 'Alligator's an extremely neat() and...' • String: 'Oh no • he's not!' • Number: 378 • Undef: undef

SQL statements can be executed like this (the equivalent of execute immediate in embedded SQL): $dbh->do("delete from emptable where status != 'active'"); print "Error: $DBI::err .... $DBI::errstr" if $DBI::err; If you make the same query or a similar-looking query a number of times, you force the system to parse it over and over again. To avoid this overhead, you can compile a parameterized query using prepare, and execute it many times. The prepare method is given a query with parameter placeholders indicated by "?": $sth = $dbh->prepare ('insert into emptable (name, age) values (?, ?)');

Insert rows into database: while (defined($line = <>)) { chomp($line); # id, name, age separated by tab ($id, $name, $age) = split (/\t/, $line); $sth->execute($id, $name, $age); die "Error: $DBI::err .... $DBI::errstr" if $DBI::err; }

Retrieve information in bulk using the SQL select statement: $cur = $dbh->prepare('select name, age from emptable where age < 40'); $cur->execute(); die "Prepare error: $DBI::err .... $DBI::errstr" if $DBI::err; while (($name, $age) = $cur->fetchrow) { print "Name:$name, Age: $age \n"; } $cur->finish();

Example : DBI

Perl (4) Hongkang Mei, Ph.D. March 9, 2002

Perl (4) Hongkang Mei, Ph.D. March 9, 2002

Presentation Transcript

Perl

Perl

Perl

PERL

Perl

Perl Chapter 4

Perl for Bioinformatics

PERL TK

Perl

PERL

Perl

Perl

Perl

Perl

Perl

Perl Chapter 4

Perl

Perl

Chapter 4 Perl programming