Perl Object Layer & Pipelines for Efficient Data Processing

Perl Object Layer & Pipelines Pipelines Steve Fischer John Iodice Deborah Pinney Mark Heiges Ed Robinson Perl Object Layer Brian Brunk Mark Gibson Dave Barkan

Pipeline Introduction • Sequential steps of • Plugin calls • Script calls • Cluster jobs • Purpose • Codifies the process of creating the data set • Reduces human resources • Reduces human error and omissions

Two Pipeline Types • Resources pipeline • Downloads resources from external sources • Loads resources into database • Example: NRDB files • Analysis pipeline • Extract data from database • Run analysis programs on data • On main or cluster server • Put value added data back into database

Resource Pipeline • Invoked by: • loadresources xmlfile propfile • Take a tour of a resources XML file

Resources Repository • Destination of downloads • Houses files in a file system • Serves as a cache for files • Has API to access files by name and version • If you request an existing file by name and version, repository returns it without downloading • But the wget arguments must match (these are remembered by the repository) • Particularly useful if multiple projects want to synchronize their data input

Analysis Pipeline • Take a tour of the analysis pipeline file • Take a tour of the Steps.pm file • Take a tour of the property file

Pipeline Directory Structure • The directory which houses all the information for the pipeline including: • Input data • Logs • Result data • Pipeline control information: • Which steps have been completed • Property files to control cluster • Structured for easy comprehension • Take a tour of the directory structure

Analysis Pipeline API • GUS::Pipeline::Manager.pm • Declares properties • Prevents steps from rerun • Calls plugins • Executes commands • Eases communication with cluster • GUS::Pipeline::MakeTaskDirs.pm • Helps make directories expected by distribjob on the cluster • GUS::Pipeline::TaskRunAndValidate.pm • Helps run a series of tasks on the cluster

DJob • Manages the distribution of tasks across a compute cluster • Handles the case of a very large number of inputs which are processed independently and uniformly • For example, blasting a set of EST against a genome • Now available for clusters using PBS cluster scheduler • http://core.pcbi.upenn.edu/tools/liniactools.html

Perl Object Layer http://www.cbil.upenn.edu/~brunkb/PERL_Objects.html

Perl Object Layer • Simplifies database interactivity • Manages parent-child relationships • Manages submits (inserts,updates and deletes) • Submits children recursively • Automatic versioning • Sets default attributes (Ex. row_user_id) • Enforces read/write permissions • Code generator - objects consistent with db • Extracts meta data from db • Prints to XML and parses XML into objects

DbiDatabase Module • Creates login to the database • Allows use of all database objects • Has methods to get meta information • Ex: getTable(tableName) returns a DbiTable for access of FK and PK attributes • DbiDatabase object automatically instantiated by plugins • DbiDatabase objects must be explicitly instantiated in scripts

Object Constructor • TableName->new($hashRef)

Retrieving objects from DB • retrieveFromDB(\@attributesToNotRetrieve) • Returns 1 if successful • Constrains attribute values • Returns 0 if not successful • No rows or multiple rows

Getting and Setting Attributes • Attributes can be set using the individual object • Preferred, for additional functionality • Ex: setRowUserId($userId); • Attributes can be set using the superclass • set('row_user_id',$userId); • Get methods use similar syntax • getRowUserId() • get('row_user_id')

Managing submits to database • submit($notDeep, $noTran) • $notDeep = 1 only submits self but not children • $noTran = 1 does not begin or commit a transaction • addToSubmitList($object) • Additional $object gets submitted after main object and its children are submitted

Managing Parents • setParent($p) • getParent($className, $retrieveIfNoParent ,\@doNotRetrieveAttributes) • retrieveParentFromDB($className ,\@doNotRetrieveAttributes)

Managing Memory • undefPointerCache() • MUST be called in each loop to allow garbage collection. • Removes all child and parent pointers so they can not be retrieved. • All other methods are automatic • addToPointerCache($ob) • getFromPointerCache($object_reference) • removeFromPointerCache($ob)

Managing deletes • Deletes occur in two steps • markDeleted($doChildren) • Mark self deleted • If $doChildren = 1 then does this recursively • Deletes occur with submit

Managing Children • getChildren($className, $retrieveIfNoChildren, $getDeletedToo, $where,\@doNotRetrieveAttributes) • getAllChildren($retrieve, $getDeletedToo, $where) • retrieveChildrenFromDB($className, $resetIfHave, $where,\@doNotRetrieveAttributes ) • retrieveAllChildrenFromDB($recursive, $resetIfHave)

Methods for dealing with sequence • getSequence() • setSequence($sequence) • removes returns and non-sequence characters and then sets. • GetFeatureSequence() • retrieves substring of sequence to which that feature points • toFasta($type) • If $type = 1 id used is the aa(or na)_sequence_id - otherwise it is the source_id

Printing • ToString() • toXML($indent, $suppressDef, $doXmlIds, $family) • $suppressDef = 1 default attributes below modification_date are suppressed • $doXmlIds = 1 will print XML ids in the object tags • $family = 1 will print parent/child relationships in object tags rather than nesting children

Checking read and write permissions • checkReadPermission() • checkWritePermission()

Perl Object Layer & Pipelines for Efficient Data Processing

Perl Object Layer & Pipelines for Efficient Data Processing

Presentation Transcript

Chapter 5: The Data Link Layer

Chapter 5: The Data Link Layer last revised 24/11/03

Boundary Layer Control with Injection and Suction Through a Porous Wall

Network Layer

Chapter 5: The Data Link Layer

Chapter 5 Link Layer and LANs

Object-Oriented Programing in Java

The Data Link Layer

ソフトウェア工学特論 (13)

INFO 330 Computer Networking Technology I

Chapter 4 Network Layer

Chapter 5 The Network Layer

Introduction to Perl

Perl in a Day Peeking Inside the Oyster

Protocols: DNS, TELNET, e-Mail, FTP, WWW, NNTP, SNMP, NTP etc.

Week 1 Introduction and Data Link Layer

Totally Awesome Computing

Chapter 4 Network Layer

Perl/CGI

Motion Object V3 Review - SECRET of Motion Object V3