230 likes | 300 Views
Learn about codifying the process of creating and analyzing data sets using Perl pipelines. Reduce errors and save resources.
E N D
Perl Object Layer & Pipelines Pipelines Steve Fischer John Iodice Deborah Pinney Mark Heiges Ed Robinson Perl Object Layer Brian Brunk Mark Gibson Dave Barkan
Pipeline Introduction • Sequential steps of • Plugin calls • Script calls • Cluster jobs • Purpose • Codifies the process of creating the data set • Reduces human resources • Reduces human error and omissions
Two Pipeline Types • Resources pipeline • Downloads resources from external sources • Loads resources into database • Example: NRDB files • Analysis pipeline • Extract data from database • Run analysis programs on data • On main or cluster server • Put value added data back into database
Resource Pipeline • Invoked by: • loadresources xmlfile propfile • Take a tour of a resources XML file
Resources Repository • Destination of downloads • Houses files in a file system • Serves as a cache for files • Has API to access files by name and version • If you request an existing file by name and version, repository returns it without downloading • But the wget arguments must match (these are remembered by the repository) • Particularly useful if multiple projects want to synchronize their data input
Analysis Pipeline • Take a tour of the analysis pipeline file • Take a tour of the Steps.pm file • Take a tour of the property file
Pipeline Directory Structure • The directory which houses all the information for the pipeline including: • Input data • Logs • Result data • Pipeline control information: • Which steps have been completed • Property files to control cluster • Structured for easy comprehension • Take a tour of the directory structure
Analysis Pipeline API • GUS::Pipeline::Manager.pm • Declares properties • Prevents steps from rerun • Calls plugins • Executes commands • Eases communication with cluster • GUS::Pipeline::MakeTaskDirs.pm • Helps make directories expected by distribjob on the cluster • GUS::Pipeline::TaskRunAndValidate.pm • Helps run a series of tasks on the cluster
DJob • Manages the distribution of tasks across a compute cluster • Handles the case of a very large number of inputs which are processed independently and uniformly • For example, blasting a set of EST against a genome • Now available for clusters using PBS cluster scheduler • http://core.pcbi.upenn.edu/tools/liniactools.html
Perl Object Layer http://www.cbil.upenn.edu/~brunkb/PERL_Objects.html
Perl Object Layer • Simplifies database interactivity • Manages parent-child relationships • Manages submits (inserts,updates and deletes) • Submits children recursively • Automatic versioning • Sets default attributes (Ex. row_user_id) • Enforces read/write permissions • Code generator - objects consistent with db • Extracts meta data from db • Prints to XML and parses XML into objects
DbiDatabase Module • Creates login to the database • Allows use of all database objects • Has methods to get meta information • Ex: getTable(tableName) returns a DbiTable for access of FK and PK attributes • DbiDatabase object automatically instantiated by plugins • DbiDatabase objects must be explicitly instantiated in scripts
Object Constructor • TableName->new($hashRef)
Retrieving objects from DB • retrieveFromDB(\@attributesToNotRetrieve) • Returns 1 if successful • Constrains attribute values • Returns 0 if not successful • No rows or multiple rows
Getting and Setting Attributes • Attributes can be set using the individual object • Preferred, for additional functionality • Ex: setRowUserId($userId); • Attributes can be set using the superclass • set('row_user_id',$userId); • Get methods use similar syntax • getRowUserId() • get('row_user_id')
Managing submits to database • submit($notDeep, $noTran) • $notDeep = 1 only submits self but not children • $noTran = 1 does not begin or commit a transaction • addToSubmitList($object) • Additional $object gets submitted after main object and its children are submitted
Managing Parents • setParent($p) • getParent($className, $retrieveIfNoParent ,\@doNotRetrieveAttributes) • retrieveParentFromDB($className ,\@doNotRetrieveAttributes)
Managing Memory • undefPointerCache() • MUST be called in each loop to allow garbage collection. • Removes all child and parent pointers so they can not be retrieved. • All other methods are automatic • addToPointerCache($ob) • getFromPointerCache($object_reference) • removeFromPointerCache($ob)
Managing deletes • Deletes occur in two steps • markDeleted($doChildren) • Mark self deleted • If $doChildren = 1 then does this recursively • Deletes occur with submit
Managing Children • getChildren($className, $retrieveIfNoChildren, $getDeletedToo, $where,\@doNotRetrieveAttributes) • getAllChildren($retrieve, $getDeletedToo, $where) • retrieveChildrenFromDB($className, $resetIfHave, $where,\@doNotRetrieveAttributes ) • retrieveAllChildrenFromDB($recursive, $resetIfHave)
Methods for dealing with sequence • getSequence() • setSequence($sequence) • removes returns and non-sequence characters and then sets. • GetFeatureSequence() • retrieves substring of sequence to which that feature points • toFasta($type) • If $type = 1 id used is the aa(or na)_sequence_id - otherwise it is the source_id
Printing • ToString() • toXML($indent, $suppressDef, $doXmlIds, $family) • $suppressDef = 1 default attributes below modification_date are suppressed • $doXmlIds = 1 will print XML ids in the object tags • $family = 1 will print parent/child relationships in object tags rather than nesting children
Checking read and write permissions • checkReadPermission() • checkWritePermission()