1 / 9

Dataset Classes

Dataset Classes. A dataset class tells us: How to handle a particular type of dataset Exactly how to put it into manual delivery (it specifies the API for manual delivery) How to put it in the database (resource XML) How to process it in the workflow (graph XML). Human Roles.

wolfe
Download Presentation

Dataset Classes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dataset Classes • A dataset class tells us: • How to handle a particular type of dataset • Exactly how to put it into manual delivery • (it specifies the API for manual delivery) • How to put it in the database • (resource XML) • How to process it in the workflow • (graph XML)

  2. Human Roles • Dataset Integrator • Puts datasets into manual delivery (conforming to the dataset class API) • Provides a specification of each dataset for the workflow. • Workflow Pilot • Configures the workflow • Runs the workflow • Workflow Developer • Writes dataset classes • Writes graph files • Writes step classes • Writesplugins • ReFlow Developer • Develops underlying workflow system

  3. Organism Abbrev • Throughout the workflow system, we use a unique, stable “identifier” for an organism: its organism abbrev • We do not use things like taxon IDs, scientific names, etc. • Examples: • tgonME49 • pfal3D7 • ncanLIV • It always includes: • One letter for the genus • Three letters for the species • The strain • Once it is set, it does not change, even if we adjust the name of the organism

  4. Manual Delivery • Manual delivery has a very specific structure: manualDelivery/ project/ organismAbbrev/ category/ datasetName/ datasetVersion/ final/ fromProvider/ workspace/ README • final/ contains standard file names that conform to the dataset class API • Eg: SNPs.gff • They never have the name of the provider or any other dataset specific info

  5. Datasets <dataset class=“dbxrefs”> <prop name=“orgAbbrev”>myOrg</prop> <prop name=“name”>uniprot</prop> <prop name=“version”>2.0</prop> </dataset> myOrg.xml Dataset Classes Workflow Plan <datasetClass name=“dbxrefs”> <prop name=“orgAbbrev”/> <prop name=“name”/> <prop name=“version”/> <graphPlanFile name=“dbXRefs.xml”/> <resource name=“${orgAbbrev}_${name}_dbxrefs”> <manualGet/> … </resource> </datasetClass> <workflow> <datasetTemplate class=“dbxrefs”> <prop name=“orgAbbrev”/> <prop name=“name”/> <subgraph name=“${orgAbbrev}_${name}_dbxrefs” xmlFile=“loadResources.xml”> <paramValue name=“what”>for</paramValue> </subgraph> </datasetTemplate> .. </workflow> classes.xml dbXRefs.xml Code generator Resources Top Level Graph <resources> <resource name=“myOrg_uniprot_dbxrefs”> … </resource> … <resource> Another Graph Another Graph Workflow Graph Another Graph <workflow> <step> <subgraph name=“myOrg_uniprot_dbxrefs”> <step> </workflow> myOrg.xml Generated files myOrg/dbXRefs.xml

  6. Dataset Files ToxoDB.xml ToxoDB/tgonME49.xml ToxoDB/tgonME49/Einstein.xml Generates Resource Files Graph Files ToxoDB.xml ToxoDB/project.xml ToxoDB/tgonME49.xml ToxoDB/tgonME49/ESTs.xml ToxoDB/tgonME49/Einstein.xml ToxoDB/tgonME49/dbXRefs.xml ToxoDB/tgonME49/arrayStudies.xml ToxoDB/tgonME49/SNPs.xml ToxoDB/tgonME49/Einstein/chipChipSamples.xml

  7. DataSource • We store simple meta information in the database about each dataset • Provider contact info • Descriptions • Display names • References to WDK searches , tables and attributes that use the data • The information is stored in two tables: • DataSource -- pulled right from the <resource> • DataSourceInfo -- provided by a specific file after loading data is completed • And it available in the WDK as a DataSource record • The search and record pages (egGene) can access this info for display purposes • Soon we will support searches for these, eg, find all searches that involve a certain dataset • It makes no sense to have two names: • <resource> • DataSource table and perl objects • So, either: • Rename <resource> to <datasource> • This is a pain to transition to in our code, • Or, rename DataSource to DataResource and keep <resource> as is

  8. DataResource? • It makes no sense to have two names: • <resource> • DataSource table, perl objects, and WDK record • So, either: • Rename <resource> to <datasource> • This is a pain to transition to in our code, • Or, rename DataSource to DataResource and keep <resource> as is

  9. DataResourceInfo • DatasetClasses do not include meta info about the dataset: • Contact info • Description • Mapping to wdk searches and records • DatasetClasses describe how to load the data • But, we can have DatasetClass

More Related