1 / 26

Provenance and the Price of Identity

Provenance and the Price of Identity. Adriane Chapman H. V. Jagadish. Obj. 2. Obj. 1. Obj. 3. Obj. 4. Obj. 5. Obj. 6. Obj. 7. Identity and Provenance. ALIGNMENT MOUSE_ABC1 HUMAN_ABC1. Create Q-gram In. Object3. FASTA8. Mod BLAST. Hash-Join Filter. Object9. Object7.

maalik
Download Presentation

Provenance and the Price of Identity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Provenance and the Price of Identity Adriane Chapman H. V. Jagadish

  2. Obj. 2 Obj. 1 Obj. 3 Obj. 4 Obj. 5 Obj. 6 Obj. 7 Identity and Provenance ALIGNMENT MOUSE_ABC1 HUMAN_ABC1

  3. Create Q-gram In. Object3 FASTA8 Mod BLAST Hash-Join Filter Object9 Object7 Object5 Create Q-gram In. Transform Affy1 Object2 Provenance Model • Fine-grained provenance • Attached to any data item • A data item is an attribute, tuple, table, dataset, element, etc • A record of exactly what happened to that data item

  4. Identity vs. Storage • Storage: A data item has been saved and stored for possible reuse • Identity: A data item has been characterized and is immutable • If a data item changes, a new identity must be assigned • A stored data item must have an identity • An unstored data item may have an identity for provenance purposes

  5. The Problem • Workflow systems, Taverna, VisTrails, etc will automatically take care of identity • What happens when there is no workflow system? • What guidelines should users have for identity in “implicit” workflows

  6. Outline • Introduction • Types of Identification • Pros and Cons • Evaluation • Conclusions

  7. Create Q-gram In. Object3 FASTA8 Mod BLAST Hash-Join Filter Object9 Object7 Object5 Create Q-gram In. Transform Affy1 Object2 Strong Identification • Every data item produced or used receives a unique identifier • Managed by an authority • All data items immutable • E.g. LSID in Taverna • URI:urn:lsid:mygrid.ac.uk:data:49:1

  8. Create Q-gram In. Object3 FASTA8 Obj7 (3,5) Mod BLAST Hash-Join Filter Object9 Object5 Create Q-gram In. Transform Affy1 Object2 Strong Identification + IDSet • In the case of aggregations, the ID of the aggregation contains the component ids • Facilitates understanding of aggregation contents • Outlined in Zhao, Goble, Stevens 2006

  9. Create Q-gram In. Object3 FASTA8 Mod BLAST Hash-Join Filter Object9 Object5 Create Q-gram In. Transform Affy1 Intermittent Identification • If data items produced are only used by 1 manipulation • When do you store an intermittent id?

  10. Intermittent Identification - Interval • Based on intervals between Identification • Break up long chains of manipulations • Allows easy ‘re-do’ from the last ‘checkpointed’ object

  11. Intermittent Identification – Manip. • Based on the Manipulation • Some manipulations may seriously alter the data item • Must be indicated by the user

  12. Intermittent Identifiation – Aggreg. • Aggregations • To go back through some ASPJ queries, intermediates need to be stored

  13. Initial Identification • Only initial input data items are identified • To find any intermediate result, or check an output, values must be recreated • Reverse Transformations of manipulations • Re-running of manipulations Create Q-gram In. FASTA8 Mod BLAST Hash-Join Filter Object9 Create Q-gram In. Transform Affy1

  14. Outline • Introduction • Types of Identification • Pros and Cons • Evaluation • Conclusions

  15. Abilities of each ID Strategy The relative strengths are shown here. What are the costs?

  16. Outline • Introduction • Types of Identification • Pros and Cons • Evaluation • Conclusions

  17. Evaluation Setup • Created a set of manipulations: • Aggregate • Reuse • 1 use • Manipulations input and output: 10, 100, 1000 data items • Manipulations have constant time

  18. Time to create and store identity These results are expected. But intermittent is highly attractive for “implicit” workflows with no reuse.

  19. Space to store identity

  20. Identity Strategy and intermediate re-creation By changing the run-time of the manipulations, the identification strategies fare differently. Because of the need for re-running, if the run-time is too long, Strong ID actually is faster.

  21. Outline • Introduction • Types of Identification • Pros and Cons • Evaluation • Conclusions

  22. Conclusions • There are multiple ways to manage identity • Especially in “implicit” workflow systems • We explore the relative costs and trade-offs for various identification strategies • We highlight areas of consideration for users of “implicit” workflow systems • When savings can be leveraged by using weaker identification without loss of provenance information

  23. Conclusions cont. • System creators can use what they know of their system needs to find the best identification strategy • If you never re-create intermediates, no need to identify them • Use input only identification. • If you want to share intermediates across experiments, you have to identify them • Use Strong or Strong IDSet identification

  24. Systems that use Strong Strategies • Strong • Vistrails, myGRID, Taverna • Strong + IDSet • Taverna extension of Zhao, Goble, Stevens 2006 Unsurprisingly, these are workflow systems

  25. abstracts Script1 Script2 Script4 Script4 Execute code trained on dataset Sentences Named Entity Tag Script3 Script5 Parse Trees SVM Classifier Chk1 interacts with p53 Example of processes found in BioSearch-2D Systems that use “Weak” Strategies • Intermittent • NCIBI NLP-backend for BioSearch-2D • Input • MiMI, MicroarrayDB

  26. Questions?

More Related