Provenance and the price of identity
This presentation is the property of its rightful owner.
Sponsored Links
1 / 26

Provenance and the Price of Identity PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on
  • Presentation posted in: General

Provenance and the Price of Identity. Adriane Chapman H. V. Jagadish. Obj. 2. Obj. 1. Obj. 3. Obj. 4. Obj. 5. Obj. 6. Obj. 7. Identity and Provenance. ALIGNMENT MOUSE_ABC1 HUMAN_ABC1. Create Q-gram In. Object3. FASTA8. Mod BLAST. Hash-Join Filter. Object9. Object7.

Download Presentation

Provenance and the Price of Identity

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Provenance and the price of identity

Provenance and the Price of Identity

Adriane Chapman

H. V. Jagadish


Identity and provenance

Obj.

2

Obj.

1

Obj.

3

Obj.

4

Obj.

5

Obj.

6

Obj.

7

Identity and Provenance

ALIGNMENT

MOUSE_ABC1

HUMAN_ABC1


Provenance model

Create

Q-gram In.

Object3

FASTA8

Mod

BLAST

Hash-Join

Filter

Object9

Object7

Object5

Create

Q-gram In.

Transform

Affy1

Object2

Provenance Model

  • Fine-grained provenance

  • Attached to any data item

    • A data item is an attribute, tuple, table, dataset, element, etc

  • A record of exactly what happened to that data item


Identity vs storage

Identity vs. Storage

  • Storage: A data item has been saved and stored for possible reuse

  • Identity: A data item has been characterized and is immutable

    • If a data item changes, a new identity must be assigned

  • A stored data item must have an identity

  • An unstored data item may have an identity for provenance purposes


The problem

The Problem

  • Workflow systems, Taverna, VisTrails, etc will automatically take care of identity

  • What happens when there is no workflow system?

    • What guidelines should users have for identity in “implicit” workflows


Outline

Outline

  • Introduction

  • Types of Identification

  • Pros and Cons

  • Evaluation

  • Conclusions


Strong identification

Create

Q-gram In.

Object3

FASTA8

Mod

BLAST

Hash-Join

Filter

Object9

Object7

Object5

Create

Q-gram In.

Transform

Affy1

Object2

Strong Identification

  • Every data item produced or used receives a unique identifier

  • Managed by an authority

    • All data items immutable

  • E.g. LSID in Taverna

    • URI:urn:lsid:mygrid.ac.uk:data:49:1


Strong identification idset

Create

Q-gram In.

Object3

FASTA8

Obj7

(3,5)

Mod

BLAST

Hash-Join

Filter

Object9

Object5

Create

Q-gram In.

Transform

Affy1

Object2

Strong Identification + IDSet

  • In the case of aggregations, the ID of the aggregation contains the component ids

    • Facilitates understanding of aggregation contents

  • Outlined in Zhao, Goble, Stevens 2006


Intermittent identification

Create

Q-gram In.

Object3

FASTA8

Mod

BLAST

Hash-Join

Filter

Object9

Object5

Create

Q-gram In.

Transform

Affy1

Intermittent Identification

  • If data items produced are only used by 1 manipulation

  • When do you store an intermittent id?


Intermittent identification interval

Intermittent Identification - Interval

  • Based on intervals between Identification

    • Break up long chains of manipulations

  • Allows easy ‘re-do’ from the last ‘checkpointed’ object


Intermittent identification manip

Intermittent Identification – Manip.

  • Based on the Manipulation

    • Some manipulations may seriously alter the data item

  • Must be indicated by the user


Intermittent identifiation aggreg

Intermittent Identifiation – Aggreg.

  • Aggregations

    • To go back through some ASPJ queries, intermediates need to be stored


Initial identification

Initial Identification

  • Only initial input data items are identified

  • To find any intermediate result, or check an output, values must be recreated

    • Reverse Transformations of manipulations

    • Re-running of manipulations

Create

Q-gram In.

FASTA8

Mod

BLAST

Hash-Join

Filter

Object9

Create

Q-gram In.

Transform

Affy1


Outline1

Outline

  • Introduction

  • Types of Identification

  • Pros and Cons

  • Evaluation

  • Conclusions


Abilities of each id strategy

Abilities of each ID Strategy

The relative strengths

are shown here.

What are the costs?


Outline2

Outline

  • Introduction

  • Types of Identification

  • Pros and Cons

  • Evaluation

  • Conclusions


Evaluation setup

Evaluation Setup

  • Created a set of manipulations:

    • Aggregate

    • Reuse

    • 1 use

  • Manipulations input and output: 10, 100, 1000 data items

  • Manipulations have constant time


Time to create and store identity

Time to create and store identity

These results are expected.

But intermittent is highly attractive for “implicit” workflows with no reuse.


Space to store identity

Space to store identity


Identity strategy and intermediate re creation

Identity Strategy and intermediate re-creation

By changing the run-time of the manipulations, the identification strategies fare differently.

Because of the need for re-running, if the run-time is too long,

Strong ID actually is faster.


Outline3

Outline

  • Introduction

  • Types of Identification

  • Pros and Cons

  • Evaluation

  • Conclusions


Conclusions

Conclusions

  • There are multiple ways to manage identity

    • Especially in “implicit” workflow systems

  • We explore the relative costs and trade-offs for various identification strategies

  • We highlight areas of consideration for users of “implicit” workflow systems

    • When savings can be leveraged by using weaker identification without loss of provenance information


Conclusions cont

Conclusions cont.

  • System creators can use what they know of their system needs to find the best identification strategy

    • If you never re-create intermediates, no need to identify them

      • Use input only identification.

    • If you want to share intermediates across experiments, you have to identify them

      • Use Strong or Strong IDSet identification


Systems that use strong strategies

Systems that use Strong Strategies

  • Strong

    • Vistrails, myGRID, Taverna

  • Strong + IDSet

    • Taverna extension of Zhao, Goble, Stevens 2006

Unsurprisingly, these are workflow systems


Systems that use weak strategies

abstracts

Script1

Script2

Script4

Script4

Execute code trained on dataset

Sentences

Named Entity Tag

Script3

Script5

Parse Trees

SVM Classifier

Chk1 interacts with p53

Example of processes found in BioSearch-2D

Systems that use “Weak” Strategies

  • Intermittent

    • NCIBI NLP-backend for BioSearch-2D

  • Input

    • MiMI, MicroarrayDB


Questions

Questions?


  • Login