Enabling privacy in provenance aware workflow systems
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

Enabling Privacy in Provenance-Aware Workflow Systems PowerPoint PPT Presentation


  • 79 Views
  • Uploaded on
  • Presentation posted in: General

Enabling Privacy in Provenance-Aware Workflow Systems. Susan B. Davidson 1 Joint work with Sanjeev Khanna , Sudeepa Roy, Julia Stoyaonovich , Val Tannen 1 Yi Chen 2 Tova Milo 3 1 University of Pennsylvania 2 Arizona State University 3 Tel Aviv University.

Download Presentation

Enabling Privacy in Provenance-Aware Workflow Systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Enabling privacy in provenance aware workflow systems

Enabling Privacy in Provenance-Aware Workflow Systems

Susan B. Davidson1

Joint work with

SanjeevKhanna,Sudeepa Roy,

Julia Stoyaonovich,Val Tannen1

Yi Chen 2

Tova Milo3

1University of Pennsylvania2Arizona State University 3Tel AvivUniversity


Science has been revolutionized

Science has been revolutionized

  • High-throughput technologies generatefloods of data, which must be analyzed to create knowledge.

  • Scientists are turning to scientific workflow systems to help manage the analysis.


Scientific workflow

Scientific Workflow

TGCCGTGTGGCTAAATG…

CTGTGC

CTAAATGTCTGTGC…

GGCTAAATGTCTG

TGCCGTGTGGCGTC…

ATCCGTGTGGCTA..

Split Entries

  • Graphical representation of a sequence of actions to perform a task (e.g., a biological experiment)

  • Vertex ≡ Module (program)

  • Edge ≡ Dataflow

  • Run:An execution of the workflow

    • Actual data appears on the edges

    • .

Align Sequences

Format-1

Functional Data

Curate Annotations

Format-3

Format-2

Construct Trees


Need for provenance

Need for provenance

TGCCGTGTGGCTAAATGTCTGTGC

CCCTTTCCGTGTGGCTAAATGTCTGTGC

Need for privacy

s

TGCCGTGTGGCTAAATGTCTGTGC

GTCTGTGC…

TGCCGTGTGGCTAAATGTCTGTGC

GTCTGTGC…

TGCCGTGTGGCTAAATGTCTGTGC…

ATGGCCGTGTGGTCTGTGCCTAACTAACTAA…

Split Entries

Align Sequences

Format

How has this tree been generated?

Curate Annotations

Functional Data

?

Format

Format

Which sequenceshave been used to produce this tree?

Construct Trees

t

Biologist’s workspace


Outline

Outline

  • The need: Workflow provenance repositories

  • The challenge of privacy

  • Privacy-aware search and query


Workflow provenance repositories

Workflow Provenance Repositories


Current situation workflow repositories

Current situation: Workflow repositories

  • To enable sharing and reuse, repositories of workflow specifications are being created

    • e.g. myExperiment.org

    • Keyword search is used to find specifications of interest

  • Several workflow systems are storing provenance information

    • Module executions, input parameters, input/output data

    • “Input-only”


The vision

The Vision

  • “Workflow Provenance” repositories will store specifications as well as executions (i.e. provenance information)

    • Searchable

    • Queryable

    • Privacy protecting

  • Searching/querying these repositories can be used to

    • Find/reuse workflows

    • Understand meaning of a workflow

    • Correct/debug erroneous specifications

    • Find “bad” data and what its downstream effect might be


Enabling privacy in provenance aware workflow systems

“You are better off designing in security and privacy… from the start, rather than trying to add them later.”

(Shapiro, CACM 53:6, June 2010)


Privacy concerns in scientific workflows

Privacy Concerns in Scientific Workflows

Data, module, structural


Example module privacy

Example: Module Privacy

Split entries

Patient record: Gender, smoking habits,

Familial environment,

blood pressure, blood test report, …

Module functionalityshouldbekept secret

(From patient’s standpoint): output should not be guessed given input data values

(From module owner’s standpoint): no one should be able to simulate the module and use it elsewhere.

Check for

Cancer

Check for

Infectious disease

P: (X1, X2, X3, X4, X5)

(X1, X2, X3)

(X1, X2, X4, X5)

Create Report

P has cancer?

P has an infectious disease?

If X1 > 60 OR

(X2 < 800 AND X5=1) AND ….

report


Example structural privacy

Example: Structural Privacy

Protein

+ Functional annotation

M2

M1

Protein

M2 compares domains of proteins (more precise but more time consuming)

M1 compares the entire protein against already annotated genomes

Relationshipsbetween certain data/module pairs shouldbekept secret


Privacy concerns at a glance

Privacy concerns at a glance

smoking habits, blood pressure,

blood test report, ……

  • Data Privacy

    • Data items are private

  • Module Privacy

    • Module functionality is private (x, f(x))

  • Structural Privacy

    • Execution paths between certain data is private

Split entries

P: (X1, X2, X3, X4)

(X1, X2, X3)

(X1, X2, X4)

Check for infectious disease

Check for cancer

Check for cancer

DB

P has infectious disease?

P has cancer?

Create Report

report


Module privacy a hint

Module Privacy (a hint)

  • A module f = a function

  • For every input x to f, f(x) value should not be revealed

    • Enoughequivalent possible f(x) values w.r.t. visible information

According to required privacy guarantee

  • There is a knife, a fork and a spoon in this figure

arxiv.org/abs/1005.5543


Search and query

Search and Query


Enabling privacy in provenance aware workflow systems

Workflow model, revisited

  • Vertex ≡ Module

    • Atomic or composite (subworkflow)

  • Edge ≡ Dataflow

W2

W4

Workflow hierarchy

W1

W3

W4

W2

W1

Generate

Database Queries

M5

Expand SNP

Set

lifestyle, family history, physical symptoms

M3

I

query

query

SNPs, ethnicity

Query

OMIM

Query

PubMed

SNPs

M6

M7

Determine

Genetic

Susceptibility

M1

M4

Consult

External Databases

disorders

disorders

M8

Combine

Disorder Sets

disorders

W3

M2

Evaluate

Disorder Risk

prognosis

….

O


Access views

Access Views

W2

W4

W1

W2

W4

  • Prefixes of workflow hierarchy can be used to define the finest level of granularity the user can “see”

W1

W1

W2

W3

W1:

I

M1

M2

O


Enabling privacy in provenance aware workflow systems

Search

  • Keyword search: “Database, disorder risk”

  • Result should be concise and informative

W1

W2

W4

W4

W2

W1

Generate

DatabaseQueries

M5

Expand SNP

Set

lifestyle, family history, physical symptoms

M3

I

query

query

SNPs, ethnicity

Query

OMIM

Query

PubMed

SNPs

M6

M7

Determine

Genetic

Susceptibility

M1

M4

Consult

ExternalDatabases

disorders

disorders

M8

Combine

Disorder Sets

disorders

M2

Evaluate

DisorderRisk

prognosis

“WISE: a Workflow Information Search Engine”,

(ICDE 2009)

O


Result of database disorder risk

Result of “database, disorder risk”

lifestyle, family history, physical symptoms

M3

Expand SNP

Set

SNPs, ethnicity

I

W1

W2

W4

SNPs

M5

Generate

DatabaseQueries

query

query

Query

OMIM

M6

Query

PubMed

M7

disorders

disorders

M2

M8

Evaluate

DisorderRisk

Combine

Disorder Sets

disorders

prognosis

O


Enabling privacy in provenance aware workflow systems

  • Result of “database, disorder risk”

W1

W2

lifestyle, family history, physical symptoms

M3

W2

W1

Expand SNP

Set

I

Expand SNP

Set

lifestyle, family history, physical symptoms

M3

I

SNPs, ethnicity

SNPs, ethnicity

SNPs

SNPs

Determine

Genetic

Susceptibility

M1

M4

Consult

ExternalDatabases

M4

Consult

ExternalDatabases

disorders

disorders

M2

Evaluate

DisorderRisk

M2

Evaluate

DisorderRisk

prognosis

O

prognosis

O


Queries

Queries

  • “Workflows in which Expand SNP is performed before Query PubMed.”

  • “Data items that were used to produce d19.”

  • Shares with search the need to match certain terms and give a view of the result and preserve privacy.


Research challenges

Research Challenges

  • More ways of enabling scientists to construct workflows

  • Formalizing privacy notions

  • Provenance query language?

  • Efficiently identifying data that users can access

    • users may have different privileges, yielding many different access views.


Acknowledgments

Acknowledgments

Members of the PennDB group

Susan Davidson

SanjeevKhanna

Sudeepa Roy

Julia Stoyanovich

Val Tannen

Friend of PennDB and Tel Aviv faculty

Tova Milo

PennDB alum and ASU faculty

Yi Chen

Bioinformatics collaborators

Sarah Cohen-Boulakia

This work was supported in part by NSF IIS-0803524, IIS-0629846,

IIS-0915438, CCF-0635084, and IIS-0904314;

NSF CAREER award IIS-0845647; and CRA 0937060


  • Login