1 / 15

Advances and Challenges for Scalable Provenance in Stream Processing Systems

Advances and Challenges for Scalable Provenance in Stream Processing Systems. Archan Misra, Marion Blount, Anastasios Kementsietsidis , Daby Sow, Min Wang. The Healthcare Crisis. 1979. 1991. Today. Weight. Weight. Weight. BP. BP. BP. ECG. ECG. ECG. Glucose. Glucose. Glucose. USN

silas
Download Presentation

Advances and Challenges for Scalable Provenance in Stream Processing Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advances and Challenges for Scalable Provenance in Stream Processing Systems Archan Misra, Marion Blount,Anastasios Kementsietsidis,Daby Sow, Min Wang

  2. The Healthcare Crisis 1979 1991 Today IPAW 2008

  3. Weight Weight Weight BP BP BP ECG ECG ECG Glucose Glucose Glucose USN Hub USN Hub USN Hub IHE Adapter Interoperability Container GUI GUI GUI GUI GUI The Century Project:Moving From Reactive to Proactive Healthcare… CENTURY Analysis Framework Event Preprocessor Solution Delivery Services Subscribe FA Application Server AR EP PT QRS RR Subscription Service SPE GLA GL Notify CHF BPA BP Event Store Query Service Application Server WTA WT Subscribe Event Management Service USN Gateway Provenance Query Service Notify ... Internet Provenance Service Group Management Service Subscribe Application Server Service Data Management We are not just addressing a social problem…There are some interesting research challenges here! We are not just proposing a set of open problems… The first version of Century has already been deployed! Notify Platform Service EHR System Base Solutions (System S, DB2, WAS) Patient Medical Record Ubiquitous Sensor Network Stream-based Distributed Interoperable Health care Infrastructure (CENTURY) Solutions (Applications) IPAW 2008

  4. An Example of Data Provenance Use The Setup: -- Dr. Lee prescribes medication to patient John Doe for his heart condition. -- Dr. Lee also prescribes a program to monitor the effect of the drug on Mr. Doe. -- A few days later, Dr. Lee receives an alert from the Century prescription program. That’s unusual!!! Before agreeing to this change, I need to understand on what basis the system has made this recommendation Urgent Alert Patient: Doe, John Condition: Abnormal Reaction Recommendation: Century has detected an issue in your patient’s medical condition which is deteriorating. A known side effect has emerged. Century recommends that you decrease the patient’s dosage of the prescribed medication to 10 mg twice a day. Ultimately, the physician is responsible for medical decisions/actions. In order for medical professionals to accept the upcoming technology, we must provide them with the information they need to make these decisions responsibly. Data Provenance provides the foundation for this understanding IPAW 2008 4

  5. Sequence dependency FA AR EP PT QRS RR O23(t) I8(i, i-256) SPE SPA SP AP BPA BP O45(t) I96(t){(spo2<89)}, I97(t,t-1){(systolic>130)} I67(t,t-1){(weightDelta>5%)} WTA WT Hybrid Time Value dependency O33(t) I15(t, t-1) Time dependency Stream Persistence Challenge Provenance (and regulatory requirements) require that data streams are persisted. Can we sustain such a high insertion throughput in today’s DBs? The Underlying Technology I8 O23 Arrhythmia I9 alert O13 O11 I21 O25 Angina Pectoris O24 I49 I10 I2 alert I41 O30 I45 I56 O51 O42 O1 I79 O40 I50 I89 I52 Well-Being I96 O70 alert O71 I97 WB O45 I80 I67 I83 O95 O33 I15 Data Provenance Granularity Challenge Unlike traditional data provenance settings, data provenance is no longer limited to a tuple-based granularity. Indeed, granularityis very much dynamic and depends on the particular analytics.How can we handle varying provenance granularities? IPAW 2008

  6. This is a general trend! Time Data Size Persisting ECG data streams [DEBS07] This is a general problem: -- Not specific to provenance! -- Not specific to an app domain! IPAW 2008

  7. The CMIR Framework O21I11 (t-15,t) The main idea of the framework is the introduction of virtual PEs to reduce the storage load imposed by provenance support. PEV1 I11 PE1 O11 I21 PE2 O21 PE5 O51 I51 PEV2 I31 PE3 O31 I41 PE4 Implementing such a framework has its own challenges! O41 I61 PE6 O61 O41I31{(i,i-17,order=1), (systolic>130,order=2)} IPAW 2008

  8. R2: O25(t) I10(t, t-32) QRS QRS I2 O13 O25 EP EP I10 = RC: O25(t) I2(t, t-42) R1: O13(t) I2(t, t-10) R1: O25I10{(systolic>130,order=1), (i,i-5,order=2)} I2 O13 O25 Composing the two formulas requires that we can determine statically how far back in time, at runtime, we have to go to retrieve five values with a systolic pressure above 130. I10 = ???? R1: O13(t) I2(t, t-10) Composing Provenance Rules Since a provenance rule is associated with each individual PE, the provenance rule for a virtual PE must be automatically generated. This assumes that provenance rules are composable. We need to develop a rule language whose primitives are composable IPAW 2008

  9. R1: O25(t) I10(t, t-32) R2: O25(t) I10(systolic>130) O25 O25 QRS QRS QRS I10 I10 Ø, if systolic<130 at t R2C: I10(t)  O25(t, -), otherwise R3: O25I10{(t, t-10,order=1), (systolic>130,order=2)} Not supported by our current model O25 I10 Inverting Provenance Rules While backward provenance is the most common, in a number of settings, forward provenance turns out to be equally useful. R1C: I10(t)  O25(t, t+32) R2C: I10(t) ???? R3C: I10(t)  O25(t, t+10) This is approximate inversion but it might suffice in practice… IPAW 2008

  10. Persisting Processing Element (PE) State The processing perfomed by a PE, and hence its output, does not depend only on its input(s)!! PEs can be, and often are, statefull. Strategies for persisting state: • As a custom byte stream • Each PE developer decides what/how to persist. • No sharing of code/design across PE developers. • As a predefined data structure • Code/structure sharing across PE developers • It must be generic enough to accommodate diverse needs across PEs. • As a (full-fledged) database • Well understood technology • Each PE developer decides what to persist • Generic enough to accommodate diverse needs across PEs. • Might be an overkill when the state information is simple. IPAW 2008

  11. QRS I2 R1: O25(t) I10(i, i-10) SPE SPE Raw Signal I10 O13 O25 QRS Producer 1 Producer 2 I3 I11 O26 O14 Consumer 1 R2: O26(t) I11(i, i-7) Consumer 2 Facing Varying Granularities The granularity at which a consumer PE ingests streaming data might differ from the granularity at which a producer PE generates them. The QRS analytics produces an alertbased on the last 10 seconds of ECG signal This problem is NOT Century-specific!See the Xstream system [ICDE08] IPAW 2008

  12. Raw Signal Producer 1 QRS Producer 2 SPE SPE Consumer 1 Consumer 2 QRS Dealing With Varying Granularities The same rule language can be used to smooth out the differences in granularities between consumer and producer PEs. I2 R1: O25(t) I10(i, i-10) I10 O13 O25 R3: I10(t) O13(t) I3 R4: I11(t) O14(t, t-4) I11 O26 O14 R2: O26(t) I11(i, i-3) IPAW 2008

  13. Related Work Medical Stream Provenance (Century) • Arbitrary time-varying lineage • Dynamic stream bindings Database View Inversion (Trio, Widom) • Specific Transforms and extensions to SQL • Data dependencies specified statically for relations Type pf Provenance Reconstruction Data Scientific SOA Workflows (KARMA) • Workflow and application binding notifications • Stateless application-defined data provenance EU Provenance (PASOA, PreServ)) Captures Workflow bindings Process Stream Provenance (Simhan) • Store temporal history of stream connectors as a per-stream stack • System-level automatic metadata collection at stream-level File Systems: PASS, LFS • Captures system calls and modifications to file records • Annotation per file Data and Processing Rates IPAW 2008

  14. Conclusions • Data provenance is a key component of the Century project • The first version of Century has already been deployed and fully supports data provenance queries in an environment with high processing and data rates. • As part of our experience with this first version, we have identified a set of research challenges we must address to move Century forward: • Stream Persistence Challenge: Not all provenance data can be persisted. Viable solutions must provide us with a way to replay some of the computation and effectively re-compute, at query time, the necessary data. To achieve this goal: • The state of processing elements must also be persisted. • The provenance model must support composition of provenance rules • The provenance model must support inversion of provenance rules • Data Provenance Granularity Challenge: The provenance model must account for the mismatch between the consumers and producers of data. • We are currently working towards addressing these (and other) issues. IPAW 2008

  15. IPAW 2008

More Related