Quality Aware Sensor Database (QUASAR) Project **

Quality Aware Sensor Database (QUASAR) Project** Sharad Mehrotra Department of Information and Computer Science University of California, Irvine **Supported in part by a collaborative NSF ITR grant entitled “real-time data capture, analysis, and querying of dynamic spatio-temporal events” in collaboration with UCLA, U. Maryland, U. Chicago

Talk Outline • Quasar Project • motivation and background • data collection and archival components • query processing • tracking application using QUASAR framework • challenges and ongoing work • Brief overview of other research projects • MARS Project - incorporating similarity retrieval and refinement over structured and semi-structured data to aid interactive data analysis/mining • Database as a Service (DAS) Project - supporting the application service provider model for data management

Emerging Computing Infrastructure… In-body, in-cell, in-vitro spaces • Generational advances to computing infrastructure • sensors will be everywhere • Emerging applications with limitless possibilities • real-time monitoring and control, analysis • New challenges • limited bandwidth & energy • highly dynamic systems • System architectures are due for an overhaul • at all levels of the system OS, middleware, databases, applications Instrumented wide-area spaces

Impact to Data Management … Data/query request client Data/query result Data producers server • Traditional data management • client-server architecture • efficient approaches to data storage & querying • query shipping versus data shipping • data changes with explicit update • Emerging Challenge • data producers must be considered as “first class” entities • sensors generate continuously changing highly dynamic data • sensors may store, process, and communicate data

Data Management Architecture Issues client Data producers producer cache • Where to store data? • Do not store -- stream model • not suitable if we wish to archive data for future analysis or if data is too important to lose • at the producers • limited storage, network, compute resources • at the servers • server may not be able to cope with high data production rates. May lead to data staleness and/or wasted resources • Where to compute? • At the client, server, data producers Data/query request Data/query result server

Quasar Architecture client producer • Hierarchical architecture • data flows from producers to server to clients periodically • queries flow the other way: • If client cache does not suffices, then • query routed to appropriate server • If server cache does not suffice, then access current data at producer • This is a logical architecture-- producers could also be clients. Client cache Query flow Server cache & archive server data flow producer cache

Quasar: Observations & Approach • Applications can tolerate errors in sensor data • applications may not require exact answers: • small errors in location during tracking or error in answer to query result may be OK • data cannot be precise due to measurement errors, transmission delays, etc. • Communication is the dominant cost • limited wireless bandwidth, source of major energy drain • Quasar Approach • exploit application error tolerance to reduce communication between producer and server • Two approaches • Minimize resource usage given quality constraints • Maximize quality given resource constraints

Quality-based Data Collection Problem Sensor time series …p[n], p[n-1], …, p[1] • Let P = < p[1], p[2], …, p[n] > be a sequence of environmental measurements (time series) generated by the producer, where n = now • Let S = <s[1], s[2], …, s[n]> be the server side representation of the sequence • A within- quality data collection protocol guarantees that for all i error(p[i], s[i]) <  •  is derived from application quality tolerance

Simple Data Collection Protocol Sensor time series …p[n], p[n-1], …, p[1] • sensor Logic (at time step n) Let p’ = last value sent to server if error(p[n], p’) >  send p[n] to server • server logic (at time step n) If new update p[n] received at step n s[n] = p[n] Else s[n] = last update sent by sensor • guarantees maximum error at server less than equal to 

Exploiting Prediction Models • Producer and server agree upon a prediction model (M, ) • Let spred[i] be the predicted value at time i based on (M, ) • sensor Logic (at time step n) if error(p[n], spred[n] ) >  send p[n] to server • server logic (at time step n) • If new update p[n] received at step n s[n] = p[n] Else s[n] = spred[n] based on model (M, )

Challenges in Prediction • Simple versus complex models? • Complex and more accurate models require more parameters (that will need to be transmitted). • Goal is to minimize communication not necessarily best prediction • How is a model M generated? • static -- one out of a fixed set of models • dynamic -- dynamically learn a model from data • When should a model M or parameters  be changed? • immediately on model violation: • too aggressive -- violation may be a temporary phenomena • never changed: • too conservative -- data rarely follows a single model

Challenges in Prediction (cont.) • who does the model update? • Server • Long-haul prediction models possible, since server maintains history • might not predict recent behavior well since server does not know exact S sequence; server has only samples • extra communication to inform the producer • Producer • better knowledge of recent history • long haul models not feasible since producer does not have history • producers share computation load • Both • server looks for new models, sensor performs parameter fitting given existing models.

Archiving Sensor Data • Often sensor-based applications are built with only the real-time utility of time series data. • Values at time instants <<n are discarded. • Archiving such data consists of maintaining the entire S sequence, or an approximation thereof. • Importance of archiving: • Discovering large-scale patterns • Once-only phenomena, e.g., earthquakes • Discovering “events” detected post facto by “rewinding” the time series • Future usage of data which may be not known while it is being collected

Problem Formulation • LetP = < p[1], p[2], …, p[n] > be the sensor time series • Let S = < s[1], s[2], …, s[n] > be the server side representation • A within archive quality data archival protocol guarantees that error(p[i], s[i]) < archive • Trivial Solution: modify collection protocol to collect data at quality guarantee of min(archive , collect) • then prediction model by itself will provide a archive quality data stream that can be archived. • Better solutions possible since • archived data not needed for immediate access by real-time or forecasting applications (such as monitoring, tracking) • compression can be used to reduce data transfer

Data Archival Protocol Sensor updates for data collection • Sensors compresses observed time series p[1:n] and sends a lossy compression to the server • At time n : • p[1:n-nlag] is at the server in compressed form s’ [1:n-nlag] within-archive • s[n-nlag+1:n] is estimated via a predictive model (M, ) • collection protocol guarantees that this remains within- collect • s[n+1:] can be predicted but its quality is not guaranteed (because it is in the future and thus the sensor has not observed these values) Compressed representation for archiving …p[n], p[n-1], .. compress processing at sensor exploited to reduce communication cost and hence battery drain Sensor memory buffer

Piecewise Constant Approximation (PCA) • Given a time series Sn = s[1:n] a piecewise constant approximation of it is a sequence PCA(Sn) = < (ci, ei) > that allows us to estimate s[j] as: scapt [j] = ci if j in [ei-1+1, ei] = c1 if j<e1 Value c1 c4 c3 Time c2 e1 e2 e3 e4

6 4 3 2.5 2 Online Compression using PCA • Goal: Given stream of sensor values, generate a within-archive PCA representation of a time series • Approach (PMC-midrange) • Maintainm, M as the minimum/maximum values of observed samples since last segment • On processing p[n], update m and M if needed • if M - m > 2archive , output a segment ((m+M )/2, n) Value Example: archive = 1.5 Time 1 2 3 4 5

Online Compression using PCA • PMC-MR … • guarantees that each segment compresses the corresponding time series segment to within-archive • requires O(1) storage • is instance optimal • no other PCA representation with fewer segments can meet the within-archiveconstraint • Variant of PMC-MR • PMC-MEAN, which takes the mean of the samples seen thus far instead of mid range.

Improving PMC using Prediction • Observation: Prediction models guarantee a within- collect version of the time series at server even before the compressed time series arrives from the producer. • Can the prediction model be exploited to reduce the overhead of compression. • If archive> collect no additional effort is required for archival --> simply archive the predicted model. • Approach: • Define an error time series E[i] = p[i]-spred[i] • Compress E[1:n] to within-archive instead of compressing p[1:n] • The archive contains the prediction parameters and the compressed error time series • Within-archive of E[I] + (M, ) can be used to reconstruct a within- archive version of p

Combing Compression and Prediction (Example) Compressed Time Series (7 segments) Actual Time Series Predicted Time Series Actual Time Series Error = Actual – Predicted Compressed Error (2 segments)

Estimating Time Series Values • Historical samples (before n-nlag) is maintained at the server within-archive • Recent samples (between n-nlag+1 and n) is maintained by the sensor and predicted at the server. • If an application requires q precision, then: • if q  collect then it must wait for  time in case a parameter refresh is en route • if q  archive but q < collect then it may probe the sensor or wait for a compressed segment • Otherwise only probing meets precision • For future samples (after n) immediate probing not available as an option

Experiments • Data sets: • Synthetic Random-Walk • x[1] = 0 and x[i]=x[i-1]+sn where sn drawn uniformly from [-1,1] • Oceanographic Buoy Data • Environmental attributes (temperature, salinity, wind-speed, etc.) sampled at 10min intervals from a buoy in the Pacific Ocean (Tropical Atmosphere Ocean Project, Pacific Marine Environment Laboratory) • GPS data collected using IPAQs • Experiments to test: • Compression Performance of PMC • Benefits of Model Selection • Query Accuracy over Compressed Data • Benefits of Prediction/Compression Combination

Compression Performance K/n ratio: number of segments/number of samples

Query Performance Over Compressed Data “How many sensors have values >v?” (Mean selectivity = 50)

Impact of Model Selection • Objects moved at approximately constant speed (+ measurement noise) • Three models used: • loc[n] = c • loc[n] = c+vt • loc[n] = c+vt+0.5at2 • Parameters v, a were estimated at sensor over moving-window of 5 samples K/n ratio: number of segments/number of samples. pred is the localization tolerance in meters

Combining Prediction with Compression K/n ratio: number of segments/number of samples

GPS Mobility Data from Mobile Clients (iPAQs) QUASAR Client Time Series Latitude Time Series: 1800 samples Compressed Time Series (PMC-MR, ICDE 2003) Accuracy of ~100 m 130 segments

Query Processing in Quasar • Problem Definition • Given • sensor time series with quality-guarantees captured at the server • A query with a specified quality-tolerance • Return • query results incurring least cost • Techniques depend upon • nature of queries • Cost measures • resource consumption -- energy, communication, I/O • query response time

Aggregate Queries 9 6 3 8 2 7 Q minQ = 2 maxQ = 7 countQ = 3 sumQ = 2+7+6 = 15 avgQ = 15/3 = 5 S

Processing Aggregate Queries (minimize producer probe) Let S = <s1,s2, …,sn> be set of sensors that meet the query criteria si.high = sipred[t] + jpred sj.low = sipred[t] - jpred • MIN Query • c = minj(si.high) • b = c - query • Probe all sensors where sj.low < b • only s1 and s3 will be probed • Sum Query • select a minimal subset S’  S such that si in S’ (jpred)>=si in S(jpred)- query • If query= 15, only s1 will be probed sn s3 s2 s1 a b c 5 3 s5 s4 5 s3 2 s2 10 s1

Minimizing Cost at Server • Error tolerance of queries can be exploited to reduce processing at server. • Key Idea • Use a multi-resolution index structure (MRA-tree) for processing aggregate queries at server. • An MRA-Tree is a modified multi-dimensional index trees (R-Tree, quadtree, Hybrid tree, etc.) • A non-leaf node contains (for each of its subtrees) four aggregates {MIN,MAX,COUNT,SUM} • A leaf node contains the actual data points (sensor models)

MRA Tree Data Structure S7 S1 S2 S3 S4 S5 S6 S7 S2 S8 S1 S6 S5 S4 S3 S8 Spatial View Tree Structure View A D B C E B G D E F G F C A

2 min max count sum 4 1 6 4 5 2 6 3 2 4 1 9 9 4 6 M1 M2 M3 1 2 3 MRA-Tree Node Structure Non-Leaf Node Leaf Node Probe “Pointers” (each costs 2 messages) Disk Page Pointers (each costs 1 I/O)

is contained contains Q Q Q Q N N partially overlaps disjoint N N Node Classification • Two sets of nodes: • NP(partial contribution to the query) • NC (complete contribution)

Aggregate Queries using MRA Tree • Initialize NPwith the root • At each iteration: Remove one node N from NPand for each Nchildof its children • discard, if Nchild disjoint with Q • insert into NPif Q is contained or partially overlaps with Nchild • “insert” into NCif Q contains Nchild(we only need to maintain aggNC) • compute the best estimate based on contents of NP and NC N Q

MIN (and MAX) Traversal Choose N  NP: minN = minNP Interval minNC= min { 4, 5 }=4 minNP = min { 3, 9 } = 3 L = min {minNC, minNP} = 3 H = minNC = 4 hence, I = [3, 4] 9 4 5 Estimate Lower bound: E(minQ) = L = 3 3

S8 S7 S2 S3 S4 S5 S6 S1 MRA Tree Traversal • Progressive answer refinement until NP is exhausted • Greedy priority-based local decision for next node to be explored based on: • Cost (1 I/O or 2 messages) • Benefit (Expected Reduction in answer uncertainty) A B C D E F G

Adaptive Tracking of mobile objects in sensor networks Track visualization object Base station 1 Wireless link Show me the approximate track of the object with precision  Server Wireless Sensor Grid Base station 2 Base station 3 • Tracking Architecture • A network of wireless acoustic sensors arranged as a grid transmitting via a base station to server • A track of the mobile object generated at the base station or server • Objective • Track a mobile object at the server such that the track deviates from the real trajectory within a user defined error threshold track with minimum communication overhead.

Sensor Model • Wireless sensors : battery operated, energy constrained • Operate on the received acoustic waveforms • Signal attenuation of target object given by :Is(t) = P /4 r2 • P : source object power • r= distance of object from sensor • Is(t) = intensity reading at time t at ithsensor • Ith : Intensity threshold at ith sensor

Sensor States S1 Receive BS message S2 Ii < I th Ii < I th S0 (Initial state) Ii > I th • S0 : Monitor ( processor on, sensor on, radio off ) • shift to S1 if intensity above threshold • S1 : Active state ( processor on, sensor on, radio on) • send intensity readings to base station. • On receiving message from BS containing error tolerance shift to S2 • S2 : Quasi-active (processor on, sensor on, radio intermittent) • send intensity reading to BS if error from previous reading exceeds error threshold Quasar Collection approach used in Quasi-active state

Server side protocol • Server maintains: • list of sensors in the active/ quasi-active state • history of their intensity readings over a period of time • Server Side Protocol • convert track quality to a relative intensity error at sensors • Send relative intensity error to sensor when sensor state = S1( quasi- active state) • Triangulate using n sensor readings at discrete time intervals.

Basic Triangulation Algorithm (using 3 sensor readings) (x1, y1) (x2, y2) (x3, y3) P: source object power, Ii = intensity reading at ith sensor (x-x1)2 + (y- y1)2 = P/4 I1 (x-x2)2 + (y- y2)2 = P/4 I2 (x-x3)2 + (y- y3)2 = P/4 I3 Solving we get (x, y)=f(x1,x2,x3,y1,y2,y3, P,I1, I2, I3,) (x, y) • More complex approaches to amalgamate more than three sensor readings possible • Based on numerical methods -- do not provide a closed form equation between sensor reading and tracking location ! • Server can use simple triangulation to convert track quality to sensor intensity quality tolerances and a more complex approach to track.

Adaptive Tracking : Mapping track quality to sensor reading  I1 Intensity ( I1 ) time • Claim 1 (power constant) • Let Ii be the intensity value of sensor • If then, track quality is guaranteed to be within track where and C is a constant derived from the known locations of the sensors and the power of the object. • Claim 2 (power varies between [Pmin , Pmax ]) • If then track quality is guaranteed to be within track where C’ = C/ P2 and is a constant . • The above constraint is a conservative estimate. Better bounds possible t i t( i+1 )  I2 Intensity ( I2 ) t i t( i+1 ) time  I3 Intensity ( I3 ) t i t( i+1 ) time  track X (m) Y (m)

Adaptive Tracking: prediction to improve performance • Communication overhead further reduced by exploiting the predictability of the object being tracked • Static Prediction : sensor & server agree on a set of prediction models • only 2 models used: stationary & constant velocity • Who Predicts:sensor based mobility prediction protocol • Every sensor by default follows a stationary model • Based on its history readings may change to constant velocity model (number of readings limited by sensor memory size) • informs server of model switch

Actual Track versus track on Adaptive Tracking (error tolerance 20m) • A restricted random motion : the object starts at (0,d) and moves from one node to another randomly chosen node until it walks out of the grid.

Energy Savings due to Adaptive Tracking • total energy consumption over all sensor nodes for random mobility model with varying track or track error. • significant energy savings using adaptive precision protocol over non adaptive tracking ( constant line in graph) • for a random model, prediction does not work well !

Energy consumption with Distance from BS • total energy consumption over all sensor nodes for random mobility model with varying base station distance from sensor grid. • As base station moves away, one can expect energy consumption to increase since transmission cost varies as d n ( n =2 ) • adaptive precision algorithm gives us better results with increasing base station distance

Challenges & Ongoing Work • Ongoing Work: • Supporting a larger class of SQL queries • Supporting continuous monitoring queries • Larger class of sensors (e.g., video sensors) • Better approaches to model fitting/switching in prediction • In the future: • distributed Quasar architecture • optimizing quality given resource constraints • supporting applications with real-time constraints • dealing with failures

The DAS Project** Goals: Support Database as a Service on the Internet Collaboration: IBM (Dr. Bala Iyer) UCI (Gene Tsudik) ** Supported in part by NSF ITR grant entitled “Privacy in Database as a Service” and by the IBM Corporation

Software as a Service • Get … • what you need • when you need • Pay … • what you use • Don’t worry … • how to deploy, implement, maintain, upgrade

Quality Aware Sensor Database (QUASAR) Project **

Quality Aware Sensor Database (QUASAR) Project **

Presentation Transcript

Provenance Aware Linked Sensor Data

Sarah Gallagher (UCLA) January 2007

Sensor Database System

Quasars and Massive Black Holes

Database for Location -Aware Applications

QUASAR query language and system

Database for Location -Aware Applications

QUASAR Technique of Chopping as devised by

Introduction to Sensor Networks

Compression Aware Physical Database Design

Quasar A Probabilistic Publish-Subscribe System for Social Networks over P2P Kademlia network

Joint QUASAR and THz Group Workshop on Accelerator Science and Technology

Information Quality Aware Routing in Event-Driven Sensor Networks

Database Sensor Network

q-gram Based Database Searching Using A Suffix Array (QUASAR)

AGN OUTFLOWS

TOB sensor quality

Energy Efficient Data Collection In Distributed Sensor Environments

Diffuse Emission and Unidentified Sources

Dr. Alexander I. PAPASH QUASAR Group (quasar-group)

AGN Accretion Disks Under the Micro-Lens X inyu Dai (OSU)