Supporting Aggregate Queries Over Ad-Hoc Wireless Sensor Networks

Supporting Aggregate Queries Over Ad-Hoc Wireless Sensor Networks Samuel Madden UC Berkeley With Robert Szewczyk, Michael Franklin, and David Culler WMCSAJune 21, 2002

Motivation: Sensor Nets and In-Network Query Processing • Many Sensor Network Applications are Data Oriented • Queries Natural and Efficient Data Processing Mechanism • Easy (unlike embedded C code) • Enable optimizations through abstraction • Aggregates Common Case • E.g. Which rooms are in use? • In-network processing a must • Sensor networks power and bandwidth constrained • Communication dominates power cost • Not subject to Moore’s law!

Overview • Background • Sensor Networks • Our Approach: Tiny Aggregation (TAG) • Overview • Expressiveness • Illustration • Optimizations • Grouping • Current Status & Future Work

Background: Sensor Networks • A collection of small, radio-equipped, battery powered, networked microprocessors • Typically Ad-hoc & Multihop Networks • Single devices unreliable • Very low power; tiny batteries power for months • Apps: Environment Monitoring, Personal Nets, Object Tracking • Data processing plays a key role!

Berkeley Mica Motes & TinyOS • TinyOS operating system (services) • 4Mhz Processor • 4K RAM, 512K EEPROM, 128K code space • Single channel CSMA half-duplex radio @ 40kbits • Lossy: 20% loss @ 5ft in Ganesan et al. • Communication Very Expensive: 800 instrs/bit

The Tiny Aggregation (TAG) Approach • Push declarative queries into network • Impose a hierarchical routing tree onto the network • Divide time into epochs • Every epoch, sensors evaluate query over local sensor data and data from children • Aggregate local and child data • Each node transmits just once per epoch • Pipelined approach increases throughput • Depending on aggregate function, various optimizations can be applied

SELECT AVG(light) FROM sensors WHERE sound < 100 GROUP BY roomNo HAVING AVG(light) < 50 SQL Primer • SQL is an established declarative language; not wedded to it • Some extensions clearly necessary, e.g. for sample rates • We adopt a basic subset: • ‘sensors’ relation (table) has • One column for each reading-type, or attribute • One row for each externalized value • May represent an aggregation of several individual readings SELECT {aggn(attrn), attrs} FROM sensors WHERE {selPreds} GROUP BY {attrs} HAVING {havingPreds} EPOCH DURATION s

Aggregation Functions • Standard SQL supports “the basic 5”: • MIN, MAX, SUM, AVERAGE, and COUNT • We support any function conforming to: Aggn={fmerge, finit, fevaluate} Fmerge{<a1>,<a2>}  <a12> finit{a0}  <a0> Fevaluate{<a1>}  aggregate value (Merge associative, commutative!) Partial Aggregate Example: Average AVGmerge {<S1, C1>, <S2, C2>}  < S1 + S2 , C1 + C2> AVGinit{v}  <v,1> AVGevaluate{<S1, C1>}  S1/C1

Query Propagation • TAG propagation agnostic • Any algorithm that can: • Deliver the query to all sensors • Provide all sensors with one or more duplicate free routes to some root • Paper describes simple flooding approach • Query introduced at a root; rebroadcast by all sensors until it reaches leaves • Sensors pick parent and level when they hear query • Reselect parent after k silent epochs Query 1 P:0, L:1 2 3 P:1, L:2 P:1, L:2 4 P:2, L:3 6 P:3, L:3 5 P:4, L:4

1 2 3 4 5 Illustration: Pipelined Aggregation SELECT COUNT(*) FROM sensors Depth = d

1 2 3 4 5 Illustration: Pipelined Aggregation SELECT COUNT(*) FROM sensors Epoch 1 1 Sensor # 1 1 1 Epoch # 1

Discussion 1 • Result is a stream of values • Ideal for monitoring scenarios • One communication / node / epoch • Symmetric power consumption, even at root • New value on every epoch • After d-1 epochs, complete aggregation • Given a single loss, network will recover after at most d-1 epochs • With time synchronization, nodes can sleep between epochs, except during small communication window 2 3 4 5

Simulation Result Simulation Results 2500 Nodes 50x50 Grid Depth = ~10 Neighbors = ~20 Some aggregates require dramatically more state!

Optimization: Channel Sharing • Insight: Shared channel enables optimizations • Suppress messages that won’t affect aggregate • E.g., in a MAX query, sensor with value v hears a neighbor with value ≥ v, so it doesn’t report • Applies to all such exemplary aggregates • Learn about query advertisements it missed • If a sensor shows up in a new environment, it can learn about queries by looking at neighbors messages. • Root doesn’t have to explicitly rebroadcast query!

Optimization: Hypothesis Testing • Insight: Root can provide information that will suppress readings that cannot affect the final aggregate value. • E.g. Tell all the nodes that the MIN is definitely < 50; nodes with value ≥ 50 need not participate. • Works for any linear aggregate function • How is hypothesis computed? • Blind guess • Statistically informed guess • Observation over first few levels of tree / rounds of aggregate

Optimization: Use Multiple Parents • For duplicate insensitive (e.g. MAX), or partitionable (e.g. COUNT) aggregates, • Send (part of) aggregate to all parents • Decreases variance • Dramatically, when there are lots of parents • No extra cost, since all messages broadcast

Grouping • Value-based, complete partitioning of records • If query is grouped, sensors apply predicate to local readings on each epoch • Aggregate records tagged with group • When a child record (with group) is received: • If it belongs to a stored group, merge with existing record for that group • If not, just store it • At the end of each epoch, transmit one record per group

Status & Future Work • Status • Simple simulator • Complete set of experiments, including behavior of algorithms in the face of loss • Generalization of algorithms beyond complete pipelining • Taxonomy of aggregates to allow optimizations on functional properties • Basic implementation (shown in demo) • Future work • Expressiveness issues • Aggregates over temporal data • Nested queries, e.g MAX(AVG(1000 readings) @ each node) • Correctness Issues in The Face Of Loss • How does the user know which nodes are and are not included in an aggregate?

Summary • Declarative queries for aggregates • Straightforward, familiar interface • Enables optimizations • Snooping techniques for exemplary aggregates • Multiple parents for partitionable aggregates • Pipelined, epoch based algorithm • Streaming Results • Symmetric communication • Low-power friendly

Questions?

Grouping • GROUP BY expr • expr is an expression over one or more attributes • Evaluation of expr yields a group number • Each reading is a member of exactly one group Example: SELECT max(light) FROM sensors GROUP BY TRUNC(temp/10) Result:

Having • HAVING preds • preds filters out groups that do not satisfy predicate • versus WHERE, which filters out tuples that do not satisfy predicate • Example: SELECT max(temp) FROM sensors GROUP BY light HAVING max(temp) < 100 Yields all groups with temperature under 100

Group Eviction • Problem: Number of groups in any one iteration may exceed available storage on sensor • Solution: Evict! • Choose one or more groups to forward up tree • Rely on nodes further up tree, or root, to recombine groups properly • What policy to choose? • Intuitively: least popular group, since don’t want to evict a group that will receive more values this epoch. • Experiments suggest: • Policy matters very little • Evicting as many groups as will fit into a single message is good

Simulation Environment • Java-based simulation & visualization for validating algorithms, collecting data. • Coarse grained event based simulation • Sensors arranged on a grid, radio connectivity by Euclidian distance • Communication model • Lossless: All neighbors hear all messages • Lossy: Messages lost with probability that increases with distance • Symmetric links • No collisions, hidden terminals, etc.

Simulation Screenshot

Experimental Results • Experiments with simulator • Performance of basic TAG • Benefits of hypothesis testing • Effect of loss • Most experiments in terms of bytes or messages sent, since message transmission is the dominant cost • Depends on radio being turned off between epochs and aggregation functions being cheap

Experiment: Basic TAG Dense Packing, Ideal Communication

Experiment: Hypothesis Testing Uniform Value Distribution, Dense Packing, Ideal Communication

Experiment: Effects of Loss

Experiment: Benefit of Cache

1 2 3 4 5 Pipelined Aggregates Value from 2 produced at time t arrives at 1 at time (t+1) • After query propagates, during each epoch: • Each sensor samples local sensors once • Combines them with PSRs from children • Outputs PSR representing aggregate state in the previous epoch. • After (d-1) epochs, PSR for the whole tree output at root • d = Depth of the routing tree • If desired, partial state from top k levels could be output in kth epoch • To avoid combining PSRs from different epochs, sensors must cache values from children Value from 5 produced at time t arrives at 1 at time (t+3)

1 2 4 3 5 Pipelining Example

1 2 4 3 5 Pipelining Example Epoch 0 <4,0,1> <5,0,1>

1 2 4 3 5 Pipelining Example Epoch 1 <2,0,2> <4,1,1> <3,0,2> <5,1,1>

1 2 4 3 5 Pipelining Example <1,0,3> Epoch 2 <2,0,4> <4,2,1> <3,1,2> <5,2,1>

1 2 4 3 5 Pipelining Example <1,0,5> Epoch 3 <2,1,4> <4,3,1> <3,2,2> <5,3,1>

1 2 4 3 5 Pipelining Example Epoch 4 <1,1,5> <2,2,4> <4,4,1> <3,3,2> <5,4,1>

Optimization: Delta Compression • If a sensor’s reading is unchanged from previous epoch, it need not transmit. • Parents assume value is unchanged • Leverage child value cache • Periodic heartbeats to handle disconnection • Extension: if a sensor’s reading is unchanged by more than some threshold, it need not transmit • Similar to hypothesis testing with AVERAGE • Really future work: See C. Olsten, “Best-Effort Cache Synchronization”, SIGMOD 2002.

Taxonomy of Aggregates • TAG insight: classifying aggregates according to various functional properties • Yields a general set of optimizations that can automatically be applied

Supporting Aggregate Queries Over Ad-Hoc Wireless Sensor Networks