TACO: T unable A pproximate C omputation of O utliers in Wireless Sensor Networks

TACO: Tunable Approximate Computation of Outliers in Wireless Sensor Networks HDMS 2010, Ayia Napa, Cyprus

ΠΑΟ:Προσεγγιστικός υπολογισμός Ακραίων τιμών σε περιβάλλΟντα ασυρμάτων δικτύων αισθητήρων HDMS 2010, Ayia Napa, Cyprus

Outline • Introduction • Why outlier detection is important • Definition of outlier • The TACO Framework • Compression of measurements at the sensor level (LSH) • Outlier detection within and amongst clusters • Optimizations: Boosting Accuracy & Load Balancing • Experimental Evaluation • Related Work • Conclusions

Introduction • Wireless Sensor Networks utility • Place inexpensive, tiny motes in areas of interest • Perform continuous querying operations • Periodically obtain reports of quantities under study • Support sampling procedures, monitoring/ surveillance applications etc • Constraints • Limited Power Supply • Low Processing Capabilities • Constraint Memory Capacity • Remark - Data communication is the main factor of energy drain

Why Outlier Detection is Useful • Outliers may denote malfunctioning sensors • sensor measurements are often unreliable • dirty readings affect computations/decisions [Deligiannakis ICDE’09] • Outliers may also represent interesting events detected by few sensors • fire detected by a sensor • Take into consideration • the recent history of samples acquired by single motes • correlations with measurements of other motes! 16 19 24 30 32 40 39

Outlier Definition • Let ui denote the latest W measurements obtained by mote Si • Given a similarity metric sim: RW→[0,1] and a similarity threshold Φ,sensors Si, Sjare considered similar if: sim(ui, uj) > Φ • Minimum Support Requirement • a mote is classified as outlier if its latest W measurements are not found to be similar with at least minSupother motes

TACO Framework – General Idea 8.2 Network organization into clusters [(Youniset al, INFOCOM ’04),(Qin et al, J. UCS ‘07)] • Step 1: Data Encoding and Reduction • Motes obtain samples and keep the latest W measurements in a tumble • Encode W in a bitmap of d<<W size 0 4.3 1 d W 5.1 0 … … Clusterhead Regular Sensor

TACO Framework – General Idea If Sim(ui,uj)>Φ { supportSi++; supportSj++;} • Step 1: Data Encoding and Reduction • Motes obtain samples and keep the latest W measurements in a tumble • Encode W in a bitmap of d<<W size • Step 2: Intra-cluster Processing • Encodings are transmitted to clusterheads • Clusterheads perform similarity tests based on a given similarity measure and a similarity threshold Φ • … and calculate support values Clusterhead Regular Sensor

TACO Framework – General Idea • Step 1: Data Encoding and Reduction • Motes obtain samples and keep the latest W measurements in a tumble • Encode W in a bitmap of d<<W size • Step 2: Intra-cluster Processing • Encodings are transmitted to clusterheads • Clusterheads perform similarity tests based on a given similarity measure and a similarity threshold Φ • … and calculate support values • Step 3: Inter-cluster Processing • An approximate TSP problem is solved. Lists of potential outliers are exchanged. Clusterhead Regular Sensor Additional load-balancing mechanisms and improvements in accuracy devised

TACO Framework 8.2 • Step 1: Data Encoding and Reduction • Motes obtain samples and keep the latest W measurements in a tumble • Encode W in a bitmap of d<<W size 0 4.3 1 d W 5.1 0 … … Clusterhead Regular Sensor

Data Encoding and Reduction • Desired Properties • Dimensionality Reduction Reduced bandwidth consumption • Similarity Preservation Allows us to later derive initial sim(ui, uj) during vector comparisons • Locality Sensitive Hashing (LSH) • Ph є F[h(ui)=h(uj)]= sim(ui , uj ) • Practically, any similarity measure satisfying a set of criteria [Charikar, STOC ‘02] may be incorporated in TACO’s framework

LSH Example: Random Hyperplane Projection [(Goemans & Wiliamson, J.ACM ’95),(Charikar, STOC ‘02) ] • Family of nd-dimensional randomvectors (rvi) • Generatesforeachdata vectora bitmapofsize n asfollows: • Sets biti=1 ifdotproductofdata vectorwithi-th randomvectorispositive • Sets biti=0 otherwise rv1 Sensor data (2-dimensional) rv2 rv3 rv4 TACO encoding: 1 0 0 1 16

Computing Similarity • Cosine Similarity: cos(θ(ui,uj)) ui RHP(ui) n bits θ(ui, ui) ui RHP(uj) θ(RHP(ui),RHP(uj))=2/6*π=π/3 Angle Similarity Hamming Distance 17

Supported Similarity Measures

TACO Framework If Sim(ui,uj)>Φ { supportSi++; supportSj++;} • Step 1: Data Encoding and Reduction • Motes obtain samples and keep the latest W measurements in a tumble • Encode W in a bitmap of d<<W size • Step 2: Intra-cluster Processing • Encodings are transmitted to clusterheads • Clusterheads perform similarity tests based on a given similarity measure and a similarity threshold Φ • … and calculate support values Clusterhead Regular Sensor

Intra-cluster Processing • Goal: Find potential outliers within the clusters realm • Back to our running example, sensor vectors are considered similar when θ(ui , uj) < Φθ • Translate user-defined similarity threshold Φθ Φh = Φθ * d/π • For any received pair of bitmaps Xi, Xj, clusterheads can obtain an estimation of the initial similarity based on their hamming distance Dh(Xi,Xj)using: Dh(Xi,Xj) < Φh • At the end of the process <Si, Xi, support> lists are extracted for motes that do not satisfy the minSup parameter

Intra-cluster Processing Probability of correctly classifying similar motes as such (W=16, θ=5, Φθ=10):

TACO Framework • Step 1: Data Encoding and Reduction • Motes obtain samples and keep the latest W measurements in a tumble • Encode W in a bitmap of d<<W size • Step 2: Intra-cluster Processing • Encodings are transmitted to clusterheads • Clusterheads perform similarity tests based on a given similarity measure and a similarity threshold Φ • … and calculate support values • Step 3: Inter-cluster Processing • An approximate TSP problem is solved. Lists of potential outliers are exchanged. Clusterhead Regular Sensor

Boosting TACO Encodings d=n·μ Xi: Obtain the answer provided by the majority of the μ tests 0 1 0 1 1 0 0 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 1 0 0 1 1 0 Xj: 0 1 1 0 1 1 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 1 1 0 0 1 0 1 1 0 1 0 1 SimBoosting(Xi,Xj)=1 • Check the quality of the boosting estimation(θ(ui,uj)≤ Φθ): • Unpartitioned bitmaps: Pwrong(d)=1-Psimilar(d) • Boosting:, Pwrong(d,μ) ≤ • Decide an appropriate μ: • Restriction on μ :Psimilar(d/μ)>0.5 • Comparison of (Pwrong(d,μ) ,Pwrong(d))

Comparison Pruning d Modified cluster election process, returns Bbucket nodes Introducing a 2nd level of hashing based on the hamming weight of the bitmaps Comparison pruning is achieved by hashing highly dissimilar bitmaps to different buckets 0 d/4 d/4 d/2 d/2 3d/4 3d/4 d Clusterhead – Bucket Node Regular Sensor

Load Balancing Among Buckets c3=d/16 c2=d/16 c1=d/12 c4=d/12 6 4 SB1 [0-3d/8] 3 3 3 3 SB2 (3d/8-9d/16] SB3 (9d/16-11d/16] SB4 (11d/16-d] 2 2 0 0 0 0 • Histogram Calculation Phase: • Buckets construct equi-width histogram based on the received Xi s hamming weight frequency 1 1 [f=(1,0,0), c4=d/12] [f=(3,3,4,6), c3=d/16] [f=(0,0,1), c1=d/12] • Histogram Communication Phase: • Each bucket communicates to the clusterhead • Estimated frequency counts • Width parameter ci 0 d/4 d/4 d/2 d/2 3d/4 3d/4 d • Hash Key Space Reassignment: • Clusterhead determines a new space partitioning and broadcasts the corresponding information SB3 SB4 SB1 SB2=SC

Outline • Introduction • Why is Outlier Detection Important and Difficult • Our Contributions • Outlier detection with limited bandwidth • Compute measurement similarity over compressed representations of measurements (LSH) • The TACO Framework • Compression of measurements at the sensor level • Outlier detection within and amongst clusters • Optimizations: Load Balancing & Comparison Pruning • Experimental Evaluation • Related Work • Conclusions

Sensitivity Analysis • Intel Lab Data - Temperature Avg. Precision Avg. Recall

Sensitivity Analysis • Boosting • Intel Lab Data - • Humidity Avg. Precision Avg. Recall

Performance Evaluation in TOSSIM • For 1/8reduction TACO provides on average 1/12 less bandwidth consumption, which reaches a maximum value of 1/15

Performance Evaluation in TOSSIM • Network Lifetime: the epoch at which the first mote in the network dies. • Average lifetime for motes initialized with 5000 mJ residual energy • Reduction in power consumption reaches a ratio of 1/2.7

TACO vs Hierarchical Outlier Detection Techniques • Robust [Deligiannakis et al, ICDE ‘09] falls short up to 10% in terms of the F-Measure metric • TACO ensures less bandwidth consumption with a ratio varying from 1/2.6 to 1/7.8

Outline • Introduction • Why is Outlier Detection Important and Difficult • Our Contributions • Outlier detection with limited bandwidth • Compute measurement similarity over compressed representations of measurements (LSH) • The TACO Framework • Compression of measurements at the sensor level • Outlier detection within and amongst clusters • Optimizations: Load Balancing & Comparison Pruning • Experimental Evaluation • Related Work • Conclusions

Related Work - Ours • Outlier reports on par with aggregate query answer [Kotidis et al, MobiDE’07] • hierarchical organization of motes • takes into account temporal & spatial correlations as well • reports aggregate, witnesses & outliers • Outlier-aware routing [Deligiannakis et al, ICDE ‘09] • route outliers towards motes that can potentially witness them • validate detection scheme for different similarity metrics (correlation coefficient, Jaccard index also supported in TACO) • Snapshot Queries [Kotidis, ICDE ’05] • motes maintain local regression models for their neighbors • models can be used for outlier detection • Random Hyperplane Projection using Derived Dimensions [Georgoulaset al MobiDE’10] • extends LSH scheme for skewed datasets • up to 70% improvements in accuracy

Related Work • Kernel based approach [Subramaniam et al, VLDB ‘06] • Centralized Approaches [Jeffrey et al, Pervasive ‘06] • Localized Voting Protocols [(Chen et al, DIWANS ’06),(Xiao et al, MobiDE ‘07) ] • Report of top-K values with the highest deviation [Branch et al, ICDCS ‘06] • Weighted Moving Average techniques [Zhuang et al, ICDCS ’07]

Συμπεράσματα • Our Contributions • outlier detection with limited bandwidth • The TACO/ΠΑΟ Framework • LSH compression of measurements at the sensor level • outlier detection within and amongst clusters • optimizations: Boosting Accuracy & Load Balancing • Experimental Evaluation • accuracy exceeding 80% in most of the experiments • reduced bandwidth consumption up to a factor of 1/12 for 1/8 reduced bitmaps • prolonged network lifetime up to a factor of 3 for 1/4 reduction ratio

TACO: Tunable Approximate Computation of Outliers in Wireless Sensor Networks Thank you!

Backup Slides

TACO Framework 8.2 0 • Step 1: Data Encoding and Reduction • Motes obtain samples and keep the latest W measurements in a tumble • Encode W in a bitmap of d<<W size • Step 2: Intra-cluster Processing • Encodings are transmitted to clusterheads • Clusterheads perform similarity tests based on a given similarity measure and a similarity threshold Φ • … and calculate support values • Step 3: Inter-cluster Processing • An approximate TSP problem is solved. Lists of potential outliers are exchanged. 4.3 W d 1 … … Clusterhead Regular Sensor If Sim(ui,uj)>Φ { supportSi++; supportSj++;}

Leveraging Additional Motes for Outlier Detection d • Introducing a 2nd level of hashing: • Besides cluster election, process continuous in each cluster so as to select Bbucket nodes with • For , 0≤ Wh(Xi)≤ d equally distribute the hash key space amongst them • Hash each bitmap to the • bucket • For bitmaps with Wh(Xi) at the edge of a bucket, transmit Xito the range: • which is guaranteed to contain at most 2 buckets since 0 d/4 d/4 d/2 d/2 3d/4 3d/4 d • Comparison Pruning is ensured by the fact that highly dissimilar bitmaps are hashed to different buckets, thus never being tested for similarity Clusterhead – Bucket Node Regular Sensor

Leveraging Additional Motes for Outlier Detection d • Intra-cluster Processing: • Buckets perform bitmap comparisons as in common Intra-cluster processing • Constraints: • -If , similarity test is performed only in that bucket • - For encodings that were hashed to the same 2 buckets, similarity is tested only in the bucket with the lowest SBi • PotOut formation: • -SiPotOut if it is not reported by all buckets it was hashed to • -Received support values are added and SiєPotOutiffSupportSi < minSup 0 d/4 d/4 d/2 d/2 3d/4 3d/4 d Clusterhead – Bucket Node Regular Sensor

Experimental Setup • Datasets: • Intel Lab Data : • Temperature and Humidity measurements • Network consisting of 48 motes organized into 4 clusters • Measurements for a period of 633 and 487 epochs respectively • minSup=4 • Weather Dataset : • Temperature, Humidity and Solar Iradiance measurements • Network consisting of 100 motes organized into 10 clusters • Measurements for a period of 2000 epochs • minSup=6

Experimental Setup • Outlier Injection • Intel Lab Data & Weather Temperature, Humidity data : • 0.4% probability that a mote obtains a spurious measurement at some epoch • 6% probability that a mote fails dirty at some epoch • Every mote that fails dirty increases its measurements by 1 degree until it reaches a MAX_VAL parameter, imposing a 15% noise at the values • Intel Lab Data MAX_VAL=100 • Weather Data MAX_VAL=200 • Weather Solar Irradiance data : • Random injection of values obtained at various time periods to the sequence of epoch readings • Simulators • TOSSIM network simulator • Custom, lightweight Java simulator

Sensitivity Analysis • Intel Lab Data - Humidity Avg. Precision Avg. Recall • Weather Data - • Humidity Avg. Precision Avg. Recall

Sensitivity Analysis • Weather Data - Solar Irradiance Avg. Precision Avg. Recall • Boosting • Intel Lab Data - • Humidity Avg. Precision Avg. Recall

Performance Evaluation in TOSSIM • Transmitted bits categorization per approach

Bucket Node Introduction

TACO: T unable A pproximate C omputation of O utliers in Wireless Sensor Networks