1 / 36

Multidimensional Analysis of Atypical Events in Cyber-Physical Data

Multidimensional Analysis of Atypical Events in Cyber-Physical Data. Lu-An Tang , Xiao Yu, Sangkyum Kim, Jiawei Han, Wen-Chih Peng, Yizhou Sun. Outline. Introduction Backgrounds Model Construction Query Processing Experiments. Introduction.

kirti
Download Presentation

Multidimensional Analysis of Atypical Events in Cyber-Physical Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multidimensional Analysis of Atypical Events in Cyber-Physical Data Lu-An Tang, Xiao Yu, Sangkyum Kim, Jiawei Han, Wen-Chih Peng, Yizhou Sun

  2. Outline • Introduction • Backgrounds • Model Construction • Query Processing • Experiments

  3. Introduction • Cyber Physical System: Integrate physical devices (e.g., sensors, cameras) with cyber components to form a situation aware analytical system • Many promising applications • traffic observation • intruder/motion detection • battlefield surveillance • remote healthcare • Key task: Analyze the atypical data with multi-dimensionalinformation

  4. Motivation Example I • Taffic Monitoring System: Typical CPS • Inductive loop sensors • Thousands, placed every few miles in highways • 24 hours * 7 days • monitoring traffic and report congestions

  5. Motivation Example II • Questions from Transportation Officers • When do the congestion usually happen in downtown? • Where do the congestion happen in the weekday? • In the past three months, which road is the most seriously congested, how do those congestion start? • Traditional SQL query cannot satisfy them

  6. Our Contribution • They demand the results that are summarized, self-organized and succinct, be delivered inshort time • Our goal • Construct a data model for atypical data in CPS • Support efficient query processing with such model

  7. Challenges • Massive Data • Thousands of sensors generate giga-bytes, even tera-bytes of data • Complex Event • The atypical event is a dynamic process influencing multiple spatial regions • How to represent such an event? – new measure/model • Effectiveness & Efficiency • If the query range is large, many events are involved • Retrieve the significant ones in short time – new algorithm

  8. Our Contribution • Introduce the techniques to discover atypical events and summarize them as atypical micro-clusters • Integrate the similar micro-clusters to macro-clusters to generate big figure • Construct the data model of atypical cluster forest • Using a guiding algorithm to retrieve the significant cluster efficiently

  9. Outline • Introduction • Backgrounds • Model Construction • Query Processing • Experiments

  10. CPS Systems in Traffic Application • PeMS: collects data in California highway • CarWeb: collects real time GPS data from cars • Google Traffic: Toolkit on Google Map • CubeView by Shekhar et.al: Implement traditional OLAP on the traffic data • AITVS: based on CubeView, using two more distinct views to support investigation • Most focus on SQL based queries, lacking analysis power • Build on the whole dataset – huge I/O overhead, atypical data are dwarfed

  11. Other Spatial OLAP Techniques • Spatial Cube by Stefanovic et. al: dimension members are spatially referenced and can be represented on a Map • Trajectory Cube by Giannotti et. al: include temporal, spatial, demo-graphic and techno-graphic dimensions, two kinds of measures: spatial measure and numerical measure • Flow Cube by Gonzalez et. al: analyzing item flows in RFID applications • Different object – cannot use them directly in this problem

  12. Preliminaries • Atypical record: (s, t, f(s,t)) • s: sensor • t: reported time • f(s,t): severity measure • Analytical query Q(W, T, etc) • W: spatial region • T: time period • There might be query conditions on other dimensions • Return total severity: • Too abstract

  13. Problem Formulation • Let R be the CPS dataset, retrieving the atypical events from R, designing a measure to represent the event and integrating the information of multiple events • Process analytical query Q in online time • We assume the atypical criteria is given and the atypical dataset can be acquired in advance

  14. System Overview

  15. Outline • Introduction • Backgrounds • Model Construction • Query Processing • Experiments

  16. Atypical Event • Let us examine the atypical event -- congestion in traffic monitoring system : • start from a single segment of the streets • expand along the road and influence nearby roads • may cover hundred road segments when reaching the full size • The data records in a congestion are spatially close andtimely relevant

  17. Retrieve the Atypical Event • Scan the dataset, retrieve the atypical records and group them by a time threshold and distance threshold • The atypical event is a set of atypical records • The size is not bounded (or bounded by the size of dataset R) • Difficult to represent and integrate • Too detail -- not a good measure

  18. Atypical Micro-Cluster • Aggregate the atypical records in one dimension • Summarize the total severity by sensors (sensor/spatial feature) • Summarize the total severity by time window (temporal features) • The size is bounded by the total numbers of sensors and time windows • Still keeping detailed information

  19. Example in Congestion Event

  20. Integrate the Micro-clusters • The micro-clusters represent an individual event • Atypical events may happen in similar places/time • For example, 10E highway congested in evening rush hours in weekday • For analytical purposes, it is helpful to group those similar congestions as a whole • Two sub-problems: • Which ones to merge? • How to merge?

  21. Similarity Measure for Atypical Clusters • Basic Principles • Consider the similarity on multiple dimensions – users may specify a preference weight • Weighted measure on the data themselves (e.g., if sensor s1 report higher severities in the clusters than s2, then the weight of s1 is higher) – employ the severity as weight

  22. Cluster Integration • For two clusters C1 and C2, the system • carry out aggregation on the feature of each dimension • for the common items, sum up their severity • keep the non-overlap items • Example • C1 {s1, 100 min; s2, 20 min} • C2 {s1, 30 min; s3, 40 min} • C23{s1, 130 min; s2, 20 min; s3, 40 min} • The spatial and temporal features are algebraic –efficient to aggregate

  23. Macro-Clusters • The macro-clusters are generated by merging the micro-clusters • The similarities are computed among those macro-clusters and even larger ones can be further generated

  24. Clustering Forest • The clusters make up the hierarchy of a tree • Different aggregate paths (preference on dimensions ) form a cluster forest

  25. Outline • Introduction • Backgrounds • Model Construction • Query Processing • Experiments

  26. The Efficiency Problem on Online Query • Usually it is not realistic to materialize the entire data forest • Only some middle results (i.e., the micro-clusters in lower level cells) are pre-computed (Partial materialization) • The time complexity of the cluster integration algorithm is O(n2) • Query efficiency will be influenced if n is large –the analytical query Q(W, T) usually covers large region with long time – n is indeed large

  27. The Effectiveness Problem • In the result, only few significant macro-clusters are generated • The remaining are the trivial ones that cannot be aggregated with others

  28. Pruning-beforehand Strategy • Filter out the insignificant micro-clusters • The insignificant micro-clusters may integrate together and generate significant macro-clusters • Can we foretell which micro-cluster will contribute to significant macro-clusters?

  29. Red-Zone Guided Clustering • Since it is fast to compute the total severity in a specified region • Select out the regions with high severities (red zones) • Filter out the micro-clusters locating outside those red zones • Only keep the ones in/intersect with red zones (where the significant macro-clusters may locate)

  30. Red-Zone Guided Clustering Example

  31. Outline • Introduction • Backgrounds • Model Construction • Query Processing • Experiments

  32. Experiment Setup • PeMS datasets from UC Berkeley • 1 year traffic data • 4,076 loop detectors in 38 freeways in California • totally 54 GB • Hardware • Inter 2200 Dual CPU @ 2.20G Hz and 2.19G Hz • 1.98 GB RAM; Windows XP SP2. • All the algorithms are implemented in Java

  33. Model Construction • Comparing Atypical Cluster (AC) with Original CubeView (OC) and Modified CubeView (MC) • AC is an order of magnitude faster than OC

  34. Query Efficieny • All: Do not prune; Pru: Prune beforehand; Gui: Guided Clustering • Gui cost 20% time of All, and is close to Pru

  35. Query Effectiveness • Ground Truth: Generated by All • Pru may miss real significant macro-clusters, but Gui can guarantee the recall

  36. Conclusions • We have investigated the problem of multi-dimensional analysis of atypical events in CPS • Atypical cluster is designed to present the event and serve as the measure for data model • The red-zone algorithm is proposed to retrieve the significant clusters for analytical query • Performance evaluation on real large datasets Thank You Very Much! Any Questions?

More Related