1 / 28

Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data

Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data. Rebecca Buchheit AIS Lab. Background. sporadic use of KDD techniques in civil infrastructure relative youth of data mining research difficult to systematically apply KDD process

eldora
Download Presentation

Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data Rebecca Buchheit AIS Lab

  2. Background • sporadic use of KDD techniques in civil infrastructure • relative youth of data mining research • difficult to systematically apply KDD process • KDD process tools (CRISP-DM) still under development • KDD process highly domain dependent • time consuming to teach data mining analysts domain knowledge

  3. Research Objectives • develop a framework for systematically applying KDD process to civil infrastructure data analysis needs • set of guidelines for inexperienced analysts • checklist for more experienced analysts • describe intersection of KDD process characteristics and civil infrastructure • what problems are well-suited to KDD? • what characteristics are unique to infrastructure?

  4. Summary • increased data collection => increased need to intelligently analyze data • KDD process as a “power tool” for analyzing data for high-level knowledge • civil infrastructure problems are well-suited to data mining but will need to apply entire KDD process to get good results • proposed framework will help researchers to systematically apply KDD process to their data analysis problems

  5. Data Quality • What is it? • in this talk, “accuracy” • how close is the observed value to the true value? • “ground truth” is rare • look for anomalous patterns • Why is it important? • poor quality data may taint analyses • patterns of poor quality data may overwhelm data mining/machine learning algorithms

  6. Mn/ROAD Data • weigh-in-motion data • axle spacings and weights, speed, lane, error codes • derived quantities • equivalent standard axle loads (ESALs) • FHWA vehicle type • gross vehicle weight • total vehicle length • trucks only (type >= 4) • Jan 1 ‘98 to Dec 31 ’00 • about 3 million vehicles courtesy Mn/ROAD

  7. Sample Data

  8. Overview of Approach • use statistical analysis and data mining algorithms to separate anomalies from normal data • clustering • regression • physical constraints • statistical properties • focus on differences between anomalies and normal data to help discover causation

  9. Clustering • group data into “natural classes” • anomalies separated from normal data • used Autoclass clustering algorithm

  10. Clustering Results

  11. Regression • confidence interval of 95% • R-square (fit) = 0.923 • if error > 15% then identify as anomaly ∑ ESAL = (3.531±0.176) ∑vehicles – (1.252±0.099) ∑axles + (0.066±0.003) ∑GVW – 139.000 ± 79.813

  12. Regression Results

  13. Binary Constraints (1)

  14. Binary Constraints (2)

  15. Constraint Interactions

  16. use a goodness-of-fit test to compare distributions from the same day of week length gross weight ESALs lane Distribution Constraints

  17. Anomaly Identification • identify days with higher than normal concentrations of binary constraint violations • identify days that are not likely to have come from the baseline distributions for length, ESALs, gross weight and lane

  18. Binary Constraints Results

  19. Distribution Constraints Results

  20. A Quick Refresher • used four different procedures to detect anomalies • clustering • regression • binary (physical) constraints • distribution constraints • next up • what is causing the anomalies? • can we fix them?

  21. Gross Vehicle Weight

  22. Lane

  23. What Happened? • two vehicles traveling slowly and close together (tailgating) may be recorded as a single vehicle • lightweight vehicles are tailgating cars • cars not supposed to be in database • mis-classified because of tailgating • this causes the “high” vehicle counts • very heavy vehicles are tailgating trucks • lane 1 (right-hand side) data is missing for all “low” vehicle count days

  24. removed all tailgating cars lightweight short 2 or 3 axles error code “halved” all tailgating trucks very long very heavy more than 9 axles error code Can It Be Fixed? (1)

  25. inserted lane 1 vehicles from same time period in 2000 “shifted” days to make sure day of week was constant Tuesday Sept 8 1998 => Tuesday Sept 5 2000 Can It Be Fixed? (2)

  26. Summary • statistical analysis and data mining algorithms can be used to detect systematic anomalies in data • focus on differences between anomalies and normal data to discover differences • need domain knowledge to understand causation

  27. Current Progress/Future Work • integrate algorithms into data quality assessment program == automation • physical constraints • distribution constraints • other statistical characteristics of data • clustering • regression, neural networks • will support infrastructure-related data collection activities • use algorithms to identify and “clean” anomalies

  28. Acknowledgements • Minnesota Department of Transportation, especially Maggi Chalkline • based upon work supported by the National Science Foundation, under Grant Numbers 9987871 and DGE 9553380

More Related