1 / 32

Current State-of-the-Art in Data Mining Methods @ NASA Presentation to AP-15

Current State-of-the-Art in Data Mining Methods @ NASA Presentation to AP-15. Dawn McIntosh NASA Aviation Safety Technical Integration Manager with assistance from NASA Data-mining Team Mar 26, 2009. Domains with Vast Data Mining Research. Science Earth Science / Climate Space Medicine

sian
Download Presentation

Current State-of-the-Art in Data Mining Methods @ NASA Presentation to AP-15

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Current State-of-the-Artin Data Mining Methods @ NASAPresentation to AP-15 Dawn McIntosh NASA Aviation Safety Technical Integration Manager with assistance from NASA Data-mining Team Mar 26, 2009

  2. Domains with Vast Data Mining Research • Science • Earth Science / Climate • Space • Medicine • Business • Credit Card Fraud • Market Trends • Intrusion Detection • Directed Marketing • Engineering • Aviation • Space • All Others! • Web Analysis • Google/Yahoo! Searches • Blogosphere Similarities abound across these domains!

  3. Let’s get into the Nitty-Gritty State-of-the-Art Methods • Monitoring • Trending • Anomaly Detection • Procedural Compliance • Text Mining • Prediction

  4. Let’s get into the Nitty-Gritty State-of-the-Art Methods • Monitoring • Trending • Anomaly Detection • Procedural Compliance • Text Mining • Prediction

  5. Monitoring – Trending • Definition: Determine whether the measured values of one or more variables change during a time period. • Aviation Safety Need (both are ongoing ASIAS activities): • Evaluate performance after Safety Enhancements are put in place • Monitor known risks over time • Challenges (in our domain and most others): • Seasonality • Missing values, • Inconsistent data collection • Biases (skewed datasets, spikes in voluntary reporting, differences in thresholds warranting a report, etc) are very real problems with trending.

  6. Monitoring – Trends Single (or a few) variables followed over time. Appropriate approach for straightforward monitoring. System-level trending, such as is done within ASIAS, has extra challenges.

  7. Monitoring – Anomaly Detection • Definition: Finding patterns in data that do not conform to nominal, or expected, behavior, i.e., identify the unknowns. • Technically, all monitoring, even trending, is an anomaly detection problem. • Anomaly Detection approaches • Detect exceedances • Need to identify meaningful definitions of nominal (e.g., from SMEs, training material, published procedures, regulations, etc) • Clustering: • Useful for cases with little or no data to train the algorithm to a definition of ‘nominal’ • Aviation Safety: Apply techniques to identify new or emerging risks. • Switches flipped in the cockpit or ground (temporal sequences) • Auto-collected continuous (time-series) data • Auto-classification of text reports (maintenance reports and records, submitted reports)

  8. Monitoring – Exceedances Minimizing your False Positive Rate is an obvious step – especially in the aviation community. Nuisance alerts breed complacency and a lack of confidence. However, you have to find your balance between False Positive Rate and False Negative Rate, which requires some analysis (at least at some point). One or more variables against a set of exceedance thresholds

  9. Monitoring – Clustering Even with only the most basic tools, or human vision, a set of data can be separated into clusters. Variable B Variable A

  10. Monitoring – Clustering Finer distinction can be made by statistical analysis Operational relevance can be improved when many aspects of performance are incorporated Variable B Variable A

  11. Variable N Variable B Variable A Monitoring – Clustering And with even greater analysis tools – able to work across many variables (hundreds even) – finer distinctions can be made and more subtle anomalies identified Imagine what anomalies and clusters we could identify if we could look at the 4th, 5th, Nth dimension… Even with beautiful algorithms, SMEs are usually necessary to identify operational relevance. ASIAS works hard to apply both.

  12. Monitoring – Clustering Based on a small amount of prior data (even the start of a flight), automatically defines the bounds of ‘normal’ cluster(s) of parameters Generated Clusters (VarA VarB VarC VarD VarE VarF) H: (3395 0.78 0.52 8105 1 5) L: (3394 0.77 0.50 8100 1 5) H: (3393 0.78 0.62 8109 1 2) L: (3392 0.78 0.55 8107 1 2) H: (3390 0.78 0.66 8120 2 3) L: (3384 0.78 0.64 8112 2 2) Archived Nominal Data Points (VarA VarB VarC VarD VarE VarF) (3395 0.77 0.50 8100 1 5) (3394 0.78 0.52 8105 1 5) (3393 0.78 0.55 8107 1 2) (3392 0.78 0.62 8109 1 2) (3390 0.78 0.64 8112 2 3) (3388 0.78 0.65 8115 2 3) (3386 0.78 0.66 8118 2 3) (3384 0.78 0.66 8120 2 2) Real-time data fits nicely in a cluster, therefore nominal (3395 0.77 0.51 8102 1 5) … … … (3386 0.78 0.62 8111 2 2) off-nominal! Two out of 5 variables outside of cluster bounds

  13. Monitoring - Clustering • Examples: • Slowly evolving anomalies (i.e., deterioration, material fatigue) • Fast evolving anomalies (e.g., temperature or pressure fluctuations) • This can be done with little data (e.g., launch data from only five previous flights), although more data is often better. Post-flight IMS analysis of wing temperature sensors for the Columbia STS Shuttle launch which ended in tragedy. Reference [I]

  14. Monitoring – (Individual) Procedural Compliance • Takeoff/landing procedure compliance • Identify when a pilot skips a procedural step • … or adds a step • … or does steps out of order (combo of the previous two bullets) • Example: sequenceMiner can detect and characterize anomalies in sequences of switches flipped in a cockpit • E.g., has identified cases of pilot mode confusion and other pilot-automation issues • Temporal sequences • Identify when a pilot does a step later in the landing/takeoff process than is typical • Variance in times is very common, this would only be able to identify big offenders

  15. Let’s get into the nitty-gritty State-of-the-Art Techniques • Monitoring • Trending • Anomaly Detection • Procedural Compliance • Text Mining • Prediction

  16. Text Mining • Purpose: • To discover new and emerging problems (use clustering techniques) • Identify or confirm emerging risks and issues • Determine contributing or causal factors • To monitor known problems (use classification or language-driven techniques) • Measure the performance of new rules and regulations (e.g., safety enhancements) and their effect • Support decision-making • Aviation Safety Need • Currently best approach for capturing the human behavior component of this highly integrated system • Ultimately, the ‘why’ • Challenges • Classifiers need lots of labeled data • A collection of narratives/reports/documents is very sparse, high dimensional data • Multiple authors and multiple users • Standard issues with written narratives: • Misspellings • Unconventional (or overused) acronyms • Incomplete sentences • Domain specific language

  17. Text Mining Analysis using SMEs is common in both ASRS and ASIAS Two automated analysis approaches (which can be combined): • Data-driven • Clustering: • The previous slides on Monitoring covered analyzing numeric data for anomalous behavior. • Amazingly, we can use the exact same techniques to analyze text reports • Classification: • Established methods: Classifies reports into pre-defined categories • Newer methods: Identifies prevalent features (and combinations of features) • Language-driven – in backup slides • Very precise representation of concepts • Rule-based

  18. Text Mining – Data-driven Established methods can be used to auto-classify text reports Sample ASRS-like Report Sample of Pre-defined Categories AT XX:XX ON DD/MM/YYYY, JUST PRIOR TO TOUCHDOWN, LAX TWR TOLD US TO GO AROUND BECAUSE OF THE AIRLINE Y ACFT IN FRONT OF US. BOTH THE COPLT AND I, HOWEVER, UNDERSTOOD TWR TO SAY, ‘CLRED TO LAND, ACFT ON THE RNWY.’ SINCE THE AIRLINE Y ACFT IN FRONT OF US WAS CLR OF RWY AA AND WE BOTH MISUNDERSTOOD TWR’S RADIO CALL AND CONSIDERED IT AN ADVISORY, WE LANDED…

  19. Text Mining – Data-driven Newer numeric methods first train a text auto-classifier… Sample of Pre-defined Categories Extract features Sample ASRS-like Reports AT XX:XX ON DD/MM/YYYY, JUST PRIOR TO TOUCHDOWN, LAX TWR TOLD US TO GO AROUND BECAUSE OF THE AIRLINE Y ACFT IN FRONT OF US. BOTH THE COPLT AND I, HOWEVER, UNDERSTOOD TWR TO SAY, ‘CLRED TO LAND, ACFT ON THE RNWY.’ SINCE THE AIRLINE Y ACFT IN FRONT OF US WAS CLR OF RWY AA AND WE BOTH MISUNDERSTOOD TWR’S RADIO CALL AND CONSIDERED IT AN ADVISORY, WE LANDED… AT XX:XX ON DD/MM/YYYY, JUST PRIOR TO TOUCHDOWN, LAX TWR TOLD US TO GO AROUND BECAUSE OF THE AIRLINE Y ACFT IN FRONT OF US. BOTH THE COPLT AND I, HOWEVER, UNDERSTOOD TWR TO SAY, ‘CLRED TO LAND, ACFT ON THE RNWY.’ SINCE THE AIRLINE Y ACFT IN FRONT OF US WAS CLR OF RWY AA AND WE BOTH MISUNDERSTOOD TWR’S RADIO CALL AND CONSIDERED IT AN ADVISORY, WE LANDED… AT XX:XX ON DD/MM/YYYY, JUST PRIOR TO TOUCHDOWN, LAX TWR TOLD US TO GO AROUND BECAUSE OF THE AIRLINE Y ACFT IN FRONT OF US. BOTH THE COPLT AND I, HOWEVER, UNDERSTOOD TWR TO SAY, ‘CLRED TO LAND, ACFT ON THE RNWY.’ SINCE THE AIRLINE Y ACFT IN FRONT OF US WAS CLR OF RWY AA AND WE BOTH MISUNDERSTOOD TWR’S RADIO CALL AND CONSIDERED IT AN ADVISORY, WE LANDED… AT XX:XX ON DD/MM/YYYY, JUST PRIOR TO TOUCHDOWN, LAX TWR TOLD US TO GO AROUND BECAUSE OF THE AIRLINE Y ACFT IN FRONT OF US. BOTH THE COPLT AND I, HOWEVER, UNDERSTOOD TWR TO SAY, ‘CLRED TO LAND, ACFT ON THE RNWY.’ SINCE THE AIRLINE Y ACFT IN FRONT OF US WAS CLR OF RWY AA AND WE BOTH MISUNDERSTOOD TWR’S RADIO CALL AND CONSIDERED IT AN ADVISORY, WE LANDED… AT XX:XX ON DD/MM/YYYY, JUST PRIOR TO TOUCHDOWN, LAX TWR TOLD US TO GO AROUND BECAUSE OF THE AIRLINE Y ACFT IN FRONT OF US. BOTH THE COPLT AND I, HOWEVER, UNDERSTOOD TWR TO SAY, ‘CLRED TO LAND, ACFT ON THE RNWY.’ SINCE THE AIRLINE Y ACFT IN FRONT OF US WAS CLR OF RWY AA AND WE BOTH MISUNDERSTOOD TWR’S RADIO CALL AND CONSIDERED IT AN ADVISORY, WE LANDED…

  20. Text Mining – Data-driven Incoming reports can then be auto-classified using the pre-built data model. Incoming sample ASRS-like report Sample of Pre-defined Categories Feature List AT XX:XX ON DD/MM/YYYY, JUST PRIOR TO TOUCHDOWN, LAX TWR TOLD US TO GO AROUND BECAUSE OF THE AIRLINE Y ACFT IN FRONT OF US. BOTH THE COPLT AND I, HOWEVER, UNDERSTOOD TWR TO SAY, ‘CLRED TO LAND, ACFT ON THE RNWY.’ SINCE THE AIRLINE Y ACFT IN FRONT OF US WAS CLR OF RWY AA AND WE BOTH MISUNDERSTOOD TWR’S RADIO CALL AND CONSIDERED IT AN ADVISORY, WE LANDED…

  21. Let’s get into the nitty-gritty State-of-the-Art Techniques • Monitoring • Trending • Anomaly Detection • Procedural Compliance • Text Mining • Prediction

  22. Prediction • Purpose: • To detect precursors to failure as early as possible • To estimate the lifespan of a component • Aviation Safety Need: • Better estimations supporting maintenance-on-demand

  23. Prediction Estimation of Remaining Useful Life of a component, subsystem, etc. As expected, more data over time improves both the accuracy of the prediction and the precision (the time range of the prediction)

  24. Prediction Prediction algorithms have improved in recent years and can now be successfully applied to more complicated time series, even given a small set of data. Newer algorithms (e.g., Gaussian processes) can make iterated forecasts and detect a precursor to a sudden drop in intensity As importantly, they can also generate a meaningful measure of prediction certainty.

  25. References https://dashlink.arc.nasa.gov/static/dashlink/media/topic/Ganguly_ICDM-SSTDM-slides_2008.pdf http://www.cs.umn.edu/tech_reports_upload/tr2007/07-017.pdf Applications of Data Mining in Computer Security, edited by Daniel Barbara and Sushil Jajodia, pp. 1-12, Springer, 2002. Srivastava, A. 2006. Enabling the discovery of recurring anomalies in aerospace problem reports using high-dimensional clustering techniques. Aerospace Conference, 2006 IEEE, 17{34}. Matthews, B. and Srivastava, A.N., Comparative Analysis of Data-Driven Anomaly Detection Methods on Solid Rocket Motor Faults, Proceedings of the Joint Army Navy NASA Air Force Conference on Propulsion, Orlando, FL, 2008. Cohen KB, Hunter L 2008 Getting Started in Text Mining. PLoS Computational Biology 4(1): e20 doi:10.1371/journal.pcbi.0040020 http://www.hydrol-earth-syst-sci.net/5/679/2001/hess-5-679-2001.pdf Vassilis Z. Antonopoulos, Dimitris M. Papamichail and Konstantina A. Mitsiou, Statistical and trend analysis of water quality and quantity data for the Strymon River in Greece, Hydrology and Earth System Sciences, 5(4), 679-691, 2001. S. Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques, Informatica Journal 31 (2007) 249-268 David L. Iverson, Inductive System Health Monitoring, Proceedings of the 2004 International Conference on Artificial Intelligence (IC-AI’04), CSREA Press, Las Vegas, NV, June 2004. Newell, C.J., Aziz, J.J. and Vanderford, Mindy, AFCEE Monitoring and Remediation Optimization System Software Appendix A.2: Statistical Trend Analysis Methods. http://www.gsi-net.com/Software/maros_dl/append/Appendix2_Trends.pdf, url confirmed on 02/02/2009. Irving C. Statler and Ashok N. Srivastava, Proactively Managing Aviation-System Safety Risk: Recent Advances in mining aviation safety data, NASA ARMD Technical Seminar, Sept. 22, 2006, https://dashlink.arc.nasa.gov/static/dashlink/media/topic/DataMiningResults-ARMDSeminar_2.pdf B. Saha, K. Goebel, and J. Christopherson, Comparison of Prognostic Algorithms for Estimating Remaining Useful Life of Batteries, submitted to the Transactions on the Royal UK Institute on Measurement and Control, special issue on Intelligent Fault Diagnosis and Prognosis for Engineering Systems, 2008, http://ti.arc.nasa.gov/m/pub/1454/1454%20(Saha).pdf. Ashok. N. Srivastava and Santanu Das, Making Predictions at the Edge of Chaos, presentation at the NASA Ames Research Center Director’s Colloquium, June 2007.

  26. Thank You! Dawn.M.McIntosh@nasa.gov

  27. Backup Slides

  28. Monitoring - Trends Trend Distribution • Instead of a single data point in time, it is more likely that we will have a distribution of data. • We can trend assuming a particular distribution and examining one of its components: arithmetic mean, standard deviation, variance, coefficient of variation, etc. Reference: [J]

  29. fleet combo big fleet small fleet Month 1 Month 2 Monitoring – Trends Is something in a small fleet observable? mean mean mean A big fleet with lots of data can dominate a system-level distribution. This can be a pitfall when you look across an entire system for isolated perturbations

  30. Monitoring – Procedural Compliance • Requires a quantifiable definition of compliance. • Benchmarks immediately come to mind, although benchmarks may not enable comparison between different operations (especially small, unique operations): • Different recording rates • Different airports or flight profiles • Different equipment • There is benefit in capturing data, even of small operations, if handled appropriately: • We can measure compliance relative to immediately prior flights in that operation • We can capture what they do record and identify fleet-level compliance once dataset is large enough

  31. Text Mining – Language-driven The most prevalent language-driven text mining approach is called Natural Language Processing • Benefits: • The rules can be written or tuned very precisely for the specific domain • Process is very interpretable by humans – that’s good when the result is being used to support a decision • Challenges • Rule-writing can be a long arduous process • Rules frequently don’t translate between domains, so a new technical ‘lexicon’ (even using GA, helicopter, etc terms) would require modification to the set of rules.

  32. Text Mining – Language-driven Standard components of Natural Language Processing: • Part-of-speech tagging • To extract nouns, verbs, dates, etc. • Word-sense disambiguation • Overused terms: “The Excel table printout is on the table” • Implied relationships: “We are officemates” vs. “We are engineers” • Phrase Extraction • “Commercial Aviation Safety Team” • Identifying synonyms • Acronyms and abbreviations • “NASA” = “National Aeronautics and Space Administration” • “wx” = “weather” • Others • “President of the United States” = “Barack Obama”

More Related