1 / 111

Monte F. Hancock, Jr. Chief Scientist Celestech, Inc.

These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist Celestech, Inc. Data Mining is the detection , characterization , and exploitation of actionable patterns in data.

jasper
Download Presentation

Monte F. Hancock, Jr. Chief Scientist Celestech, Inc.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist Celestech, Inc.

  2. Data Mining is the detection, characterization, and exploitation of actionable patterns in data.

  3. Data Mining (DM) Data Mining (DM) is the principled detection, characterization, and exploitation of actionable patterns in data. It is performed by applying modern mathematical techniques to collected data in accordance with the scientific method. DM uses a combination of empirical and theoretical principles to Connect Structure to Meaning by: Selecting and conditioning relevant data Identifying, characterizing, and classifying latent patterns Presenting useful representations and interpretations to users DM attempts to answer these questions: What patterns are in the information? What are the characteristics of these patterns? Can “meaning” be ascribed to these patterns and/or their changes? Can these patterns be presented to users in a way that will facilitate their assessment, understanding, and exploitation? Can a machine learn these patterns and their relevant interpretations?

  4. DM for Decision Support • “Decision Support” is all about… • enabling users to group information in familiar ways • controlling complexity by layering results (e.g., drill-down) • supporting user’s changing priorities • allowing intuition to be triggered (“I’ve seen this before!”) • preserving and automating perishable institutional knowledge • providing objective, repeatable metrics (e.g., confidence factors) • fusing & simplifying results • automating alerts on important results (“It’s happening again!”) • detecting emerging behaviors before they consummate (“Look!”) • delivering value (timely-relevant-accurate results) • …helping users make the best choices.

  5. DM Provides “Intelligent” Analytic Functions • Automating pattern detection – to characterize complex, distributed signatures that are worth human attention… and recognize those that are not. • Associating events – that “go together” but are difficult for humans to correlate. • Characterizing interesting processes – not just facts or simple events • Detecting actionable anomalies – and explaining what makes them “different AND interesting”. • Describing contexts – from multiple perspectives –with numbers, text and graphics

  6. DM Answers Questions Users are Asking • Fusion Level 1: Who/What is Where/When in my space? • Organize and present facts in domain context • Fusion Level 2: What does it mean? • Has this been seen before? What will happen next? • Fusion Level 3: Do I care? • Enterprise relevance? What action should be taken? • Fusion Level 4: What can I do better next time? • Adaptation by pattern updates and retraining • How certain am I? • Quantitative assessment of evidentiary pedigree

  7. Useful Data Applications • Accurate identification and classification– add value to raw data by tagging and annotation (e.g., fraud detection) • Anomaly / normalcy and fusion – characterize, quantify, and assess “normalcy” of patterns and trends (e.g., network intrusion detection) • Emerging patterns and evidence evaluation - capturing institutional knowledge of how “events” arise and alerting when they emerge • Behavior association - detection of actions that are distributed in time & space but “synchronized” by a common objective: “connecting the dots” • Signature detection and association – detection & characterization of multivariate signals, symbols, and emissions (e.g., voice recognition) • Concept tagging - reasoning about abstract relationships to tag and annotate media of all types (e.g., automated web bots) • Software agents assisting analysts – small-footprint “fire-and-forget” apps that facilitate search, collaboration, etc.

  8. Some “Good” Data Mining Analytic Applications Help the user focus via unobtrusive automation Off-load burdensome labor (perform intelligent searches, smart winnowing) Post “smart” triggers/tripwires to data stream (e.g., anomaly detection) Help with mission triage (“Sort my in-basket!”) Automate aspects of classification and detection Determine which sets of data hold the most information for a task Support construction of ad hoc “on-the-fly” classifiers Provide automated constructs for merging decision engines (multi-level fusion) Detect and characterize “domain drift” (the “rules of the game” are changing) Provide functionality to make best estimate of “missing data” Extract/characterize/employ knowledge Rule induction from data, develop “signatures” from data Implement reasoning for decision support High-dimensional visualization Embed “decision explanation” capability into analytic applications Capture/automate/institutionalize best practice Make proven analytic processes available to all Capture rare, perishable human knowledge… and put it everywhere Generate “signature-ready” prose reports Capture and characterize the analytic process to anticipate user needs

  9. Things that make “hard” problems VERY hard Events of interest occur relatively infrequently in very large datasets (“population imbalance”) Information is distributed in a complex way across many features (the “feature selection problem”) Collection is hard to task, data are difficult to prepare for analysis, and are never “perfect” (“noise” in the data, data gaps, coverage gaps) Target patterns are ambiguous/unknown; “squelch” settings are brittle (e.g., hard to balance detection vs. “false-alarm” rates) Target patterns change/morph over time and across operational modes (“domain drift”, processing methods becomes “stale”)

  10. Some Key Principles of “Information Driven” Data Mining Right People, Methods, Tools (in that order) Make no prior assumptions about the problem (“agnostic”) Begin with general techniques that let the data determine the direction of the analysis (“Funnel Method”) Don’t jump to conclusions; perform process audits as needed Don’t be a “one widget wonder”; integrate multiple paradigms so the strengths of one compensate for the weaknesses of another Break the problem into the right pieces (“Divide and Conquer”) Work the data, not the tools, but automate when possible Be systematic, consistent, thorough; don’t lose the forest for the trees. Document the work so that it is reproducible Collaborate to avoid surprises: team members, experts, customer Focus on the Goal: maximum value to the user within cost and schedule

  11. Select Appropriate Machine Reasoners 1.) Classifiers Classifiers ingest a list of attributes, and determine into which of finitely many categories the entity exhibiting these attributes falls. Automatic object recognition and next-event prediction are examples of this type of reasoning. 2.) Estimators Estimators ingest a list of attributes, and assign some numeric value to the entity exhibiting these attributes. The estimation of a probability or a "risk score" are examples of this type of reasoning. 3.) Semantic Mappers Semantic mappers ingest text (structured, unstructured, or both), and generate a data structure that gives the "meaning" of the text. Automatic gisting of documents is an example of this type of reasoning Semantic mapping generally requires some kind of domain model. 4.) Planners Planners ingest a scenario description, and formulate an efficient sequence of feasible actions that will move the domain to the specified goal state. 5.) Associators Associators sample the entire corpus of domain data, and identify relationships among entities. Automatic clustering of data to identify coherent subpopulations is a simple example. A more sophisticated example is the forensic analysis of phone, flight, and financial records to infer the structure of terrorist networks.

  12. Embedded Knowledge… • Principled, domain-savvy synthesis of “circumstantial” evidence • Copes well with ambiguous, incomplete, or incorrect input • Enables justification of results in terms domain experts use • Facilitates good pedagogical helps • “Solves the problem like the man does”, and so is comprehensible to most domain experts. • Degrades linearly in combinatorial domains • Can grow in power with “experience” • Preserves perishable expertise • Allows efficient incremental upgrade/adjustment/repurposing

  13. Features • A feature is the value assumed by some attribute of an entity in the domain (e.g., size, quality, age, color, etc.) • Features can be numbers, symbols, or complex data objects • Features are usually reduced to some simple form before modeling is performed. >>>features are usually single numeric values or contiguous strings.<<<

  14. Feature Space • Once the features have been designated, a feature space can be defined for a domain by placing the features into an ordered array in a systematic way. • Each instance of an entity having the given features is then represented by a single point in n-dimensional Euclidean space: its feature vector. • This Euclidean space, or feature space for the domain, has dimension equal to the number of features. • Feature spaces can be one-dimensional, infinite-dimensional, or anywhere in between.

  15. How do classifiers work?

  16. Machines • Data mining paradigms are characterized by • A “concept of operation (CONOP: component structure, I/O, training alg., operation) • An architecture (component type, #, arrangement, semantics) • A set of parameters (weights/coefficients/vigilance parameters) >>>it is assumed here that parameters are real numbers.<<< A machine is an instantiation of a data mining paradigm. • Examples of parameter sets for various paradigms • Neural Networks: interconnect weights • Belief Networks: conditional probability tables • Kernel-Based-classifiers (SVM, RBF): regression coefficients • Metric classifiers (K-means): cluster centroids

  17. A Spiral Methodology for theData Mining Process

  18. The DM Discovery Phase: Descriptive Modeling • OLAP • Visualization • Unsupervised learning • Link Analysis/Collaborative Filtering • Rule Induction

  19. The DM Exploitation Phase: Predictive Modeling • Paradigm selection • Test design • Formulation of meta-schemes • Model construction • Model evaluation • Model deployment • Model maintenance

  20. A “de facto” standard DM Methodology CRISP-DM (“cross-industry standard process for data mining”) • 1.) Business Understanding • 2.) Data Understanding • 3.) Data Preparation • 4.) Modeling • 5.) Evaluation • 6.) Deployment

  21. Data Mining Paradigms: What does your solution look like? Conventional Decision Models -statistical inference, logistic regression, score cards Heuristic Models -human expert, knowledge-based expert systems, fuzzy logic, decision trees, belief nets Regression Models -neural networks (all sorts), radial basis functions, adaptive logic networks, decision trees, SVM

  22. Real-World DM Business Challenges • Complex and conflicting goals • Defining “success” • Getting “buy in” • Enterprise data is distributed • Limited automation • Unrealistic expectations

  23. Real-World DM Technical Challenges • big data consume space and time • efficiency vs. comprehensibility • combinatorial explosion • diluted information • difficult to develop “intuition” • algorithm roulette

  24. Data Mining Problems:What does your domain look like? • How well is the problem understood? • How "big" is the problem? • What kind of data do we have? • What question are we answering? • How deeply buried in the data is the answer? • How must the answer be presented to the user?

  25. 1. Business Understanding How well is the problem understood?

  26. How well is the problem understood? Domain intuition: low/medium/high Experts available? Good documentation? DM team’s prior experience? Prior art? What is the enterprise definition of “success”? What is the target environment? How skillful are the users? Where are the pitchforks?

  27. 2. Data Understanding3. Preparing the Data How "big" is the problem? What kind of data do we have?

  28. DM Aspects of Data Preparation • Data Selection • Data Cleansing • Data Representation • Feature Extraction and Transformation • Feature Enhancement • Data Division • Configuration Management

  29. How "big" is the problem? Number of exemplars (“rows”) Number of features (“columns”) Number of classes (“ground truth”) Cost/schedule/talent (dollars, days, dudes) Tools (own/make/buy, familiarity, scope)

  30. What kind of data do we have? Feature type: nominal/numeric/complex Feature mix: homo/heterogeneous by type Feature tempo: Fresh/stale Periodic/sporadic Synchronous/asynchronous Feature data quality: Low/high SNR Few/many gaps Easy/hard to access Objective/subjective Feature information quality Salience, correlation, localization, conditioning Comprehensive? Representative?

  31. How much data do I need? • Many heuristics • Monte’s 6MN rule, other similar • Support vectors • Segmentation requirements • Comprehensive • Representative • Consider population imbalance

  32. Feature Saliency Tests • Correlation/Independence • Visualization to determine saliency • Autoclustering to test for homogeneity • KL-Principal Component Analysis • Statistical Normalization (e.g., ZSCORE) • Outliers, Gaps

  33. Making Feature Sets for Data Mining • Converting Nominal Data to Numeric: Numeric Coding • Converting Numeric data to Nominal: Symbolic Coding • Creating Ground-Truth

  34. Information can be Irretrievably Distributed (e.g., the parity-N problem) 0010100110… 1 The best feature set is not necessarily the set of best features.

  35. An example of a Feature Metric “Salience” : geometric mean of class precisions • an objective measure of the ability of a feature to distinguish classes • takes class proportion into account • specific to a particular classifier and problem • does not measure independence

  36. Original Data: Nominal to Numeric Coding...…one step at a time! Step 1: Step 2:

  37. Numeric to Nominal Quantization

  38. “Clusters” Usually Mean Something

  39. How many objects are shown here? One, seen from various perspectives!This illustrates the danger of using ONE METHOD/TOOL/VISUALIZATION!

  40. Autoclustering • Automatically find spatial patterns in complex data • find patterns in data • measure the complexity of the data

  41. Differential Analysis • Discover the Difference “Drivers” Between Groups • Which combination of features accounts for the observed differences between groups? • Focus research

  42. Sensitivity Analysis • Measure the Influence of Individual Features on Outcomes • Rank order features by salience and independence • Estimate problem difficulty

  43. Rule Induction • Automatically find semantic patterns in complex data • discover rules directly from data • organize “raw” data into actionable knowledge

  44. A Rule Induction Example(using data splits)

  45. Rule Induction Example (Data Splits)

More Related