carlos bossy principal consultant cbip data mining mcts mcitp bi quanta intelligence n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Carlos Bossy Principal Consultant CBIP, Data Mining, MCTS , MCITP BI Quanta Intelligence PowerPoint Presentation
Download Presentation
Carlos Bossy Principal Consultant CBIP, Data Mining, MCTS , MCITP BI Quanta Intelligence

Loading in 2 Seconds...

play fullscreen
1 / 69

Carlos Bossy Principal Consultant CBIP, Data Mining, MCTS , MCITP BI Quanta Intelligence

1 Views Download Presentation
Download Presentation

Carlos Bossy Principal Consultant CBIP, Data Mining, MCTS , MCITP BI Quanta Intelligence

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Predictive Analytics in the Enterprise Carlos Bossy Principal Consultant CBIP, Data Mining, MCTS, MCITP BI Quanta Intelligence

  2. Introduction Experience • 10years Business Intelligence Consultant • Data Warehouse, Integration, Data Mining, Predictive Models • 15 years Software Development as Programmer thru CTO Speaker/Author • Speaker at PASS Summit in October • Local and Nationwide Events and Users Groups • Published Chapter on Big Data for Pro SQL Server 2012 Practices Contact Info • Email: cbossy@quantaintelligence.com • Blog: www.carlosbossy.com • Twitter: @carlosbossy

  3. Today’s Session

  4. IntroductionPredictive Analytics

  5. Math! Just a little….

  6. Predictive Analytics Defined Predictive Analytics is the discovery of predictive information, sometimes hidden, from databases (often large) using related and unrelated data attributes by applying algorithmic analysis, and the creation of models that generate predictive results. • Predictive Modeling is the process of creating a model to best predict the probability of an outcome.

  7. Data Scientist A Data Scientist incorporates techniques and theories from many fields, including mathematics, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data. Data science:a novel term used interchangeably with competitive intelligence or business analytics.

  8. Predictive AnalyticsGartner Hype Cycle

  9. The End What do we want to Achieve?

  10. Predictive Analytics A vital component in the BI Data Architecture and available in real-time to Applications, Reports, ETL ROI of Business Analytics Projects when Incorporating Predictive Analytics (Source: IDC) 145% vs 89%

  11. Yesterday <-----> Today Legacy Today • Manageable Volumes of Data • The ease and power of SQL • Domain Experts could grasp and analyze a complete Database • Limited CPU Horsepower • Finite Data • Explosion of Data doubles every 3 years (Moore’s Law) • Data Volumes can’t be comprehended by humans • Uncover complex and difficult to find patterns for competitive edge • Improve professional judgment of Domain Expert (small but valuable) • Knowledge Discovery • Converting Data to Information • N --> Infinity

  12. Applications • Credit Risk Analysis • Churn Analysis • Customer Retention • Targeted Marketing • Market Basket Analysis • Sales Forecasting • Stock Predictions • Medical Diagnosis • Bioscience Research • Surveys • Insurance Rate Quotes • Credit Card Fraud • Web Site Events • Loan Applications • Hiring and Recruiting • Cross-Marketing • Attendance • Anything!

  13. Terms to Know Class Variable is the data we are predicting. Instance and Attribute is similar to a row and column used as input to a predictive model. Support refers to the number of instances of a particular pattern. Overfitting (Overtraining) is a flaw in a predictive model where the model reflects the training data too closely. Confusion Matrix (Classification Matrix) is a table showing predicted values and their accuracy. Cross Validation is a way to develop and test a model by using a percentage of the data to test and another .

  14. Modeling Methodology / Approach

  15. Methodology / ApproachProblem Statement

  16. Methodology / Approach

  17. Event Management Model • How many people will register for an Event? Simultaneous events managed: > 100 Mean Error: 15% Target Error: 5% Pass 1: 60 days Pass 2: 30 days Pass 3: 7 days

  18. Medical Case Management Workplace Injuries • Can a case benefit from Nurse Case Management intervention?

  19. Gather Data

  20. Prepare Data Transform the data to fit the Problem/Algorithm/Solutions Get derivations from SMEs, Data Analysis, Groupings Discrete vs. Continuous Binsand Buckets (discretize) Analyze Quantities: logarithmic, binary, linear Moving Averages: simple, exponential, convergence, divergence Input to Algorithm in Tabular Form

  21. Select Training and Testing Sets

  22. Choose Algorithm(s) • Random Forest • Model Tree • Genetic • Support Vector Machine • Many more … • Time Series • Naïve Bayes • Association • Clustering • Regression Tree • Logistic Regression • Clustering • Sequence Clustering • Neural Networks

  23. Choose Algorithms

  24. Decision Tree Yearly Income = 64,387.383+5,984.161*(Cars-2.125)+9,372.900*(Age-60.313)+0.763*(ID-17,759.125)+20,740.055*(Children-0.125)

  25. Model Training and Testing

  26. Model Training and Testing Predictive Model Applications

  27. SQL Server Data Mining • SQL Server 2005 and 2008 • Analysis Services • SQL Server 2012 • Analysis Service Multi-Dimensional • Development • Business Intelligence Development Studio • Excel Data Mining Add-in

  28. Event Registrations Model Development

  29. Event Registrations • Daily Snapshot in Tabular Form

  30. Event Registrations Predictive Model Outcome 30 Day Prediction

  31. More Terms to Know (Advanced) Ensemble modeling combines the results of two or more independent models to improve the quality of predictions. Boosting (Bootstrapping) is a form of ensemble modeling that trains subsequent models byheavily weighting poorly classified instances. Bagging is also a form of ensemble modeling that trains each model in the ensemble using a random subset of the training data. The models then vote to obtain the best prediction. Stackingtrains a model by using as input the output of predictions of other models. Feature Selection is a process of selecting a subset of relevant data to reduce the amount of data a model must process to train. It can generalize a model and help avoid overfitting.

  32. Worker Injury Claims

  33. Workers’ Compensation • Model Development

  34. Worker Injury Claims • Tabular Data View

  35. More Terms and Things to Know Balancing of the training dataset is done to ensure class values occur in a similar proportion. Scoring is the output of a model (the predicted value). Model Disruption happens when an event occurs that causes a model to no longer make accurate predictions. Machine Learning is analogous to algorithmic predictive modeling. Expert Systems are developed using expertise and rules, not algorithms. Unstructured data such as text and images can be modeled to obtain information from the data.

  36. Working Session Group Model • A Predictive Model • for Prospective Hires

  37. Group Model Developing our Predictive Model • Problem Statement • Gather Data • Explore Data • Prepare Data • Training / Test Sets • Choose Algorithms • Train • Test

  38. Group Model Problem Statement • Should we hire a job candidate? • What salary should we offer to a potential employee?

  39. Group Model • Where’s the Data? Personality Type Certifications Experience References Leadership Integrity Evaluations Internet Search GPA Confidence Exams Interview Problem Solving Salary Requirements Job History

  40. Group Model Building the Model

  41. Model Development • Exercise #1 • New Project • SSAS Multidimensional and Data Mining Project • Set Properties • Server/Instance • Add Data Source with appropriate Impersonation Mode • Add Data Source View • Mining Structure • Defines the data we will use • Data must have a key • Ensure data types and content attribute are correct for each column • Column usage must be declared (Key, Input, Predict, PredictOnly) • Mining Model • One created through wizard • Create more mining models for each mining structure • Process and Run • Builds Models

  42. Model Development • Exercise #1 • BI Studio (SQL Server Data Tools) • Visualization • Viewer (different for each algorithm) • Diagrams • Mining Legend • Mining Model Viewer • Mining Accuracy Chart • Lift Chart (Compare Models to random guesses and ideal model and to each other) • Classification Matrix (Confusion Matrix) • Cross Validation (90-10) • Mining Model Prediction

  43. Model Development • Exercise #1 • Predict class of Iris plant from sepal length and width, petal length and width. • Dataset: iris table • Create:MiningStructurewith 30% holdout • AddMining Models: • Decision Tree • Neural Network • Logistic Regression • Cluster • Process / Test / Compare

  44. Model Development • Exercise #1 • Visualizeand evaluate the quality of the models. • Dataset:iris table • Visuals • Mining Model Viewer • Mining Legend • Evaluate • Classification Matrix • Cross Validation • Mining Model Prediction

  45. Model Development • Exercise #1 • Fine-tune the mining models by changing settings and parameters. • Dataset: iris table • Decision Tree • Change Minimum Support and view Classification Matrix • Explain Complexity Penalty for pruning and to prevent overfitting • Cluster • Cluster Count • Minimum Support • Change cluster count to 3 • Add another cluster mining model with cluster count set to 0 • Add one more with cluster count = 0 and modeling cardinality = 50

  46. Tools What Analytics, Big Data, Data mining, Data Science software have you used in the past 12 months for a real project?

  47. Tools – Excel • Data Exploration and Preparation Explore Data (data profiling) Discretization of Continuous Data Identify\Exclude Outliers Re-label (rename\map columns) Random Sampling to reduce size of Training Set Over Sampling for Balancing

  48. Algorithm Differences Unsupervised: Cluster Algorithm Supervised: A class variable (prediction) is specified Classification: Predict a discrete variable Regression: Predict a continuous variable Time Series: Date driven predictions Association: Market Basket predictions

  49. Lunch 1 Hour

  50. Neural Network Algorithm New Weight = Weight + Error * Input * Learning Constant Multi-Layered…Feed Forward…Back Propagation