1 / 74

Mining data with PolyAnalyst

Mining data with PolyAnalyst. Your Knowledge Partner TM. www.megaputer.com. Outline. Data Mining in BI chain PolyAnalyst overview Learning algorithms Additional features Future developments. Data Mining in BI chain. Your Knowledge Partner TM. Data. Knowledge. Decision. Action.

makala
Download Presentation

Mining data with PolyAnalyst

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining data withPolyAnalyst Your Knowledge PartnerTM www.megaputer.com

  2. Outline • Data Mining in BI chain • PolyAnalyst overview • Learning algorithms • Additional features • Future developments

  3. Data Mining in BI chain Your Knowledge PartnerTM

  4. Data Knowledge Decision Action DM in Decision Making Consider a fragment of the BI chain: • Data - is what we can capture and store • Knowledge- is what provides for informed decisions • Problem:How to get from Data to Knowledge? • Solution:Data Mining (Machine Learning)

  5. Data Mining • "Data Mining is the process of identifying valid, novel, potentially useful, and ultimately comprehensible knowledge from databases that is used to make crucial business decisions." -- G. Piatetsky-Shapiro, KDNuggets editor www.kdnuggets.com • Valid • Novel • Actionable • Comprehensible

  6. Data Mining vs. OLAP • OLAP- Helps prove or reject your hypothesesby dissecting data along different dimensions- But you have to guess the answer first ! • Data Mining- Automatically develops and tests numerous hypotheses by learning from historical data- Analyzes raw data

  7. Business Intelligence Chain • Consider direct marketing automation • Analyze data • Integrate applications X

  8. Data Mining Tasks • Predicting • Classifying • Clustering • Segmenting • Explaining • Associating • Visualizing • Link Analysis • Text Mining

  9. Fields of application

  10. What makes DM hard? • Unfamiliar concept and lack of experience • Results require interpretation by an analyst • Poor integration in existing applications • Difficulty processing very large databases • Necessity to learn a new application • High cost

  11. Megaputer response • Challenge:Unfamiliar concept and lack of experienceResponse:Collaborative Appliance Program – combines Megaputer analysts expertise in data mining and customer knowledge of the business project • Challenge:Results require interpretation by an analystResponse: Simple reporting and batch processing capabilities • Challenge:Poor integration in existing applicationsResponse: Easy scoring of external data with a few mouse clicks • Challenge:Difficulty processing very large databasesResponse: In-Place Data Mining • Challenge:Necessity to learn a new applicationResponse: An SDK of easy-to-integrate PolyAnalyst COM components • Challenge:High costResponse: Flexible licensing mechanism

  12. PolyAnalyst overview Your Knowledge PartnerTM

  13. What is PolyAnalyst? • Multi-strategy data mining suite • The largest selection of ML algorithms for diverse business tasks • Structured data and text processing tools • Ease-of-use: • friendly data manipulation and visualization • Deep integration • Applying models to external DB through the OLE DB protocol • Exporting models to XML • COM components • Best Price/Performance ratio

  14. Key differentiators of PolyAnalyst • Integrated analysis of structured (numeric and categorical) and unstructured (text) data • Easy to learn and operate visual analytical interface • The largest selection of powerful machine learning algorithms • Mouse-driven application of predictive models to data in any external system through a standard OLEDB link • Simple integration with external applications: SDK of COM components • In-Place Data Mining capabilities for processing huge databases • Step-by-step tutorials based on real-world case studies • Rich data manipulation and visualization tools • Reusable analytical scripts for batch process data mining • The best Price/Performance ratio

  15. PolyAnalyst Customer base:300+ installations Sample customers

  16. Control buttons Project navigation tree Data and Results pane Exploration engine report fragment Objects and Collections represented by icons PolyAnalyst log journal PolyAnalyst workplace

  17. PolyAnalyst provides • Access to data held in a database or data warehouse • Numerical • Categorical • Yes/no • Date • Data manipulation and visualization • 14 machine learning algorithms • Convenient results reporting and outputing • Integration with external applications

  18. PolyAnalyst machine learning algorithms Your Knowledge PartnerTM

  19. “Probably one the most impressive characteristic of PolyAnalyst is the sheer number of data mining tasks it can tackle.” Mario Apicella Technology Analyst InfoWorld Test Center July 3, 2000

  20. Learning algorithms • Find Laws(SKAT algorithm) • Cluster(Localization of anomalies) • Find Dependencies(n-dimensional distributions) • Classify(Fuzzy logic modeling) • Decision Tree(Information Gain criterion) • PolyNet Predictor(GMDH-Neural Net hybrid) • Market Basket Analysis(Association rules) • Memory Based Reasoning(k-NN + GA) • Linear Regression(Stepwise and rule-enriched) • Discriminate (Unsupervised classification) • Summary Statistics(Data summarization) • Link Analysis (Visual correlation analysis) • Text Mining(Semantic text analysis)

  21. Cluster (FC) • Identifies clusters of similar records • Selects best variables for clustering • Suggests the number of clusters • Separates clusters of records in new data sets for further investigation - preprocessing for other algorithms

  22. Cluster (continued) Groups of similar records

  23. Cluster (continued) • Based on analyzing distributions in hypercubes of all variables rather than on measuring distances between points • Hence, independent of rescaling of axes variable • Finds only clusters actually present in data, on the background of uniformly distributed cases

  24. Classify (CL) • Fuzzy-logic based classification • The function of belonging modeled by either Find Laws, PolyNet Predictor, or LR • Provides record scoring with Lift and Gain charts used for visualization • Assigns records to one of two classes and furnishes utilized classification rule

  25. PolyAnalyst Lift chartillustrates an increase in the response to a campaign based on the discovered model - instead of random mailing Targeted mailing % of maximal possible response Mass mailing PolyAnalystGain charthelps optimize the profit obtained in a direct marketing campaign Targeted mailing Profit ($) Mass mailing Classify (continued)

  26. Decision Tree (DT) • Intuitively classifies cases to selected categories • Based on Information Gain splitting criteria • The fastest algorithm in PolyAnalyst • Scales linearly with increasing number of records

  27. Decision Tree (continued) Node characteristics Classification tree

  28. Decision Forest (DF) • The most efficient classification algorithm for tasks with multiple target categories • Transforms the task of categorizing data records to N classes into the problem of solving N tasks of categorizing records to two classes • Develops the best collection of N classification trees, with leaves containing probabilities of classifying records in the corresponding classes • Scales linearly with increasing number of records

  29. Link Analysis (LK) • Reveals pairs of correlated objects • Used in Fraud Detection, Text Analysis and other correlation analysis tasks

  30. Text Analysis (TA) • Extracts key concepts from natural language notes • Tags individual records with the main encountered concepts • Recognizes synonyms and othe semantic relations • Can perform user-focused or unsupervised analysis • Integrates the analysis of text with the power of other machine learning algorithms of PolyAnalyst • Facilitates categorization of textual documents

  31. Text Analysis (continued)

  32. Basket Analysis (BA) • Is used in Retailing, Fraud Detection and Medicine • Identifies in transactional data groups of products sold together well • Finds directed association rules for each of these groups • Groups baskets containing similar sets of products • Characterized by • Support • Confidence • Improvement • Based on new mathematics: • works 10 to 50 times faster than traditional algorithms

  33. Basket Analysis (continued) Groups of products sold together well Directed Association Rules

  34. Basket Analysis (continued) • Works with both transactional and flat data format • Easily finds many-to-one rules “I would like to continue working together with Megaputer on other CTP customers’ projects (mainly Swedish and Danish Banks ).” -- Olof Goransson Senior Data Consultant CTP Skandinavien AB

  35. Find Laws (FL) • Models relationships hidden in data • Presents discovered knowledge explicitly • Searches the space of all possible hypotheses “The unique Find Laws algorithm along with an easy to use interface made PolyAnalyst the only choice for our environment.” -- James Farkas, Senior Navigation Engineer, The Boeing Company

  36. Find Laws (continued) • FL is based on the Megaputer’s unique Symbolic Knowledge Acquisition Technology(SKAT) • A good introduction to SKAT:PCAI magazine, January 99, p. 48-52

  37. Find Dependencies (FD) • Determines most influential variables • Detects multi-dimensional dependencies • Predicts target variable in a table format • Used as preprocessing for FL

  38. Find Dependencies (continued) Predicted Sales per Employee

  39. PolyNet Predictor (PN) • Predicts values of continuous attributes • Hybrid GMDH-Neural Network method • Works well with large amounts of data • The best architecture network is built automatically

  40. Memory Based Reasoning(MB) • Performs classification to multiple categories • Based on identifying similar cases in the previous history • Uses Genetic Algorithms to find the most suitable metric for the problem

  41. Discriminate (DS) • Determines what features of a selected data set distinguish it from the rest of the data • Requires no target variable • Can be powered by • Find Laws • PolyNet Predictor • Linear Regression

  42. Linear Regression (LR) • Incorporates categorical and yes/no variables in the analysis correctly • Stepwise Linear Regression: only influential variables included • Can be used as a preprocessing and benchmarking module

  43. PolyAnalyst features in more detail Your Knowledge PartnerTM

  44. Data Analysis Project Workflow • Access data • Understand, clean and transform data • Run machine learning analysis • Visualize, report and share results • Integrate results in existing business process

  45. Data Access • ODBC-compliant databases: Oracle, DB2, Informix, Sybase, MS SQL Server, etc. • Dedicated access • IBM Visual Warehouse • Oracle Express • OLE DB (can do In-Place Data Mining) • CSV or DBF files • Data can be appended to the project when necessary

  46. Data cleansing and manipulation • SQL querying through OLE DB • Records selection according to multiple criteria • Union, intersection, or complement of data sets • Categorical values aggregation • Visual Drill-through • Exceptional records filtering • Split into n-tile percentage intervals • Random sampling

  47. Visualization • Histograms • Line and scatter plots with zoom and drill-through capabilities • Snake charts • Interactive 3D-charts • Interactive Rule-graphs with sliders for visualizing multi-variable relations • Frequency charts for categorical, integer, or yes/no variables • Lift and Gaincharts for marketing applications

  48. Histogram displays distribution of numerical variables Frequencies chart displays distribution of categorical and yes/no variables Histograms and Frequencies

  49. Sliders help visualize effects of other variables in more than two-dimensional models The Find Laws model (red line) for a product market share dependence on the price predicts a dramatic change in the formula when the product goes on promotion 2D charts and Rule-graphs

  50. Compared data sets “High” All variables “Low” Snake-charts • Quickly compare qualitatively several datasets on all their attributes

More Related