1 / 35

Amer Kanj

Amer Kanj. Data Mining For Business Professionals. Contents. Data Mining Overview Types of Data Mining Why use Data Mining How do we Mine Data Models of Data Mining. Data Mining Overview. Data Mining deals with large volumes of data stored in DBMS

Download Presentation

Amer Kanj

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Amer Kanj Data Mining For Business Professionals

  2. Contents • Data Mining Overview • Types of Data Mining • Why use Data Mining • How do we Mine Data • Models of Data Mining

  3. Data Mining Overview • Data Mining deals with large volumes of data stored in DBMS • It is the process of analyzing large databases to find useful patterns • Data Mining is the process of automating information discovery • It automates the process of discovering useful trends and patterns

  4. Data Mining Overview (Cont) • The fundamental assumption of Data Mining is that large data may contain recurring hidden patterns • A Data Mining tool does not require any assumptions • It tries to discover relationships and hidden patterns that may not always be obvious

  5. Types of Data Mining • Business professionals look for Data Mining approaches that meet their needs. • They requires Data Mining to: • Be understandable • Have good performance • Be accurate • They define three fundamental approaches to Data Mining: • Classification Studies • Clustering Studies • Visualization Studies

  6. Classification Studies • Classification studies = Supervised learning • Very common in business world. • A telecommunication company’s analyst wants to: • Understand why some customers remain loyal while othersleave • Predict which customers likely to lose to competitors

  7. Classification Studies (cont) • So he can: • Construct a model derived from historical data of loyal customers versus customers who have left • A good model enables him to better understanding his customers and to predict which customer will stay and which will leave • A study will identify an overall goal and the data to be used

  8. Classification Rules • Classification rules help assign new objects to a set of classes • Given a new automobile insurance applicant, should he/she be classified as low risk, medium risk or high risk? • Classification rules for above example could use a variety of knowledge, such as educational level of applicant, salary of applicant, age of applicant, etc… •  person p, p.degree = masters & p.income > 75,000  p.credit = excellent •  person p, p.degree = bachelors and (p.income >= 25,000 and p.income <= 75,000)  p.credit = good

  9. Classification rules can compactly shown as a decision tree

  10. Clustering Studies • Clustering Studies = Unsupervised Learning • A method of grouping rows of data that share similar trends and patterns • We have no dependent variable • Clustering can also be based on historical patterns, but the outcome (loyal or lost) is not supplied with the training data • Clustering techniques try to look for similarities within a data set and group similar rows together into clusters or segments

  11. Customers are clustered into four segments Cluster 1 Cluster 4 Income: High Children: 1 Car: Luxery Income: Medium Children: 2 Car: Sedan and Car: Track Income: Medium Children: 3 Cluster 2 Cluster 3 Income: high Children: 0 Car: Compact

  12. Visualization • It is simply the graphical presentation of data • Microsoft Excel has graphing and mapping capabilities in its product • Representing data graphically often brings out points that you would not normally see

  13. Why use Data Mining • Direct Marking • Trend Analysis • Fraud Detection • Forecasting in Financial Markets

  14. Direct Marketing • The ability to predict who is most likely or most desirable to buy certain product can save companies immense amounts in marketing expenditures

  15. Trend Analysis • Understanding trends in the marketplace is a strategic advantage, because it is useful in reducing costs and timeliness to market

  16. Fraud Detection • data Mining techniques can model which insurance claims, cellular phone calls, or credit card purchases are likely to be fraudulent

  17. Forecasting in Financial Markets • The use of data mining to model financial markets is used extensively

  18. How Do We Mine Data • There are five steps to Data Mining: • Data Manipulating • Defining a study • Reading the data and building a model • Understanding the model • Prediction

  19. Data Preparation • Data preparation is considered as the heart of the Data Mining process • Data usually accumulates in transactional database where actual records of transactions are stored • Data preparation requires that the data from distributed databases be pooled together, cleansed from redundant, inconsistent, incomplete, irrelevant, and otherwise inappropriate data

  20. Data Preparation (Cont) • Data Cleaning: • A column containing a list of soft drinks may have the values “Pepsi” , “Pepsi Cola”, and “Cola”. • The values refer to the same drink, but are not known to the computer as the same. • Missing Values: • Some Data Mining approaches require rows of data to be complete in order to mine the data • If too many values are missing in a data set, it becomes hard to gather any useful information from this data or to make predictions from it

  21. Data Preparation (Cont) • Data Derivation: • If I have column called maximum$-2002 and maximum$-2003 to describe the dollars spent in 2002 and 2003 • Then an interesting derivation is $-difference, which is the change in amount of money spent between 2002 and 2003 • Merging Data: • Data usually stored in the form of tables • Merging data in a relational system can be achieved in a number of ways: 1. Merging tables through a view (Query Tools) 2. An SQL statement, or 3. An export of data into a flat file

  22. Defining a Study • Differs from Supervised (Classification) versus Unsupervised (Clustering)learning • For Supervised learning: • Involves articulating a goal • Specifying the data fields that are used in the study • For Unsupervised learning: • The goal is to group similar types of data, usually used in many activities, or • To identify exceptions in a data set, which is useful in discovering fraudulent or incorrect data

  23. Read the data and build a Model • A data mining product reads a data set and constructs a model • A model will summarize large amounts of data by accumulating indicators • such Indicators: • Frequencies: Show how often a certain value occurs • Weight: or impacts, indicate how well some inputs indicate the occurrence of an output • Conjunctions: Sometimes inputs have more weight together than apart • Differentiation: Indicates how much more important an input criterion is to one outcome than another

  24. Understanding the Model • Model understand takes different forms based on the type of model used to represent the data • We will discuss Data Mining Models later…

  25. Prediction • Prediction is the process of choosing the best possible outcomes based on historical data • Predictive data mining methods fall into three broad categories: • Mathematical methods • Logic methods • Distance methods

  26. Prediction (Cont) • Mathematical method: • Linear math solution • Non-linear math solution • Logic methods: • Quite different from what math methods produce • Logical methods often produce tree-like solutions • Best known logical solutions are decision trees, and decision rules.

  27. Prediction (Cont) • Distance methods: • A representative sample of cases is kept on file • These cases will be used as a benchmark for classifying new cases • Features of the new case are measured against features of the benchmark cases for proximity

  28. Prediction (Cont) • Here are a few interesting predictive capabilities: • Understanding why a prediction is made: some models will provide the reasons why a prediction is made • Margin of victory: if the best case prediction has a score of 100 and the challenger prediction has a score of 50, then the margin of victory is 50%. If the prediction has a score of 100 and the challenger has 99, then the margin of victory would be 1%. Generally, the higher the margin of victory, the more likely the prediction is to be true

  29. Prediction (Cont) • Scenario playing: Some prediction models have the ability to change parameters to see how predictions change • Understanding prediction affinities: Is to set two variables constant and see what the other predictions would look like

  30. Data Mining Models • Decision Trees • Genetic Algorithms • Neural Nets • Agent Network Technology • Hybrid Models • Statistics

  31. Data mining Models (Cont) • Decision Trees: • Creating a tree-like structure to describe a data set • The greatest benefit to decision tree approaches is their understandability • Genetic Algorithms: • Are a method of combinatorial optimization based on processes in biological evolution

  32. Data Mining Models (Cont) • Neural Nets: • Are used extensively in the business world as predictive models • Neural Nets are widely used in the financial market to model fraud in credit cards and monetary transactions • Agent Network Technology: • This method of model treats all data elements as agents that are connected to each other in a significant way

  33. Data Mining Models (Cont) • Hybrid Models: • Vendor Tools that make use of more than one approach are referred to as hybrid systems • Being a hybrid system does not always imply that the tool uses a hybrid algorithm • For example, Thinking Machines, with their Darwin product, makes use of several different mining algorithm. While the algorithm themselves are not hybrid, the product uses the algorithms in combination

  34. Data Mining Models (Cont) • Statistics: • Used to create a model of data sets • Uses probability, data analysis, and statistical inference

  35. Thank You For ListeningQs…QS…Qs

More Related