1 / 109

Chapter I:Introduct ion BIS 541 20 18/2019 Spring

Explore the motivation behind data mining, the methodology of knowledge discovery from data, and the functionalities of data mining. Understand the importance of converting data into information and knowledge for decision-making. Discover various business applications of data mining.

pgober
Download Presentation

Chapter I:Introduct ion BIS 541 20 18/2019 Spring

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter I:IntroductionBIS5412018/2019 Spring

  2. Chapter 1. Introduction • Motivation: Why data mining? • Methodology of Knowledge Discovery from Data • Data mining functionalities • Are all the patterns interesting? • Business applications of data mining

  3. Motivation: “Necessity is the Mother of Invention” • Data explosion problem • Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories • Need to convert such data into information and knowledge • and then take action, use in decision making • Applications • Business: marketing, finance, HR, operations management • Information Systems: security • Dayly life: trafic, weather prediction • Education • Healt Case • Science/Engineering

  4. Evolution of Database Technology (1) • Data collection, database creation • Data management • data storage and retrieval • database transaction processing • Data analysis and understanding • Data mining and data warehousing

  5. Evolution of Database Technology (2) • 1960s: • Data collection, database creation, IMS and network DBMS • 1970s: • Relational data model, relational DBMS implementation • 1980s: • RDBMS, advanced data models (extended-relational, OO, deductive, etc.) • Application-oriented DBMS (spatial, scientific, engineering, etc.) • 1990s: • Data mining, data warehousing, multimedia databases, and Web databases • 2000s • Stream data management and mining • Data mining and its applications • Web technology (XML, data integration) and global information systems

  6. The Explosive Growth of Data: from terabytes to petabytes • Data collection and data availability • Automated data collection tools, database systems, Web, computerized society • Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, … • Science: Remote sensing, bioinformatics, scientific simulation, … • Society and everyone: news, digital cameras, YouTube • We are drowning in data, but starving for knowledge! • “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets

  7. Developments in computer hardware • Powerful and affordable computers • Data collection equipment • clickstrems, sensors, cameras, IoT • Storage media • Communication and networking • cloud computing

  8. Data Warehouse/ Business Intelligence (BI) • Data cleaning • Data integration • OLAP: On-Line Analytical Processing • summarization • consolidation • aggregation • view information from different angles • but additional data analysis tools are needed for • classification/prediction • clustering • discovering frequently occuring patterns • charecterization of data changing over time

  9. Data rich information poor situation • Abundance of data • need for powerful data analysis tools • “data tombs” - data archives • seldom visited • Important decisions are made • not on the information rich data stored in databases • but on a decision maker’s intuition • no tool to extract knowledge embedded in vast amounts of data • Expert system technology • domain experts to input knowledge • time consuming and costly

  10. What Is Data Mining? • Data mining (knowledge discovery from data): • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)information or patterns from data in large databases • Alternative names and their “inside stories”: • Data mining: a misnomer? • Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, data analytics, etc. • What is not data mining? • query processing. • Expert systems or small ML/statistical programs

  11. Data Mining vs. Data Query • Data Query:e.g. • A list of all customers who use a credit card to buy a smart phone • A list of all MIS students having a GPA of 3.5 or higher and has studied 4 or less semesters • Data Mining problems:e.g. • What is the likelihood of a customer purchasing a smart phone with credit card • Given the characteristics of MIS students predict her SPA in the comming term • What are the characteristics of MIS undergrad students

  12. Chapter 1. Introduction • Motivation: Why data mining? • Methodology of Knowledge Discovery from Data • Data mining functionalities • Are all the patterns interesting? • Business applications of data mining

  13. Why Data Mining? • Four questions to be answered • Can the problem clearly be defined? • Does potentially meaningful data exists? • Does the data contain hidden knowledge or useful only for reporting purposes? • Will the cost of processing the data will be less then the likely increase in profit from the knowledge gained from applying any data mining project

  14. Steps of a KDD Process(1) • 1. Goal identification: • Define problem • relevant prior knowledge and goals of application • 2. Creating a target data set: data selection • 3. Data preprocessing: (may take 60%-80% of effort!) • removal of noise or outliers • strategies for handling missing data fields • accounting for time sequence information • 4. Data reduction and transformation: • Find useful features, dimensionality/variable reduction, invariant representation.

  15. Steps of a KDD Process(2) • 5. Data Mining: • Choosing functions of data mining: • summarization, classification, regression, association, clustering. • Choosing the mining algorithm(s): • which models or parameters • Search for patterns of interest • 6. Presentationand Evaluation: • visualization, transformation, removing redundant patterns, etc. • 7. Taking action: • incorporating into the performance system • documenting • reporting to interested parties

  16. An example: Customer Segmentation • 1. Marketing department wants to perform a segmentation study on the customers of AE Company • 2. Decide on revevant variables from a data warehouse on customers, sales, promotions • Customers: name,ID,income,age,education,... • Sales: hisory of sales • Promotion: promotion types durations... • 3. Hendle missing income, addresses.. • determine outliers if any • 4. Cenerate new index variables representing wealth of customers • Wealth = a*income+b*#houses+c*#cars... • Make neccesary transformations z scores so that some data mining algorithms work more efficiently

  17. Example: Customer Segmentation cont. • 5.a: Choose clustering as the data mining functionality as it is the natural one for a segmentation study so as to find group of customers with similar charecteristics • 5.b: Choose a clustering algorithm • K-means or k-medoids or any suitable one for that problem • 5.c: Apply the algorithm • Find clusters or segments • 6. make reverse transformations, visualize the customer segments • 7. present the results in the form of a report to the marketing deprtment • İmplement the segmentation as part of a DSS so that it can be applied repeatedly at certain internvals as new customers arrive • Develop marketing strategies for each segment

  18. Data Mining: A KDD Process Knowledge Pattern Evaluation • Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases

  19. CRISP-DM • 1.      Business Understanding • Focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition and a preliminary plan. • 2.      Data Understanding • Starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

  20. CRISP-DM (cont.) • 3.      Data Preparation • The data preparation phase covers all activities to construct the final dataset from the initial raw data. • 4.      Modeling • Modeling techniques are selected and applied.  Since some techniques like neural nets have specific requirements regarding the form of the data, there can be a loop back here to data prep.

  21. CRISP-DM (cont.) • 5.      Evaluation • Once one or more models have been built that appear to have high quality based on whichever loss functions have been selected, these need to be tested to ensure they generalize against unseen data and that all key business issues have been sufficiently considered.  The end result is the selection of the champion model(s). • 6.      Deployment • Generally this will mean deploying a code representation of the model into an operating system to score or categorize new unseen data as it arises and to create a mechanism for the use of that new information in the solution of the original business problem.  Importantly, the code representation must also include all the data prep steps leading up to modeling so that the model will treat new raw data in the same manner as during model development.

  22. Business Understanding • focuses on understanding the project objectives and requirements from a business perspective, • then converting this knowledge into a data mining problem definition • and a preliminary plan designed to achieve the objectives

  23. Data Understanding • starts with initial data collection and proceeds with activities that enable you to become familiar with the data, identify data quality problems, discover first insights into the data, • and/or detect interesting subsets to form hypotheses regarding hidden information

  24. Data Preparation • covers all activities needed to construct the final dataset • [data that will be fed into the modeling tool(s)] • from the initial raw data. • Data preparation tasks are likely to be performed multiple times and not in any prescribed order. • Tasks include: • table, record, and attribute selection, • as well as transformation and cleaning of data for modeling tools.

  25. Modeling • Ivarious modeling techniques are selected and applied, • their parameters are calibrated to optimal values. • Typically, there are several techniques for the same data mining problem type. • Some techniques have specific requirements on the form of data. • Therefore, going back to the data preparation phase is often necessary

  26. Evaluation • you have built a model (or models) that appears to have high quality from a data analysis perspective. • Before proceeding to final deployment of the model, it is important to thoroughly evaluate it and review • the steps executed to create it, to be certain the model properly achieves the business objectives. • A key objective is • to determine if there is some important business issue that has not been sufficiently considered. • At the end of this phase, • a decision on the use of the data mining results should be reached

  27. Deployment (1) • Creation of the model is generally not the end of the project. • Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. • It often involves applying “live” models within an organization’s decision making processes—for example, real-time • personalization of Web pages or repeated scoring of marketing databases.

  28. Deployment (2) • Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise. • In many cases, it is the customer, not the data analyst, who carries out the deployment steps. • However, even if the analyst will carry out the deployment effort, it is important for the customer to understand up front what actions need to be carried out in order to actually make use of the created models

  29. Data Mining in Business Intelligence Increasing potential to support business decisions End User DecisionMaking Business Analyst Data Presentation Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses DBA Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems January 4, 2020 29 Data Mining: Concepts and Techniques

  30. Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Filtering Data cleaning & data integration Data Warehouse Databases

  31. Architecture of a Typical Data Mining System • Data base, data warehouse • Data base or data warehouse server • Knowledge base • concept hierarchies • user beliefs • asses pattern’s interestingness • other thresholds • Data mining engine • functional modules • characterization, association, classification, cluster analysis, evolution and deviation analysis • Pattern evaluation module • Graphical user interface

  32. Data Mining: Confluence of Multiple Disciplines Database Technology Statistics Data Mining Machine Learning Visualization Information Science Other Disciplines

  33. Why Confluence of Multiple Disciplines? Tremendous amount of data Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications January 4, 2020 33 Data Mining: Concepts and Techniques

  34. Efficient and Scalable Techniques • For an algorithm to be efficient and scalable • its running time should be predictable and acceptable • How • Parallel and distributed algorithms • Sampling from databases

  35. Chapter 1. Introduction • Motivation: Why data mining? • Methodology of Knowledge Discovery in Databases • Data mining functionalities • Are all the patterns interesting? • Business applications of data mining

  36. Two Styles of Data Mining • Descriptive data mining • characterize the general properties of the data in the database • finds patterns in data and • the user determines which ones are important • Predictive data mining • perform inference on the current data to make predictions • we know what to predict • Not mutually exclusive • used together • Descriptive  predictive • Eg. Customer segmentation – descriptive by clustering • Followed by a risk assignment model – predictive by ANN

  37. Supervised vs. Unsupervised Learning • Supervised learning (classification, prediction) • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set • Unsupervised learning(summarization. association, clustering) • The class labels of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

  38. Descriptive Data Mining (1) • Discovering new patterns inside the data • Used during the data exploration steps • Typical questions answered by descriptive data mining • what is in the data • what does it look like • are there any unusual patterns • what dose the data suggest for customer segmentation • users may have no idea • which kind of patterns may be interesting

  39. Descriptive Data Mining (2) • patterns at verious granularities • geograph • country - city - region - street • student • university - faculty - department - minor • Fuctionalities of descriptive data mining • Clustering • Ex: customer segmentation • summarization • visualization • Association • Ex: market basket analysis

  40. A model is a black box X: vector of independent variables or inputs Y =f(X) : an unknown function Y: dependent variables or output a single variable or a vector Model Y output inputs X1,X2 The user does not care what the model is doing it is a black box interested in the accuracy of its predictions

  41. Predictive Data Mining (1) • Using known examples the model is trained • the unknown function is learned from data • the more data with known outcomes is available • the better the predictive power of the model • Used to predict outcomes whose inputs are known but the output values are not realized yet • Never %100 accurate

  42. Predictive Data Mining (2) • The performance of a model on past data is not important • to predict the known outcomes • Its performance on unknown data is much more important

  43. Typical questions answered by predictive models • Who is likely to respond to our next offer • based on history of previous marketing campaigns • Which customers are likely to leave in the next six months • What transactions are likely to be fraudulent • based on known examples of fraud • What is the total amount spending of a customer in the next month

  44. Data Mining Functionalities (1) • Concept description: Characterization and discrimination • Generalize, summarize, and contrast data characteristics, e.g., big spenders vs. budget spenders • Pattern Mining(correlation and causality) • Frequent patterns v.s. rare patterns • Association rule mining: Multi-dimensional vs. single-dimensional association • age(X, “20..29”) ^ income(X, “20..29K”) à buys(X, “NB”) [support = 2%, confidence = 60%] • contains(T, “computer”) à contains(x, “software”) [1%, 75%] • Sequantial pattern mining

  45. Data Mining Functionalities (2) • Classification and Numerical-Prediction • Finding models (functions) that describe and distinguish classes or concepts for future prediction • E.g., classify people as healty or sick, or classify transactions as fraudulent or not • Methods: decision-tree, classification rule, neural network • Prediction: Predict some unknown or missing numerical values • Cluster analysis • Class label is unknown: Group data to form new classes, e.g., cluster customers of a retail company to learn about characteristics of different segments • Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

  46. Data Mining Functionalities (3) • Outlier analysis • Outlier: a data object that does not comply with the general behavior of the data • It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis • Trend and evolution analysis • Trend and deviation: regression analysis • Sequential pattern mining: click stream analysis • Similarity-based analysis • Other pattern-directed or statistical analyses

  47. How to apply to complex data • time series, sequential, spatial, spatiotemporal, network

  48. Concept Description • Characterization • Discerimination • Data • classes or • concpets • classes of items for sales • computers, printers • concepts of customers: • bigSpenders • BudgetSpenders

  49. Data Characterization • Summarization the data of the class under study (target class) • Methods • SQL queries • OLAP roll up -operation • user-controlled data summarization • along a specified dimension • attribute oriented induction • without step by step user interraction • the output of characterization • pie charts, bar chars, curves, multidimensional data cube, or cross tabs • in rule form as characteristic rules

  50. Characterization example • Description summarizing the characteristics of customers who spend more than $1000 a year at AllElecronics • age, employment, income • drill down on any dimension • on occupation view these according to their type of employment

More Related