1 / 18

Data Mining

Data Mining. CS157B Section 2 Larry Varela. What is Data Mining?. Data Mining is "The science of extracting useful information from large data sets or databases“. -- http://en.wikipedia.org/wiki/Data_mining

Download Presentation

Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining CS157B Section 2 Larry Varela

  2. What is Data Mining? • Data Mining is "The science of extracting useful information from large data sets or databases“. -- http://en.wikipedia.org/wiki/Data_mining • Data mining is the process of analyzing data from different perspectives and summarizing it into useful information within a particular context.

  3. History of Data Mining • Although data mining is a relatively new term the technology has been around for more than 20 years. • Companies have used powerful computers to sift through volumes of supermarket scanner data and analyze market research reports for years. • Recent innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost.

  4. Data Mining History cont… • Data mining was derived from three previously defined disciplines. • Classical statistics - embrace concepts such as regression analysis, standard distribution, standard deviation, standard variance, discriminant analysis, cluster analysis, and confidence intervals, all of which are used to study data and data relationships. • Artificial intelligence - attempts to apply human-thought-like processing to statistical problems. AI concepts have been adopted by some high-end commercial products, such as query optimization modules for Relational Database Management Systems (RDBMS). • Machine learning - attempts to let software learn about the data they study, such that future decisions are based on the quality of the studied data.

  5. What is it used for? • Data Mining enables businesses to automatically explore and understand their data while identifying patterns, relationships, and dependencies that impact business outcomes. (Descriptive application) • Business Outcomes include: revenue growth, profit improvement, cost containment, and risk management. • Data Mining enables the uncovering and identification of relationships expressed as business rules, or predictive models. • These outputs can then be communicated in traditional reporting formats to guide business planning and strategies. • In addition, these outputs can also be expressed as programming code that can then be deployed into business software to generate predictions of future outcomes. (Predictive application)

  6. Common Types of Relationships • Classes: Stored data is used to locate information in predetermined groups. For example, a coffee chain could mine customer purchase data to determine when customers arrive and what they typically purchase. This information could be used to increase traffic by having daily specials. • Clusters: Data items can be grouped according to logical relationships. For example, data can be mined to identify technology market segments or recent consumer purchasing trends. • Associations: Data can be mined to identify associations between items purchased or queried. For example the beer-diaper example Dr. Lee mentioned during last class is an example of associative mining. • Sequential patterns: Data is mined to anticipate or predict behavior patterns and trends. For example, a Corvette dealer could predict the likelihood of power-folding convertible tops being purchased based on recent increased purchases of convertible style vehicles.

  7. How does data mining work? • Data mining consists of five major elements: • Extract, transform, and load transaction data onto the data warehouse system. • Store and manage the data in a multidimensional database system. • Provide data access to business analysts and/or information technology professionals. • Analyze the data using application software. • Present the data in a readable format. -- info quoted from http://www.anderson.ucla.edu

  8. Data mining Techniques • Classical Techniques • Statistics • Neighborhoods and Clustering • Next Generation Techniques • Trees • Networks and Rules

  9. Trees • Within a decision tree each branch is a classification question and the leaves of the tree are partitions of the dataset with their classification.  • Decision trees can be viewed as segmentations of the original dataset where each segment would be one of the leaves of the tree. • The decision tree technology can be used for exploration of datasets and/or business problems.  This is often done by looking at the predictors and values that are chosen for each split of the tree.  Often times these predictors provide usable insights or propose questions that need to be answered.

  10. Type of Decision Trees • Classification tree analysis is a term used when the predicted outcome is the class to which the data belongs. • Regression tree analysis is a term used when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient’s length of stay in a hospital). • CART analysis is a term used to refer to both of the above procedures. The name CART is an acronym from the words Classification And Regression Trees, and was first introduced by Breiman et al. [BFOS84]. -- info quoted from http://en.wikipedia.org/wiki/Decision_tree

  11. Decision Tree Example • Angelo is the manager of a children's’ zoo. Recently Angelo has been experiencing customer attendance problems. Some days lots of visitors arrive wanting to tour the park when the staff is overworked. Yet on other days no visitors arrive and zoo staff has too much unproductive free time. Angelo’s objective is to optimize staff availability by trying to predict when people will visit the park. To accomplish this Angelo needs to understand why people decide to visit on particular days. He assumes that weather must be an important underlying factor, so he decides to use the weather forecast for the upcoming week. Angelo records the following: • Weather Outlook (sunny, cloudy, or rainy) • Temperature • Percent Humidity • Whether it was windy or not. • Zoo attendance on that particular day

  12. Decision Tree Example

  13. Visits = 9 No Visits = 5 OUTLOOK? overcast sunny rain Visits = 2 No Visits = 3 Visits = 3 No Visits = 2 Visit = 4 No Visit = 0 HUMIDITY? WINDY? >70 <=70 TRUE FALSE Visit = 0 No Visit = 3 Visit = 2 No Visit = 0 Visit = 0 No Visit = 2 Visit = 3 No Visit = 0 Decision Tree Example cont… • Angelo then applies a decision tree model to solve his problem.

  14. Decision Tree Example cont… • The decision tree created is a model of the data that encodes the distribution of the class label in terms of the predictor attributes. The top node represents all the data. The classification tree algorithm finds out that the best way to explain the dependent variable, VISIT, is by using the variable OUTLOOK. • Angelo’s first conclusion: if the OUTLOOK is OVERCAST people always visit the zoo, and there exist some crazy people who visit the zoo even in the rain. • But then again he divided the sunny group in two groups and realized that people don't like to visit the zoo if the humidity is higher than seventy percent. • Finally he divided the rain category into two and found that visitors will also not visit the zoo if it is windy.

  15. Decision Tree Example Conclusion • Angelo dismisses most of the staff on days that are sunny and humid or on rainy and windy because almost no one is going to visit the zoo on those days. On days when a lot of people will visit, he hires extra staff. • The conclusion is that the decision tree helped Angelo turn a complex data representation into a much easier structure.

  16. Decision Tree Advantages • Decision trees are simple to understand and interpret. • Data preparation for a decision tree is basic or unnecessary. • Is able to handle both nominal and categorical data. • Other techniques are usually specialised in analysing datasets that have only one type of variable. • It is possible to validate a model using statistical tests. • Is robust, perform well with large data in a short time.

  17. Data Mining Pitfalls • Sometime data mining may imposing patterns on data where none exist. This imposition of irrelevant correlation is termed data dredging or data fishing. • Large data sets invariably happen to have some exciting relationships peculiar to that data. Therefore any conclusions reached are likely to be highly suspect.

  18. References • Wikipedia.org (2006) Data mining. Retrieved on 3/20/2006 from www.wikipedia.com • Wikipedia.org (2006) Data mining. Retrieved on 3/20/2006 from www.wikipedia.com • Bill Palace (1996) What is Data Mining? Retrieved on 3/20/2006 at http://www.anderson.ucla.edu/faculty/jason.frand/teacher/techn ologies/palace/datamining.htm • Data-Mining-Software.com (2006) Data Mining History. Retrieved on 3/20/2006 at http://www.data-mining- software.com/data_mining_history.htm • Alex Berson, Stephen Smith, and Kurt Thearling (1999) An Overview of Data Mining Techniques. Retrieved on 3/20/2006 from http://www.thearling.com/text/dmtechniques/dmtechniques.htm • [BFOS84] L. Breiman, J. Friedman, R. A. Olshen and C. J. Stone, Classification and regression trees. Wadsworth, 1984.

More Related