Introduction to Data Mining (CS 422)

Introduction to Data Mining(CS 422) Fall 2010

Course Outline • Introduction • Data Pre-processing • Data Mining Algorithms • Recommenders • Classifiers • Naïve Bayes • Decision Trees • Support Vector Machines • Clustering • K-Means • LDA • Other • Association Rules

Introduction • Mining useful facts from a large amount of data. Examples: • Product A sells a lot when we sell • People who take a loan are more likely to default if they have the following characteristics • A person committing credit card fraud is likely to do X • A person who likes these movies will probably like movies about X

Different Data Sources • Relational Database • Data Warehouse • Flat Files • Web • Object Oriented database • Multi Media

Data Warehouse • Many enterprises consolidate data from their different homogeneous and heterogeneous data repositories into one common data source called Data Warehouse (DW). • Data Warehouse contains current and historical data to be used for planning and forecasting in Decision Support Systems (DSS). • Traditional Databases are operational databases that are day-to-day data. • Star-schema, Snow-Flakes, Galaxy are modeling schemes in DW.

Data Warehouse (Cont’d) • To improve the performance in DW different techniques such as Summarization and Denormalization are used. • Usually but not always DW is accessed by On-Line Analytical Processing (OLAP). • SQL gives a precise answer to a user query. • OLAP gives a multi-dimensional view of data and is as extension of some aggregate functions in SQL. • OLAP Operations are Slice, Dice, Roll-up, and Drill-down.

OLAP vs. Data Mining • OLAP is a data summarization/ aggregation tool that facilitates the data analysis for the user by providing a multi-dimensional view of the data. • Data Mining Tool provides an automated discovery of knowledge and gives more in-depth knowledge about data and hidden information. • OLAM (OLAP Mining) is the integration of OLAP with Data Mining.

OLAM Architecture Graphical user interface OLAP Data Mining/OLAM Data warehouse DB DB DB

DM vs Statistics and ML • Data Mining (DM), Statistics and ML have lots in common: • Finding patterns • Building models to make predictions • Statistics largely relies on probabilistically rational mathmatical models. • ML is extremely useful, but tends not to care about large data sets. If a process runs out of memory, its not an ML problem. • DM wants accuracy but is designed for very large datasets.

Scalability • Statistical approach deal with small data sets. • Believe that all data must be cleaned and reduced. • Machine Learning deal with small data sets. • Goal is to make machine learn. • Applications such as Chess Playing rather than applications that deal market analysis. • Real life data to be mined can be huge, thus need scalable algorithms.

DM Algorithms • Supervised (Classification / Categorization) • Bayesian • Neural Network • Decision Tree • Others: Genetic Algorithms, Fuzzy Set, K-Nearest Neighbor • Unsupervised • Association Rules • Clustering • Collaborative Filtering

Supervised vs. Unsupervised • Supervised algorithms • Learning by example: • Use training data which has correct answers (class label attribute) • Create a model by running the algorithm on the training data • Identify a class label for the incoming new data • Unsupervised algorithms • Do not use training data. • Classes may not be known in advance.

Supervised Algorithms Test Data 3 1 Classifier Algorithm Training Data 2 Model 4 Classification

Collaborative Filtering • Use user recommendations such as • Ratings • Clicks • Purchases • Provide recommendations to other users • All users “collaborate” in order to “filter” the search for the best item.

Introduction to Evaluation (Evaluating Supervised Algorithms) • Goal: We want to measure the effectiveness of a classification (supervised) algorithm. • Take the training dataset • Build model • Test using the training dataset • This usually leads to a very optimistic result that has little ability to predict the real accuracy of the model.

Cross-Validation (Cont’d) • Another approach is to take the training data set, cut it in half and use half on training and half for testing. • This leads to potential errors in estimating the real classification rate because the half we hold for testing may be very different that the half we used for training.

Run 1 Train Test 10-fold Cross Validation • Take the data set and use the first 90 percent of the data for training and then test on the final ten percent. Then use the next 10 percent for testing, etc. Run 10 Run 2 Test Train Train … Test Train

10-fold Cross Validation • Each run will result in a particular classification rate. • Ex: If we classified 50/100 of the test records correctly our classification rate for that run is 50%. • Choose the model that generated the highest classification rate. The final classification rate for the model is the average of the ten classification rates.

DM Open Source • Weka • Pros • Weka has numerous algorithms and plenty of data mining classes uses it. • Has a great tool to run experiments, try lots of algorithms and see which one works. • Cons • Not typically use for large production systems. Lacks large-scale distributed implementation. • Mahout • Runs on top of Hadoop - - that’s where it got its name. It’s a highly scalable system. Hadoop is meant for widespread distributed processing. More on it later. • Its real -- used by real, production systems.

Open Source • A reasonable goal of this course is to make it so you can meaningfully contribute to open source. • This will ensure you understand the concepts being taught and it also will help you as you progress. • If you do that the conversation about you will change from: • “oh I did a cool data mining project in school and got an A” • To: • “I designed and implemented a new, highly scalable algorithm that is now downloaded and used by 200 major systems. People have gone through it and found a few bugs, but most were relatively small, and I work to improve it from time to time”.

Summary • Data Mining algorithms are used to detect the information that we did not know. • There are various data sources, types,formats and applications for Data Mining. • Scalability differentiates DMfrom statistics and Machine Learning. • Mahout is a robust, open source framework for implementation of data mining algorithms.

Introduction to Data Mining (CS 422)

Introduction to Data Mining (CS 422)

Presentation Transcript

DATA MINING

Data Mining

Data mining

Introduction to Data Mining

Introduction to Data Mining

INTRODUCTION TO DATA MINING

Introduction to Data Mining

Presentation Title: DATA MINING

Introduction to Data Mining

CS 277, Data Mining Introduction

Introduction to Data Mining

DATA MINING

CS 111 - Introduction to Data Structures

Data Mining

Introduction to Data Mining

Data Mining

Introduction to Data Mining

Introduction to data mining

Data Mining -Introduction

Introduction to Data Mining

Introduction To Data Mining