210 likes | 214 Views
Data Mining. Adrian Tuhtan 004757481 CS157A Section1. Overview. Introduction Explanation of Data Mining Techniques Advantages Applications Privacy. Data Mining. What is Data Mining?
E N D
Data Mining Adrian Tuhtan 004757481 CS157A Section1
Overview • Introduction • Explanation of Data Mining Techniques • Advantages • Applications • Privacy
Data Mining • What is Data Mining? • “The process of semi automatically analyzing large databases to find useful patterns” (Silberschatz) • KDD – “Knowledge Discovery in Databases” (3) • “Attempts to discover rules and patterns from data” • Discover Rules Make Predictions • Areas of Use • Internet – Discover needs of customers • Economics – Predict stock prices • Science – Predict environmental change • Medicine – Match patients with similar problems cure
Example of Data Mining • Credit Card Company wants to discover information about clients from databases. Want to find: • Clients who respond to promotions in “Junk Mail” • Clients that are likely to change to another competitor • Clients that are likely to not pay • Services that clients use to try to promote services affiliated with the Credit Card Company • Anything else that may help the Company provide/ promote services to help their clients and ultimately make more money.
Data Mining & Data Warehousing • Data Warehouse: “is a repository (or archive) of information gathered from multiple sources, stored under a unified schema, at a single site.” (Silberschatz) • Collect data Store in single repository • Allows for easier query development as a single repository can be queried. • Data Mining: • Analyzing databases or Data Warehouses to discover patterns about the data to gain knowledge. • Knowledge is power.
Data Mining Techniques • Classification • Clustering • Regression • Association Rules
Classification • Classification: Given a set of items that have several classes, and given the past instances (training instances) with their associated class, Classification is the process of predicting the class of a new item. • Therefore to classify the new item and identify to which class it belongs • Example: A bank wants to classify its Home Loan Customers into groups according to their response to bank advertisements. The bank might use the classifications “Responds Rarely, Responds Sometimes, Responds Frequently”. • The bank will then attempt to find rules about the customers that respond Frequently and Sometimes. • The rules could be used to predict needs of potential customers.
Job Good Good Income Bad Income Income Good Bad Bad Technique for Classification • Decision-Tree Classifiers Job Doctor Engineer Carpenter >100K <30K >50K <40K >90K <50K Predicting credit risk of a person with the jobs specified.
Clustering • “Clustering algorithms find groups of items that are similar. … It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. ” (2) • Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased. • The categories are unspecified and this is referred to as ‘unsupervised learning’
Clustering • Group Data into Clusters • Similar data is grouped in the same cluster • Dissimilar data is grouped in the same cluster • How is this achieved ? • K-Nearest Neighbor • A classification method that classifies a point by calculating the distances between the point and points in the training data set. Then it assigns the point to the class that is most common among its k-nearest neighbors (where k is an integer).(2) • Hierarchical • Group data into t-trees
Regression • “Regression deals with the prediction of a value, rather than a class.” (1, P747) • Example: Find out if there is a relationship between smoking patients and cancer related illness. • Given values: X1, X2... Xn • Objective predict variable Y • One way is to predict coefficients a0, a1, a2 • Y = a0 + a1X1 + a2X2 + … anXn • Linear Regression
Regression • Example graph: • Line of Best Fit • Curve Fitting
Association Rules • “An association algorithm creates rules that describe how often events have occurred together.” (2) • Example: When a customer buys a hammer, then 90% of the time they will buy nails.
Association Rules • Support: “is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule”(1, p748) • Example: • People who buy hotdog buns also buy hotdog sausages in 99% of cases. = High Support • People who buy hotdog buns buy hangers in 0.005% of cases. = Low support • Situations where there is high support for the antecedent are worth careful attention • E.g. Hotdog sausages should be placed in near hotdog buns in supermarkets if there is also high confidence.
Association Rules • Confidence: “is a measure of how often the consequent is true when the antecedent is true.” (1, p748) • Example: • 90% of Hotdog bun purchases are accompanied by hotdog sausages. • High confidence is meaningful as we can derive rules. • Hotdog bun Hotdog sausage • 2 rules may have different confidence levels and have the same support. • E.g. Hotdog sausage Hotdog bun may have a much lower confidence than Hotdog bun Hotdog sausage yet they both can have the same support.
Advantages of Data Mining • Provides new knowledge from existing data • Public databases • Government sources • Company Databases • Old data can be used to develop new knowledge • New knowledge can be used to improve services or products • Improvements lead to: • Bigger profits • More efficient service
Uses of Data Mining • Sales/ Marketing • Diversify target market • Identify clients needs to increase response rates • Risk Assessment • Identify Customers that pose high credit risk • Fraud Detection • Identify people misusing the system. E.g. People who have two Social Security Numbers • Customer Care • Identify customers likely to change providers • Identify customer needs
Applications of Data Mining (4) Source IDC 1998
Privacy Concerns • Effective Data Mining requires large sources of data • To achieve a wide spectrum of data, link multiple data sources • Linking sources leads can be problematic for privacy as follows: If the following histories of a customer were linked: • Shopping History • Credit History • Bank History • Employment History • The users life story can be painted from the collected data
References • Silberschatz, Korth, Sudarshan, “Database System Concepts”, 5th Edition, Mc Graw Hill, 2005 • http://www.twocrows.com/glossary.htm, “Two Crows, Data Mining Glossary” • http://en.wikipedia.org/wiki/Data_mining, “Wikipedia” • http://phoenix.phys.clemson.edu/tutorials/excel/regression.html • http://wwwmaths.anu.edu.au/~steve/pdcn.pdf