Chapter 2

Chapter 2 Data Mining Tasks

Data Mining Tasks • Prediction methods • Use some variables to predict unknown or future values of the same or other variables. • Inference on the current data in order to make prediction • Description methods • Find human interpretable patterns that describe data • Characterize the general properties of data in db • Descriptive mining is complementary to predictive mining but it is closer to decision support than decision making

Cont’d • Association Rule Mining (descriptive) • Classification and Prediction (predictive) • Clustering (descriptive) • Sequential Pattern Discover (descriptive) • Regression (predictive) • Deviation Detection (predictive)

Association Rule Mining • Initially developed for market basket analysis • Goal is to discover relationships between attributes • Data is typically stored in very large databases, sometimes in flat files or images • Uses include decision support, classification and clustering • Application areas include business, medicine and engineering

Given a set of transactions, each of which is a set of items, find all rules (XY) that satisfy user specified minimum support and confidence constraints Support = (#T containing X and Y)/(#T) Confidence=(#T containing X and Y)/ (#T containing X) Applications Cross selling and up selling Supermarket shelf management Some rules discovered Bread Jem Sup=60%, conf=75% Jelly Bread Sup=60%, conf=100% Jelly Jem Sup=20%, conf=100% Jelly Milk Sup=0% Association RuleMining

Association Rule Mining:Definition • Given a set of records, each of which contain some number of items from a given collection: • Produce dependency rules which will predict occurrence of an item based on occurrences of other items • Example: • {Bread} {Jem} • {Jelly} {Jem}

Association Rule Mining:Marketing and sales promotion • Say the rule discovered is {Bread, …} {Jem} • Jem as a consequent: can be used to determine what products will boost its sales. • Bread as antecedent: can be used to see which products will be impacted if the store stops selling bread • Bread as an antecedent and Jem as a consequent: can be used to see what products should be stocked along with Bread to promote the sale of Jem.

Association Rule Mining:Supermarket shelf management • Goal: To identify items that are bought concomitantly by a reasonable fraction of customers so that they can be shelved. • Data Used: Point-of sale data collected with barcode scanners to find dependencies among products. • Example • If customer buys jelly, then he is very likely to by Jem. • So don’t be surprised if you find Jem next to Jelly on an aisle in the super market. Also salsa next to tortilla chips.

Association Rule Mining • Association rule mining will produce LOTS of rules • How can you tell which ones are important? • High Support • High Confidence • Rules involving certain attributes of interest • Rules with a specific structure • Rules with support / confidence higher than expected • Completeness – Generating all interesting rules • Efficiency – Generating only rules that are interesting

Clustering • Determine object groupings such that objects within the same cluster are similar to each other, while objects in different groups are not • Typically objects are represented by data points in a multidimensional space with each dimension corresponding to one or more attributes. Clustering problem in this case reduces to the following: • Given a set of data points, each having a set of attributes, and a similarity measure, find cluster such that • Data points in one cluster are more similar to one another • Data points in separate clusters are less similar to one another

Cont’d • Similarity measures: • Euclidean distance (continuous attr.) • Other problem – specific measures • Types of Clustering • Group-Based Clustering • Hierarchical Clustering

Euclidean distance based clustering in 3D space Intra cluster distances are minimised Inter cluster distances are maximised Clustering Example

Clustering: Market Segmentation • Goal: To subdivide a market into distinct subset of customers where each subset can be targeted with a distinct marketing mix • Approach: • Collect different attributes of customers based on their geographical and lifestyle related information • Find clusters of similar customers • Measure the clustering quality by observing the buying patterns of customers in the same cluster vs. those from different clusters.

Clustering: Document Clustering • Goal: To find groups of documents that are similar to each other based on important terms appearing in them • Approach: To identify frequently occurring terms in each document. Form a similarity measure based on frequencies of different terms. Use it to generate clusters. • Gain: Information Retrieval can utilize the clusters to relate a new document or search to clustered documents

Clustering: Document Clustering Example • Clustering points: 3204 articles of LA Times • Similarity measure: Number of common words in documents (after some word filtering)

Classification: Definition • Given a set of records (called the training set) • Each record contains a set of attributes. One of the attributes is the class • Find a model for the class attribute as a function of the values of other attributes • Goal: Previous unseen records should be assigned to a class as accurately as possible • Usually, the given data set is divided into training and test set, with training set used to build the model and test set used to validate it. The accuracy of the model is determined on the test set.

Classification: cont’d • Classifiers are created using labeled training samples • Classifiers are evaluated using independent labeled samples (test set) • Training samples created by ground truth / experts • Classifier later used to classify unknown samples • Measurements must be able to predict the phenomenon! • Examples • Direct marketing • Fraud detection • Customer churn • Sky survey cataloging • Classifying galaxies

Classification Example

Classification: Direct Marketing • Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell phone product • Approach: • Use the data collected for a similar product introduced in the recent past. • Use the profiles of consumers along with their (buy, didn’t buy} decision. The latter becomes the class attribute. • The profile of the information may consist of demographic, lifestyle and company interaction. • Demographic – Age, Gender, Geography, Salary • Psychographic - Hobbies • Company Interaction – Recentness, Frequency, Monetary • Use these information as input attributes to learn a classifier model

Classification: Fraud Detection • Goal: Predict fraudulent cases in credit card transactions • Approach: • Use credit card transactions and the information on its account holders as attributes (important: when and where the card was used) • Label past transactions as {fraud, fair} transactions. This forms the class attribute • Learn a model for the class of transactions • Use this model to detect fraud by observing credit card transactions on an account.

Regression • Predict the value of a given continuous valued variable based on the values of other variables, assuming a linear or non-linear model of dependency • Extensively studied in the fields of Statistics and Neural Networks • Predicting sales number of new product based on advertising expenditure • Predicting wind velocities based on temperature, humidity, air pressure, etc • Time series prediction of stock market indices

Deviation/Anomaly Detection • Some data objects do not comply with the general behavior or model of the data. Data objects that are different from or inconsistent with the remaining set are called outliers • Outliers can be caused by measurement or execution error. Or they represent some kind of fraudulent activity • Goal of deviation/anomaly detection is to detect significant deviations from normal behavior

Deviation/Anomaly Detection:Definition • Given a set of n points or objects, and k, the expected number of outliers, find the top k objects that considerably dissimilar, exceptional or inconsistent with the remaining data • This can be viewed as two sub problems • Define what data can be considered as inconsistent in a given data set • Find an efficient method to mine the outliers

Deviation:Credit Card Fraud Detection • Goal: to detect fraudulent credit card transactions • Approach: • Based on past usage patterns, develop model for authorized credit card transactions • Check for deviation from model, before authenticating new credit card transactions • Hold payment and verify authenticity of “doubtful” transaction by other means (phone call, etc.)

Anomaly detection:Network Intrusion Detection • Goal: to detect intrusion of a computer network • Approach: • Define and develop a model for normal user behavior on the computer network • Continuously monitor behavior of users to check if it deviates from the defined normal behavior • Raise an alarm, if such deviation is found

Sequential pattern discovery:definition • Given is a set of objects, with each object associated with its own time of events, find rules that predict strong sequential dependencies among different events • Sequence discovery aims at extracting sets of events that commonly occur over a period of time (A B) (C)  (D E)

Sequential pattern discovery:Telecommunication Alarm Logs • Telecommunication alarm logs • (Inverter_Problem Excessive_Line_Current) (Rectifier_Alarm)  (Fire_Alarm)

Sequential pattern discovery:Point of Sell Up Sell / Cross Sell • Point of sale transaction sequences • Computer bookstore • (Intro_to_Visual_C) (C++ Primer)  (Perl_For_Dummies, Tcl_Tk) • 60% customers who buy Intro toVisual C and C++ Primer also buy Perl for dummies and Tcl Tk within a month • Athletic apparel store • (Shoes) (Racket, Racket ball)  (Sport_Jacket)

Example: Data Mining(Weather data) • By applying various data mining techniques, we can find • associations and regularities in our data • Extract knowledge in the forms of rules, decision trees etc. • Predict the value of the dependent variable in new situation • Some example • Mining association rules • Classification by decision trees and rules • Prediction methods

Mining association rules • First, discretize the numeric attributes (a part of the data preprocessing stage) • Group the temperature values in three intervals (hot, mild, cool) and humidity values in two (high, normal) • Substitute the values in data with the corresponding names • Apply the Apriori algorithm and get the following rules

Day outlook temperature humidity windy play 1 sunny hot high false No 2 sunny hot high true No 3 overcast hot high False Yes 4 rainy mild high False Yes 5 rainy cool normal False Yes 6 rainy cool normal True No 7 overcast cool normal True Yes 8 sunny mild high False No 9 sunny cool normal False Yes 10 rainy mild normal False Yes 11 sunny mild normal True Yes 12 overcast mild high True Yes 13 overcast hot normal False Yes 14 rainy mild high true no Discretized weather data

Cont’d • humidity=normal windy=false  play=yes (4,1) • temperature=cool  humidity=normal (4,1) • outlook=overcast  play=yes (4,1) • temperature=cool play=yes  humidity=normal (3,1) • outlook=rainy windy=false  play=yes (3, 1) • outlook=rainy play=yes  windy=false (3, 1) • outlook=sunny humidity=high  play=no (3, 1) • outlook=sunny play=no  humidity=high (3, 1) • temperature=cool windy=false  humidity=normal play=yes (2, 1) • temperature=cool humidity=normal windy=false  play=yes (2, 1)

Cont’d • These rules show some attribute values sets (itemsets) that appear frequently in the data • Support (the number of occurrences of the itemset in the data) • Confidence (accuracy) of the rules • Rule 3 – the same as the one that is produced by observing the data cube

Classification by Decision Trees and Rules • Using ID3 algorithm, the following decision tree is produced • Outlook=sunny • Humidity=high:no • Humidity=normal:yes • Outlook=overcast:yes • Outlook=rainy • Windy=true:no • Windy=false:yes

Cont’d • Decision tree consists of: • Decision nodes that test the values of their corresponding attribute • Each value of this attribute leads to a subtree and so on, until the leaves of the tree are reached • They determine the value of the dependent variable • Using a decision tree we can classify new tuples

Cont’d • A decision tree can be presented as a set of rules • Each rule represents a path through the tree from the root to a leaf • Other data mining techniques can produce rules directly: Prism algorithm if outlook=overcast then yes if humidity=normal and windy=false then yes If temperature=mild and humidity=normal the yes If outlook=rainy and windy=false then yes If outlook=sunny and humidity=high then no If outlook=rainy and windy=true then no

Prediction methods • DM offers techniques to predict the value of the dependent variable directly without first generating a model • The most popular approaches is based of statistical methods • Uses the Bayes rule to predict the probability of each value of the dependent variable given the values of the independent variables

Cont’d • Eg: applying Bayes to the new tuple: (sunny, mild, normal, false, ?) P(play=yes| outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.8 P(play=no| outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.2  The predicted value must be “yes”

Data Mining : Problems and Challenges Noisy data Large Databases Dynamic Databases Difficult Training Set • Incomplete Data

Noisy data • many of attribute values will be inexact or incorrect • erroneous instruments measuring some property • human errors occurring at data entry • two forms of noise in the data • corrupted values - some of the values in the training set are altered from the original form • missing values - one or more of the attribute values may be missing both for examples in the training set and for object which are to be classified.

Difficult Training Set • Non-representative data • Learning are based on a few examples • Using large db, the rules probably representative • Absence of boundary cases • To find the real differences between two classes • Limited information • Two objects to be classified give the same conditional attributes but are classified in the diff class • Not have enough information of distinguishing two types of objects

Dynamic databases • Db change continually • Rules that reflect the content of the db at all time (preferred) • If same changes are made, the whole learning process may have to be conducted again

Large databases • The size of db to be ever increasing • Machine learning algorithms – handling a small training set (a few hundred examples) • Much care on using similar techniques in larger db • Large db – provide more knowledge (eg. rules may be enormous)

Data Mining – Issues in Data Mining • User Interaction / Visualization • Incorporation of Background Knowledge • Noisy or Incomplete Data • Determining Interestingness of Patterns • Efficiency and Scalability • Parallel and Distributed Mining • Incremental Learning / Mining Time-Changing Phenomena • Mining from Image / Video / Audio Data • Mining Unstructured Data

Chapter 2

Chapter 2

Presentation Transcript

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2

Chapter 2:

Chapter 2

chapter 2

chapter 2

Chapter 2-2

CHAPTER 2

Chapter 2

Chapter 2

CHAPTER 2

Chapter 2

Chapter 2

CHAPTER 2

Chapter 2