DATA MINING Prof. Sin-Min Lee Surya Bhagvat CS 157B – Spring 2006 Making sense out of data With the hard drives prices becoming inexpensive the amount of data stored in the databases by the corporations has increased dramatically.
Prof. Sin-Min Lee
CS 157B – Spring 2006
With the hard drives prices becoming inexpensive the amount of data stored in the databases by the corporations has increased dramatically.
Just having the raw data in the database is of no use unless someone makes sense of the data. For example one could store a decade of customer data but for the data to become useful one needs to find the patterns in the data to identify the customer behavior.
Would SQL solve the above problem?
Traditional SQL is useful in performing very large queries and one
could argue saying that SQL is all but necessary in order to get
This argument holds good for small sets of data but when a query is performed against a huge database which stores about terabytes of data then the performance of SQL would go down.
Also identifying patterns in the data is not always feasible with the traditional SQL querying. This is where the field of Analytics come into play
Analytics is basically identifying patterns of data in order to make
For example if you are maintaining a commercial ecommerce web site, then one thing which you want to know would be the visitors behavior patterns like from which search engine they came from, how they go on about searching for items in your web site and so on.
Basically what we are trying to do here is identify the patterns of customer behavior which would be useful later on to target that particular customer with promotional offers.
Google recently came up with Google Analytics for free.
The URL for this is site is http://www.google.com/analytics/feature_fast.html
Right now one needs to do sign up for their invitation and once they accept it all one needs to do is to include google analytics tracking code in your web site and then you can start monitoring the customer behavior.
In transactional systems the information about day-to-day transactions is stored.
For example retail stores like Safeway records each transaction that happens during the day at the time the purchase is made.
Identifying patterns on transactional systems is relatively hard because the data stored in these systems usually run up to terabytes and if a SQL query is performed across such a huge database then it may bring the whole system down.
So what’s the alternative?
For decision making activities like to determine patterns or to run complex SQL’s a separate database or system is usually maintained and those systems are known as Decision Support systems.
The high level data is pulled out from the transactional systems and then stored into these databases for performing analytics or data mining techniques.
The downside to this is the data may not be real time. But a service could be written which runs in the background which updates the decision support systems at real time.
Decision support systems can be classified into three kinds
Statistical analysis, OLAP (On-line Analytical Processing) and
If detailed statistical analysis of data needs to be performed then SQL is very limited and one needs to go for commercial packages like SAS. Further information could be found at http://www.sas.com/technologies/analytics/statistics/index.html?sgc=u
OLAP provides very fast access to data.
The data from RDBMS is gathered and placed it into multidimensional cubes which are then made available to the users.
Cognos powerplay is the best selling OLAP product. The link to this product is http://www.cognos.com/products/business_intelligence/analysis/
The third kind of a decision support system is data warehouse.
Data mining is usually performed on these data warehouses.
The data in an enterprise is usually stored in various transactional systems or databases. For example some data might be stored in Oracle database, the other data might be stored in DB2 or Teradata or in some systems it may just be stored in text files or excel files.
When one wants to combine all this data to look for patterns it becomes very difficult, so all this disparate data from various different sources are pulled together to form a data warehouse.
Data Mining is about finding the patterns in data and is essentially
used for predicting customer behavior.
For example Data Mining could be used to predict based on customer complaints whether that customer is going to go to another competitor.
Applications of Data Mining are varied and is used in almost all applications from CRM to Earthquake predictions.
Predictive analytics is based on predictor, a single value. Predictive analytics is extensively used in CRM applications.
A predictor for a customer could be 'Recent purchase' made.
For example if you are calling customers for promotions then based on this predictor one would call the most recent customer first followed by the customers who purchased items like a month ago.
Association rules have an associated population which consists of a
set of instances.
For example if one buys an iPod from Amazon.com then the association with this product would be the accessories that come with iPod and displayed by Amazon include Apple iPod Nano Armband Grey, Apple iPod Nano Dock and Apple iPod Nano Lanyard Headphones.
Association rule measures are Support and Confidence
Support: Is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule. For example the support for iPod=>DVD player is 0.001 percent, that means the support is very low.
Confidence: Is a measure of how often the consequent is true when the antecedent is true. For example the rule iPod=>Apple iPod Nano Armband Grey would be say 80 percent
The most popular way to classify the items is using Decision tree classifiers. In the example degree is masters and the person's income is 40K starting from the root, we follow the edge labeled 25K to 75K to reach a leaf. The class at the leaf is "good" so we predict that the credit risk of that person is good
Grouping similar data into clusters is what clustering is all about.
The degree of association would be strong in the case of same cluster and weak between different clusters
Clustering is based on the distance measures like Euclidian, probabilistic etc. K-means is one of the most famous clustering algorithm
A.Silberschatz, H.F. Korth, S. Sudarshan
Database System Concepts, 5th Ed., McGraw-Hill, 2006