Data mining
Download
1 / 20

DATA MINING - PowerPoint PPT Presentation


  • 332 Views
  • Uploaded on

DATA MINING Prof. Sin-Min Lee Surya Bhagvat CS 157B – Spring 2006 Making sense out of data With the hard drives prices becoming inexpensive the amount of data stored in the databases by the corporations has increased dramatically.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'DATA MINING' - salena


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Data mining l.jpg

DATA MINING

Prof. Sin-Min Lee

Surya Bhagvat

CS 157B – Spring 2006


Making sense out of data l.jpg
Making sense out of data

With the hard drives prices becoming inexpensive the amount of data stored in the databases by the corporations has increased dramatically.

Just having the raw data in the database is of no use unless someone makes sense of the data. For example one could store a decade of customer data but for the data to become useful one needs to find the patterns in the data to identify the customer behavior.

Would SQL solve the above problem?


Traditional sql and analytics l.jpg
Traditional SQL and Analytics

Traditional SQL is useful in performing very large queries and one

could argue saying that SQL is all but necessary in order to get

the information.

This argument holds good for small sets of data but when a query is performed against a huge database which stores about terabytes of data then the performance of SQL would go down.

Also identifying patterns in the data is not always feasible with the traditional SQL querying. This is where the field of Analytics come into play


Analytics l.jpg
Analytics

Analytics is basically identifying patterns of data in order to make

better decisions.

For example if you are maintaining a commercial ecommerce web site, then one thing which you want to know would be the visitors behavior patterns like from which search engine they came from, how they go on about searching for items in your web site and so on.

Basically what we are trying to do here is identify the patterns of customer behavior which would be useful later on to target that particular customer with promotional offers.


Analytics continued l.jpg
Analytics (Continued….)

Google recently came up with Google Analytics for free.

The URL for this is site is http://www.google.com/analytics/feature_fast.html

Right now one needs to do sign up for their invitation and once they accept it all one needs to do is to include google analytics tracking code in your web site and then you can start monitoring the customer behavior.


Transactional systems l.jpg
Transactional Systems

In transactional systems the information about day-to-day transactions is stored.

For example retail stores like Safeway records each transaction that happens during the day at the time the purchase is made.

Identifying patterns on transactional systems is relatively hard because the data stored in these systems usually run up to terabytes and if a SQL query is performed across such a huge database then it may bring the whole system down.

So what’s the alternative?


Decision support systems l.jpg
Decision support Systems

For decision making activities like to determine patterns or to run complex SQL’s a separate database or system is usually maintained and those systems are known as Decision Support systems.

The high level data is pulled out from the transactional systems and then stored into these databases for performing analytics or data mining techniques.

The downside to this is the data may not be real time. But a service could be written which runs in the background which updates the decision support systems at real time.


Decision support systems contd l.jpg
Decision support systems (contd…)

Decision support systems can be classified into three kinds

Statistical analysis, OLAP (On-line Analytical Processing) and

Data warehouses.

If detailed statistical analysis of data needs to be performed then SQL is very limited and one needs to go for commercial packages like SAS. Further information could be found at http://www.sas.com/technologies/analytics/statistics/index.html?sgc=u


Decision support systems contd9 l.jpg
Decision support systems (contd….)

OLAP provides very fast access to data.

The data from RDBMS is gathered and placed it into multidimensional cubes which are then made available to the users.

Cognos powerplay is the best selling OLAP product. The link to this product is http://www.cognos.com/products/business_intelligence/analysis/


Data warehousing l.jpg
Data warehousing

The third kind of a decision support system is data warehouse.

Data mining is usually performed on these data warehouses.

The data in an enterprise is usually stored in various transactional systems or databases. For example some data might be stored in Oracle database, the other data might be stored in DB2 or Teradata or in some systems it may just be stored in text files or excel files.

When one wants to combine all this data to look for patterns it becomes very difficult, so all this disparate data from various different sources are pulled together to form a data warehouse.


Data warehousing contd l.jpg
Data warehousing (Contd…)

  • The steps involved in building a data warehouse includes:

  • Getting the raw data from different sources and storing it as is in a temporary staging area. Typically ETL tools are used for this process.

  • The data from the temporary staging area is then cleansed and various business rules are applied to load the data into the actual data warehouse tables.


Predictive analytics and data mining l.jpg
Predictive analytics and Data Mining

Data Mining is about finding the patterns in data and is essentially

used for predicting customer behavior.

For example Data Mining could be used to predict based on customer complaints whether that customer is going to go to another competitor.

Applications of Data Mining are varied and is used in almost all applications from CRM to Earthquake predictions.


Predictive analytics and data mining13 l.jpg
Predictive analytics and Data Mining

Predictive analytics is based on predictor, a single value. Predictive analytics is extensively used in CRM applications.

A predictor for a customer could be 'Recent purchase' made.

For example if you are calling customers for promotions then based on this predictor one would call the most recent customer first followed by the customers who purchased items like a month ago.


Procedures in data mining l.jpg
Procedures in Data Mining

  • The key procedures used in Data mining include :

  • Association rules

  • Classification

  • Clustering


Association rules l.jpg
Association rules

Association rules have an associated population which consists of a

set of instances.

For example if one buys an iPod from Amazon.com then the association with this product would be the accessories that come with iPod and displayed by Amazon include Apple iPod Nano Armband Grey, Apple iPod Nano Dock and Apple iPod Nano Lanyard Headphones.

Association rule measures are Support and Confidence


Association rules16 l.jpg
Association rules

Support: Is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule. For example the support for iPod=>DVD player is 0.001 percent, that means the support is very low.

Confidence: Is a measure of how often the consequent is true when the antecedent is true. For example the rule iPod=>Apple iPod Nano Armband Grey would be say 80 percent



Classification l.jpg
Classification

The most popular way to classify the items is using Decision tree classifiers. In the example degree is masters and the person's income is 40K starting from the root, we follow the edge labeled 25K to 75K to reach a leaf. The class at the leaf is "good" so we predict that the credit risk of that person is good


Clustering l.jpg
Clustering

Grouping similar data into clusters is what clustering is all about.

The degree of association would be strong in the case of same cluster and weak between different clusters

Clustering is based on the distance measures like Euclidian, probabilistic etc. K-means is one of the most famous clustering algorithm


Resources l.jpg
Resources

A.Silberschatz, H.F. Korth, S. Sudarshan

Database System Concepts, 5th Ed., McGraw-Hill, 2006

http://www.google.com/analytics/feature_fast.html

http://www.sas.com/technologies/analytics/statistics/index.html?sgc=u

http://www.cognos.com/products/business_intelligence/analysis/


ad