Data mining
Download
1 / 39

Data Mining - PowerPoint PPT Presentation


  • 252 Views
  • Uploaded on

Data Mining . Rajagopal Sukumar Cognizant Technology Solutions. Agenda. What is Data Mining ? Data Mining Techniques Data Mining Process Our work in Data Mining Tools available in the market. What is Data Mining ?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data Mining' - ace


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Data mining l.jpg

Data Mining

Rajagopal Sukumar

Cognizant Technology Solutions


Agenda l.jpg
Agenda

  • What is Data Mining ?

  • Data Mining Techniques

  • Data Mining Process

  • Our work in Data Mining

  • Tools available in the market


What is data mining l.jpg
What is Data Mining ?

  • Data mining is the search for relationships and global patterns that exist in large databases but are `hidden' among the vast amount of data

  • These relationships represent valuable knowledge about the database and the objects in the database and, if the database is a faithful mirror, of the real world registered by the database.


What is data mining4 l.jpg
What is Data Mining ?

  • The analogy with the mining process is described as:

  • Data mining refers to "using a variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation. The data is often voluminous, but as it stands of low value as no direct use can be made of it; it is the hidden information in the data that is useful"


Why do we need data mining l.jpg
Why do we need Data Mining ?

  • We need it because everybody needs it !

  • To uncover strategic competitive insight to drive market share and profits


What can we do with our data l.jpg
What can we do with our data ?

  • Derive Quantitative Information

    • How many people bought our products last month ?

  • Explain Past Results

    • Why did my monthly sales for our products have declined sharply ?

  • Discover Hidden Patterns

    • Houses with a male HOH (Head of the HHLD) are more likely to have both cats and dogs than those with a female. The actual ratio is 7:3.

  • Predict Future Results

    • So those household in our customer base that have a male Head of Household are likely to have both cats and dogs. If we are a pet food supplier, think about the value of this prediction ?


Transforming data l.jpg
Transforming Data

Data

Facts/Information

Knowledge

Recommendations/Decisions



Data mining methods l.jpg
Data Mining Methods

  • Decision Trees

  • Case Based Reasoning

  • Neural Networks

  • Genetic Algorithms

  • Linear and Non Linear Regression Analysis



Case based reasoning cbr l.jpg
Case based Reasoning (CBR)

  • Finds the closest situation that occurred in the past and adopts the same solution that was the right one

  • Disadvantage is that CBR systems do not create rules or models summarizing the past experiences

  • Example: Help Desk Support Systems


Neural networks l.jpg
Neural Networks

  • Mimic the way learning occurs in the brain

  • They are used extensively in the business world as predictive models

  • Each neuron takes many inputs and generates an output that is a non-linear function of the weighted sum of inputs


Neural networks13 l.jpg
Neural Networks

Toy Type

n1

Buyer Sex

Good

n2

Quantity

Bad

n3

Sale Month

n4

Location


Neural networks14 l.jpg
Neural Networks

  • y = Good or Bad

  • y = w1n1 + w2n2 + w3n3 + w4n4

  • The weights w1..w4 can be calculated using backward propagation by training the net using known values of y and the inputs

  • Then the net can be used for predictions


Genetic algorithms l.jpg
Genetic Algorithms

  • Mimic the evolutionary process of natural selection

  • It has a fitness function that determines those solutions that are better fits

  • Then genetic operations mutations and mating are performed to generate more solutions

  • Currently in research mode rather than in practical applications


Linear and non linear regression l.jpg
Linear and Non-Linear Regression

  • Searching for a dependence of the target variable on other variables in the form of function of some predetermined polynomial form

  • Quantity = A*Buyer Sex + B* Location + C* Month (This is linear !)

  • Solving this equation for A, B, C using the available data can be a predictive model


Usage l.jpg
Usage

  • Clustering

    • Grouping data into disjoint sets that are similar in some respect. It also attempts to place dissimilar data in different clusters.

  • For example, in the context of super market data, clustering of sale items to perform effective shelf space organization is a typical application

  • Clustering algorithms typically use a distance function to separate data


Usage18 l.jpg
Usage

  • Classification

    • Classifies data into distinctive groups

  • For example, people can be categorized into the classifications of babies, children, teenagers, adults, and elderly.

  • The attribute age two years or younger can be mapped to babies.

  • Once data is classified, traits of these groups can be summarized


Usage19 l.jpg
Usage

  • Deviation Detection

    • Extracting anomalies or deviations in the data

    • An anomaly may show a new fact of great interest


Usage20 l.jpg
Usage

  • Association Rules

    • Extracting associations between data items. Can be used to predict the value of one object based on the value of another.

  • Find a model that identifies the most predictive characteristics of people buying toy pickup trucks ?

  • Answer - During summer vacation, single parent families with certain income levels buy toy pickup trucks


Association rules l.jpg
Association Rules

  • 70% of customers who order pen and pencils also order writing tablets

  • If Writing Tablets are high margin items discover all associations that have Writing Tablets as a consequent

  • If pencils are low margin items, discover all associations that have pencils as an antecedent to determine the impact of discontinuing pencils


Data mining process l.jpg
Data Mining Process

  • Data Preparation

    • Most Important Phase GIGO !

  • Defining a Study

  • Reading the data and building a model

  • Understanding the model

  • Prediction


Data preparation l.jpg
Data Preparation

  • Data Cleansing

    • Inconsistencies

      • Toy types soft and plush mean the same

    • Stale Data

      • Address changes are not reflected correctly

    • Typographical Errors

      • words are misspelled or typed incorrectly

    • Missing Values

      • Tough problem to address


Data cleansing missing values l.jpg
Data Cleansing - Missing Values

  • Treatment of missing numeric values is more difficult

    • Artificial assignment change distribution and statistics of the field

    • Assign using average values

    • Segment data using another variable and assign segment averages

    • Build a model and impute the missing values (the best method)


Data transformation l.jpg
Data Transformation

  • Ratio Variables

  • Time derivatives

  • Discretization using quantiles

  • Discretization using other mathematical transforms



Time derivatives l.jpg
Time Derivatives

  • Variation of data over time is very important to understand

  • For example, toy sales time series = toy sales of current month - toy sales of previous month

  • Cyclic Association Rules can be identified

    • monthly sales of goods may have different correlations based on the season


Discretization using quantiles l.jpg
Discretization using quantiles

  • Discretization of numeric data using quantiles is a very good way to normalize data. Makes the data easier to interpret.

  • For example, the quantile break points we can use for toy sales quantity could be 10, 25, 50, 75, and 90.


Discretization using other mathematical transforms l.jpg
Discretization using other mathematical transforms

  • Range transformations

  • Logarithmic transforms

    • used for highly skewed distributions

  • Polynomial transforms

    • Used to linearize variable if the data is continuously distributed


Data mining process30 l.jpg
Data Mining Process

  • Choose the study

    • Classification/Clustering

    • Deviation Detection

    • Affinity Analysis

  • Run the algorithm on the prepared data

  • Analyze the outputs

  • Make decisions


Our approach l.jpg
Our Approach

  • Demystification of Data Mining

  • Built a Windows based Prototype to demonstrate decision trees

  • Working on adding a module to our Adhoc Query Generator - Extempore


Slide32 l.jpg

Sample Study

  • I want to understand what makes certain types of customers buy more

  • Is it related to their salary levels ?

  • Or is it related to their age ?

  • Or is it related to their sex ?

Subject Field

Associated Fields



What is extempore l.jpg
What is Extempore ?

  • EXTract M204 and Process On REquest

  • Generates native M204 UL code

  • Reports generated on multiple M204 files without any M204 coding

  • Complex report formatting with the help of reporting tools like info-maker

  • Provides user friendly GUI

  • Dynamically generates customized reports


What is extempore35 l.jpg
What is Extempore ?

  • Structured user interface

  • Point & click methodology

  • Limited M204 knowledge required to use

  • Quick access to M204 data

  • Reports can be copied/saved and reused

  • Data retrieved can be saved in formats like excel, CSV or HTML tables to be used by other systems

  • Online & batch modes of execution


Extempore architecture l.jpg
Extempore Architecture

Sybase routes

client RPC

to M204

Hidden connection

from M204 to Sybase

to read report

specification

RPC to Sybase

& results from

RPC to client

Extempore /

Infomaker

Model 204

Sybase Database

JANUS

CT LIB


Tools in the market l.jpg
Tools in the market

  • IBM Intelligent Miner

  • Data Mind Corp’s Data Mind Professional Edition

  • Angoss Software’s Knowledge Seeker

  • Neuralware’s Neuralworks Predict

  • Pilot Software’s Discovery Server

  • Redbrick Systems’ Data Mine

  • Thinking Machines Corp’s Darwin


Web sites l.jpg
Web sites

  • Excellent reference sites

    • http://www.thearling.com

    • http://www.kdnuggets.com

  • Source code sites

    • C4.5 Decision Tree Algorithm

      • htttp://ftp.cs.su.oz.au/pub/ml/

    • OC1 Decision Tree Algorithm

      • http:/www.cs.jhu.edu/



ad