Chapter i introduct ion bis 541 20 13 2014 summer
This presentation is the property of its rightful owner.
Sponsored Links
1 / 94

Chapter I:Introduct ion BIS 541 20 13/2014 Summer PowerPoint PPT Presentation


  • 43 Views
  • Uploaded on
  • Presentation posted in: General

Chapter I:Introduct ion BIS 541 20 13/2014 Summer. Chapter 1. Introduction. Motivation: Why data mining? Methodology of Knowledge Discovery in Databases Data mining functionalities Are all the patterns interesting? Business a pplications of data mining.

Download Presentation

Chapter I:Introduct ion BIS 541 20 13/2014 Summer

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Chapter i introduct ion bis 541 20 13 2014 summer

Chapter I:IntroductionBIS5412013/2014 Summer


Chapter 1 introduction

Chapter 1. Introduction

  • Motivation: Why data mining?

  • Methodology of Knowledge Discovery in Databases

  • Data mining functionalities

  • Are all the patterns interesting?

  • Business applications of data mining


Motivation necessity is the mother of invention

Motivation: “Necessity is the Mother of Invention”

  • Data explosion problem

    • Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories

  • Need to convert such data into knowledge and information

  • Applications

    • Business management

    • Production control

    • Market analysis

    • Engineering design

    • Science exploration


Evolution of database technology 1

Evolution of Database Technology (1)

  • Data collection, database creation

  • Data management

    • data storage and retrieval

    • database transaction processing

  • Data analysis and understanding

    • Data mining and data warehousing


Evolution of database technology 2

Evolution of Database Technology (2)

  • 1960s:

    • Data collection, database creation, IMS and network DBMS

  • 1970s:

    • Relational data model, relational DBMS implementation

  • 1980s:

    • RDBMS, advanced data models (extended-relational, OO, deductive, etc.)

    • Application-oriented DBMS (spatial, scientific, engineering, etc.)

  • 1990s:

    • Data mining, data warehousing, multimedia databases, and Web databases

  • 2000s

    • Stream data management and mining

    • Data mining and its applications

    • Web technology (XML, data integration) and global information systems


Chapter i introduct ion bis 541 20 13 2014 summer

  • The Explosive Growth of Data: from terabytes to petabytes

    • Data collection and data availability

      • Automated data collection tools, database systems, Web, computerized society

    • Major sources of abundant data

      • Business: Web, e-commerce, transactions, stocks, …

      • Science: Remote sensing, bioinformatics, scientific simulation, …

      • Society and everyone: news, digital cameras, YouTube

  • We are drowning in data, but starving for knowledge!

  • “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets


Developments in computer hardware

Developments in computer hardware

  • Powerful and affordable computers

  • Data collection equipment

  • Storage media

  • Communication and networking


Data warehouse

Data Warehouse

  • Data cleaning

  • Data integration

  • OLAP: On-Line Analytical Processing

    • summarization

    • consolidation

    • aggregation

    • view information from different angles

  • but additional data analysis tools are needed for

    • classification

    • clustering

    • charecterization of data changing over time


Data rich information poor situation

Data rich information poor situation

  • Abundance of data

  • need for powerful data analysis tools

  • “data tombs” - data archives

    • seldom visited

  • Important decisions are made

    • not on the information rich data stored in databases

    • but on a decision maker’s intuition

  • no tool to extract knowledge embedded in vast amounts of data

  • Expert system technology

    • domain experts to input knowledge

    • time consuming and costly


What is data mining

What Is Data Mining?

  • Data mining (knowledge discovery in databases):

    • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)information or patterns from data in large databases

  • Alternative names and their “inside stories”:

    • Data mining: a misnomer?

    • Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

  • What is not data mining?

    • query processing.

    • Expert systems or small ML/statistical programs


Data mining vs data query

Data Mining vs. Data Query

  • Data Query:e.g.

    • A list of all customers who use a credit card to buy a PC

    • A list of all MIS students having a GPA of 3.5 or higher and has studied 4 or less semesters

  • Data Mining problems:e.g.

    • What is the likelihood of a customer purchasing PC with credit card

    • Given the characteristics of MIS students predict her SPA in the comming term

    • What are the characteristics of MIS undergrad students


Chapter i introduct ion bis 541 20 13 2014 summer

Chapter 1. Introduction

  • Motivation: Why data mining?

  • Methodology of Knowledge Discovery in Databases

  • Data mining functionalities

  • Are all the patterns interesting?

  • Business applications of data mining


Why data mining

Why Data Mining?

  • Four questions to be answered

  • Can the problem clearly be defined?

  • Does potentially meaningful data exists?

  • Does the data contain hidden knowledge or useful only for reporting purposes?

  • Will the cost of processing the data will be less then the likely increase in profit from the knowledge gained from applying any data mining project


Steps of a kdd process 1

Steps of a KDD Process(1)

  • 1. Goal identification:

    • Define problem

    • relevant prior knowledge and goals of application

  • 2. Creating a target data set: data selection

  • 3. Data preprocessing: (may take 60%-80% of effort!)

    • removal of noise or outliers

    • strategies for handling missing data fields

    • accounting for time sequence information

  • 4. Data reduction and transformation:

    • Find useful features, dimensionality/variable reduction, invariant representation.


Steps of a kdd process 2

Steps of a KDD Process(2)

  • 5. Data Mining:

    • Choosing functions of data mining:

      • summarization, classification, regression, association, clustering.

    • Choosing the mining algorithm(s):

      • which models or parameters

    • Search for patterns of interest

  • 6. Presentationand Evaluation:

    • visualization, transformation, removing redundant patterns, etc.

  • 7. Taking action:

    • incorporating into the performance system

    • documenting

    • reporting to interested parties


An example c ustomer s egmentation

An example: Customer Segmentation

  • 1. Marketing department wants to perform a segmentation study on the customers of AE Company

  • 2. Decide on revevant variables from a data warehouse on customers, sales, promotions

    • Customers: name,ID,income,age,education,...

    • Sales: hisory of sales

    • Promotion: promotion types durations...

  • 3. Hendle missing income, addresses..

  • determine outliers if any

  • 4. Cenerate new index variables representing wealth of customers

    • Wealth = a*income+b*#houses+c*#cars...

    • Make neccesary transformations z scores so that some data mining algorithms work more efficiently


E xample c ustomer s egmentation cont

Example: Customer Segmentation cont.

  • 5.a: Choose clustering as the data mining functionality as it is the natural one for a segmentation study so as to find group of customers with similar charecteristics

  • 5.b: Choose a clustering algorithm

    • K-means or k-medoids or any suitable one for that problem

  • 5.c: Apply the algorithm

    • Find clusters or segments

  • 6. make reverse transformations, visualize the customer segments

  • 7. present the results in the form of a report to the marketing deprtment

    • İmplement the segmentation as part of a DSS so that it can be applied repeatedly at certain internvals as new customers arrive

    • Develop marketing strategies for each segment


Data mining a kdd process

Data Mining: A KDD Process

Knowledge

Pattern Evaluation

  • Data mining: the core of knowledge discovery process.

Data Mining

Task-relevant Data

Selection

Data Warehouse

Data Cleaning

Data Integration

Databases


Data mining in business intelligence

Data Mining in Business Intelligence

Increasing potential

to support

business decisions

End User

DecisionMaking

Business

Analyst

Data Presentation

Visualization Techniques

Data Mining

Data

Analyst

Information Discovery

Data Exploration

Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

DBA

Data Sources

Paper, Files, Web documents, Scientific experiments, Database Systems

September 3, 2014

19

Data Mining: Concepts and Techniques


Architecture of a typical data mining system

Architecture of a Typical Data Mining System

Graphical user interface

Pattern evaluation

Data mining engine

Knowledge-base

Database or data warehouse server

Filtering

Data cleaning & data integration

Data

Warehouse

Databases


Architecture of a typical data mining system1

Architecture of a Typical Data Mining System

  • Data base, data warehouse

  • Data base or data warehouse server

  • Knowledge base

    • concept hierarchies

    • user beliefs

      • asses pattern’s interestingness

    • other thresholds

  • Data mining engine

    • functional modules

      • characterization, association, classification, cluster analysis, evolution and deviation analysis

  • Pattern evaluation module

  • Graphical user interface


Chapter i introduct ion bis 541 20 13 2014 summer

Data Mining: Confluence of Multiple Disciplines

Database

Technology

Statistics

Data Mining

Machine

Learning

Visualization

Information

Science

Other

Disciplines


Why confluence of multiple disciplines

Why Confluence of Multiple Disciplines?

Tremendous amount of data

Algorithms must be highly scalable to handle such as tera-bytes of data

High-dimensionality of data

Micro-array may have tens of thousands of dimensions

High complexity of data

Data streams and sensor data

Time-series data, temporal data, sequence data

Structure data, graphs, social networks and multi-linked data

Heterogeneous databases and legacy databases

Spatial, spatiotemporal, multimedia, text and Web data

Software programs, scientific simulations

New and sophisticated applications

September 3, 2014

23

Data Mining: Concepts and Techniques


Efficient and scalable techniques

Efficient and Scalable Techniques

  • For an algorithm to be efficient and scalable

  • its running time should be predictable and acceptable

  • How

    • Parallel and distributed algorithms

    • Sampling from databases


Chapter i introduct ion bis 541 20 13 2014 summer

Chapter 1. Introduction

  • Motivation: Why data mining?

  • Methodology of Knowledge Discovery in Databases

  • Data mining functionalities

  • Are all the patterns interesting?

  • Business applications of data mining


Two styles of data mining

Two Styles of Data Mining

  • Descriptive data mining

    • characterize the general properties of the data in the database

    • finds patterns in data and

    • the user determines which ones are important

  • Predictive data mining

    • perform inference on the current data to make predictions

    • we know what to predict

  • Not mutually exclusive

    • used together

    • Descriptive  predictive

  • Eg. Customer segmentation – descriptive by clustering

  • Followed by a risk assignment model – predictive by ANN


Chapter i introduct ion bis 541 20 13 2014 summer

Supervised vs. Unsupervised Learning

  • Supervised learning (classification, prediction)

    • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations

    • New data is classified based on the training set

  • Unsupervised learning(summarization. association, clustering)

    • The class labels of training data is unknown

    • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data


Descriptive data mining 1

Descriptive Data Mining (1)

  • Discovering new patterns inside the data

  • Used during the data exploration steps

  • Typical questions answered by descriptive data mining

    • what is in the data

    • what does it look like

    • are there any unusual patterns

    • what dose the data suggest for customer segmentation

  • users may have no idea

    • which kind of patterns may be interesting


Descriptive data mining 2

Descriptive Data Mining (2)

  • patterns at verious granularities

    • geograph

      • country - city - region - street

    • student

      • university - faculty - department - minor

  • Fuctionalities of descriptive data mining

    • Clustering

      • Ex: customer segmentation

    • summarization

    • visualization

    • Association

      • Ex: market basket analysis


Chapter i introduct ion bis 541 20 13 2014 summer

A model is a black box

X: vector of independent variables or inputs

Y =f(X) : an unknown function

Y: dependent variables or output

a single variable or a vector

Model

Y output

inputs

X1,X2

The user does not care what the model is doing

it is a black box

interested in the accuracy of its predictions


Predictive data mining 1

Predictive Data Mining (1)

  • Using known examples the model is trained

    • the unknown function is learned from data

  • the more data with known outcomes is available

    • the better the predictive power of the model

  • Used to predict outcomes whose inputs are known but the output values are not realized yet

  • Never %100 accurate


Predictive data mining 2

Predictive Data Mining (2)

  • The performance of a model on past data is not important

    • to predict the known outcomes

  • Its performance on unknown data is much more important


Typical questions answered by predictive models

Typical questions answered by predictive models

  • Who is likely to respond to our next offer

    • based on history of previous marketing campaigns

  • Which customers are likely to leave in the next six months

  • What transactions are likely to be fraudulent

    • based on known examples of fraud

  • What is the total amount spending of a customer in the next month


Data mining functionalities 1

Data Mining Functionalities (1)

  • Concept description: Characterization and discrimination

    • Generalize, summarize, and contrast data characteristics, e.g., big spenders vs. budget spenders

  • Association (correlation and causality)

    • Multi-dimensional vs. single-dimensional association

    • age(X, “20..29”) ^ income(X, “20..29K”) à buys(X, “PC”) [support = 2%, confidence = 60%]

    • contains(T, “computer”) à contains(x, “software”) [1%, 75%]


Data mining functionalities 2

Data Mining Functionalities (2)

  • Classification and Prediction

    • Finding models (functions) that describe and distinguish classes or concepts for future prediction

    • E.g., classify people as healty or sick, or classify transactions as fraudulent or not

    • Methods: decision-tree, classification rule, neural network

    • Prediction: Predict some unknown or missing numerical values

  • Cluster analysis

    • Class label is unknown: Group data to form new classes, e.g., cluster customers of a retail company to learn about characteristics of different segments

    • Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity


Data mining functionalities 3

Data Mining Functionalities (3)

  • Outlier analysis

    • Outlier: a data object that does not comply with the general behavior of the data

    • It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis

  • Trend and evolution analysis

    • Trend and deviation: regression analysis

    • Sequential pattern mining: click stream analysis

    • Similarity-based analysis

  • Other pattern-directed or statistical analyses


Concept description

Concept Description

  • Characterization

  • Discerimination

  • Data

    • classes or

    • concpets

  • classes of items for sale

    • computers, printers

  • concepts of customers:

    • bigSpenders

    • BudgetSpenders


Data characterization

Data Characterization

  • Summarization the data of the class under study (target class)

  • Methods

    • SQL queries

    • OLAP roll up -operation

      • user-controlled data summarization

      • along a specified dimension

    • attribute oriented induction

      • without step by step user interraction

  • the output of characterization

    • pie charts, bar chars, curves, multidimensional data cube, or cross tabs

    • in rule form as characteristic rules


Characterization example

Characterization example

  • Description summarizing the characteristics of customers who spend more than $1000 a year at AllElecronics

    • age, employment, income

    • drill down on any dimension

      • on occupation view these according to their type of employment


Data discrimination

Data Discrimination

  • Comparing the target class with one or a set of comparative classes (contrasting classes)

    • these classes can be specified by the use

  • database queries

  • methods and output

    • similar to those used for characterization

    • include comparative measures to distinguish between the target and contrasting classes


Discrimination examples

Discrimination examples

  • Example 1:Compare the general features of software products

    • whose sales increased by %10 in the last year (target class)

    • whose sales decreased by at least %30 during the same period (contrasting class)

  • Example 2: Compare two groups of AE customers

    • I) who shop for computer products regularly (target class)

      • more than two times a month

    • II) who rarely shop for such products (contrasting class)

      • less than three times a year

  • The resulting description:

  • %80 of I group customers

    • university education

    • ages 20-40

  • %60 of II group customers

    • seniors or young

    • no university degree


Chapter i introduct ion bis 541 20 13 2014 summer

Multidimensional Data

  • sales according to region month and product type

Dimensions: Product, Location, Time

Hierarchical summarization paths

Region

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

Product

Month


Chapter i introduct ion bis 541 20 13 2014 summer

Association Analysis

  • Discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data

  • widely used

    • market basket

    • transaction data analysis

  • more formally

  • X  Y that is

  • A1A2.. Ak B1B2.. Bl

  • A1 , B1 are attribute value pairs or predicates


Chapter i introduct ion bis 541 20 13 2014 summer

Example: association analysis

  • From the AllEs database

    • age(X,”20..29”)income(X,”1,000...2,000”)buy(X,”Notebook computer”)

    • (support = %2,

    • confidence= %60)

  • X is a variable representing a customer

  • %2 of the AE customers are

    • between 20 and 29 age

    • incomes ranging from 1 to 2 billon TL

    • Buy Notebook

  • with %60 probability that customers in those age and income groups will buynote book

  • a multidimensional association rule

    • contains more than one attribute or predicate


Market basket analysis

Market basket analysis

  • customers buying behaviour is investigated

  • Based on only the transactions data

    • no information about customer properties: age income

  • Managers

    • are interested in which products or product groups are sold together


Transactional database

Transactional Database


Example basket analysis rule

Example: basket analysis rule

  • buy(notebok)buy(printer)

  • (support= %1,confidence=%60)

  • %1 of all transactions contains

    • computer and printer

  • if a transaction containsnotebook

    • there is a %60 chance that it contains printer as well

  • a single dimensional association rule

    • contains a single predicate

  • an association rule is interesting if

    • its support exceeds a minimum threshold and

    • its confidence exceeds a min threshold

  • These min values are set by specialists


  • Classification

    Classification

    • Learning is supervised

    • Dependent variable is categorical

    • Build a model able to assign new instances to one of a set of well-defined classes


    Typical classification problems

    Typical Classification Problems

    • Given characteristics of individuals differentiate them who have suffered a heart attack from those who have not

    • Determine if a credit card purchase is fraudulent

    • Classify a car loan applicant as a good or a poor credit risk


    Methods of classification

    Methods of Classification

    • Decision Trees

    • Artificial Neural Networks

    • Bayesian Classification

      • Naïve

      • Belief Networks

    • k-nearest neighbor

    • Regression

      • Logistic (logit) probit

        • Predicts probability of each class

        • when the dependent variable is categorical

          • good customer bed customer or employed unemployed


    Steps of classification process

    Steps of classification process

    • (1) Train the model

      • using a training set

      • data objects whose class labels are known

    • (2) Test the model

      • on a test sample

      • whose class labels are known but not used for training the model

    • (3) Use the model for classification

      • on new data whose class labels are unknown


    An example classification

    An example - classification

    Historical dataEach customer type İs known

    Each customer has aLabel

    • Testing set whose labels are also

    • Known but not used in model

    • Training the model

    • New customersWhose type hsa to be

    • Estimated

    • Each new customer hss to be classified as Risky normal or good


    Orginal data

    Orginal data


    Chapter i introduct ion bis 541 20 13 2014 summer

    Historical dataEach customer type İs known

    Each customer has aLabel

    • Testing set whose labels are also

    • Known but not used in model

    • Training the model

    • New customersWhose type hsa to be

    • Estimated

    • Each new customer hss to be classified as buyer or non buyer


    An example classification cont

    An example – classification cont.

    • Based on historical data develop a classification model

      • Decision tree, neural network, regression ...

    • Test the performance of the model on a portion of the historical data

    • İf accuricy of the model is satisfactory

    • Use the model on the new customers

      • 11 and 27 to assign a type the these new customers


    Chapter i introduct ion bis 541 20 13 2014 summer

    Example AE customers

    age

    goodl

    risky

    Yearly income


    Chapter i introduct ion bis 541 20 13 2014 summer

    Example AE customers

    age

    goodl

    risky

    ?

    Yearly income

    Assign the new customer whose type in unknown to

    either * or +


    Chapter i introduct ion bis 541 20 13 2014 summer

    x2 : age

    x1 : yearly income

    1000

    Solution

    good

    risky

    35

    rule: IF yearly income> 1000and age> 35

    THEN good ELSE risky


    Credit card promotion policy

    Credit Card Promotion Policy

    • Credit card companies

      • Promotional offerings with their monthly credit card billing

      • Offers provide the opportunity to purchase items such as magazines, …

    • A data mining study

      • Predict individual behaviour

      • What is the likelihood of an individual towards taking the advantage of promotions

      • based on individual characteristics, credit history..

      • Expected reduction in postage; paper and processing costs for the credit card company


    Chapter i introduct ion bis 541 20 13 2014 summer

    Credit Card Promotion Database


    Chapter i introduct ion bis 541 20 13 2014 summer

    age

    Cr Ins

    Decision Trees for Credit Card Insurance Database

    Dependent Variable

    Life Insurance Promotion

    <=43

    >43

    • critical value of 43

    • is deter by the

    • algorithm

    N 3,Y 0

    Decision:No

    Gender

    Female

    Male

    A Production Rule

    from the Tree

    IF (age<=43)&(Sex=Male)

    &(Credit Card In = No)

    THEN Life Insurance Pr = No

    N 0, Y 6

    Decision: Yes

    Yes

    No

    Yes 2, No 0

    Decision? Yes

    N 4, Y 1

    Decision: No


    Artificial neural networks

    Artificial Neural Networks

    • Set of interconnected nodes designed to imitate the functioning of the human brain

    • Feed-forward network

      • Supervised learner model


    For the promotion example

    For the promotion example

    • Encode all variables

    • Assign a numerical value even for qualitative variables such as sex

    • Say X1 represent gender

    • When

      • Male X1 =1

      • Female X1 =0


    Chapter i introduct ion bis 541 20 13 2014 summer

    Input

    layer

    Hidden

    layer

    Output

    layer

    1

    W1,5=0.014

    X1=+1

    5

    W5,9=-0.17

    X2=0

    X3=0.5

    X4=-1

    (1-0.78)2 is error square

    1 actual value of O9 for a particular

    Data object 0.78 is predicted value


    Weights updating

    Weights updating

    • Weights between nodes are adjusted so as to reduce error

    • Details of the training process for neural networks are not important for the time being


    Estimation prediction

    Estimation-Prediction

    • Similar to classification

    • Output is a continuous variable

    • Estimation: current value

    • Prediction: future outcome rather then current behavior


    Typical estimation prediction problems

    Typical Estimation-Prediction Problems

    • Estimate the salary of an individual who owns a sports car

    • Predict next week`s closing price for the IMKB100 index

    • Forecast next days temperature


    Chapter i introduct ion bis 541 20 13 2014 summer

    Prediction methods

    • Artificial Neural networks

    • linear regression

      • Yi = a0+a1X1,i+a2X2,i+...+akXk,i+ui

    • non-linear regression

      • Yi =f(X1,i, X2,i,.., Xk,ia1,a2,..,ak,ui)

    • generalized linear regression

      • logistic

        • logit,probit

      • poisson regression

        • for count variables

    • Regression Trees


    Example prediction and classification

    Example:Prediction and Classification

    • Classification is used to classify customers applying for credit cards

      • known class labels: risky,reliable

      • when a new customer applies looking at her charecteristics

        • income age education wealth region ...

      • Customer class is predicted

    • Prediction: The monthly expense of a new customer ( a real continuous variable ) is predicted based on personal information

      • independent variables

        • income education wealth profession ...

        • Some are numeric some categorical


    Cluster analysis

    Cluster Analysis

    • Class label is unknown: Group data to form new classes,

    • assign class labels to each data object

      • Unknown generated by the clustering model

    • e.g., cluster customers to find customer segments

    • Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

      • Objects within a cluster have high similarity in comparison to one another

      • but are very dissimilar to objects in other clusters

    • there may be hierarchy of classes


    Example clustering

    Example: Clustering

    • Can be performed on AE customer data

    • to identify homogenous subpopulations of customers

    • represent individual target groups for marketing


    Chapter i introduct ion bis 541 20 13 2014 summer

    Before clustering

    After clustering


    Chapter i introduct ion bis 541 20 13 2014 summer

    distance

    Type1

    Type 2

    type 3

    income

    Clustering according to income and distance to store

    three cluster of data points are evident


    Outlier analysis

    Outlier Analysis

    • Outlier: a data object that does not comply with the general behavior of the data

    • It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis

    • DECTECED using

      • statistical tests

      • distance measures

      • visually inspecting the data

    • Examples:


    Reasons for outliers

    Reasons for outliers

    • Measurement errors

    • coding errors

      • age is entered as 999

    • nature of data

      • salary of the general manager is much more higher than the other employees

      • in crisis the interest rate was in the order of 1000s


    Evolution analysis

    Evolution Analysis

    • Describes and models regularities or trends for objects whose behavior changes over time

    • Distinct features include

      • Trend and deviation: time-series data analysis

      • Sequential pattern mining, periodicity analysis

      • Similarity-based analysis

    • Example

      • Stock market predictions: future stock prices

      • for overall stocks: indexes or individual company stocks


    Sequential pattern analysis

    Sequential Pattern Analysis

    • Determine sequential patterns in data

    • Based on time sequence of actions

    • Similar to associations

      • Relationship is based on time

    • Example 1: buy CD player today buy CD within one week

    • Example 2: In what sequence web pages of an e-business company are accessed

    • %70 percents of visitors follows

      • A B C or A D B C or A E B C

      • He then determines to add a link directly from page A to page C


    Chapter i introduct ion bis 541 20 13 2014 summer

    Chapter 1. Introduction

    • Motivation: Why data mining?

    • Methodology of Knowledge Discovery in Databases

    • Data mining functionalities

    • Are all the patterns interesting?

    • Business applications of data mining


    Are all the discovered patterns interesting

    Are All the “Discovered” Patterns Interesting?

    • A data mining system/query may generate thousands of patterns, not all of them are interesting.

    • Are all patterns interesting?

      • Typically not -only a small fraction of patterns are interesting to any given user

    • Interestingness measures: A pattern is interesting if

      • it is easily understood by humans,

      • valid on new or test data with some degree of certainty,

      • potentially useful,

      • novel, or

      • validates some hypothesis that a user seeks to confirm


    Objective vs subjective interestingness measures

    Objective vs. subjective interestingness measures:

    • Objective:

      • Objective: based on statistics and structures of patterns, e.g.,

        • support,

        • X Y P(X  Y):probability of a transaction contains both X and Y

      • confidence, degree of certainty of the detected association

      • P(Y I X) the conditional probability : the probability that a transaction containing X also contains Y

      • thresholds - controlled by the user

      • ex: rules that do not satisfy a confidence threshold of %50 are uninteresting

    • Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc.


    Chapter i introduct ion bis 541 20 13 2014 summer

    Chapter 1. Introduction

    • Motivation: Why data mining?

    • Methodology of Knowledge Discovery in Databases

    • Data mining functionalities

    • Are all the patterns interesting?

    • Business Applications of data mining


    Potential business applications

    Potential Business Applications

    • Market analysis and management

      • target marketing, customer relation management, market basket analysis, cross selling, market segmentation

  • Risk analysis and management

    • Banks assume a financial risk when they grant loans

      • risk models attempt to predict the probability of default or fail to pay back the borrowed amount

      • Credit cards

    • Insurance companies

  • Fraud detection and management

  • Other Applications

    • Text mining (news group, email, documents) and Web analysis.

    • Intelligent query answering


  • Market analysis and management 1

    Market Analysis and Management (1)

    • Where are the data sources for analysis?

      • Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies,clickstreams

    • Customer profiling-segmentation

      • data mining can tell you what types of customers buy what products (clustering or classification)

    • Target marketing

      • Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.


    Market analysis and management 2

    Market Analysis and Management (2)

    • Effectiveness of sales campaigns

      • Advertisements, coupons, discounts, bonuses

      • promote products and attract customers

      • can help improve profits

      • Compare amount of sales and number of transactions

        • during the sales period versus before or after the sales campaign

      • Association analysis

        • which items are likely to be purchased together with the items on sale


    Market analysis and management 3

    Market Analysis and Management (3)

    • Customer retention Analysis of Customer loyalty

      • sequences of purchases of particular customers

      • goods purchased at different periods by the same customers can be grouped into sequences

      • changes in customer consumption or loyalty

      • suggests adjustments on the pricing and variety of goods

      • to retain old customers and attract new customers

    • Cross-selling and up-selling

      • associations from sales records

      • a customer who buy a PC is likely to buy a printer

      • purchase recommendations


    Fraud detection and management

    Fraud Detection and Management

    • Applications

      • widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.

    • Approach

      • use historical data to build models of fraudulent behavior and use data mining to help identify similar instances

    • Examples

      • Credit card transactions: The FALCON fraud assessment system by HNC Inc. to signal possibly fraudulent credit card transactions

      • money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network)

      • Detecting telephone fraud:ASPECT European Research Gr.

        • Unsupervised clustering to detect fraud in mobile phone networks

        • Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.


    Health care

    Health Care

    • Storing patients` records in electronic format, developments in medical information systems

      • Large amount of clinical data

    • Regularities, trends and surprising events extracted by data mining methods

      • ANN, temporal reasoning

      • assist clinicians to make informed decisions and improving health sevices

    • MERCK-MEDCO Managed Care, Pharmaceutical Insurance … company

      • Uncover less expensive but equally effective drug treatments


    Financial data analysis

    Financial Data Analysis

    • Financial data

      • complete, reliable, high quality

    • Loan payment prediction and customer credit policy analysis


    Loan payment prediction and customer credit policy analysis

    Loan payment prediction and customer credit policy analysis

    • Factors influencing loan payment performance

      • loan-to-value ratio

      • term of the loan

      • dept ratio (total monthly debt/total monthly income)

      • payment-to-income ratio

      • income level

      • education level

      • residence region

      • credit history

    • analysis may find that

      • payment-income ratio is a dominant factor while

      • education level and debt ratio are not


    Risk management and insurance

    Risk Management and Insurance

    • determine insurance rates

    • manage investment portfolios

    • differentiate between companies and/or individuals who are good and poor credit risks

    • Farmer`s Group discover a scenario:

      • Someone who owns a sports car is not a higher accident risk

      • Conditions: the sport car to be a second car and the family car to be a station wagon or a sedan


    Data mining for the telecommunication industry

    Data Mining for the Telecommunication Industry

    • Telecommunication data are multidimensional

      • calling-timeduration

      • location of callerlocation of callee

      • type of call

    • used to identify and compare

      • data traffic system workload

      • resource usageuser group behavior

      • profit

    • fraudulent pattern analysis and identification of unusual patterns

    • to achieve customer loyalty

    • characteristics of customers affecting line usage


    Chapter i introduct ion bis 541 20 13 2014 summer

    Other Applications

    • Sports and Gaming

      • Predicting outcome of football games

    • Text Mining

      • Spam detection

    • Internet Web Mining

      • Web usage mining

        • İmprove link structure

        • Recommander Systmes

      • Web structure mining: mining link structure of Web


    Chapter i introduct ion bis 541 20 13 2014 summer

    Other Applications

    • Educational Data Mining

      • Clustering students

      • Design enterece exams, selection policies

    • Human Resources

      • How to select applicants

    • Online Dating

      • Recommandataions to visitors


    Summary

    Summary

    • Data mining: discovering interesting patterns from large amounts of data

    • A natural evolution of database technology, in great demand, with wide applications

    • A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation

    • Mining can be performed in a variety of information repositories

    • Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.

    • Classification of data mining systems

    • Major issues in data mining


  • Login