An Introduction to
Download
1 / 20

An Introduction to Data Mining Hosein Rostani Alireza Zohdi - PowerPoint PPT Presentation


  • 72 Views
  • Uploaded on

An Introduction to Data Mining Hosein Rostani Alireza Zohdi Report 1 for “advance data base” course Supervisor: Dr. Masoud Rahgozar December 2007. Outline. Why data mining? Data mining applications Data mining functionalities Concept description Association analysis

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' An Introduction to Data Mining Hosein Rostani Alireza Zohdi' - pegeen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

An Introduction to

Data Mining

Hosein Rostani Alireza Zohdi

Report 1 for “advance data base” course

Supervisor: Dr. Masoud Rahgozar

December 2007


Outline
Outline

  • Why data mining?

  • Data mining applications

  • Data mining functionalities

    • Concept description

    • Association analysis

    • Outlier Analysis

    • Evolution Analysis

    • Classification

    • Clustering


Why data mining
Why data mining?

  • Motivation:

    • Wide availability of huge amounts of data

    • Need for turning data into useful info & knowledge

  • Data mining:

    • Extracting or “mining” knowledge from large amounts of data

    • Knowledge : useful patterns

    • Semiautomatic process

      • Focus on automatic aspects


Data mining applications
Data mining applications

  • Prediction. Examples:

    • Credit risk

    • Customer switching to competitors

    • Fraudulent phone calling card usage

  • Associations. Examples:

    • Related books for buy

    • Related accessories for suggest: e.g. camera

    • Causation discovery: e.g. medicine

  • Clusters. Example:

    • Clusters of disease


Data mining functionalities
Data mining functionalities

  • Concept description

    • Characterization & discrimination

  • Association analysis

  • Outlier Analysis

  • Evolution Analysis

  • Classification and Prediction

  • Clustering


Concept description
Concept description

  • Description of concepts

    • summarized, concise & precise

  • Ways:

    • Data characterization

      • Summarizing the data of the target class in general terms

    • Data discrimination

      • Comparison of the target class with the contrasting class(es)

  • Examples of Output forms:

    • Pie charts, bar charts, curves & multidimensional tables


Association analysis
Association analysis

  • Mining frequent patterns

    • For discovery of interesting associations within data

  • Kinds of frequent patterns:

    • Frequent itemset

      • Set of items frequently appear together. E.g. milk and bread

    • Frequent subsequence

      • E.g. pattern of customers’ purchase:

        • First a PC, then a digital camera & then a memory card

    • Frequent substructure

      • Structural forms such as graphs, trees, or lattices

  • Support and confidence


Outlier analysis
Outlier Analysis

  • Outliers:

    • data objects disobeying the general behavior of data

  • Approaches to outliers

    • Discard as noise or exceptions

    • Keep for applications such as fraud detection

      • Example: detecting fraudulent usage of credit cards

  • Ways:

    • Using statistical tests

    • Using distance measures

    • Using deviation-based methods


Evolution analysis
Evolution Analysis

  • Description and modeling of trends

    • For objects with changing behavior over time

  • Ways:

    • Applying other data mining tasks on time related data

      • Association analysis, classification, prediction, clustering & …

    • Distinct ways

      • time-series data analysis

      • sequence or periodicity pattern matching

      • similarity-based data analysis

  • Example: stock market: predict future trends in prices


Classification and prediction
Classification and Prediction

  • Classification:

    • Process of finding a model that distinguishes data classes

    • Purpose: using the model to predict the class of new objects

  • Deriving model:

    • Based on the analysis of a set of training data

      • data objects with known class labels

  • Example:

    • In a credit card company

      • Classification of customers based on their payment history

      • Prediction of a new customer’s credit worthiness


Classification
Classification

  • A two-step process for classification:

    • First: Learning or training step

      • Building the classifier by analyzing or learning from training data

    • Second: classifying step

      • Using classifier for classification

  • Accuracy of a classifier (on a given test set)

    • Percentage of test set tuples correctly classified by classifier

  • Classification methods:

    • Decision tree, Naïve Bayesian classification, Neural network, k-nearest neighbor classification, …


Decision tree
Decision tree

  • Decision tree induction :

    • Learning of decision trees from class-labeled training tuples

  • Decision tree: A flowchart-like tree structure

    • Internal nodes: tests on attributes

    • Branches: outcomes of the test

    • Leaves: class labels

  • Usage in classification:

    • Prediction by tracing a path from the root to a leaf node

    • Testing attribute values of new tuple against decision tree

  • Easily converting Decision tree to classification rules



Bayesian classification
Bayesian Classification

  • Bayesian classification

    • Predicting the probability that a new tuple belongs to a particular class

  • High accuracy and speed in large databases

  • Based on Bayes’ theorem

    • Conditional probability

  • Naïve Bayesian classifier

    • Assumption: class conditional independence

    • Good for Simplifying computations


Clustering
Clustering

  • The process of grouping a set of physical or abstract objects into classes of similar objects

    • Generating class labels for objects currently without label

  • Clustering based on this principle:

    • Maximizing the intraclass similarity and

    • Minimizing the interclass similarity

  • Clustering also for facilitating taxonomy formation

    • Hierarchical organization of observations


An example clustering customers in a restaurant

Restaurant database

Preprocessing

Object View for Clustering

Clustering

A Set of Similar Object Clusters

Summarization

White Collar for Dinner

Retired for Lunch

Young at midnight

An example: clustering customers in a restaurant


Steps of database clustering
Steps of database Clustering

  • Define object-view

  • Select relevant attributes

  • Generate suitable input format for the clustering tool

  • Define similarity measure

  • Select parameter settings for the chosen clustering algorithm

  • Run clustering algorithm

  • Characterize the computed clusters


Challenge database clustering
Challenge: database clustering

  • Data collections are in many different formats

    • Flat files

    • Relational databases

    • Object-oriented database

  • Flat file format:

    • The simplest and most frequently used format in the traditional data analysis area

  • Databases are more complex than flat files


Challenge database clustering cont
Challenge: database clustering (cont.)

  • Challenge: Changing clustering algorithms to become more directly applicable to real-world databases

  • Issues related to databases:

    • Different types of objects in DB

    • Relationships between objects: 1:1, 1:n & n:m

    • Complexity in definition of object similarity

      • Due to the presence of bags of values for an object

    • Difficulty in selection of an appropriate similarity measure

      • Due to the presence of different types for attributes of objects


Refferences
Refferences

  • Han, J., Kamber, M., Data Mining: Concepts and Techniques, Second Edition, Elsevier Inc., 2006, 770 p., ISBN 1-55860-901-3.

  • Silberschatz, A., Korth, F., Sudarshan, S., Database System Concepts, Fifth Edition, McGraw-Hill, 2005, ISBN 0-07-295886-3.

  • Ryu, T., Eick, C., A Database Clustering Methodology and Tool, in Information Sciences 171(1-3): 29-59 (2005).


ad