- 60 Views
- Uploaded on
- Presentation posted in: General

An Introduction to Data Mining Hosein Rostani Alireza Zohdi

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

An Introduction to

Data Mining

Hosein Rostani Alireza Zohdi

Report 1 for “advance data base” course

Supervisor: Dr. Masoud Rahgozar

December 2007

- Why data mining?
- Data mining applications
- Data mining functionalities
- Concept description
- Association analysis
- Outlier Analysis
- Evolution Analysis
- Classification
- Clustering

- Motivation:
- Wide availability of huge amounts of data
- Need for turning data into useful info & knowledge

- Data mining:
- Extracting or “mining” knowledge from large amounts of data
- Knowledge : useful patterns
- Semiautomatic process
- Focus on automatic aspects

- Prediction. Examples:
- Credit risk
- Customer switching to competitors
- Fraudulent phone calling card usage

- Associations. Examples:
- Related books for buy
- Related accessories for suggest: e.g. camera
- Causation discovery: e.g. medicine

- Clusters. Example:
- Clusters of disease

- Concept description
- Characterization & discrimination

- Association analysis
- Outlier Analysis
- Evolution Analysis
- Classification and Prediction
- Clustering

- Description of concepts
- summarized, concise & precise

- Ways:
- Data characterization
- Summarizing the data of the target class in general terms

- Data discrimination
- Comparison of the target class with the contrasting class(es)

- Data characterization
- Examples of Output forms:
- Pie charts, bar charts, curves & multidimensional tables

- Mining frequent patterns
- For discovery of interesting associations within data

- Kinds of frequent patterns:
- Frequent itemset
- Set of items frequently appear together. E.g. milk and bread

- Frequent subsequence
- E.g. pattern of customers’ purchase:
- First a PC, then a digital camera & then a memory card

- E.g. pattern of customers’ purchase:
- Frequent substructure
- Structural forms such as graphs, trees, or lattices

- Frequent itemset
- Support and confidence

- Outliers:
- data objects disobeying the general behavior of data

- Approaches to outliers
- Discard as noise or exceptions
- Keep for applications such as fraud detection
- Example: detecting fraudulent usage of credit cards

- Ways:
- Using statistical tests
- Using distance measures
- Using deviation-based methods

- Description and modeling of trends
- For objects with changing behavior over time

- Ways:
- Applying other data mining tasks on time related data
- Association analysis, classification, prediction, clustering & …

- Distinct ways
- time-series data analysis
- sequence or periodicity pattern matching
- similarity-based data analysis

- Applying other data mining tasks on time related data
- Example: stock market: predict future trends in prices

- Classification:
- Process of finding a model that distinguishes data classes
- Purpose: using the model to predict the class of new objects

- Deriving model:
- Based on the analysis of a set of training data
- data objects with known class labels

- Based on the analysis of a set of training data
- Example:
- In a credit card company
- Classification of customers based on their payment history
- Prediction of a new customer’s credit worthiness

- In a credit card company

- A two-step process for classification:
- First: Learning or training step
- Building the classifier by analyzing or learning from training data

- Second: classifying step
- Using classifier for classification

- First: Learning or training step
- Accuracy of a classifier (on a given test set)
- Percentage of test set tuples correctly classified by classifier

- Classification methods:
- Decision tree, Naïve Bayesian classification, Neural network, k-nearest neighbor classification, …

- Decision tree induction :
- Learning of decision trees from class-labeled training tuples

- Decision tree: A flowchart-like tree structure
- Internal nodes: tests on attributes
- Branches: outcomes of the test
- Leaves: class labels

- Usage in classification:
- Prediction by tracing a path from the root to a leaf node
- Testing attribute values of new tuple against decision tree

- Easily converting Decision tree to classification rules

- Bayesian classification
- Predicting the probability that a new tuple belongs to a particular class

- High accuracy and speed in large databases
- Based on Bayes’ theorem
- Conditional probability

- Naïve Bayesian classifier
- Assumption: class conditional independence
- Good for Simplifying computations

- The process of grouping a set of physical or abstract objects into classes of similar objects
- Generating class labels for objects currently without label

- Clustering based on this principle:
- Maximizing the intraclass similarity and
- Minimizing the interclass similarity

- Clustering also for facilitating taxonomy formation
- Hierarchical organization of observations

Restaurant database

Preprocessing

Object View for Clustering

Clustering

A Set of Similar Object Clusters

Summarization

White Collar for Dinner

Retired for Lunch

Young at midnight

- Define object-view
- Select relevant attributes
- Generate suitable input format for the clustering tool
- Define similarity measure
- Select parameter settings for the chosen clustering algorithm
- Run clustering algorithm
- Characterize the computed clusters

- Data collections are in many different formats
- Flat files
- Relational databases
- Object-oriented database

- Flat file format:
- The simplest and most frequently used format in the traditional data analysis area

- Databases are more complex than flat files

- Challenge: Changing clustering algorithms to become more directly applicable to real-world databases
- Issues related to databases:
- Different types of objects in DB
- Relationships between objects: 1:1, 1:n & n:m
- Complexity in definition of object similarity
- Due to the presence of bags of values for an object

- Difficulty in selection of an appropriate similarity measure
- Due to the presence of different types for attributes of objects

- Han, J., Kamber, M., Data Mining: Concepts and Techniques, Second Edition, Elsevier Inc., 2006, 770 p., ISBN 1-55860-901-3.
- Silberschatz, A., Korth, F., Sudarshan, S., Database System Concepts, Fifth Edition, McGraw-Hill, 2005, ISBN 0-07-295886-3.
- Ryu, T., Eick, C., A Database Clustering Methodology and Tool, in Information Sciences 171(1-3): 29-59 (2005).