Loading in 5 sec....

An Introduction to Data Mining Hosein Rostani Alireza ZohdiPowerPoint Presentation

An Introduction to Data Mining Hosein Rostani Alireza Zohdi

Download Presentation

An Introduction to Data Mining Hosein Rostani Alireza Zohdi

Loading in 2 Seconds...

- 63 Views
- Uploaded on
- Presentation posted in: General

An Introduction to Data Mining Hosein Rostani Alireza Zohdi

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

An Introduction to

Data Mining

Hosein Rostani Alireza Zohdi

Report 1 for “advance data base” course

Supervisor: Dr. Masoud Rahgozar

December 2007

- Why data mining?
- Data mining applications
- Data mining functionalities
- Concept description
- Association analysis
- Outlier Analysis
- Evolution Analysis
- Classification
- Clustering

- Motivation:
- Wide availability of huge amounts of data
- Need for turning data into useful info & knowledge

- Data mining:
- Extracting or “mining” knowledge from large amounts of data
- Knowledge : useful patterns
- Semiautomatic process
- Focus on automatic aspects

- Prediction. Examples:
- Credit risk
- Customer switching to competitors
- Fraudulent phone calling card usage

- Associations. Examples:
- Related books for buy
- Related accessories for suggest: e.g. camera
- Causation discovery: e.g. medicine

- Clusters. Example:
- Clusters of disease

- Concept description
- Characterization & discrimination

- Association analysis
- Outlier Analysis
- Evolution Analysis
- Classification and Prediction
- Clustering

- Description of concepts
- summarized, concise & precise

- Ways:
- Data characterization
- Summarizing the data of the target class in general terms

- Data discrimination
- Comparison of the target class with the contrasting class(es)

- Data characterization
- Examples of Output forms:
- Pie charts, bar charts, curves & multidimensional tables

- Mining frequent patterns
- For discovery of interesting associations within data

- Kinds of frequent patterns:
- Frequent itemset
- Set of items frequently appear together. E.g. milk and bread

- Frequent subsequence
- E.g. pattern of customers’ purchase:
- First a PC, then a digital camera & then a memory card

- E.g. pattern of customers’ purchase:
- Frequent substructure
- Structural forms such as graphs, trees, or lattices

- Frequent itemset
- Support and confidence

- Outliers:
- data objects disobeying the general behavior of data

- Approaches to outliers
- Discard as noise or exceptions
- Keep for applications such as fraud detection
- Example: detecting fraudulent usage of credit cards

- Ways:
- Using statistical tests
- Using distance measures
- Using deviation-based methods

- Description and modeling of trends
- For objects with changing behavior over time

- Ways:
- Applying other data mining tasks on time related data
- Association analysis, classification, prediction, clustering & …

- Distinct ways
- time-series data analysis
- sequence or periodicity pattern matching
- similarity-based data analysis

- Applying other data mining tasks on time related data
- Example: stock market: predict future trends in prices

- Classification:
- Process of finding a model that distinguishes data classes
- Purpose: using the model to predict the class of new objects

- Deriving model:
- Based on the analysis of a set of training data
- data objects with known class labels

- Based on the analysis of a set of training data
- Example:
- In a credit card company
- Classification of customers based on their payment history
- Prediction of a new customer’s credit worthiness

- In a credit card company

- A two-step process for classification:
- First: Learning or training step
- Building the classifier by analyzing or learning from training data

- Second: classifying step
- Using classifier for classification

- First: Learning or training step
- Accuracy of a classifier (on a given test set)
- Percentage of test set tuples correctly classified by classifier

- Classification methods:
- Decision tree, Naïve Bayesian classification, Neural network, k-nearest neighbor classification, …

- Decision tree induction :
- Learning of decision trees from class-labeled training tuples

- Decision tree: A flowchart-like tree structure
- Internal nodes: tests on attributes
- Branches: outcomes of the test
- Leaves: class labels

- Usage in classification:
- Prediction by tracing a path from the root to a leaf node
- Testing attribute values of new tuple against decision tree

- Easily converting Decision tree to classification rules

- Bayesian classification
- Predicting the probability that a new tuple belongs to a particular class

- High accuracy and speed in large databases
- Based on Bayes’ theorem
- Conditional probability

- Naïve Bayesian classifier
- Assumption: class conditional independence
- Good for Simplifying computations

- The process of grouping a set of physical or abstract objects into classes of similar objects
- Generating class labels for objects currently without label

- Clustering based on this principle:
- Maximizing the intraclass similarity and
- Minimizing the interclass similarity

- Clustering also for facilitating taxonomy formation
- Hierarchical organization of observations

Restaurant database

Preprocessing

Object View for Clustering

Clustering

A Set of Similar Object Clusters

Summarization

White Collar for Dinner

Retired for Lunch

Young at midnight

- Define object-view
- Select relevant attributes
- Generate suitable input format for the clustering tool
- Define similarity measure
- Select parameter settings for the chosen clustering algorithm
- Run clustering algorithm
- Characterize the computed clusters

- Data collections are in many different formats
- Flat files
- Relational databases
- Object-oriented database

- Flat file format:
- The simplest and most frequently used format in the traditional data analysis area

- Databases are more complex than flat files

- Challenge: Changing clustering algorithms to become more directly applicable to real-world databases
- Issues related to databases:
- Different types of objects in DB
- Relationships between objects: 1:1, 1:n & n:m
- Complexity in definition of object similarity
- Due to the presence of bags of values for an object

- Difficulty in selection of an appropriate similarity measure
- Due to the presence of different types for attributes of objects

- Han, J., Kamber, M., Data Mining: Concepts and Techniques, Second Edition, Elsevier Inc., 2006, 770 p., ISBN 1-55860-901-3.
- Silberschatz, A., Korth, F., Sudarshan, S., Database System Concepts, Fifth Edition, McGraw-Hill, 2005, ISBN 0-07-295886-3.
- Ryu, T., Eick, C., A Database Clustering Methodology and Tool, in Information Sciences 171(1-3): 29-59 (2005).