An Introduction

An Introduction Student Name: Riaz Ahmad Program: MSIT(2005-2007) Subject: Data warehouse & Data Mining

Concerned terms with my Research KDD process Data warehouse Materialized or Indexed view Data Mining Data Mining Techniques Classification Decision Tree ID3 algorithm Objective Conclusion

Knowledge KDD process Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases

Steps of a KDD Process Learning the application domain: relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge

Data Warehouse Data warehouse is a sub-oriented, integrated or consolidated, Non- volatile or read only, time-variant collection of data designed to support management DSS needs. Any read-only collection of accumulated historical data is called a data warehouse. A data warehouse is a database specifically structured for query and analysis. A data warehouse typically contains data representing the business history of an organization.

Materialized or Indexed View A materialized view is a special type of summary table that is constructed by aggregating one or more columns of data from a a single table, or a series of tables that are joined together Materialized views can dramatically improve query performance, and significantly decrease the load on the system. You need it in data warehouse environment more that OLTP environment you need it in huge databases, and not a table that has 3 records.

Data Mining Data mining (DM) is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to some advantage. We simply define; data mining refers to extracting or mining “knowledge from large amounts of data. There are many other terms such as knowledge mining from databases, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. Data mining is concerned with finding hidden relationships present in business data to allow businesses to make predictions for future use.

Data Mining Tasks or Techniques Classification Regression Segmentation Association Forecasting Text Analysis Advanced Data Exploration We only select classification among different data mining techniques or tasks for the research work.

Classification Classification is data analysis which can be used to extract models describing important data classes or to predict future data trends. Given a number of pre-specified classes. Examine a new object, record, or individual and assign it, based on a model, to one of these classes Examples Which credit applicants are low, medium, high risk? Which hotel customers are likely, unlikely to return? Which residents are likely, unlikely to vote?

Classification Techniques Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Genetic algorithms Bayesian networks Among these classification techniques we only select the decision tree

Decision Tree A decision tree is a flow-chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes . The topmost node in a tree is the root node. A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision. Decision tree are commonly used for gaining information for the purpose of decision -making. Decision tree starts with a root node on which it is for users to take actions. From this node, users split each node recursively according to decision tree learning algorithm. The final result is a decision tree in which each branch represents a possible scenario of decision and its outcome

Decision Tree

Generating Classification rules from Decision Tree IF age = “<30" AND student = no THEN buys computer = no IF age = “<30" AND student = yes THEN buys computer = yes IF age = “30-40" THEN buys computer = yes IF age = “>40" AND credit rating = excellent THEN buys computer = yes IF age = “>40" AND credit rating = fair THEN buys computer = no

ID3 algorithm • Originator of the ID3 Algorithm • ID3 and its successors have been developed by Ross Quinlan, who • discovered it in the 1970s. • Implementation of ID3 Algorithm • ID3 (Learning Sets S, Attributes Sets A, Attributes values V) • Return Decision Tree. • Begin • Load learning sets first, create decision tree root node 'root Node', • Add learning set S into Root node as its subset. • For root Node, we compute Entropy (rootNode. subset) first • If Entropy (rootNode. subset) = =0, then • RootNode. subset consists of records all with the same value for the categorical • Attribute, return a leaf node with decision attribute: attribute value; • If Entropy (rootNode. subset)! =0, then • Compute information gain for each attribute left (have not been used in splitting), • Find attribute A with Maximum (Gain(S, A)). • Create child nodes of this root Node and add to root Node in the decision tree. • For each child of the root Node, apply • ID3(S, A, V) recursively until reach Node that has entropy=0 or reach Leaf node. • End ID3.

Mathematical Formulae The following mathematical formulae are used for the calculation of Entropy and Gain. • Entropy Equation • Information Gain Equation

Classification Experiments • First we take a training dataset ( S )For classification purpose

Step(1) Entropy of Original Dataset • Entropy calculation process of dataset is shown below. • First decides the number of records which have (No) class value that are five (5) while the (yes) class value records are Nine (9). Total number of records is fourteen (14). • Relative frequency of No class: 5/14. • Relative frequency of Yes class: 9/14. • Entropy of S dataset is calculated by the above Entropy formula. • Entropy (5, 9) = -5/14 log2 5/14 – 9/14 log2 9/14 • = 0.9403

Step (2) Calculate The Gain of Each input Attribute in Dataset • The following information is requiredfor theGain Calculation of Outlook Attribute For the calculation of attribute gain first checked the number of values for this attribute, and then on the basis of each value the S dataset is classified. Outlook attribute have three values like rain, overcast and sunny. There are three subset is possible of S dataset on the basis of outlook attribute values. • I. First subset(S1) contains five (5) records on the basis of rain value of outlook attribute. • II. Second subset (S2) contains four (4) records on the basis of overcast value of outlook attribute. • III. Thirds subset (S3) contains five (5) records on the basis of sunny value. • Proportionality measure for S1 is 5/14 • Proportionality measure fro S2 is 4/14 • Proportionality measure for S3 is 5/14

Step (2) Calculate The Gain of Each input Attribute in Dataset The following steps are required for the calculation of Gain value of Each Attribute in original dataset (S) • Calculate the Entropy of each subset ( S1, S2, S3) • Calculate the attribute Entropy ( Outlook ) • Calculate the Gain of Attribute ( Outlook )

Calculate the Entropy of each subset (S1,S2,S3) In first set S1 the three (3) yes class and (2) No class. Total is five records (5) Entropy (3, 2) = -3/5 log2 3/5 – 2/5 log2 2/5 = 0.971 In second set S2the four (4) yes class. Total is four records (4) Entropy (4,4) = -4/4 log24/4 = 0 In third set S3 the three (3) NO class and two (2) Yes class. Total is five records (5). Entropy (3, 2) = -3/5 log2 3/5 – 2/5 log2 2/5 = 0.971

Calculate the Entropy of the outlook Attribute The following formula is used for calculation Entropy (S1, S2, S3) = S1/S * Entropy (S1) + S2/S * Entropy (S2) + S3/S * Entropy (S3) Entropy (5, 4, 5) or Entropy (outlook) = 5/14 * 0.971 + 4/14 * 0 + 5/14 * 0.971 = 0.694

Calculate the Gain value of the outlook Attribute The following formula is used for calculation Gain (S, A) = Entropy (S) – Entropy (outlook) Gain(S, outlook) = 0.9403 - 0.694 = 0.2463 The above Three steps Repeats for other three remaining input attributes. The following tables contain the Gain of attributes for Original set, Rain subset, and Sunny subset.

Attributes Along with Gain Information (Original set, rain and sunny subsets)

Step (3) Select the Maximum Gain value Attribute for the classification of dataset (S) In the above Table the attribute which have the maximum gain value is the outlook. The Gain value of outlook is 0.2463 which is the highest value. After this process we can split the dataset into three different subsets on the basis of outlook attribute values which are following. Rain, overcast and sunny. The classification is show the complete classification process which generate the decision tree or classification rules.

Decision Tree

Decision Tree developed in my work

Objectives 1. To integrate the decision tree with Data warehouse or database 2. To reduce the time of construction of decision tree at root node. The computational process for constructing tree is highly complex and recursive in nature. It includes calculating various values i.e. Entropy of dataset, Entropy and Gain values of each input attributes in dataset repeatedly. Here, I have pre-computed results required at least for the selection and classification of root node. There is no change in the intelligence approach, only the values required are stored, instead of calculating them at run time in memory. However, this integration has given a jump start for the construction of the classification model, enhancing the overall efficiency of the model.

Pre-Calculated Structure

Conclusion Classification algorithms are memory resident, calculating various statistical values at runtime. Storage of these statistical values, even for the selection of the best attribute at the root node, greatly increases the performance of the classification algorithms. Materialized view will hold the input training dataset while these statistical values will be stored in a dependent table. This table will be updated according to the policy chosen. Modern data warehouses offer a many methods to update the materialized view. However, each time a new target class is introduced or new data is loaded in this containing the statistical values will be updated accordingly. The accuracy of the algorithm is in no way affected, not in a positive or negative direction. The significant improvement introduced is in the efficiency, in selection of the root level attribute.

An Introduction

An Introduction

Presentation Transcript

An Introduction

An introduction

An Introduction

An introduction

An Introduction

An introduction

An Introduction

An Introduction

An Introduction

An Introduction

An introduction

An Introduction

An Introduction

An introduction

An introduction

An introduction…………

An Introduction

An Introduction

An Introduction

An Introduction

An Introduction

An Introduction