1 / 54

MC7403-DATA WAREHOUSING AND DATA MINING PREPARED BY M.SABARI RAMACHANDRAN AP/MCA

MC7403-DATA WAREHOUSING AND DATA MINING PREPARED BY M.SABARI RAMACHANDRAN AP/MCA. UNIT-I. DATA WAREHOUSE DATA WAREHOUSING. INTRODUCTION ABOUT DATA WAREHOUSE

sgroom
Download Presentation

MC7403-DATA WAREHOUSING AND DATA MINING PREPARED BY M.SABARI RAMACHANDRAN AP/MCA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MC7403-DATA WAREHOUSING AND DATA MINING PREPARED BY M.SABARI RAMACHANDRAN AP/MCA

  2. UNIT-I

  3. DATA WAREHOUSE DATA WAREHOUSING INTRODUCTION ABOUT DATA WAREHOUSE • Data warehousing is the repository of information which are gathered from multiple source and stored under unified schema • A data warehouse, also known as an enterprise data warehouse, is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources.

  4. OPERATIONAL DBMS VS DW • OLTP OLTP (online transaction processing) is a class of software programs capable of supporting transaction-oriented applications on the Internet. Typically, OLTP systems are used for order entry, financial transactions, customer relationship management (CRM) and retail sales. • OLAP Online analytical processing, or OLAP, is an approach to answer multi-dimensional analytical queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining • Distinct features of(OLTP vs OLAP)

  5. MULTIDIMENSIONAL DATA MODEL • A data warehouse is based on a multidimensional data model which views data in the form of a datacube

  6. SCHEMAS FOR MULTIDIMENSINAL DATABASE • Star schema The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts. Thestar schema consists of one or more fact tables referencing any number of dimension tables. • Snowflake schema A snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles asnowflake shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions.

  7. OLAP OPERATIONS • Need for OLAP • Types of OLAP server TYPES • MOLAP • ROLAP • HOLAP

  8. DATA WAREHOUSE ARCHITECTURE • Data warehouse design process • Design of a data warehouse • Design of a data warehouse business analysis process

  9. DATA WAREHOUSE ARCHITECTURE VIEWS • Top Down views • Data source view • Data warehouse view • Business Query view

  10. QUERY& APPLICATION TOOLS • Adhoc Query Tools An Ad-Hoc Query is a query that cannot be determined prior to the moment the query is issued. It is created in order to get information when need arises and it consists of dynamically constructed SQL which is usually constructed by desktop-resident query tools • Reporting Tools Reporting Tool. BI (Business Intelligence) tools are used by business users to create basic, medium, and complex reports from the transactional data in data warehouse and by creating Universes using the Information Design Tool/UDT. Various SAP and non-SAP data sources can be used to create reports

  11. INDEXING • Star Indexing • Bitmap Index • Foot projection Index • Low Fast Index • Low/High cardinality Index • Low Index

  12. UNIT-II

  13. DATA MINING & DATA PREPROCESSING DATA MINING • Data mining is the process of knowledge discovery from large DB • Introduction to KDD process The term Knowledge Discovery in Databases, or KDD for short, refers to the broad process of finding knowledge in data, and emphasizes the "high-level" application of particular data mining methods.

  14. KNOWLEDGE DISCOVERY FROM DB • KDD is the process of mining or extracting the knowledge from DB

  15. NEED FOR DATA PREPROCESSING • Data cleaning • Data reduction • Data transaction • Data integration

  16. DATA CLEANING • Removing missing values& noisy datas from original data DATA REDUCTION • Reduction the data size for handling complex problems

  17. DATA DISCRETIZATION& CONCEPT HIERARCHY GENERATION • Types of discretization • Supervised discretization • Unsupervised discretization

  18. DATA INTEGRATION AND TRANSFORMATION • DATA INTEGRATION It is used to avoid the continuous values repetition • DATA TRANSFORMATION Transforming the data from one form to another

  19. CONCEPT HIERARCHY GENERATION • The level by level representation of the data concept is called concept hierarchy generation

  20. UNIT-III

  21. ASSOCIATION RULE MINING • ASSOCITION RULE MINING Association rules are required to satisfied minimum support & confidence at the same time • Finding frequent itemsets using min-sub • Framing the rules • Finding strong association rule

  22. DATA MINING FUNCTIONALITIES • Classification & prediction • Cluster analysis • Outlier analysis • Evaluation analysis

  23. Association Rule Mining • Mining frequent itemsets using with candidate generation Example: Apriori algorithm • Mining frequent itemsets without candidate generation Example:FP Growth • Vertical data format

  24. APRIORI ALGORITHM The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules. • Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation, and groups of candidates are tested against the data • Improve the efficiency of apriori algorithm

  25. FP GROWTH • Construct ‘L’ order • Construct conditional pattern base table

  26. VARTICAL DATA FORMAT • In the format dataset & transaction through vertical data scanning method

  27. VERIOUS KINDS OF ASSOCIATION RULES • Single level association rules • Single dimensional level association rules • Multilevel level association rules • Multidimensional association rules

  28. CONSTRAINS BASED ASSOCIATION RULES • Introduction for rule mining CONSTRAINT BASED ASSOCIATION RULES: ... Thus, a good heuristic is to have the users specify such intuition or expectations as constraints to confine the search space. This strategy is known as constraint-based mining. Constraint based mining provides. User Flexibility: provides constraints on what to be mined

  29. UNIT-IV

  30. CLASSIFICATION & PREDICTION CLASSIFICATION & PREDICTION • Data preprocessing for classification & prediction • Data cleaning • Relevance analysis • Data transportation & reduction

  31. CLASSIFICATION BY DECISION TREE INTRODUCTION • Decision tree algorithm • Splitting attributes • Attribute selection measures • Information Gain • Gain Ratio • Gini Index

  32. BAYESIAN CLASSIFICATION • Bayesian theorem • Naive Bayesian Classification • Bayesian network

  33. Bayes Theorem • P(h) = prior probability of hypothesis h • P(D) = prior probability of training data D • P(h|D) = probability of h given D • P(D|h) = probability of D given h

  34. RULE BASED CLASSIFICATION • IF-THEN Rules Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the following from − • IF condition THEN conclusion Let us consider a rule R1, R1: IF age = youth AND student = yes THEN buy_computer = yes Points to remember − The IF part of the rule is called rule antecedent or precondition. The THEN part of the rule is called rule consequent. The antecedent part the condition consist of one or more attribute tests and these tests are logically. • Coverage's • Accuracy

  35. CLASSIFICATION BY BACK PROPAGATION • Multilayer feed format neural network • Definition a network topology

  36. SUPPORT VECTOR MACHINES • When the datas are linearly separable • When the datas are in linearly inseparable

  37. ASSOCIATION CLASSIFICATION • CBA • CMAR • CPAR

  38. LAZY LEARNERS • Eager • Lazy learns OTHER CLASSIFICATION METHODS • Fuzy set approach • Genetic algorithm • Roughest approach

  39. PREDICTION • Introduction Prediction in data mining is to identify data points purely on the description of another related data value. It is not necessarily related to future events but the used variables are unknown. The prediction in data mining is known as Numeric Prediction. Generally regression analysis is used for prediction.

  40. Accuracy and error measures • Evaluating the Accuracy of a classifier

  41. Ensemble methods Ensemble methods have been called the most influential development in Data Mining and Machine Learning in the past decade. They combine multiple models into one usually more accurate than the best of its components. Ensembles can provide a critical boost to industrial challenges -- from investment timing to drug discovery, and fraud detection to recommendation systems -- where predictive accuracy is more vital than model interpretability. Ensembles are useful with all modeling algorithm • Model selection Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the data collected is well-suited to the problem of model selection

  42. UNIT-V

  43. CLUSTERING CLUSTER ANALYSIS • The process of grouping a set of similar objects types of data in cluster analysis • Similarity matrix • Dissimilarity matrix

  44. MAJOR CLUSTERING METHODS • Partitioning methods • Hierarchical methods • Density based methods • Grid based methods • Model based methods

  45. PARTITIONING METHODS • K- means algorithm

  46. K-Medoids Algorithm

  47. HIERARCHICAL METHODS • In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:[1] • Agglomerative: This is a "bottom-up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. • Divisive: This is a "top-down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy

  48. AGNES • Aglomerative nesting • Bottom up merging DIANA • Divisible hierarchy • Top down splitting

  49. DENSITY BASED METHODS • Introduction Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and XiaoweiXu in 1996.[1] It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature • DBSCAN • OPTICS • DENCLUE

  50. GRID BASED METHODS • STING • Wave cluster • CLIQUE

More Related