DATA MINING This PPT is Dedicated to my inner controller AMMA BHAGAVAN– ONENESS Founders. Developed by, S.V.G.REDDY, Associate professor, Dept.of CSE, GIT, GITAM UNIVERSITY.
What Motivated Data Mining? Huge amount of Raw DATA is available.The Motivation for the Data Mining is to • Analyse, • Classify, • Cluster, • Charecterize the Data etc...
What Is Data Mining? • The Databases are PreProcessed i.e. Cleaned and Integrated and the Data Warehouse is formed. • The Data Warehouse is Selected and Transformed as per the User Requirement and it is submitted to the Data Mining Engine. • The Data Mining Engine will run for ‘n’ iterations/ tuples. • As a Result, We will get some Patterns as Output. • Then the Patterns are Evaluated and finally we will get an Output which is Knowledge.
Data Mining—On What Kind of Data? RDBMS- A Relational database is a collection of tables, each ofwhich is assigned a unique name. • Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows). Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values. • A semantic data model, such as an entity-relationship (ER) data model, is often constructed for relational databases. • An ER data model represents the database as a set of entities and their relationships.
Data Mining—On What Kind of Data? DataWareHouse - A Data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site. Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing.
Data Mining—On What Kind of Data? Transactional DataBase -In general, a transactional database consists of a filewhere each record represents a transaction. A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up the transaction (such as items purchased in a store).
Data Mining—On What Kind of Data? Object Oriented RDBMS -Conceptually, the object-relational data model inherits the essential concepts of object-oriented databases, where, in general terms, each entity is considered as an object. Data and code relating to an object are encapsulated into a single unit. Each object has associated with it the following. • A set of variables that describe the objects. These correspond to attributes in the entity-relationship and relational models. • A set of messages that the object can use to communicate with other objects, or with the rest of the database system. • A set of methods, where each method holds the code to implement a message. Upon receiving a message, the method returns a value in response. For instance, the method for the message get photo(employee) will retrieve and return a photo of the given employee object.
Data Mining—On What Kind of Data? A Temporal database typically stores relational data that include time-related attributes.These attributes may involve several timestamps, each having different semantics. A Sequence database stores sequences of ordered events, with or without a concrete notion of time. Examples include customer shopping sequences, Web click streams, and biological sequences. A Time-seriesdatabase stores sequences of values or events obtained over repeated measurements of time (e.g., hourly, daily, weekly). Examples include data collected from the stock exchange, inventory control, and the observation of natural phenomena (like temperature and wind).
Data Mining—On What Kind of Data? Spatial databases contain spatial-related information. Examples include geographic (map) databases, very large-scale integration (VLSI) or computed-aided design databases, and medical and satellite image databases. Text databases are databases that contain word descriptions for objects. These word descriptions are usually not simple keywords but rather long sentences or paragraphs, such as product specifications, error or bug reports, warning messages, summary reports,notes, or other documents.
Data Mining—On What Kind of Data? Multimedia databases store image, audio, and video data. They are used in applications such as picture content-based retrieval, voice-mail systems, video-on-demand systems, the World Wide Web, and speech-based user interfaces that recognize spoken commands. A Heterogeneous database consists of a set of interconnected, autonomous component databases. The components communicate in order to exchange information and answer queries. Legacy Database formed as a result of long history of IT Development. A legacy database is a group of heterogeneous databases that combines different kinds of data systems, such as relational or object-oriented databases, hierarchical databases, network databases, spreadsheets, multimedia databases, or file systems.
Data Mining—On What Kind of Data? Data Streams - Many applications involve the generation and analysis of a newkind of data, called stream data, where data flow in and out of an observation platform (or window) dynamically. The World Wide Web and its associated distributed information services, such as Yahoo!, Google, America Online, and AltaVista, provide rich, worldwide, on-line information services, where data objects are linked together to facilitate interactive access.
Data mining Functionalities • Characterization and Discrimination • Mining Frequent Patterns, Associations, and Correlations • Classification and Prediction • Cluster Analysis • Outlier Analysis • Evolution Analysis
Data mining Functionalities Data characterization is a summarization of the general characteristics or features of a target class of data. The data corresponding to the user-specified class are typically collected by a database query. For example, to study the characteristics of software products whose sales increased by 10% in the last year, the data related to such products can be collected by executing an SQL query. Data discrimination is a comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes. The target and contrasting classes can be specified by the user, and the corresponding data objects retrieved through database queries. For example, the user may like to compare the general features of software products whose sales increased by 10% in the last year with those whose sales decreased by at least 30% during the same period.
Data mining Functionalities Frequent patterns, as the name suggests, are patterns that occur frequently in data. There are many kinds of frequent patterns, including itemsets, subsequences, and substructures.A frequent itemset typically refers to a set of items that frequently appear together in a transactional data set, such as milk and bread. A frequently occurring subsequence,such as the pattern that customers tend to purchase first a PC, followed by a digital camera, and then a memory card, is a (frequent) sequential pattern. buys(X; “computer”))buys(X; “software”) [support = 1%; confidence = 50%] age(X, “20:::29”)^income(X, “20K:::29K”))buys(X, “CD player”) [support = 2% , confidence = 60%]
Data mining Functionalities Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. “How is the derived model presented?” The derived model may be represented in various forms, such as classification (IF-THEN) rules, decision trees, mathematical formulae, or neural networks. Prediction is the amount of revenue that each item will generate during an upcoming sale, stock etc..
Cluster Analysis What is cluster analysis?”Unlike classification and prediction, which analyze class-labeled data objects, clustering analyzes data objects without consulting a known class label. The objects are clustered or grouped based on the principle of maximizing the intra class similarity and minimizing the interclass similarity.
Outlier Analysis A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outliers. However, in some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring ones.
Evolution Analysis Data evolution analysis describes and models regularities or trends for objects whose behavior changes over time. Although this may include characterization, discrimination, association and correlation analysis, classification, prediction, or clustering of time related data, distinct features of such an analysis include time-series data analysis, sequence or periodicity pattern matching, and similarity-based data analysis.
Are All of the Patterns Interesting? A data mining system has the potential to generate thousands or even millions of patterns, or rules. “So,” you may ask, “are all of the patterns interesting?” Typically not—only a small fraction of the patterns potentially generated would actually be of interest to any given user. Those are • Rules that do not satisfy a confidence threshold of, say, 50% can be considered uninteresting. Rules below the threshold likely reflect noise , exceptions, or minority cases and are probably of less value. • patterns are interesting if they are unexpected (contradicting a user’s belief). • Patterns that are expected can be interesting if they confirm a hypothesis that the user wished to validate, or resemble a user’s hunch.
Classification of Data Mining Systems The Data Mining Systems are classified according to • kinds of databases mined- relational , transactional, object-relational, or data warehouse mining system etc.. • kinds of knowledge mined- characterization, discrimination, association and correlation analysis, classification, prediction, clustering, outlier analysis, and evolution analysis. • kinds of techniques- autonomous systems, interactive exploratory systems, query-driven systems. • applications adapted- finance , tele communications, DNA, stock markets etc..
Data Mining Task Primitives • The set of task-relevant data to be mined • The kind of knowledge to be mined • The background knowledge to be used in the discovery process • The interestingness measures and thresholds for pattern evaluation • The expected representation for visualizing the discovered patterns
Major Issues in Data Mining • Mining methodology and user interaction issues • Mining different kinds of knowledge in databases • Interactive mining of knowledge at multiple levels of abstraction • Incorporation of background knowledge • Data mining query languages and ad hoc data mining • Presentation and visualization of data mining results • Handling noisy or incomplete data • Pattern evaluation—the interestingness problem. • Performance issues. Efficiency and scalability of data mining algorithms Parallel, distributed, and incremental mining algorithms • Issues relating to the diversity of database types Handling of relational and complex types of data Mining information from heterogeneous databases and global information systems
Data Preprocessing Why Preprocess the Data ? Imagine that you are a manager at AllElectronics and have been charged with analyzing the company’s data with respect to the sales at your branch.You carefully inspect the company’s database and data warehouse, identifying and selecting the attributes or dimensions to be included in your analysis, such as item, price, and units sold. Alas! You notice that several of the attributes for various tuples have no recorded value. For your analysis, you would like to include information as to whether each item purchased was advertised as on sale, yet you discover that this information has not been recorded. Furthermore, users of your database system have reported errors, unusual values, and inconsistencies in the data recorded for some transactions. In other words, the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest, or containing only aggregate data), noisy (containing errors, or outlier values that deviate from the expected), and inconsistent (e.g., containing discrepancies in the department codes used to categorize items).
Data cleaning Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. Missing Values – • Ignore the tuple • Fill in the missing value manually • Use a global constant to fill in the missing value • Use the attribute mean to fill in the missing value • Use the attribute mean for all samples belonging to the same class as the given tuple • Use the most probable value to fill in the missing value. Noisy Data - Noise is a random error or variance in a measured variable. Noise is Removed in the following three ways Binning – see the next page Regression - Data can be smoothed by fitting the data to a function Clustering - Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.”
Binning Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition into (equal-frequency) bins: Bin 1: 4, 8, 15 Bin 2: 21, 21, 24 Bin 3: 25, 28, 34 Smoothing by bin means: Bin 1: 9, 9, 9 Bin 2: 22, 22, 22 Bin 3: 29, 29, 29 Smoothing by bin boundaries: Bin 1: 4, 4, 15 Bin 2: 21, 21, 24 Bin 3: 25, 25, 34 To Detect that Data cleaning is required for a particular Data is called Discrepancy Detection and it can be done by using Knowledge and metadata.
Data Integration It is likely that your data analysis task will involve data integration, which combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files. There are a number of issues to consider during data integration – Schema integration and object matching customer id in one database and cust _number in another Redundancy Hence we will perform Data integration by using Metadata Normalisation Correlation analysis using ϰ2
Data Transformation The data are transformed or consolidated into forms appropriate for mining in the following ways. Smoothing – can be done by binning, regression, clustering Aggregation - the daily sales data may be aggregated so as to compute monthly and annual total amounts. Generalization – low level data are Replaced by high level data i.e. street can be generalized to city or country. Normalization – the data is normalized. Attribute construction – the new attributes are constructed and added from the given set of attributes.
Data Reduction The data to be Mined will be Generally very Huge, Hence the Data is Reduced in the following ways Data cube aggregation Attribute subset selection Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation
Data cube Aggregation By using the Data cube, all the quarter sales can be aggregated to yearly sales. Hence the Huge data of quarterly is Reduced to yearly..
Attribute Subset Selection Data sets for analysis may contain hundreds of attributes, many of which may be irrelevant to the mining task or redundant. For example, if the task is to classify customers as to whether or not they are likely to purchase a popular new CD at AllElectronics when notified of a sale, attributes such as the customer’s telephone number are likely to be irrelevant, unlike attributes such as age or music taste. The attribute subset selection is done in the following ways Stepwise forward selection Stepwise backward elimination Combination of forward selection and backward elimination Decision tree induction
Dimensionality Reduction Wavelet Transforms The discrete wavelet transform(DWT) is a linear signal processing technique that, when applied to a data vector X, transforms it to a numerically different vector, X0, of wavelet coefficients. The two vectors are of the same length. When applying this technique to data reduction, we consider each tuple as an n-dimensional data vector, that is, X = (x1;x2; : : : ;xn), depicting n measurements made on the tuple from n database attributes. “How can this technique be useful for data reduction if the wavelet transformed data are of the same length as the original data?” The usefulness lies in the fact that the wavelet transformed data can be truncated. A compressed approximation of the data can be retained by storing only a small fraction of the strongest of the wavelet coefficients. Principal Components Analysis Suppose that the data to be reduced consist of tuples or data vectors described by n attributes or dimensions. Principal components analysis, or PCA (also called the Karhunen-Loeve, or K-L, method), searches for k n-dimensional orthogonal vectors that can best be used to represent the data, where k <= n. The original data are thus projected onto a much smaller space, resulting in dimensionality reduction.
Numerosity Reduction “Can we reduce the data volume by choosing alternative, ‘smaller’ forms of data representation?”. This can be done as Regression and Log-Linear Models - the data are modeled to fit a straight line Histograms - A histogram for an attribute, A, partitions the data distribution of A into disjoint subsets, or buckets. Clustering - They partition the objects into groups or clusters, so that objects within a cluster are “similar” to one another and “dissimilar” to objects in other clusters. Sampling - it allows a large data set to be represented by a much smaller random sample (or subset) of the data. Data Discretization and Concept Hierarchy Generation – pls see the next page..
Discretization and Concept Hierarchy Generation forNumerical Data Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. This is done in the following ways Binning – Here we will take the bins with some intervals Histogram Analysis – Here, the histogram partitions the values into buckets. Entropy-Based Discretization - The method selects the value of A that has the minimum entropy as a split-point, and recursively partitions the resulting intervals to arrive at a hierarchical discretization. Interval Merging by using ϰ2 Analysis – This contrasts with ChiMerge, which employs a bottom-up approach by finding the best neighboring intervals and then merging these to form larger intervals, recursively. Cluster Analysis - A clustering algorithm can be applied to discretize a numerical attribute, A, by partitioning the values of A into clusters or groups. Discretization by Intuitive Partitioning - For example, annual salaries broken into ranges like ($50,000, $60,000] are often more desirable than ranges like ($51,263.98, $60,872.34], obtained by, say, some sophisticated clustering analysis.
Concept Hierarchy Generation for Categorical Data Here the concept hierarchy is generated in the following manner. if there are some attributes like state, street, city, country..then the concept hierarchy is generated as
Datawarehouse Traditional Databases uses OLAP, whereas DatawareHouse uses OLTP.
Multi dimensional data model (MDDB) The entity-relationship data model is commonly used in the design of relational databases, where a database schema consists of a set of entities and the relationships between them. Such a data model is appropriate for on-line transaction processing. A data warehouse, however, requires a concise, subject-oriented schema that facilitates on-line data analysis. The most popular data model for a data warehouse is a multidimensional model. Such a model can exist in the form of a star schema, a snowflake schema, or a fact constellation schema. Let’s look at each of these schema types.
Multi dimensional data model (MDDB) Star schema: The most common modeling paradigm is the star schema, in which the data warehouse contains (1) a large central table (fact table) containing the bulk of the data, with no redundancy, and (2) a set of smaller attendant tables (dimension tables), one for each dimension. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table.
Multi dimensional data model (MDDB) Snowflake schema - For example, the item dimension table now contains the attributes item key, item name, brand, type, and supplier key, where supplier key is linked to the supplier dimension table, containing supplier key and supplier type information. Similarly, the single dimension table for location in the star schema can be normalized into two new tables: location and city.
Multi dimensional data model (MDDB) Fact constellation. A fact constellation schema is shown in Figure 3.6. This schema specifies two fact tables, sales and shipping. The sales table definition is identical to that of the star schema (Figure 3.4). The shipping table has five dimensions, or keys: item key, time key, shipper key, from location, and to location, and two measures: dollars cost and units shipped. A fact constellation schema allows dimension tables to be shared between fact tables. For example, the dimensions tables for time, item, and location are shared between both the sales and shipping fact tables.
OLAP Operations in the MDDB Roll-up: The roll-up operation (also called the drill-up operation by some vendors) performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction. This hierarchy was defined as the total order “street < city < province or state < country.” Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed data. Drill-down can be realized by either stepping down a concept hierarchy for a dimension or introducing additional dimensions. Drill-down operation performed on the central cube by stepping down a concept hierarchy for time defined as “day < month < quarter < year.” Slice and dice: The slice operation performs a selection on one dimension of the given cube, resulting in a subcube. The dice operation defines a subcube by performing a selection on two or more dimensions. Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data axes in view in order to provide an alternative presentation of the data.
A Three-Tier DataWarehouse Architecture 1. The bottom tier is a warehouse database server that is almost always a relational database system. Back-end tools and utilities are used to feed data into the bottom tier from operational databases or other external sources (such as customer profile information provided by external consultants). These tools and utilities perform data extraction, cleaning, and transformation (e.g., to merge similar data from different sources into a unified format), as well as load and refresh functions to update the data warehouse. 2. The middle tier is an OLAP server that is typically implemented using either (1) a relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps operations on multidimensional data to standard relational operations; or (2) a multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly implements multidimensional data and operations. 3. The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
A Three-Tier DataWarehouse Architecture Enterprise warehouse: An enterprise warehouse collects all of the information about subjects spanning the entire organization. It provides corporate-wide data integration, usually from one or more operational systems or external information providers, and is cross-functional in scope. Data mart: A data mart contains a subset of corporate-wide data that is of value to a specific group of users. The scope is confined to specific selected subjects. For example, a marketing data mart may confine its subjects to customer, item, and sales. Virtual warehouse: A virtual warehouse is a set of views over operational databases. For efficient query processing, only some of the possible summary views may be materialized.
Types of OLAP Servers Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a relational back-end server and client front-end tools. Multidimensional OLAP (MOLAP) servers: These servers support multidimensional views of data through array-based multidimensional storage engines. Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. Specialized SQL servers: To meet the growing demand of OLAP processing in relational databases, some database systemvendors implement specialized SQL servers that provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment.
CONCEPT DESCRIPTION/CHARECTERIZATION Data generalization summarizes data by replacing relatively low-level values (such as numeric values for an attribute age) with higher-level concepts (such as young, middle aged, and senior). Given the large amount of data stored in databases, it is useful to be able to describe concepts in concise and succinct terms at generalized (rather than low) levels of abstraction. Attribute-Oriented Induction for Data Characterization Before Attribute Induction - 1) First, data focusing should be performed before attribute-oriented induction. This step corresponds to the specification of the task-relevant data (i.e., data for analysis). The data are collected based on the information provided in the data mining query. 2) Specifying the set of relevant attributes. For example, suppose that the dimension birth place is defined by the attributes city, province or state, and country. Of these attributes, let’s say that the user has only thought to specify city. In order to allow generalization on the birth place dimension, the other attributes defining this dimension should also be included.
CONCEPT DESCRIPTION/CHARECTERIZATION 3) A correlation-based (Section 2.4.1) or entropy-based (Section 2.6.1) analysis method can be used to perform attribute relevance analysis and filter out statistically irrelevant or weakly relevant attributes from the descriptive mining process. Attribute Induction takes place in two phases 1) Attribute Removal - If there is a large set of distinct values for an attribute of the initial working relation, but either (1) there is no generalization operator on the attribute (e.g., there is no concept hierarchy defined for the attribute), or (2) its higher-level concepts are expressed in terms of other attributes, then the attribute should be removed from the working relation. 2) Attribute Generalization - If there is a large set of distinct values for an attribute in the initial working relation, and there exists a set of generalization operators on the attribute, then a generalization operator should be selected and applied to the attribute.
CONCEPT DESCRIPTION/CHARECTERIZATION Attribute Generalization can be controlled in 2 ways 1) Attribute generalization threshold control - sets one threshold for each attribute. If the number of distinct values in an attribute is greater than the attribute threshold, further attribute removal or attribute generalization should be performed. 2) Generalized relation threshold control - sets a threshold for the generalized relation. If the number of (distinct) tuples in the generalized relation is greater than the threshold, further generalization should be performed.