Data Warehousing and Mining

Data Warehousing and Mining

Data Warehousing • Data warehouses • Decision support • The multidimensional model • OLAP queries

Data Warehousing

Introduction • Traditional DBMSs are used by organizations for maintaining data to record day to day operations • On-line Transaction Processing (OLTP) • Such DBs have driven the growth of the DBMS industry, and will continue to be important • Organizations are increasingly analyzing current and historical data to identify useful patterns • Market research to target products to particular market segments (including political polling) • Decision support for high level decision-making • On-line Analytic Processing (OLAP)

Data Warehouses • High level decision making requires an overall view of all aspects of an organization • This can be prohibitively expensive with a large, global, distributed database • Many organizations have created large consolidated data warehouses, often gigabytes or terabytes in size • Information from several databases is consolidated into a data warehouse • By copying tables from many sources into one location • The focus is on identifying patterns and global analysis, so having data current is not a priority

Architecture ExternalDataSources MetadataRepository extract clean transform load refresh DataWarehouse OperationalDatabases

Creating a Warehouse • Extract data from operational databases and other sources • Clean data to minimize errors, and fill in missing information where possible • Transform data to reconcile semantic mismatches • Typically done by defining a view over the data sources • Load data by materializing, and storing, the views created in the transformation stage • Sort the data and generate summary information • Partition data and build indexes to increase efficiency

Warehousing Issues • Semantic integration • Eliminate mismatches between data from many sources • e.g. different schemas, currencies • Heterogeneous sources • Data has to be accessed from a variety of sources formats • Load, refresh, purge • Data must be loaded and periodically refreshed • Data that is too old should be purged • Metadata management • The source, loading date, and other information should be maintained for all data in the warehouse

Decision Support • Traditional SQL queries are inadequate for typical decision support queries • WHERE clauses often contain many AND, and OR conditions , SQL handles OR conditions poorly • Statistical functions are not supported by SQL-92, so queries to be embedded in host language programs • Many queries involve conditions over time, or aggregation of data over time • Users often need to submit several related queries, SQL does not support optimization of such families of queries

Analysis tools • Some systems support querying where the data is considered to be a multidimensional array • Typical queries involve group-by and aggregation operators, complex conditions, and statistical functions • OLAP applications • Some DBMSs are designed to support OLAP queries as well as traditional SQL queries • Relational DBMSs optimized for decision support • A third class of analysis tools supports exploratory data analysis • Identifying interesting patterns in the data, data mining

Multidimensional Data Model • OLAP applications use ad hoc, complex queries involving group-by and aggregation operators • Typically OLAP queries are considered in terms of a multidimensional data model • The data can be represented as a multidimensional array • The focus is on a collection of numeric measures • Each measure depends on a set of dimensions • Each measure has one value, over each set of possible values for the dimensions

A Multidimensional Dataset Measure of sales, dimensions are products, location and time location_id slice where location_id = 1 product_id time_id

MOLAP and ROLAP • Multidimensional data can be stored physically in a multidimensional array, MOLAP • The array must be disk-resident and persistent • Or, data can be stored in relations, ROLAP • The relation which relates the dimensions to a particular measure is known as the fact table • Each dimension (location, product, etc.) can have additional attributes stored in a dimension table • Fact tables are usually much larger than dimension tables

year country quarter province week month city date Dimension Hierarchies • Dimensions can be considered as hierarchies • Attributes have positions in the hierarchies • In many OLAP applications it is necessary to store information about time • Such data often cannot be represented by an SQL date or timestamp • e.g. date, week, month, quarter, year, holiday status, …

Multidimensional DB Design • Star schemas are a common pattern in OLAP DBs • So called because the dimension tables form a star pattern around the fact table • The majority of the data is in the fact table • The fact table should have no redundancy and be in BCNF • To reduce size, system generated dimension identifiers are used as the primary keys of dimensions • e.g. product ID instead of a product name • Dimension tables are often not normalized • As they are static, so update, insert, and delete anomalies are not an issue

OLAP Queries • Influenced by SQL and by spreadsheets • A measure is commonly aggregated over one or more dimensions e.g. • Find total sales • Find total sales for each city, or state • Find the top five products ranked by sales • Aggregation may be at different levels of a dimension hierarchy • Roll-up total sales by city to get sales by state, or • Drill-down from total sales by state, to get sales by city • It is possible to drill-down on different dimensions

OLAP Queries … • Pivoting sales on location and time produces the total sales for each location and time • The result is a cross-tabulation • Slicing a dataset is an equality selection on a dimension • Dicing a dataset is a range selection cross-tabulation

OLAP and SQL • The queries shown below give the cross-tabulation shown previously: SELECT SUM(S.sales) FROM Sales S total SELECT SUM(S.sales) FROM Sales S, Times T, Location L WHERE S.time_id = T.time_id AND S.loc_id = L.loc_id GROUP BY T.year, L.province details SELECT SUM(S.sales) FROM Sales S, Times T WHERE S.time_id = T.time_id GROUP BY T.year SELECT SUM(S.sales) FROM Sales S, Times T WHERE S.loc_id = L.loc_id GROUP BY L.province sub-totals

The CUBE Operator • The cross-tabulation shown previously is a roll-up on the location and time dimensions • With the sub-totals being roll-ups on time, and location • Each roll-up query corresponds to an SQL GROUP BY query with different grouping criteria • With three dimensions (time, location, and product) how many such queries are there? • If there are k dimensions, there are 2k possible GROUP BY queries generated by pivoting • These can be generated by an SQL 99 CUBE query

CUBE and ROLLUP • SQL 1999 extended the GROUP BYconstruct to provide better OLAP support • The GROUP BYclause with the CUBE keyword is equivalent to a collection of GROUP BYstatements • With one GROUP BYstatement for each subset of the k dimensions • The ROLLUPkeyword can also be used, which distinguishes between the GROUP BYattributes • Subsets where the first attribute is null are not included in the result (except where all attributes are null)

GROUP BY CUBE SELECT t.year, L.state, SUM (S.sales) FROM Sales S, Times T, Location L WHERE S.time_id = T.time_id AND S.loc_id = L.loc_id GROUP BY CUBE (T.year, L.province) not included in the equivalent rollup query (GROUP BY ROLLUP)

GROUP BY Lattice • The eight grouping lists for the set product, time and location are shown below • Note that the CUBE operation calculates the specified groupings and it's child nodes {product, location, time} {product, location} {product, time} {location, time} {product} {location} {time} {}

Trend Analysis • Find the percentage change in the total monthly sales of each product • Find the top five products ranked by total sales • Find the trailing n day moving average of sales • For each day compute the average daily sales over the preceding n days • The first two queries are hard to express in SQL • The third cannot be expressed if n is a parameter of the query • The SQL 1999 WINDOW clause allows such queries, over a table viewed as a sequence

Window Queries SELECT L.province, T.month, AVG(S.sales) OVER W AS movavg FROM Sales S, Time T, Location L WHERE S.time_id = T.time_id AND S.loc_id = L.loc_id WINDOW W AS (PARTITION BY L.province ORDER BY T.month RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND INTERVAL '1' MONTH FOLLOWING) • The FROM and WHERE clause is a normal SQL query • Call this query Temp • Temp is the sales rows with attributes of time and location • The subquery defines the window

Analyzing the Window Query • Temp is partitioned by the PARTION BY clause • The result has one row for eachrow in the partition, rather than one row for each partition • The partitions are sorted by the ORDER BY clause • The WINDOW clause makes groups nearby records • Value based, using RANGE (like the example) • Based on the number of rows, using the ROWS clause • Compute the aggregate function for each row, and its corresponding group, i.e. its window • There are new aggregate functions, RANK, and variants

Top N Queries • In analyzing trends it is common to want to find the 10 (or n) best, or worst performers • e.g., which the top ten best selling products • Sort sales by each product • Return answer in descending order • If there are a million products this is very wasteful • Instead guess (!) a sales value, c • All top ten performers are better than c, • But many more are less than c • Add the selectionsales > c

Top N Queries … SELECT P.product_id, P.pName, S.sales FROM Product P, Sales S WHERE P.product_id = S.product_id AND S.loc_id = 1 AND S.time_id = 3 ORDER BY S.sales DESC OPTIMIZE FOR 10 ROWS • The OPTIMIZE construct is not in SQL: 1999 but is supported by some DBMSs • The cutoff value (c) is chosen by the optimizer • Choosing the cutoff can be tricky • The effectiveness of the approach depends on how accurately the cutoff can be estimated

Online Aggregation SELECT L.province, AVG(S.sales) FROM Sales S, Location L WHERE S.loc_id = L.loc_id GROUP BY L.province • This query may be expensive if the tables are large • If speed is of the essence it is possible to return data before the query is complete • Either return the current running average • Or use sampling and other statistical techniques to return an approximation • Note that the algorithms must be non-blocking

Implementation Issues • The OLAP environment motivates some different implementation techniques • Indexing is very important in OLAP systems • Interactive response time is desired for queries over very large database • OLAP systems are mostly read, and rarely updated which reduces the cost of maintaining indexes • New indexing techniques have been developed for OLAP systems • Bitmap indexes • Join indexes

Bitmap Indexes • A bitmap index can speed up queries on sparse columns, that have few possible values • One bit is allocated for each possible value • The indexes can be used to answer some queries • How many male customers have a rating of 3? • AND the M and 3 columns and count the 1s sex index rating index

Join Indexes • Joins are often expensive operations • Join indexes can be built to speed up specific join queries • A join index contains record IDs of matching records from different tables • e.g. Sales, products and locations of all sales in B.C. • The index would contain the sales rids and their matching product and location rids • Only locations where province = "BC" are included • The number of such indexes can be a problem where there are many similar queries

Alternative Join Indexes • To reduce the number of join indexes separate indexes can be created on selected columns • Each index contains rids of dimension table records that meet the condition, and rids of matching fact table records • The separate join indexes have to be combined, using rid intersection, to compute a join query • The intersection can be performed more efficiently if the new indexes are bitmap indexes • Particularly if the selection columns are sparse • The result is a bitmapped join index

Summary • Decision support is a rapidly growing area of database use and research • It involves the creation of large, consolidated data repositories called data warehouses • Warehouses are queried using sophisticated analysis techniques • Complex multidimensional queries influenced by both SQL and spreadsheets • New techniques for database design, indexing, view maintenance and querying must be supported

Data Mining

Introduction • Data mining consists of finding interesting trends in large datasets • Related to an area of statistics - exploratory data analysis • Such patterns should be identified with minimal user input • The knowledge discovery process is in four steps • Data selection - find the target subset of the data • Data cleaning - remove noise and outliers, transform fields to common units and prepare the data for analysis • Data mining - apply data mining algorithms to find interesting trends or patterns • Evaluation - present results to end users

Market Basket Analysis • A market basket is a collection of items purchased by a customer in a single transaction • Retailers commonly want to know which items are purchased together, to identify marketing opportunities • An itemset is a set of items bought in a transaction • The support of an itemset is the fraction of transactions that contain all the items in an itemset • e.g. if {milk, cookies} has 60% support then 60% of all transactions contain both milk and cookies • We may be interested in single item itemsets as they identify frequently purchased items

A Priori Property • The a priori property is that every subset of a frequent itemset is also a frequent itemset • The algorithm can proceed iteratively by first identifying frequent itemsets with only one item • Each single item itemset can then be extended with another item to generate larger candidate itemsets • Each iteration of the algorithm scans the transactions once • Increasing the candidate itemsets by one item • The algorithm can be improved by only considering additional items that are themselves itemsets • The minimum support level has to be specified by the user

Iceberg Queries SELECT P.custid, P.item, SUM(P.qty) FROM Purchases P GROUP BY P.custid, P.item HAVING SUM (P.qty) > 5 • If the number of {custid, item} pairs is large the relation may have to be sorted or hashed, but • The result set is probably small, just the tip of the iceberg • The query will waste time computing all of the groups, even though only few will meet the HAVING condition • A modification of the priori property suggests that we only need to consider • Customers that have purchased 5 items, and • Items that have been purchased 5 times Find items that customers have purchased more than five times

Association Rules • An association rule is a rule of the form • {milk}  {cookies}, which states that: • If milk is purchased in a transaction then it is likely that cookies are also purchased in that transaction • There are two measures for association rules • The support is the percentage of transactions that contain the {LHS RHS} • The same as the support for that itemset • The confidence is a measure of the strength of the rule • The percentage of times that cookies are purchased whenever milk is purchased, or • support ({LHS RHS}) / support ({LHS})

Finding Association Rules • Users ask for association rules with given minimum support and confidence • First all frequent itemsets with the specified minimum support are found • As discussed previously • Once the frequent itemsets have been produced they are divided into LHS and RHS • The confidence measure is then tested for each possible LHS and RHS combination of the qualifying itemsets • The most expensive part of the algorithm is identifying the frequent itemsets

More Complex Analysis • Association rules can be applied to sets of days • By using the date field as a grouping attribute • In calendric market basket analysis the user specifies a group of dates, or calendar, to test the rule over • Sequential patterns can be analyzed, where a customer purchases a given sequence of itemsets • Care must be taken when using association rules for predictive purposes • Rules like {milk}  {cookies} may not be causal

Bayesian Networks • Finding causal relationships can be hard • Although two events (or purchases) are correlated there may not be a causal relationship between them • Each possible combination of causal relationships can be considered as a model of the world • Assign a score to each model based on its consistency with the observed data • Bayesian networks are graphs that can be used to describe such models • The number of models is exponential in the number of variables so some subset should be considered

Predicting Attributes • An insurance company may want to predict whether or not customers are high risk • What information can they use to do this? • e.g. If a male aged between 16 and 25 drives a truck they are high risk • There is one attribute whose value is to be predicted, the dependent attribute, and • A number of predictor attributes • The general form of such rules is: • P1(X1)  P2(X2)  …  Pk(Xk) Y = c • Pis are predicates, Xis are predictors

Classification and Regression • The form of predicates depends on the type of the predictor attribute • If the attribute is numerical, then numerical computations can be performed • The predicate is of the form low  Xi  high • If the attribute is categorical, then we must test to see if two values are equal • The predicate is of the form Xi  {v1, …, vj} • age is numerical, car type and risk rating are categorical • (16  age  25)car  {truck}hiRisk= true • Rules are based on the dependent attribute type • Categorical – classification rules • Numerical – regression rules

Classification and Regression • Support and confidence can be defined for classification and regression • Support – the support for a condition, C, is the percentage of records that satisfy C • The support for C1 C2 is the support for C1 C2 • Confidence – the confidence is the percentage of records of C1that also satisfy condition C2 • Classification and regression rules differ from association rules • They consider more than one set-valued field as the left hand side of the rule

Decision Trees • A collection of classification rules can be represented as a decision tree • Each internal node of the tree is labeled with a predictor attribute, referred to as a splitting attribute • Outgoing edges are labeled with predicates • Leaf nodes are given labels values of dependent attributes • Decision trees are constructed in two phases • In the growth phase an overly large tree is built • Trees are built by repeatedly splitting the tree on the best remaining splitting criterion • The database is then partitioned on this criterion • The tree is then pruned to remove overspecialized rules

Clustering • The goal is to partition a set of records into groups • Records in a group are similar, and records that belong to different groups are dissimilar • Each group is called a cluster, and records should belong to only one cluster • Partitional clustering algorithms partitions the data into groups based on some criterion • Hierarchical clustering algorithms generate a sequence of partitions • In the first partition each cluster consists of one record, the algorithm then mergers two partitions in each step

The End

Data Warehousing and Mining

Data Warehousing and Mining

Presentation Transcript

Environmental Data Warehousing and Mining

Environmental Data Warehousing and Mining

Data Warehousing and Data Mining

Data Warehousing and Data Mining

Data warehousing and mining

DATA WAREHOUSING AND DATA MINING

Data Mining: Data Warehousing

CSCE822 Data Mining and Warehousing

DATA WAREHOUSING AND DATA MINING

DATA WAREHOUSING AND DATA MINING

DATA WAREHOUSING AND DATA MINING

Data Warehousing and Data Mining

Data Warehousing/Mining

RIS625 – Data Warehousing and Mining

CSCE822 Data Mining and Warehousing

CSCE822 Data Mining and Warehousing

DATA WAREHOUSING AND DATA MINING