Data Warehousing & Data Mining Concepts

Data Warehousing & Data Mining Concepts David Chrestman Evan Rakestraw Andy Stevens

Data Warehousing • A data warehouse is a collection of information with a supporting system that is: • Intended for decision-support applications. • Optimized for data retrieval, not transaction processing.

Data Warehousing • A data warehouse is characterized as a subject-oriented, integrated, time-variant collection of data to support the decisions of management. • Data warehouses provide access to data for complex analysis, knowledge discovery, and decision making.

Data Warehousing • A data warehouse can be normalized or denormalized. • Data warehouse data is not changed often. • Outputs from data warehouses are usually tabular listings or formatted formal reports.

Data Warehousing • Data warehouses support high-performance demands on data with support for applications such as: • OLAP • DSS • Data Mining Applications

OLAP • Online Analytical Processing describes the analysis of complex data from the data warehouse. • OLAP tools use distributed computing capabilities for analyses that require more storage and processing power than found on an average desktop.

DSS • DSS stands for Decision Support Systems and is also known as Executive Information Systems (EIS). • DSS supplies higher level data for complex decisions.

Data Mining • Data Mining is used for knowledge discovery. • Data Mining is the process of searching data for unanticipated new knowledge.

Data Warehousing • Compared with transactional databases, information in a data warehouse is regarded as non-real-time with periodic updating. • Warehouse updates are handled by the warehouse’s acquisition component that provides all required processing.

Data Warehousing Characteristics • Multidimensional Conceptual View • Generic Dimensionality • Unlimited Dimensions and Aggregation Levels • Unrestricted Cross-Dimensional Operations • Dynamic Sparse Matrix Handling • Client-Server Architecture

Data Warehousing Characteristics • Multi-User Support • Accessibility • Transparency • Intuitive Data Manipulations • Consistent Reporting Performance • Flexible Reporting

Data Warehousing • Data Warehouses are generally 10 times larger than the source databases, and are likely to be in terabytes. • The issue of their size has been dealt with using: • Enterprise-Wide Data Warehouses • Virtual Data Warehouses • Data Marts

Data Warehousing • Enterprise-Wide Data Warehouses are huge projects requiring massive investment of time and resources. • Virtual Data Warehouses provide views of operational databases that are materialized for efficient access. • Data Marts generally are targeted to a subset of an organization and are tightly focused.

Data Modeling For Warehouses • Multidimensional models take advantage of inherent relationships in data to populate data in multidimensional matrices called data cubes. • If there are more than three dimensions they may be called hypercubes.

Data Modeling For Warehouses • Examples of dimensions in a data warehouses would be a company’s fiscal periods, products, and regions. • Changing the dimensional orientation is easily accomplished using rotation (also called pivoting).

Two Dimensional Matrix Model

Three Dimensional Data Cube

Pivoted 3D Data Cube

Data Modeling For Warehouses • Roll-up display moves up the hierarchy, grouping into larger units along a dimension. • Example: Summing weekly data by month, quarter, or year. • Drill-down display moves down the hierarchy, dividing into smaller units along a dimension. • Example: Dividing country sales by region and then region by sub-region.

Roll-Up

Drill-Down

Data Modeling For Warehouses • The multidimensional storage model involves two types of tables: • Dimension Table: consists of tuples of attributes of the dimension. • Fact Table: has tuples (one per recorded fact), contains some measured variables, and identifies it with pointers to dimension tables.

Data Modeling For Warehouses • Two common multidimensional schemas are: • Star Schema: consists of a fact table with a single table for each dimension. • Snowflake Schema: a variation on the star schema in which the dimensional tables from a star schema are organized into a hierarchy by normalization

Star Schema

Snowflake Schema

Data Modeling For Warehouses • To support high performance access, data warehouse storage utilizes indexing techniques: • Bitmap Indexing: constructs a bit vector for each value in a domain being indexed. • Join Indexing: uses traditional indexes to maintain relationships between primary key and foreign key values

Building A Data Warehouse • Warehouse design should specifically support ad-hoc querying, which is accessing data with any meaningful combination of values for attributes in the dimension or fact tables. • This is because there is no way to anticipate all possible queries or analyses during the design phase.

Building A Data Warehouse • Acquisition of data for the warehouse involves the following steps: • Data must be extracted from multiple, heterogeneous sources • Data must be formatted for consistency within the warehouse • Data must be cleaned before loaded into the warehouse to ensure validity.

Building A Data Warehouse • Data must be fitted into the data model of the warehouse. • Data must be loaded into warehouse.

Building A Data Warehouse

Building A Data Warehouse • Due to the large size of data loaded into a warehouse, the updating policy must keep in mind a few questions: • How up-to-date must the data be? • Can the warehouse go offline, and for how long? • What is the storage availability? • What is the loading time?

Building A Data Warehouse • Data storage in a warehouse must be able to reflect the specialization of optimized access, which involves: • Storing data according to data model. • Maintaining required data structures. • Maintaining appropriate access paths. • Providing for time-variant data as new data is added.

Building A Data Warehouse • Supporting the updating of the warehouse data. • Refreshing the data. • Purging the data.

Building A Data Warehouse • Data warehouses must be designed with consideration to the environment in which they reside. Important considerations: • Usage projections • Fit of the data model • Characteristics of available sources • Design of the metadata component

Building A Data Warehouse • Modular component design • Design for manageability and change • Considerations of distributed and parallel architecture

Oracle & Data Warehouses • Oracle offers their “Oracle Warehouse Builder” (OWB) in Oracle 10g. • OWB manages the full life-cycle of data and metadata for the Oracle 10g Database. • Provides an easy to use, graphical environment to rapidly design, deploy, and manage business intelligence systems.

Oracle & Data Warehouses • Oracle Warehouse Builder: • Provides specific mapping operators to cleanse data upon loading • Extracts data from heterogeneous data sources • Provides users with a graphical environment to model the ETL processes • Reduces extraction, transformation and loading development times

Oracle & Data Warehouses • More information on OWB at: http://www.oracle.com/technology/products/warehouse/index.html

Data Warehouse Summary • A data warehouse is a collection of data used for research and decision support. • Data warehouses exist to facilitate complex, data-intensive, and frequent ad hoc queries. • Data warehouses must provide far greater and more efficient query support than is demanded of transactional databases.

Data Mining • Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast amounts of data. • To date, it is not well-integrated with database management systems.

Goals of Data Mining • Prediction • Identification • Classification • Optimization

Prediction • Shows how certain attributes within the data will behave in the future. • Example: Certain seismic wave patterns may predict an earthquake with high probability.

Identification • Data patterns can be used to identify the existence of an item, and event, or an activity. • Example: Intruders trying to break a system may be identified by the programs executed, files accessed, and CPU time per session.

Classification • Catagorizes data so that different categories can be identified • Example: Customers in a supermarket can be categorized into discount-seeking shoppers, shoppers in a rush, loyal regular shoppers, shoppers attached to name brands, and infrequent shoppers.

Optimization • Optimizes the use of limited resources • Maximize output variables such as sales or profits

Knowledge discovered in Data Mining • Association • Classification hierarchies • Sequential patterns • Patterns within time series • Clustering • For most applications the desired knowledge is a combination of these types.

Association Rules • These rules correlate the presence of a set of items with another range of values for another set of variables. • Example: When a person buys a gallon of milk that person is likely to buy a box of cereal.

Classification Hierarchies • Works from an existing set of events to create a hierarchy of classes. • Example: A model may be developed for the factors that determine the desirability of location of a store on a scale of 1-10.

Sequential Patterns • A sequence of actions or events is sought. • Example: If a patient underwent cardiac bypass surgery for blocked arteries and an aneurysm and later developed high blood urea within a year of surgery, he or she is likely to suffer from kidney failure within the next 18 months.

Patterns Within Time Series • Similarities detected within positions of a time series of data • A sequence of data taken in intervals such as daily sales or daily closing stock prices. • Example: Two products show the same selling pattern in the summer but a different one in the winter.

Data Warehousing & Data Mining Concepts