Download
data warehousing data mining concepts n.
Skip this Video
Loading SlideShow in 5 Seconds..
Data Warehousing & Data Mining Concepts PowerPoint Presentation
Download Presentation
Data Warehousing & Data Mining Concepts

Data Warehousing & Data Mining Concepts

336 Views Download Presentation
Download Presentation

Data Warehousing & Data Mining Concepts

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Data Warehousing & Data Mining Concepts David Chrestman Evan Rakestraw Andy Stevens

  2. Data Warehousing • A data warehouse is a collection of information with a supporting system that is: • Intended for decision-support applications. • Optimized for data retrieval, not transaction processing.

  3. Data Warehousing • A data warehouse is characterized as a subject-oriented, integrated, time-variant collection of data to support the decisions of management. • Data warehouses provide access to data for complex analysis, knowledge discovery, and decision making.

  4. Data Warehousing • A data warehouse can be normalized or denormalized. • Data warehouse data is not changed often. • Outputs from data warehouses are usually tabular listings or formatted formal reports.

  5. Data Warehousing • Data warehouses support high-performance demands on data with support for applications such as: • OLAP • DSS • Data Mining Applications

  6. OLAP • Online Analytical Processing describes the analysis of complex data from the data warehouse. • OLAP tools use distributed computing capabilities for analyses that require more storage and processing power than found on an average desktop.

  7. DSS • DSS stands for Decision Support Systems and is also known as Executive Information Systems (EIS). • DSS supplies higher level data for complex decisions.

  8. Data Mining • Data Mining is used for knowledge discovery. • Data Mining is the process of searching data for unanticipated new knowledge.

  9. Data Warehousing • Compared with transactional databases, information in a data warehouse is regarded as non-real-time with periodic updating. • Warehouse updates are handled by the warehouse’s acquisition component that provides all required processing.

  10. Data Warehousing Characteristics • Multidimensional Conceptual View • Generic Dimensionality • Unlimited Dimensions and Aggregation Levels • Unrestricted Cross-Dimensional Operations • Dynamic Sparse Matrix Handling • Client-Server Architecture

  11. Data Warehousing Characteristics • Multi-User Support • Accessibility • Transparency • Intuitive Data Manipulations • Consistent Reporting Performance • Flexible Reporting

  12. Data Warehousing • Data Warehouses are generally 10 times larger than the source databases, and are likely to be in terabytes. • The issue of their size has been dealt with using: • Enterprise-Wide Data Warehouses • Virtual Data Warehouses • Data Marts

  13. Data Warehousing • Enterprise-Wide Data Warehouses are huge projects requiring massive investment of time and resources. • Virtual Data Warehouses provide views of operational databases that are materialized for efficient access. • Data Marts generally are targeted to a subset of an organization and are tightly focused.

  14. Data Modeling For Warehouses • Multidimensional models take advantage of inherent relationships in data to populate data in multidimensional matrices called data cubes. • If there are more than three dimensions they may be called hypercubes.

  15. Data Modeling For Warehouses • Examples of dimensions in a data warehouses would be a company’s fiscal periods, products, and regions. • Changing the dimensional orientation is easily accomplished using rotation (also called pivoting).

  16. Two Dimensional Matrix Model

  17. Three Dimensional Data Cube

  18. Pivoted 3D Data Cube

  19. Data Modeling For Warehouses • Roll-up display moves up the hierarchy, grouping into larger units along a dimension. • Example: Summing weekly data by month, quarter, or year. • Drill-down display moves down the hierarchy, dividing into smaller units along a dimension. • Example: Dividing country sales by region and then region by sub-region.

  20. Roll-Up

  21. Drill-Down

  22. Data Modeling For Warehouses • The multidimensional storage model involves two types of tables: • Dimension Table: consists of tuples of attributes of the dimension. • Fact Table: has tuples (one per recorded fact), contains some measured variables, and identifies it with pointers to dimension tables.

  23. Data Modeling For Warehouses • Two common multidimensional schemas are: • Star Schema: consists of a fact table with a single table for each dimension. • Snowflake Schema: a variation on the star schema in which the dimensional tables from a star schema are organized into a hierarchy by normalization

  24. Star Schema

  25. Snowflake Schema

  26. Data Modeling For Warehouses • To support high performance access, data warehouse storage utilizes indexing techniques: • Bitmap Indexing: constructs a bit vector for each value in a domain being indexed. • Join Indexing: uses traditional indexes to maintain relationships between primary key and foreign key values

  27. Building A Data Warehouse • Warehouse design should specifically support ad-hoc querying, which is accessing data with any meaningful combination of values for attributes in the dimension or fact tables. • This is because there is no way to anticipate all possible queries or analyses during the design phase.

  28. Building A Data Warehouse • Acquisition of data for the warehouse involves the following steps: • Data must be extracted from multiple, heterogeneous sources • Data must be formatted for consistency within the warehouse • Data must be cleaned before loaded into the warehouse to ensure validity.

  29. Building A Data Warehouse • Data must be fitted into the data model of the warehouse. • Data must be loaded into warehouse.

  30. Building A Data Warehouse

  31. Building A Data Warehouse • Due to the large size of data loaded into a warehouse, the updating policy must keep in mind a few questions: • How up-to-date must the data be? • Can the warehouse go offline, and for how long? • What is the storage availability? • What is the loading time?

  32. Building A Data Warehouse • Data storage in a warehouse must be able to reflect the specialization of optimized access, which involves: • Storing data according to data model. • Maintaining required data structures. • Maintaining appropriate access paths. • Providing for time-variant data as new data is added.

  33. Building A Data Warehouse • Supporting the updating of the warehouse data. • Refreshing the data. • Purging the data.

  34. Building A Data Warehouse • Data warehouses must be designed with consideration to the environment in which they reside. Important considerations: • Usage projections • Fit of the data model • Characteristics of available sources • Design of the metadata component

  35. Building A Data Warehouse • Modular component design • Design for manageability and change • Considerations of distributed and parallel architecture

  36. Oracle & Data Warehouses • Oracle offers their “Oracle Warehouse Builder” (OWB) in Oracle 10g. • OWB manages the full life-cycle of data and metadata for the Oracle 10g Database. • Provides an easy to use, graphical environment to rapidly design, deploy, and manage business intelligence systems.

  37. Oracle & Data Warehouses • Oracle Warehouse Builder: • Provides specific mapping operators to cleanse data upon loading • Extracts data from heterogeneous data sources • Provides users with a graphical environment to model the ETL processes • Reduces extraction, transformation and loading development times

  38. Oracle & Data Warehouses • More information on OWB at: http://www.oracle.com/technology/products/warehouse/index.html

  39. Data Warehouse Summary • A data warehouse is a collection of data used for research and decision support. • Data warehouses exist to facilitate complex, data-intensive, and frequent ad hoc queries. • Data warehouses must provide far greater and more efficient query support than is demanded of transactional databases.

  40. Data Mining • Data mining refers to the mining or discovery of new information in terms of patterns or rules from vast amounts of data. • To date, it is not well-integrated with database management systems.

  41. Goals of Data Mining • Prediction • Identification • Classification • Optimization

  42. Prediction • Shows how certain attributes within the data will behave in the future. • Example: Certain seismic wave patterns may predict an earthquake with high probability.

  43. Identification • Data patterns can be used to identify the existence of an item, and event, or an activity. • Example: Intruders trying to break a system may be identified by the programs executed, files accessed, and CPU time per session.

  44. Classification • Catagorizes data so that different categories can be identified • Example: Customers in a supermarket can be categorized into discount-seeking shoppers, shoppers in a rush, loyal regular shoppers, shoppers attached to name brands, and infrequent shoppers.

  45. Optimization • Optimizes the use of limited resources • Maximize output variables such as sales or profits

  46. Knowledge discovered in Data Mining • Association • Classification hierarchies • Sequential patterns • Patterns within time series • Clustering • For most applications the desired knowledge is a combination of these types.

  47. Association Rules • These rules correlate the presence of a set of items with another range of values for another set of variables. • Example: When a person buys a gallon of milk that person is likely to buy a box of cereal.

  48. Classification Hierarchies • Works from an existing set of events to create a hierarchy of classes. • Example: A model may be developed for the factors that determine the desirability of location of a store on a scale of 1-10.

  49. Sequential Patterns • A sequence of actions or events is sought. • Example: If a patient underwent cardiac bypass surgery for blocked arteries and an aneurysm and later developed high blood urea within a year of surgery, he or she is likely to suffer from kidney failure within the next 18 months.

  50. Patterns Within Time Series • Similarities detected within positions of a time series of data • A sequence of data taken in intervals such as daily sales or daily closing stock prices. • Example: Two products show the same selling pattern in the summer but a different one in the winter.