Data Warehousing • Denis Manley • Enterprise Systems • FT228/3
Data Warehousing • A data warehouse is a repository for all an organisations data. It is designed to present data in very user-friendly formats to end-users. • Why is Data Warehousing necessary? • think of a typical company that uses computers for order processing, invoicing, etc. • Data is gathered for every transaction that occurs. Different departments probably use different systems, each designed for a specific task.
Data Warehousing (cont’d) • There will be lots of different database Tables, in different formats, on different machines. There will be tie-ups between data items, but those tie-ups may not be obvious outside the applications that manipulate the tables. • Also, the amount of data is constantly increasing. Some estimates state that the size of a typical data processing system doubles every two years. • Both standard, and ad-hoc, reports must be catered for, and they must run in a reasonable time-frame.
Data Warehousing (cont’d) • Some solutions to the issues listed can be implemented in the applications that manage specific data. • E.G. if you want to ask “How much does this customer owe?” then the original package is probably the one to use. • But if you want to ask “Was this ad campaign more successful than that one?”, you require data from more disparate sources, and one application may not provide all of it.
Data Warehousing (cont’d) • The new alternative is a Data Warehouse: • a D.W. is a way of organising data from a wide range of formats, possibly from different locations, so it’s all available in a single location. • Once this stage is complete, the collection of data is usually frequently replicated around multiple locations. • This means users have a local copy of the data they need to inspect. This improves query run-times, and reduces communications overheads.
Data Warehousing (cont’d) • The data is not only collected and joined, however. What the data means is described in a way that the end-users can easily understand, e.g. using the name “Customer Account Number” instead of “order>cacno” for a field. • The data must be quickly accessible, so that users can find data quickly and easily.
Data Warehousing process Operational data in legacy systems e.g. OLTP apps. Data fusion: Assembles diverse data Data Cleansing: “fixing” data Meta Data: shows transformations of data,origins of data, etc. Data Migration: load data and meta-data into warehouse periodically Data Warehouse Decision-support analyst queries
Data Warehousing Architecture • A data warewhouse is an architectural construct of an information system that provides users with current and historical decision support information that is hard to access or present in traditional operational data stores. • It comprises a number of components illustrated in figure 1
Data Warehousing Architecture Figure 1 : A datawarehouse architecture
Data Warehousing Architecture • Data Warehouse database • This is a cornerstone of the data warehouse (item 2 in figure 1) • Almost always implemented on a relational database management system • However very large databases, ad hoc processing and the need for flexible user views e.g. aggregates and drill downs are calling for different technological approaches: • Parallel relational database designs requiring the use of symmetric multiprocessors, Massively parallel processors and clusters
Data Warehousing Architecture • Speeding up traditional RDBMS • The use of Multidimensional databases (MDDBs) which are tightly coupled to on-line analytical processing
Data Warehousing Architecture • Sourcing, acquisation, cleanup and transformation of data • Implementing data warehouses involves extracting data from operational systems including legacy systems and putting it into a suitable format. • The various tools are illustrated as item 1 in figure 1 • These tools perform all the conversions, summarisations, key changes, structural changes, and condensations needed to transform disparate data into information can be used by decision support tools
Data Warehousing Architecture • It produces programs and control statements required to move data into the data warehouse form multiple operational systems • Maintains the metadata • Remove unwanted data • Converts to common data names and definitions • Calculates summaries • Establish defaults for missing data • Keep track of source data definition changes
Data Warehousing Architecture • The tools have to deal with: • Database hetergeneity: DBMS can be very different in data models, data access language etc. • Data hetergeneity: the difference in the way data is defined and used e.g. synonyms and different attributes for the same entity etc.
Data Warehousing Architecture • Metadata (item 3 figure1): data about data that describes the datawarehouse • Technical metadata: contains information for warehouse designers and administrators • Information about data sources • Transformation descriptions • Rules used to perform data clean up • Access authorisation, information delivery history, data acquisation history, data access etc • Business metadata: information that gives users an understanding of the information stored in the data warehouse. • Queries, reports images • Data warehouse operational information e.g. data history and ownership
Meta Data Versioning • In the operational environment, there tends to be a single instance of data and meta data at any one moment in time. • In the data warehouse environment, there are multiple definitions of data and meta data over an historically long period of time. • Therefore, versioning of data warehouse meta data is crucial to the success of the data warehouse vis-a-vis the end users’ ability to access and understand the data in the data warehouse.
Guidelines for Metadata Management • Develop an Information Directory that integrates technical and business metadata. • Keep the metadata current and accurate! • Maintain the time variant history of the metadata! • Provide meaningful descriptions and definitions (use business definitions, not technical definitions, where possible). • The end users must be educated about what metadata is, how to access it, how to use it, etc.
?? Meta Data Answers Questions for Users of the Data Warehouse • How do I find the data I need? • What is the original source of the data? • How was this summarization created? • What queries are available to access the data? • How have business definitions and terms changed over time? • How do product lines vary across organizations? • What business assumptions have been made?
The Role of Meta Datain the Data WarehouseArchitecture Meta Data enables data to become information, because with it you • Know what data you haveand • You can trust it!
Data marts (item 4 figure 1) • A data mart is a data store that is subsidary to a data warehouse of intergrated data. • The data mart is directed at a partition of data (subject area) that is created for the use of a dedicated group of users and is sometimes termed a “subject warehouse” • The data mart might be a set of denormalised, summarised or aggregated data that can be placed on the data warehouse database or more often placed on a separate physical store. • Data marts can be “dependent data marts” when the data is sourced from the data warehouse. • Independent data marts represent fragmented solutions to a range of business problems in the enterprise, however, such a concept should not be deployed as it doesn’t have the “data intergration” concept that’s associated with data warehouses.
Independent Data Marts Data Mart Systems of Record Extract, Transform, Clean, Integrate, Summarize, etc....
Independent Data Marts Extract, Transform, Clean, Integrate, Summarize, etc.... Three Times!
Independent Data Marts $ Significant and expensive duplication of effort and data.
Independent Data Marts $ $ Maintenance of proliferating unarchitected marts expensive and cumbersome.
md md Unarchitected Data Marts ? There may be metadata for some marts, but what about consistency & history?
md Atomic md Information Directory md Contrast Architected Data Warehouse Data Access Data Warehouse Dept’l Dept’l Dept’l Systems of Record Dependent (Architected) Departmental “Marts” with the Appropriate Subset of Metadata
Unarchitected Data Marts Data Warehouse md Dept’l Atomic Independent Data Marts vs. The Real Thing • Architected to meet organizational as well as departmental requirements • Data and results consistent • Redundancy is managed • Detailed history available for drill-down • Metadata is consistent! • Easy to do, but... • Are the extracts, transformations, integrations & loads consistent? • Is the redundancy managed? • What is the impact on the sources?
Independent Data marts • However, such marts are not necessarly all bad. • Often a valid solution to a pressing business problem: • Extremely urgent user requirements • The absence of a budget for a full data warehouse • The decentralisation of business units
Data Warehousing Architecture • Access Tools (item 5 figure 1) • The principal purpose of the data warehouse is to provide information for strategic decision making. • The main tools used to achieve this objective are: • Data query and reporting tools • Executive information system tools • On-line analytical processing tools • Data mining tools
Data Warehousing Architecture • Query and reporting tools • Reporting tools • Production reporting tools e.g. generate operational reports • Report writers: inexpensive desktop tools • Managed Query tools • Shield users from the complexities of SQL and database structures by inserting a metalayer between the users and the database
A Few Definitions • OLAP • On-Line Analytical Processing • A set of functionality that attempts to facilitate multidimensional analysis • Multidimensional Analysis • The ability to manipulate information by a variety of relevant categories or “dimensions” to facilitate analysis and an understanding of that data • Has also been called as “drill-down”, “drill-across” and “slice and dice”
A Few Definitions • Hypercube • A means of visually representing multidimensional data • Star Schema • A means of aggregating data based on a set of known dimensions, attempting to store data multidimensionally in a two dimensional RDBMS • Snowflake Schema • An extension of the star schema by means of applying additional dimensions to the dimensions of a star schema in a relational environment
A Few Definitions • Multidimensional Database • Also known as MDDB or MDDBS • A class of proprietary, non-relational database management tools that store and manage data in a multidimensional manner, as opposed to the two dimensions associated with traditional relational database management systems • OLAP Tools • A set of software products that attempt to facilitate multidimensional analysis • Can incorporate data acquisition, data access, data manipulation, or any combination thereof
A Few Definitions • ROLAP • Relational OLAP • Using an RDBMS to implement an OLAP environment • Typically involves a star schema to provide the multidimensional capabilities • OLAP tool manipulates RDBMS star schema data
A Few Definitions • MOLAP (1) • Multidimensional OLAP • Using an MDDBS to store and access data • MDDBS “directly” manages data multidimensionally • Usually requires proprietary (non-SQL) access tools • MOLAP (2) • OLAP tool facilitates multidimensional capabilities without the need for a star schema • Often utilizes a 3-Tier environment, where middle tier server preprocesses data from an RDBMS • Some OLAP tools access an RDBMS directly and build “cubes” as a “fat” client
How Can OLAP Be Accomplished? • Use the Data Warehouse as the architected foundation for the organization’s informational processing requirements • Use the appropriate design techniques to ensure that the data required is at the appropriate degree of granularity at the atomic level, and the appropriate degree of summarization at the departmental level • Use the appropriate tools to either access relational data in a multidimensional manner, or manage multidimensional data
Data mining • A CSF for any business is its ability to effectively use information. • This strategic use can occur from discovering hidden, undected and frequently valuable information about customers suppliers retailers etc. • This information can be used to formulate effective business, marketing and sales strategies. • A relatively new technology that can be used to achieve this strategic advantage is “data mining”
Data visualisation • Data visualisation is a method of presenting the output of the previously mentioned methods in such a way that the problem or solution is clearly visible to domain experts and even casual observers. • It goes way beyond simple bar and pie charts • It is a collection of complex techniques that focus on determining the best way to display complex patterns on a 2-D computer monitor. • Such techniques involve experimenting with various colours, shapes 3-d images, sound and vittual reality to help users really see and feel the problem and solution.
DW administration and management(Item 6 figure 1) • Managing data warehouses involves: • Security and priority management • Monitoring updates from multiple sources • Data quality checks • Managing and updataing metadata • Replicating, subsetting and distributing data • Backup and recovery • Data warehouse storage management e.g. capacity planning, hierarchical storage management, purging of aged data.
Information delivery system(Item 7 figure 1) • The IDS is used to enable the process of subscribing for data warehouse information and having it delivered to one or more destinations of choice according to some user-specified scheduling algorithm • IDS may be based on time of day or on completion of an external event. • IDS can be achieved by Client/Server architecture and now by Interner/intranet and World Wide Web.
Top 10 “Don’t’s” of Data Warehousing
10. Pre-selecting Your Technical Environment This is a very common trap in which many organizations find themselves. It is traditional to select the hardware, software and other physical, technical components of a system as one of the earliest activities. However, a data warehouse is an unstable environment from a sizing perspective. How do you know the hardware/RDBMS/end user tool is appropriate for your data warehouse before conducting even the first round of analysis? If at all possible, wait to select your technical environment until after you have analyzed the business requirements for information, data, and potential systems of record.
9. Allowing Access Tool to Determine Data Architecture This is an extension of #10, but is important enough to list by itself. If you select an end user tool before developing your data architecture, it is very likely that that architecture will suffer at the hand of design requirements delivered by the tool. If you have to sacrifice design requirements in order to meet functional requirements of a tool, it is probably time to put that tool aside and select another one.
8. Unarchitected Data Marts OK. Data marts are good; they are an essential part of the data warehouse architecture. But to build only a data mart and to ignore the rest of the data warehouse (specifically the atomic level data and centralized meta data) will lead you down a path that will be more expensive and deliver less quality of data than the alternative. The alternative is to architect and build the data warehouse incrementally, iteratively. Include data marts as departmental instances of the architecture, and populate them from the atomic level data. This will ensure accuracy across the architecture, and reduce costs by eliminating unnecessary population of stand-alone data marts.
7. Boiling the Ocean It is more efficient to implement the data warehouse in small, achievable and palatable chunks than to try to implement it all at once. When I say “boil the ocean”, I mean trying to do too many things for too many people all at the same time. There is an old adage: “You can have everything; where would you put it all?” The same holds true for a data warehouse. If you try to design, develop and implement a data warehouse that is all-encompassing as your first iteration, how will the users be able to use all that you delivered? And in the mean time, while you’ve been trying to meet all of their needs, you have failed to meet any needs. And users won’t forget that for a long time.
6. “If you build it they will come” If you design, develop and implement an operational system, such as an order processing system, that new system is typically going to replace an existing system. In other words, the old system goes away and the users have no choice but to use the new one. Not so with the d/w. “If you build it…” implies an analysis that includes only bottom-up activities. It is crucial to the success of a data warehouse that a top-down analysis of user requirements for information be conducted. After that, users must be tutored, mentored and otherwise have their hands held as part of the implementation of the data warehouse. Existence does not guarantee utilization and, therefore, value.
5. Putting ROI before RFI (Requirements for Information) It is very difficult to quantify the intangible benefits that a data warehouse can provide to an organization. How can you put a price on increased customer loyalty. Somewhere, sometime, someone has probably made this calculation. In most cases, however, the determination of how beneficial the data warehouse will be is based on criteria that was developed for operational systems. Just as you cannot use operational data to meet your strategic informational requirements, it is difficult to calculate the return on investment (ROI) of a data warehouse. In terms of benefits to the organization, it is more appropriate to concentrate on how well the data warehouse addresses the target users’ requirements for information.
4. No Committed User Involvement Write this down: The success of any data warehouse is directly proportional to the amount of end user participation! A data warehouse cannot be successful without active participate on the part of the target users. Period. If you do not have user participation, you will find yourself in a situation where you will build it and hope that they will come. If there is no serious user participation in a data warehouse project, you have to seriously question whether or not the organization truly needs a data warehouse.
3. No Dedicated DBA In many situations the lack of a dedicated database administrator (DBA) has prevented a data warehouse project to be complete 1) on time, or 2) successfully. “Borrowing” a DBA from the operational “pool” will only result in questions about the nature of the data warehouse data models and database design. It’s too flat, not normalized properly, too much redundancy, and other criticisms are well suited for an operational system’s database design, but not a data warehouse. Considering that “data” is the first word in “data warehouse”, be sure you have a dedicated database administration resource committed to this important project.
2. No Meta Data Meta data is like documentation and training: Everyone knows it is necessary, but it usually gets dropped somewhere along the route to implementation. For the data warehouse, meta data is more important than just your typical documentation. Remember, in order to turn data into information you have to have the data, know that you have it, be able to access it, and trust it. Meta data is the means by which the users will be able to understand and trust the data. A time-variant record of where data came from, what happened to it along the way, where it is in the data warehouse, and what vehicles exist to access it will spell the difference between success and frustration.