Topics to be covered in this lecture:- What is Data Warehouse History of data warehouse

Topics to be covered in this lecture:- • What is Data Warehouse • History of data warehouse • Data warehouse benefits • Data Warehouse Layers • Data Warehouse Architecture

What is Data Warehouse One of the most important approaches to the integration of data sources is based on a data warehouse architecture. In this architecture, data coming from multiple external data sources (EDSs) are extracted, filtered, merged, and stored in a central repository, called a data warehouse (DW). Different people have different definitions for a data warehouse. The most popular definition came from Bill Inmon, who provided the following: A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.

The terms in the data warehousing definition above are explained as below: • Subject oriented: Data in an organization is organized in major objects or business processes. The common example of subject oriented data are customer, product, vendor and sale transaction. • Integrated: Data warehouse integrates data from various sources across departments within organization. Different data sources can have different ways to define a specific object for example product. However in a data warehouse there must be only one definition of products. This is achieved by using name conflict resolution in data warehouse. And we when we achieve this we say the data is integrated. • Nonvolatile: Data in data warehouse is not subject to change. Once placed in a data warehouse, data is not updated. • Time-variant: Data in data warehouse associates with time. The time can be a single moment or span of time. In any cases, the data is only relevant to time. • Support management’s decision making process the outcomes of a data warehouse are helping making decision based on historical data or facts. Then from business decisions, business processes can be optimized to increase the efficiency and effectiveness.

History of data warehouse Data warehouse were developed in the late 1980s to meet growing demands for data analysis and information management that could not be achieved by operational systems. Because the operational systems were designed in such as way that optimize for transactions only and number of operational or transaction systems were growing quickly across departments inside an organization that make the data integration more difficult. This created problems of data redundancy, data integration, analysis and performance in reporting. As a result, a separated system called data warehouse is designed to solved those problems. Data warehouse system can bring data from various source systems such as relational data management systems, flat files, spreadsheets, even remote data sources outside organization. This data then is organized in such a way that optimized for reporting purposes. User-friendly reporting tools provided by data warehouse system enable business users and decision makers to access data in the form of useful information with ease of use.

Data warehouse benefits • There are a lot of benefits that data warehouse brings to organizations: • Keep history data for analyzing even if the source systems do not maintain historical data. • Allow center point of accessing data across enterprises. • Improve data quality by cleansing and transforming data when loading it into data warehouse. • Give business user or decision makers “a single version of truth” or information is presented consistently. • Provide information instead of data to business users and decision makers. • Provide optimized query performance without impacting the operational systems. • Provide information to improve the business processes.

Data Warehouse Layers • Different data warehousing systems have different structures. Some may have an (Operational Data Store), while some may have (Multiple Data Marts). Some may have a small number of data sources, while some may have dozens of data sources. • In general, all data warehouse systems have the following layers: • Data Source Layer • Data Extraction Layer • Staging Area • ETL (Extraction, Transformation, Loading) Layer • Data Storage Layer • Data Logic Layer • Data Presentation Layer • Metadata Layer • System Operations Layer

The picture below shows the relationships among the different components of the data warehouse architecture:

Data Source Layer • This represents the different data sources that feed data into the data warehouse. The data source can be of any format -- plain text file, relational database, other types of database, Excel file, etc., can all act as a data source. • Many different types of data can be a data source: • Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems data. • Web server logs with user browsing data. • Internal market research data. • Third-party data, such as census data, demographics data, or survey data. • All these data sources together form the Data Source Layer.

Data Extraction Layer Data gets pulled from the data source into the data warehouse system. There is likely some minimal data cleansing. Staging Area This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration. ETL (Extraction, Transformation, Loading) Layer This is where data gains its "intelligence", as logic is applied to transform the data from a transactional nature to an analytical nature. This layer is also where data cleansing happens. Data Storage Layer This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you may have just one of the three, two of the three, or all three types.

Data Logic Layer This is where business rules are stored. Business rules stored here do not affect the underlying data transformation rules, but do affect what the report looks like. Data Presentation Layer This refers to the information that reaches the users. This can be in a form of a tabular / graphical report in a browser, an emailed report that gets automatically generated and sent everyday. Usually an OLAP tool and/or a reporting tool is used in this layer. OLAP short for Online Analytical Processing, a category of software tools that provides analysis of data stored in a database. OLAP tools enable users to analyze different dimensions of multidimensional data. Metadata Layer This is where information about the data stored in the data warehouse system is stored. A metadata tool is often to used to manage metadata. System Operations Layer This layer includes information on how the data warehouse system operates, such as ETL job status, system performance, and user access history.

Data Warehouse Architecture Is the key factor in building a good data warehouse for your business. Choosing the most suitable data warehouse architecture is a critical task in data warehouse life-cycle. In this section, we will discuss about the most popular data warehouse architectures such as dimensional data warehouse and enterprise data warehouse in a practical approach so you can make a decision to design a good data warehouse architecture for your business.

Popular Data Warehouse Architectures Enterprise data warehouse architecture -In the Bill Inmon’scorporate information factory (CIF) architecture or enterprise data warehouse architecture, the information from various source systems are consolidated into a central repository called an enterprise data warehouse. Dimensional data warehouse architecture - In Ralph Kimball’s data warehouse architecture, data is brought from throughout the enterprise into a central place called dimensional data warehouse. Like Inmon’s data warehouse architecture, the dimensional data warehouse also has enterprise focus. In Kimball’s data warehouse architecture, data mart is a subset of the tables linking together using star and snowflake schema. Unlike Inmon’s enterprise data warehouse architecture, analytic systems can access data directly from dimensional data warehouse. Federated data warehouse architecture - Federated data warehouse architecture provides an effective and practical approach for building a new data warehouse in an heterogeneous environment by integrating legacy data warehouse and business intelligence systems together. Data MartIn this architecture, data mart concept is just a logical distinction. The data mart is a subject area within the dimensional data warehouse.

1) Ralph Kimball Data Warehouse Architecture • Summary: Ralph Kimball data warehouse architecture which is known as dimensional data warehouse architecture. Ralph Kimball Data Warehouse Architecture

We will examine the elements of Ralph Kimball data warehouse architecture in detail: • Transaction applications are the operational systems created to capture business transactions. The data of transaction system usually stored in relational databases. • ETL. To bring data from transaction system in variety forms. ETL stands for extract, transform and load. The data in different formats is standardized and converted into format that ready to load into data warehouse. • Dimensional data warehouse is the heart of data warehouse architecture. The dimensional data warehouse contains enterprise data in high granular format. While Bill Inmon’s data warehouse architecture using ER modeling, dimensional data warehouse is designed using dimensional modeling. It means dimensional data warehouse consists of star schema or cubes. The analytic systems or reporting tools can access data from dimensional data warehouse directly. • Data mart. In this architecture, data mart concept is just a logical distinction. The data mart is a subject area within the dimensional data warehouse.

2) Bill Inmon Data Warehouse Architecture • In order to understand the Inmon’s data architecture, you can look at the picture from the left to right. There are some elements as follows: Bill Inmon’s Data Warehouse Architecture – CIF

Corporate Applications: These applications are operational systems, or transaction systems, that develops to support business. • Transaction systems are used to collect data from business transactions such as sales, marketing, material managements…etc and stored those data in various forms including relational data, hierarchical data or even spreadsheets. • ETL Processes. To bring data from transaction system, a process called ETL is used. ETL stands for extract, transform and load. ETL process consolidates data, transform it into a specific standard format and load it into a single repository called enterprise data warehouse, or EDW. ETL processes can run as a batch process periodically or a transaction-based for near real time data. ETL process is referred as data integration or data services. • Enterprise data warehouse is a central element in the Inmon’s data warehouse architecture, enterprise data warehouse is an integrated repository of atomic data. Data in the enterprise data warehouse is captured at a very lowest level of detail. Data in the enterprise data warehouse is stored in relational database and uses third normal database design. • Data marts are departmental views of information with subject oriented data. Data marts takes data from enterprise data warehouse. Aggregations can take place when data brings from enterprise data warehouse to data marts.

Kimball vs. Inmon Data Warehouse Architectures • Both Kimball and Inmon’s architectures share a same common feature that each has a single integrated repository of atomic data. In Inmon’s architecture, it is called enterprise data warehouse. And in Kimball’s architecture, it is known as dimensional data warehouse. Both architectures have an enterprise focus that supports information analysis across the organization. This approach enables to address the business requirements not only within a subject area but also across subject areas. • However there are some differences in the data warehouse architectures of both experts: • Kimball uses dimensional model such as star schemas or snowflakes to organize the data in dimensional data warehouse while Inmon uses ER model in enterprise data warehouse. Inmon only uses dimensional model for data marts only while Kimball uses it for all data • Inmon uses data marts as physical separation from enterprise data warehouse and they are built for departmental uses. While in Kimball’s architecture, it is unnecessary to separate the data marts from the dimensional data warehouse. • In dimensional data warehouse of Kimball, analytic systems can access data directly. While in Inmon’s architecture, analytic systems can only access data in enterprise data warehouse via data marts.

How to choose between Kimball vs. Inmon approach for building data warehouse? Here are the most important criteria how to choose between Kimball vsInmon approach. KPIs are commonly used by an organization to evaluate its success

3) Federated Data Warehouse Architecture • Introduction to federated data warehouse • Nowadays, corporate usually has a set of heterogeneous system landscape that contain transaction systems and business intelligence tools which provides analytical capabilities for each individual department needs. Each department views a business model under their own perspective. • For example a product in Sale can be defined as a material in Manufacturing and Equipment in service management. • In order to integrate those heterogeneous systems that aim to provide analytic capabilities across the different functions and departments, federated data warehouse was invented. • Federated data warehouse is a practical approach to achieve the “single version of the truth” across the organization. Federated data warehouse is used to integrate key business measures and dimensions. The foundations of federated data warehouse are common business model and common staging area.

Architecture of federated data warehouse • Regional federation possible in federated data warehouse • Big organization has various regions that provide businesses to customers of that region. Different regional data warehouses were built for each region to meet the specific business needs. A global data warehouse also was built to provide analytical capabilities to executive at global level, there are two data flows between regional and global data warehouses: • The difference between the regional and global data warehouse system is the nature of data resided in each system level. • Upward federation – only fact data are moved from regional data warehouse to global data warehouse. The aggregation of data can take place at global data warehouse after data integrated or during data movement. • Downward federation – in downward federation, the reference flows from global to regional level. This ensures the consistency and integrity of data across organization. Transactional data from corporate operational systems such as ERP, CRM… are sourced at global level and then extracted, transformed and loaded into respective regional data warehouse.

Functional federation possible in federated data warehouse Functional federated data warehouse is used when the organizations has different data warehouses system was built for specific applications such as ERP, CRM or subject specific. The components of functional federated data warehouse architecture includes data marts, custom built data warehouses, ETL tools, cross function reporting systems, real time data store and reporting.

Benefits of federated data warehouse • Ease of implementation – Federated data warehouse integrated all legacy data warehouses, business intelligence systems into a newer system that provides analytical capabilities across function. Federated data warehouse data not try to rebuild a new system which potentially causes major point of conflict. • Shorter implementation time – By integrating all legacy BI systems, federated data warehouse approach has a shorter implementation time in compare with lengthy processing of building enterprise data warehouse.

4) Data Mart • What is a Data Mart? • A data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. • A data mart is a set of subject areas organized for decision making support based on specific needs of a group of business users or department. • There are two types of data marts: independent or stand-alone data mart and dependent data mart.

Stand-alone data mart A Stand-alone data mart focuses exclusively on one subject area and it is not designed in an enterprise context. For example, manufacturing has their data mart, human resources has their, finance has their and so on. stand-alone data mart gets data from multiple transaction systems in one subject area or department to support specific business needs. stand-alone data mart may use dimensional design or entity-relationship model. Business intelligence tools query data directly from data mart and present information to user. Stand-alone data mart takes very short time to build and bring the visible result to specific departments with less cost. However if you look at the whole system landscape where multiple data marts exist, you will see that different ETL tools need to built for different transaction systems in different technologies and the data is duplicate in several data marts. From business perspective, each data mart is built to address a set of specific business needs, what if the needs expand? And what if you want to analyze data across function or department? The inconsistent data, such as definition of product, will make the information comparison between departments impossible.

Dependent data mart • According to Bill Inmon, a dependent data mart is a place where its data comes from a data warehouse. Data in a data warehouse is aggregated, restructured, and summarized when it passes into the dependent data mart. • There are several benefits of building a dependent data mart: • Performance: when performance of a data warehouse becomes an issue, build one or two dependent data marts can solve the problem. Because the data processing is performed outside the data warehouse. • Security: by putting data outside data warehouse in dependent data marts, each department owns their data and has complete control over their data. • KPI tracking: dependent data marts are ideal places for building and tracking KPIs over long period of time.

Topics to be covered in this lecture:- What is Data Warehouse History of data warehouse