1 / 47

ISQS 6339, Business Intelligence Data Warehousing

ISQS 6339, Business Intelligence Data Warehousing . Zhangxi Lin Texas Tech University. 1. Outlines. So far students should have learned Basic concepts of business intelligence The definition and importance of data warehouse In this lecture, the following topics will be covered

toviel
Download Presentation

ISQS 6339, Business Intelligence Data Warehousing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ISQS 6339, Business IntelligenceData Warehousing Zhangxi Lin Texas Tech University 1

  2. Outlines • So far students should have learned • Basic concepts of business intelligence • The definition and importance of data warehouse • In this lecture, the following topics will be covered • SQL Server 2008 data mart case study • How to access data in a network directory • How to access SQL Server 2008 on the Citrix Server • How to load data from an Excel file to a database • Data warehouse overview • Data warehouse architecture • Data integration ISQS 6347, Data & Text Mining

  3. Data Warehousing Definitions and Concepts • Data warehouse • Video – Overview of data warehouse 2’38” A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format • Benefits of data warehouse 3’18”

  4. Data mart • Definition A localized data warehouse that stores only relevant data to a department or event an individual • Dependent data mart A subset that is created directly from a data warehouse • Independent data mart A small data warehouse designed for a strategic business unit or a department

  5. Data Mart- The IMW Case IMW, standing for Internet Media Works!, is an ASP in real estate information services. It is headquartered in Austin, Texas. CEO is Gary Anderson. Web page: http://www.inetworks.com

  6. Why need Data Mart? • Data mart complements the centralized data warehousing based on UDM model, for the situations where UDM cannot be used • Legacy databases • Data are from nondatabase sources • No physical connection the centralized data warehouse • Data are not clean

  7. Data Mart Structures • Fact tables • Measures • Dimension tables • Dimensions and Hierarchies • Attributes (or columns) • Dimensional modeling – Stars and Snowflakes

  8. Measures • A numeric quantity expressing some of the organization’s performance. The information represented by this quantity is used to support or evaluate the decision making and performance of the organization. • A measure is also called a fact • The table holding measure information is called as a fact table • Dimensions vs. Measures 2’38”

  9. Commrex Real Estate Operational Database • Users: property listors, webmaster, marketing manager of IMW • Objective: Encourage realtors to use the online ASP services with the best information services to increase IMW’s revenue. • Value Chain • Listors create their account • Listors post their real estate properties to the web-based database services and pay listing fees • Property buyers search the website-based database and buy properties from listors. This is the incentive for listors to use the ASP services • Business Processes • Listor sign up • Listor account management • Property data posting • Property search • Property database maintenance 9 9

  10. IMW’s Database ERD Model Property Listing Database Membership Database TransactionID M:1 Property ID UserID M:1 Listor ID Listor ID PropID Listor Name    M:M Property Type Property Type    Type Name Address Company ID City    Subtype 1 Chapter    Subtype 2 UpdateDate M:M Functions    Feature Specializations Subtype n          Company ID Legends Comp Name Primary Key    Address Secondary Key    Telephone # Link to a table    ISQS 6339, Data Mgmt & BI, Zhangxi Lin 10

  11. Commrex Data Warehousing • Users: CEO of IMW, IMW business analyst, IMW marketing manager • Analytic themes • Fast retrieval of business key performance indicators (KPIs) • Decision making on business promotions • Applications • Geographic distribution of property listings • Scorecard for main performance indicators • Dashboard • Questions • How to model data warehouse? • What are required in data transformation and preprocessing? • Any missing dimension for data ware housing? • How to perform routine data warehouse updates – frequency, timing, etc.

  12. IMW’s Data Warehouse Dimensional Model Property Listing Fact Membership Dimension Property Type Dimension Property ID Listor ID Listor ID Listor Name    PropType PropType    SubName Address Company ID … City    Chapter    UpdateDate Functions Features Specializations       Year Company ID Quarter Comp Name Legends Month Company Dimension    Primary Key Address Date       Secondary Key Telephone #    Link to a table ISQS 6339, Data Mgmt & BI, Zhangxi Lin 12

  13. Data Warehouse Overview

  14. * Opening Vignette - DirectTV • 13,000 employees, 2008 revenue $20billion, 50 million subscribers • Problem: the daily updated data warehouse did meet the business coping with high transactional data volume • Solution: using15-minutes update data warehouse, and GoldenGate integration system • Results: • Churn is reduced • Better targeted sales • Better service quality • Real-time order management and fraud detection

  15. Data Warehousing Characteristics • Basic characteristics of data warehousing • Subject oriented • Integrated • Time variant (time series) • Nonvolatile (not allow to change) • Others • Web based • Relational/multidimensional • Client/server • Real-time • Include metadata

  16. Data Warehousing Process Overview • Data in DW are constantly accumulated. • Organizations continuously collect data, information, and knowledge at an increasingly accelerated rate and store them in computerized systems • The number of users is constantly increasing. • The number of users needing to access the information continues to increase as a result of improved reliability and availability of network access, especially the Internet • The organization using data warehouse relied on DW more and more

  17. Data Warehousing More Concepts • Operational data stores (ODS) A type of database often used as an interim area for a data warehouse, especially for customer information files • Enterprise data warehouse (EDW) A large-scale data warehouse used across the enterprise for decision support. It integrates different sources of information into a consolidated information system. • Metadata (Video 1’41”) Data about data. In a data warehouse, metadata describe the contents of a data warehouse and the manner of its use • Syntactic metadata, structural metadata, and semantic metadata

  18. Data Warehousing Process Overview

  19. Data Warehousing Process Overview • The major components of a data warehousing process • Data sources • Data extraction • Data loading • Comprehensive database • Metadata • Middleware tools

  20. Data Warehouse Architectures

  21. Three Parts of Data Warehouse • The data warehouse that contains the data and associated software • Data acquisition (back-end) software that extracts data from legacy systems and external sources, consolidates and summarizes them, and loads them into the data warehouse • Client (front-end) software that allows users to access and analyze data from the warehouse

  22. Three-Tier Data Warehouse

  23. Alternative Data Warehouse Architectures (1)

  24. Alternative Data Warehouse Architectures (2)

  25. Alternative Data Warehouse Architectures (3)

  26. Alternative Data Warehouse Architectures (4)

  27. Alternative Data Warehouse Architectures (5)

  28. Architectures Comparison

  29. Teradata’s EDW

  30. Hadoop – for BI in the Cloud era • Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. • Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. • Hadoop was inspired by Google's MapReduce, a software framework in which anapplication is broken down into numerous small parts. Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant.

  31. Apache Hadoop •  The Apache Hadoop framework is composed of the following modules : • Hadoop Common - contains libraries and utilities needed by other Hadoop modules • Hadoop Distributed File System (HDFS). • Hadoop YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications. • Hadoop MapReduce - a programming model for large scale data processing. ISQS 6339, Data Mgmt & BI

  32. MapReduce MapReduce is a framework for processing parallelizable  problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster  or a grid. 

  33. How Hadoop Operates ISQS 6339, Data Mgmt & BI

  34. Hadoop 2: Big data's big leap forward • The new Hadoop is the Apache Foundation's attempt to create a whole new general framework for the way big data can be stored, mined, and processed. • The biggest constraint on scale has been Hadoop’s job handling. All jobs in Hadoop are run as batch processes through a single daemon called JobTracker, which creates a scalability and processing-speed bottleneck. • Hadoop 2 uses an entirely new job-processing framework built using two daemons: ResourceManager, which governs all jobs in the system, and NodeManager, which runs on each Hadoop node and keeps the ResourceManager informed about what's happening on that node. ISQS 6339, Data Mgmt & BI

  35. MapReduce 2.0 – YARN(Yet Another Resource Negotiator) ISQS 6339, Data Mgmt & BI

  36. Teradata Big Data Platform 林漳希 @清华大学

  37. Dell representation of the Hadoop ecosystem

  38. Nokia’s Big Data Architechture 林漳希 @清华大学

  39. Comparison between big data platform and traditional BI platform 林漳希 @清华大学

  40. Resolving legacy problem– Dual platform 林漳希 @清华大学

  41. Ten factors that potentially affect the architecture selection decision 1. Information interdependence between organizational units 2. Upper management’s information needs 3. Urgency of need for a data warehouse 4. Nature of end-user tasks 5. Constraints on resources 6. Strategic view of the data warehouse prior to implementation 7. Compatibility with existing systems 8. Perceived ability of the in-house IT staff 9. Technical issues 10. Social/political factors

  42. Data Integration

  43. Data Integration • Integration that comprises three major processes: • data access, • data federation, and • change capture. • When these three processes are correctly implemented, data can be accessed and made accessible to an array of ETL and analysis tools and data warehousing environments • ETL Tools 4’56”

  44. Data Integration • Enterprise application integration (EAI) A technology thatprovides a vehicle for pushing data from source systems into a data warehouse, including application functionality integration. Recently service-oriented architecture (SOA) is applied • Enterprise information integration (EII) An evolving tool space that promises real-time data integration from a variety of sources, such as relational databases, Web services, and multidimensional databases • Extraction, transformation, and load (ETL) A data warehousing process that consists of extraction (i.e., reading data from a database), transformation (i.e., converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a data warehouse or simply another database), and load (i.e., putting the data into the data warehouse)

  45. Transformation Tools: To purchase or to Build in-House • Issues affect whether an organization will purchase data transformation tools or build the transformation process itself • Data transformation tools are expensive • Data transformation tools may have a long learning curve • It is difficult to measure how the IT organization is doing until it has learned to use the data transformation tools • Important criteria in selecting an ETL tool • Ability to read from and write to an unlimited number of data source architectures • Automatic capturing and delivery of metadata • A history of conforming to open standards • An easy-to-use interface for the developer and the functional user

  46. Structure and Components of Business Intelligence MS SQL Server 2008 SSMS SSIS SSAS BIDS SSRS SAS EG SAS EM

  47. Exercise 1 – Walk through data warehousing process • Learning Objectives • To gain a general impression how to use SQL Server 2008 to implement a data mart • Tasks • Create your database with SSMS, named as ISQS6339_lastname • Import data from Commrex_2011.xls • Use SSMS to create a ERD diagram • Create a SSAS project using BIDS • Define data source, data source view, and cube • Deliverable: • One-page printout of the screenshot of the cube diagram 47

More Related