1 / 43

Data Warehousing

Data Warehousing. High quality information is vital to good decision making Data is different from information Organizations have a lot of data, may not have a lot of information  information gap. Data Warehousing.

fauna
Download Presentation

Data Warehousing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Warehousing • High quality information is vital to good decision making • Data is different from information • Organizations have a lot of data, may not have a lot of information •  information gap

  2. Data Warehousing Databases, in an organization, may have been developed in a fragmented manner over time  need to reorganize, filter, .. The data before mining it Most systems are designed for operational processing, not informational processing

  3. Data Warehousing Operational Processing: the processing of daily data Information processing: analysis of summarized data in order to support decision making

  4. Data Warehousing Ex 1: WalMart: what to stock, when, .. Ex 2: Airlines: how much to overbook a flight by (forecast no-shows, ..) Ex 3: Amazon: people who bought this book also bought these books

  5. Data Warehousing A lot of transactions are done via the web today  fully automated, lots of data, real time  opportunities to mine the data

  6. In this chapter How data warehouses relate to operational systems 3-tier architecture for data warehousing Extracting data from operational systems and loading them into a data warehouse Big data / NoSql Data mining

  7. Data Warehousing: Basic Concepts A data warehouse is: subject oriented (organized around key topics) Integrated (formats, names, ..) time variant (time dimension  trends, … non-updateable (loaded by operation systems, not by end users)

  8. Data Warehousing: Brief History Improvements in database technology Advances in computer hardware (cpu, memory, storage, ..) End user computing Advances in middleware products (networks, internet, ..)

  9. Need for Data Warehousing Business require an integrated, company-wide view of high-quality information IS department must separate informational from operational systems to improve performance in managing company data

  10. Need for a company-wide view Data scattered around company Different formats Different names/synonyms Free data vs structured data Inconsistent data (?) Missing data

  11. Operational vs informational systems Operational system: designed to manage day to day operations; reservation systems, sales transactions, .. Informational systems: designed to support decision making; trends, data mining, forecasting, planning, ..

  12. Informational systems A data warehouse is centralized A data warehouse adds value to data by improving its quality and consistency A data warehouse eliminates the contention/bottleneck for resource use

  13. Data Warehouse Architectures Independent data marts Dependent data marts Logical data marts

  14. Independent data marts Figure 9-2: 4 steps from left to right Data is extracted from various sources Data is transformed and integrated Data warehouse = collection of data marts; contains both summary and detailed data Tools for users

  15. Data Mart A data warehouse that is limited in scope, whose data is generated from either a data warehouse or other source data systems Independent data mart = data mart filled with data extracted from the operational environment, not from a data warehouse

  16. Dependent data marts Independent data marts have some limitations: Need separate process for each data mart Possible consistency issues Limited drill down capabilities Scaling costs can be high

  17. Dependent data mart A data mart filled exclusively from an enterprise data warehouse and its reconciled data EDW = Enterprise Data Warehouse = a centralized, integrated data warehouse that is the entry point and single source of data for decision support applications

  18. Dependent data marts Figure 9-3: 4 steps from left to right Data is extracted from various sources Data is transformed and integrated EDW and data marts Tools for users, pulling data from EDW and/or data marts

  19. Logical data marts and Real time Data Warehousing Figure 9-4: 3 steps from left to right Data is extracted from various sources Data is transformed and integrated and fed real-time to data marts Tools for users, pulling data from EDW and/or data marts

  20. Real time Data Warehouse An Enterprise Data Warehouse that accepts near-real-time feeds of transactional data from the systems of record, analyzes warehouse data, and in near real-time relays business rules to the data warehouse

  21. Real time Data Warehouse Enable real time data analysis and response Capture customer data at time of event Analyze customer behavior and possibly predict customer response Develop rules for optimizing customer interaction Take immediate action with customer

  22. Real time Data Warehouse E-commerce, abandoned shopping cart can trigger instant promotional email message Fraud detection in credit card transaction (unusual pattern detected)  triggers call to credit card owner

  23. Three-Layer Data Architecture Figure 9-5: Operational data (Operational systems) Reconciled data (EDW) Derived data (Data marts) Note: for each data, there is metadata associated with it (describes the data, ..)

  24. Data warehouse data Status vs event data Figure 9-6 Status data = data before and after event Event data = data of the event

  25. Data warehouse data Transient vs periodic data Figure 9-7 Transient data = can be overwritten  less data needed, but less information Periodic data = never altered  more data needed, but more information

  26. Transient vs periodic data Likes of a post Do we just store the number of likes (transient)? Can we prevent somebody from voting twice? Probably not Do we store every like (periodic)?  aggregate to get total

  27. The Derived Data Layer Data layer associated with physical and logical data marts (from figure 9.5) Users interact with this layer for their decision support applications

  28. Characteristics of Derived Data The source of derived data is the reconciled data Typically, derived data is aggregated (per topic, user group, ..)  it looks more like information than data

  29. The Star Schema A simple database design in which dimensional data are separated from fact or event data. A dimensional model is another name for the starts schema.

  30. The Star Schema Figure 9.9: a start schema includes: Dimension Tables (descriptive data) Fact table (actual data) ER model looks like a star

  31. The Star Schema Figure 9.10: example of a star schema: Dimension Tables: product, period, store Fact table: sales Primary key in fact table (sales) is a combination of all primary keys ion the dimension tables

  32. The Star Schema A row of the product table contains information for a particular product A row of the period table contains information about a unit of time A row of the store table contains information about a store

  33. The Star Schema A row of the sales table contains information about the sales of a particular product in a particular product during a particular unit of time: unit sold, dollars sold, and dollars cost Example with actual data in figure 9.11

  34. Fact table Can be big: example Sears Assumption: 1000 stores, 10000 products, 5 years of daily data  1000 X 100000 * ( 5 * 365) rows

  35. Grain of the Fact table = level of detail of the fact table = intersection of primary keys of the dimension tables The finer the grain, the bigger the table Tradeoff between size and information quality/detail

  36. Fact table Variations If we are interested in storing events, we can have a “factless” fact table  it contains foreign key values only; no extra data Example figure 9.14: an event took place at that time, between these people, at that facility, on that topic (course)

  37. BIG DATA A database whose size strains the ability of RDBMS to capture, manage, and process data efficiently  scalability issues Large scale data gathering and analytics (particularly in web applications)  big data  challenging for traditional relational database management systems

  38. BIG DATA NewSQL  Make SQL based, relational databases more scalable NoSQL  alternative databases to SQL

  39. NoSQL databases NoSQL Stands for Not Only SQL Less functionality than SQL Higher performance (fast), more scalable than SQL

  40. NoSQL Non relational No schema Scalable Hash tables and similar data structures

  41. NoSQL No joins No constraints No complex transactions (do this but if that, rollback the transaction)

  42. NoSQL Data Structures Hash tables: Key, Value pairs Key, Document pairs (can retrieve the document via its key, then query the document) Graph

  43. Hash tables Key, Value pairs MD Maryland, CA  California, NY  New York, .. Jan  January, Feb  February, .. originalPostId  the whole thread for this original post (this value contains a lot of data)

More Related