what is a data warehouse and why are so many schools setting them up l.
Skip this Video
Loading SlideShow in 5 Seconds..
What is a Data Warehouse? And Why Are So Many Schools Setting Them Up? PowerPoint Presentation
Download Presentation
What is a Data Warehouse? And Why Are So Many Schools Setting Them Up?

Loading in 2 Seconds...

play fullscreen
1 / 37

What is a Data Warehouse? And Why Are So Many Schools Setting Them Up? - PowerPoint PPT Presentation

  • Uploaded on

What is a Data Warehouse? And Why Are So Many Schools Setting Them Up?. Richard Goerwitz. What Is a Data Warehouse?. Nobody can agree So I’m not actually going to define a DW Don’t feel cheated, though By the end of this talk, you’ll

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'What is a Data Warehouse? And Why Are So Many Schools Setting Them Up?' - JasminFlorian

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
what is a data warehouse
What Is a Data Warehouse?
  • Nobody can agree
  • So I’m not actually going to define a DW
  • Don’t feel cheated, though
  • By the end of this talk, you’ll
    • Understand key concepts that underlie all warehouse implementations (“talk the talk”)
    • Understand the various components out of which DW architects construct real-world data warehouses
    • Understand what a data warehouse project looks like
why are schools setting up data warehouses
Why Are Schools Setting UpData Warehouses?
  • A data warehouse makes it easier to:
    • Optimize classroom, computer lab usage
    • Refine admissions ratings systems
    • Forecast future demand for courses, majors
    • Tie private spreadsheet data into central repositories
    • Correlate admissions and IR data with outcomes such as:
      • GPAs
      • Placement rates
      • Happiness, as measured by alumni surveys
    • Notify advisors when extra help may be needed based on
      • Admissions data (student vitals; SAT, etc.)
      • Special events: A-student suddenly gets a C in his/her major
      • Slower trends: Student’s GPA falls for > 2 semesters/terms
    • (Many other examples could be given!)
  • Better information = better decisions
    • Better admission decisions
    • Better retention rates
    • More effective fund raising, etc.
talking the talk
Talking The Talk
  • To think and communicate usefully about data warehouses you’ll need to understand a set of common terms and concepts:
    • OLTP
    • ODS
    • ETL
    • Star schema
    • Conformed dimension
    • Data mart
    • Cube
    • Metadata
  • Even if you’re not an IT person, pay heed:
    • You’ll have to communicate with IT people
    • More importantly:

Evidence shows that IT will only build a successful warehouse if you are intimately involved!

  • OLTP = online transaction processing
  • The process of moving data around to handle day-to-day affairs
    • Scheduling classes
    • Registering students
    • Tracking benefits
    • Recording payments, etc.
  • Systems supporting this kind of activity are called transactional systems
transactional systems
Transactional Systems
  • Transactional systems are optimized primarily for the here and now
    • Can support many simultaneous users
    • Can support heavy read/write access
    • Allow for constant change
    • Are big, ugly, and often don’t give people the data they want
      • As a result a lot of data ends up in shadow databases
      • Some ends up locked away in private spreadsheets
  • Transactional systems don’t record all previous data states
  • Lots of data gets thrown away or archived, e.g.:
    • Admissions data
    • Enrollment data
    • Asset tracking data (“How many computers did we support each year, from 1996 to 2006, and where do we expect to be in 2010?”)
simple transactional database
Simple Transactional Database
  • Map of Microsoft Windows Update Service (WUS) back-end database
    • Diagrammed using Sybase PowerDesigner
      • Each green box is a database “table”
      • Arrows are “joins” or foreign keys
      • This is simple for an OLTP back end
more complex example
More Complex Example
  • Recruitment Plus back-end database
  • Used by many admissions offices
  • Note again:
    • Green boxes are tables
    • Lines are foreign key relationships
    • Purple boxes are views
  • Considerable expertise is required to report off this database!
  • Imagine what it’s like for even more complex systems
    • Colleague
    • SCT Banner (over 4,000 tables)
the reporting problem
The “Reporting Problem”
  • Often we require OLTP data as a snapshot, in a spreadsheet or report
  • Reports require querying back-end OLTP support databases
  • But OLTP databases are often very complex, and typically
    • Contain many, often obscure, tables
    • Utilize cryptic, unintuitive field/column names
    • Don’t store all necessary historical data
  • As a result, reporting becomes a problem –
    • Requires special expertise
    • May require modifications to production OLTP systems
    • Becomes harder and harder for staff to keep up!
  • Ways of working around the reporting problem include:
    • Have OLTP system vendors do the work
      • Provide canned reports
      • Write reporting GUIs for their products
    • Hire more specialists
      • To create simplified views of OLTP data
      • To write reports, create snapshots
    • Periodically copy data from OLTP systems to a place where
      • The data is easier to understand
      • The data is optimized for reporting
      • Easily pluggable into reporting tools
  • ODS = operational data store
  • ODSs were an early workaround to the “reporting problem”
  • To create an ODS you
    • Build a separate/simplified version of an OLTP system
    • Periodically copy data into it from the live OLTP system
    • Hook it to operational reporting tools
  • An ODS can be an integration point or real-time “reporting database” for an operational system
  • It’s not enough for full enterprise-level, cross-database analytical processing
  • OLAP = online analytical processing
  • OLAP is the process of creating and summarizing historical, multidimensional data
    • To help users understand the data better
    • Provide a basis for informed decisions
    • Allow users to manipulate and explore data themselves, easily and intuitively
  • More than just “reporting”
  • Reporting is just one (static) product of OLAP
olap support databases
OLAP Support Databases
  • OLAP systems require support databases
  • These databases typically
    • Support fewer simultaneous users than OLTP back ends
    • Are structured simply; i.e., denormalized
    • Can grow large
      • Hold snapshots of data in OLTP systems
      • Provide history/time depth to our analyses
    • Are optimized for read (not write) access
    • Updated via periodic batch (e.g., nightly) ETL processes
etl processes
ETL Processes
  • ETL = extract, transform, load
    • Extract data from various sources
    • Transform and clean the data from those sources
    • Load the data into databases used for analysis and reporting
  • ETL processes are coded in various ways
    • By hand in SQL, UniBASIC, etc.
    • Using more general programming languages
    • In semi-automated fashion using specialized ETL tools like Cognos Decision Stream
  • Most institutions do hand ETL; but note well:
    • Hand ETL is slow
    • Requires specialized knowledge
    • Becomes extremely difficult to maintain as code accumulates and databases/personnel change!
where does the data go
Where Does the Data Go?
  • What sort of a database do the ETL processes dump data into?
  • Typically, into very simple table structures
  • These table structures are:
    • Denormalized
    • Minimally branched/hierarchized
    • Structured into star schemas
so what are star schemas
So What Are Star Schemas?
  • Star schemas are collections of data arranged into star-like patterns
    • They have fact tables in the middle, which contain amounts, measures (like counts, dollar amounts, GPAs)
    • Dimension tables around the outside, which contain labels and classifications (like names, geocodes, majors)
    • For faster processing, aggregate fact tables are sometimes also used (e.g., counts pre-averaged for an entire term)
  • Star schemas should
    • Have descriptive column/field labels
    • Be easy for users to understand
    • Perform well on queries
a very simple star schema
A Very Simple Star Schema

Data Center UPS

Power Output









a more complex star schema
A More Complex Star Schema
  • Freshman survey data (HERI/CIRP)
  • Dimensions:
    • Questions
    • Survey years
    • Data about test takers
  • Facts:
    • Answer (text)
    • Answer (raw)
    • Count (1)
  • Oops
    • Not a star
    • Snowflaked!

Oops, answers should have been placed in their own dimension (creating a “factless fact table”). I’ll demo a better version of this star later!

data marts
Data Marts
  • One definition:
    • One or more star schemas that present data on a single or related set of business processes
  • Data marts should not be built in isolation
  • They need to be connected via dimensional tables that are
    • The same or subsets of each other
    • Hierarchized the same way internally
  • So, e.g., if I construct data marts for…
    • GPA trends, student major trends, enrollments
    • Freshman survey data, senior survey data, etc.
  • …I connect these marts via a conformed student dimension
    • Makes correlation of data across star schemas intuitive
    • Makes it easier for OLAP tools to use the data
    • Allows nonspecialists to do much of the work
simple data mart example
Simple Data Mart Example


Battery star

By battery


% charged


Input star

By phase



Output star

By phase



Sensor star

By sensor



Note conformed date, time dimensions!

cirp star data mart
CIRP Star/Data Mart
  • CIRP Freshman survey data
  • Corrected from a previous slide
  • Note the CirpAnswer dimension
  • Note student dimension (ties in with other marts)
rolap molap
  • ROLAP = OLAP via direct relational query
    • E.g., against a (materialized) view
    • Against star schemas in a warehouse
  • MOLAP = OLAP via multidimensional database (MDB)
    • MDB is a special kind of database
    • Treats data kind of like a big, fast spreadsheet
    • MDBs typically draw data in from a data warehouse
      • Built to work best with star schemas
data cubes
Data Cubes
  • The term data cube means different things to different people
  • Various definitions:
    • A star schema
    • Any DB view used for reporting
    • A three-dimensional array in a MDB
    • Any multidimensional MDB array (really a hypercube)
  • Which definition do you suppose is technically correct?
  • Metadata = data about data
  • In a data warehousing context it can mean many things
    • Information on data in source OLTP systems
    • Information on ETL jobs and what they do to the data
    • Information on data in marts/star schemas
    • Documentation in OLAP tools on the data they manipulate
  • Many institutions make metadata available via data malls or warehouse portals, e.g.:
    • University of New Mexico
    • UC Davis
    • Rensselear Polytechnic Institute
    • University of Illinois
  • Good ETL tools automate the setup of malls/portals!
the data warehouse
The Data Warehouse
  • OK now we’re experts in terms like OLTP, OLAP, star schema, metadata, etc.
  • Let’s use some of these terms to describe how a DW works:
    • Provides ample metadata – data about the data
    • Utilizes easy-to-understand column/field names
    • Feeds multidimensional databases (MDBs)
    • Is updated via periodic (mainly nightly) ETL jobs
    • Presents data in a simplified, denormalized form
    • Utilizes star-like fact/dimension table schemas
    • Encompasses multiple, smaller data “marts”
    • Supports OLAP tools (Access/Excel, Safari, Cognos BI)
    • Derives data from (multiple) back-end OLTP systems
    • Houses historical data, and can grow very big
a data warehouse is not
A Data Warehouse is Not…
  • Vendor and consultant proclamations aside, a data warehouse is not:
    • A project
      • With a specific end date
    • A product you buy from a vendor
      • Like an ODS (such as SCT’s)
      • A canned “warehouse” supplied by iStrategy
      • Cognos ReportNet
    • A database schema or instance
      • Like Oracle
      • SQL Server
    • A cut-down version of your live transactional database
kimball caserta s definition
Kimball & Caserta’s Definition
  • According to Ralph Kimball and Joe Caserta, a data warehouse is:

A system that extracts, cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making.

  • Another def.: The union of all the enterprise’s data marts
  • Aside: The Kimball model is not without some critics:
    • E.g., Bill Inmon
example data warehouse 1
Example Data Warehouse (1)
  • This one is RPI’s
  • 5 parts:
    • Sources
    • ETL stuff
    • DW proper
    • Cubes etc.
    • OLAP apps
example data warehouse 2
Example Data Warehouse (2)
  • Caltech’s DW
  • Five Parts:
    • Source systems
    • ETL processes
    • Data marts
    • FM/metadata
    • Reporting and analysis tools
    • Note: They’re also customers of Cognos!
so where is colorado college
So Where is Colorado College?
  • Phil Goldstein (Educause Center for Applied Research fellow) identifies the major deployment levels:
    • Level 1: Transactional systems only
    • Level 2a: ODS or single data mart; no ETL
    • Level 2: ODS or single data mart with ETL tools
    • Level 3a: Warehouse or multiple marts; no ETL; OLAP
    • Level 3b: Warehouse or multiple marts; ETL; OLAP
    • Level 3: Enterprise-wide warehouse or multiple marts; ETL tools; OLAP tools
  • Goldstein’s study was just released in late 2005
  • It’s very good; based on real survey data
  • Which level is Colorado College at?
implementing a data warehouse
Implementing a Data Warehouse
  • In many organizations IT people want to huddle and work out a warehousing plan, but in fact
    • The purpose of a DW is decision support
    • The primary audience of a DW is therefore College decision makers
    • It is College decision makers therefore who must determine
      • Scope
      • Priority
      • Resources
  • Decision makers can’t make these determinations without an understanding of data warehouses
  • It is therefore imperative that key decision makers first be educated about data warehouses
    • Once this occurs, it is possible to
      • Elicit requirements (a critical step that’s often skipped)
      • Determine priorities/scope
      • Formulate a budget
      • Create a plan and timeline, with real milestones and deliverables!
is this really a good plan
Is This Really a Good Plan?
  • Sure, according to Phil Goldstein (Educause Center for Applied Research)
  • He’s conducted extensive surveys on “academic analytics” (= business intelligence for higher ed)
  • His four recommendations for improving analytics:
    • Key decisionmakers must lead the way
    • Technologists must collaborate
      • Must collect requirements
      • Must form strong partnerships with functional sponsors
    • IT must build the needed infrastructure
      • Carleton violated this rule with Cognos BI
      • As we discovered, without an ETL/warehouse infrastructure, success with OLAP is elusive
    • Staff must train and develop deep analysis skills
  • Goldstein’s findings mirror closely the advice of industry heavyweights – Ralph Kimball, Laura Reeves, Margie Ross, Warren Thornthwaite, etc.
isn t a dw a huge undertaking
Isn’t a DW a Huge Undertaking?
  • Sure, it can be huge
  • Don’t hold on too tightly to the big-sounding word, “warehouse”
  • Luminaries like Ralph Kimball have shown that a data warehouse can be built incrementally
    • Can start with just a few data marts
    • Targeted consulting help will ensure proper, extensible architecture and tool selection
what takes up the most time
What Takes Up the Most Time?
  • You may be surprised to learn what DW step takes the most time
  • Try guessing which:
    • Hardware
    • Physical database setup
    • Database design
    • ETL
    • OLAP setup

Acc. to Kimball & Caserta, ETL will eat up 70% of the time.Other analysts give estimates ranging from 50% to 80%.

The most often underestimated part of the warehouse project!

  • Information is held in transactional systems
    • But transactional systems are complex
    • They don’t talk to each other well; each is a silo
    • They require specially trained people to report off of
  • For normal people to explore institutional data, data in transactional systems needs to be
    • Renormalized as star schemas
    • Moved to a system optimized for analysis
    • Merged into a unified whole in a data warehouse
  • Note: This process must be led by “customers”
    • Yes, IT people must build the infrastructure
    • But IT people aren’t the main customers
  • So who are the customers?
    • Admissions officers trying to make good admission decisions
    • Student counselors trying to find/help students at risk
    • Development offers raising funds that support the College
    • Alumni affairs people trying to manage volunteers
    • Faculty deans trying to right-size departments
    • IT people managing software/hardware assets, etc….