cs 345 topics in data warehousing n.
Skip this Video
Loading SlideShow in 5 Seconds..
CS 345: Topics in Data Warehousing PowerPoint Presentation
Download Presentation
CS 345: Topics in Data Warehousing

Loading in 2 Seconds...

play fullscreen
1 / 19

CS 345: Topics in Data Warehousing - PowerPoint PPT Presentation

  • Uploaded on

CS 345: Topics in Data Warehousing. Tuesday, September 28, 2004. Outline of Today’s Class. What is data warehousing? Transaction processing vs. data analysis Course logistics Data integration. A Brief History of Information Technology. The “dark ages”: paper forms in file cabinets

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'CS 345: Topics in Data Warehousing' - trey

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cs 345 topics in data warehousing

CS 345:Topics in Data Warehousing

Tuesday, September 28, 2004

outline of today s class
Outline of Today’s Class
  • What is data warehousing?
  • Transaction processing vs. data analysis
  • Course logistics
  • Data integration
a brief history of information technology
A Brief History of Information Technology
  • The “dark ages”: paper forms in file cabinets
  • Computerized systems emerge
    • Initially for big projects like Social Security
    • Same functionality as old paper-based systems
  • The “golden age”: databases are everywhere
    • Most activities tracked electronically
    • Stored data provides detailed history of activity
  • The next step: use data for decision-making
    • The focus of this course!
    • Made possible by omnipresence of IT
    • Identify inefficiencies in current processes
    • Quantify likely impact of decisions
databases for decision support
Databases for Decision Support
  • 1st phase: Automating existing processes makes them more efficient.
    • Automation → Lots of well-organized, easily accessed data
  • 2nd phase: Data analysis allows for better decision-making.
    • Analyze data → better understanding
    • Better understanding → better decisions
  • “Data Entry” vs. “Thinking”
    • Data analysts are decision-makers: managers, executives, etc.
oltp vs olap
OLTP: On-Line Transaction Processing

Many short transactions (queries + updates)


Update account balance

Enroll in course

Add book to shopping cart

Queries touch small amounts of data (one record or a few records)

Updates are frequent

Concurrency is biggest performance concern

OLAP: On-Line Analytical Processing

Long transactions, complex queries


Report total sales for each department in each month

Identify top-selling books

Count classes with fewer than 10 students

Queries touch large amounts of data

Updates are infrequent

Individual queries can require lots of resources

why olap oltp don t mix 1
Why OLAP & OLTP don’t mix (1)

Different performance requirements

  • Transaction processing (OLTP):
    • Fast response time important (< 1 second)
    • Data must be up-to-date, consistent at all times
  • Data analysis (OLAP):
    • Queries can consume lots of resources
    • Can saturate CPUs and disk bandwidth
    • Operating on static “snapshot” of data usually OK
  • OLAP can “crowd out” OLTP transactions
    • Transactions are slow → unhappy users
  • Example:
    • Analysis query asks for sum of all sales
    • Acquires lock on sales table for consistency
    • New sales transaction is blocked
why olap oltp don t mix 2
Why OLAP & OLTP don’t mix (2)

Different data modeling requirements

  • Transaction processing (OLTP):
    • Normalized schema for consistency
    • Complex data models, many tables
    • Limited number of standardized queriesand updates
  • Data analysis (OLAP):
    • Simplicity of data model is important
      • Allow semi-technical users to formulate ad hocqueries
    • De-normalized schemas are common
      • Fewer joins → improved query performance
      • Fewer tables → schema is easier to understand
why olap oltp don t mix 3
Why OLAP & OLTP don’t mix (3)

Analysis requires data from many sources

  • An OLTP system targets one specific process
    • For example: ordering from an online store
  • OLAP integrates data from different processes
    • Combine sales, inventory, and purchasing data
    • Analyze experiments conducted by different labs
  • OLAP often makes use of historical data
    • Identify long-term patterns
    • Notice changes in behavior over time
  • Terminology, schemas vary across data sources
    • Integrating data from disparate sources is a major challenge
data warehouses
Data Warehouses
  • Doing OLTP and OLAP in the same database system is often impractical
    • Different performance requirements
    • Different data modeling requirements
    • Analysis queries require data from many sources
  • Solution: Build a “data warehouse”
    • Copy data from various OLTP systems
    • Optimize data organization, system tuning for OLAP
    • Transactions aren’t slowed by big analysis queries
    • Periodically refresh the data in the warehouse
course logistics
Course Logistics
  • Course web site:http://cs345.stanford.edu
  • Course format will be lecture-based
    • As opposed to a paper-reading course
  • Prerequisite:
    • Knowledge of SQL
assigned work
Assigned Work
  • Five homework assignments
    • One problem set
    • Four programming assignments
      • Not a lot of code to write
      • Emphasis will be on interacting with Oracle
  • Course project
    • Open-ended
    • Focus on a topic of your choosing
    • Any of these types:
      • Research project, or…
      • Programming project, or…
      • Survey of research literature
    • May be done individually or in groups of two
  • Final Exam
high level course outline
High-Level Course Outline
  • Logical Database Design
    • How should the data be modeled?
    • Designing the data warehouse schema
  • Query Processing
    • Analysis queries are hard to answer efficiently
    • What techniques are available to the DBMS?
  • Physical Database Design
    • How should the data be organized on disk?
    • What data structures should be used?
  • Data Mining
    • What use is all this data?
    • Which questions should we ask our data warehouse?
additional topics
Additional Topics
  • Related topics to be touched on briefly:
    • Data integration
    • Data cleaning
    • Approximate query answering
    • Data lineage
    • Data visualization
    • Incremental maintenance of materialized views
    • Answering queries using views
    • Indexing special data types (spatial, text, geographic)
    • Metadata management
  • Projects can be done in these areas
the textbook
The Textbook
  • “The Data Warehouse Toolkit”by Ralph Kimball and Margy Ross
  • Written by well-known data warehouse designer
  • Clearly written and readable
  • Lots of generic but realistic examples
  • Semi-technical (no math!)
  • Business-focused
  • We’ll use it for the first one-third of the course
  • Get the second edition!
course objectives
Course Objectives
  • Gain practical understanding of how data warehouses are built and used
  • Gain exposure to data modeling “best practices”
  • Learn techniques used to process complex queries over very large data sets
  • Understand the performance trade-offs that come from alternative data structures
  • Learn commonly-used methods for mining and analysis of large data sets
  • Become familiar with current research directions in data warehousing and related areas
loading the data warehouse
Loading the Data Warehouse

Data is periodically extracted

Data is cleansed and transformed

Users query the data warehouse

Data Staging Area

Data Warehouse

Source Systems


data integration is hard
Data Integration is Hard
  • Data warehouses combine data from multiple sources
  • Data must be translated into a consistent format
  • Data integration represents ~80% of effort for a typical data warehouse project!
  • Some reasons why it’s hard:
    • Metadata is poor or non-existent
    • Data quality is often bad
      • Missing or default values
      • Multiple spellings of the same thing (Cal vs. UC Berkeley vs. University of California)
    • Inconsistent semantics
      • What is an airline passenger?
federated databases

Rewritten Queries






Federated Databases
  • An alternative to data warehouses
  • Data warehouse
    • Create a copy of all the data
    • Execute queries against the copy
  • Federated database
    • Pull data from source systems as needed to answer queries
  • “lazy” vs. “eager” data integration







Data Warehouse

Federated Database

warehouses vs federation
Warehouses vs. Federation
  • Advantages of federated databases:
    • No redundant copying of data
    • Queries see “real-time” view of evolving data
    • More flexible security policy
  • Disadvantages of federated databases:
    • Analysis queries place extra load on transactional systems
    • Query optimization is hard to do well
    • Historical data may not be available
    • Complex “wrappers” needed to mediate between analysis server and source systems
  • Data warehouses are much more common in practice
    • Better performance
    • Lower complexity
    • Slightly out-of-date data is acceptable