Cs 345 topics in data warehousing
Sponsored Links
This presentation is the property of its rightful owner.
1 / 30

CS 345: Topics in Data Warehousing PowerPoint PPT Presentation


  • 72 Views
  • Uploaded on
  • Presentation posted in: General

CS 345: Topics in Data Warehousing. Tuesday, October 19, 2004. Review of Thursday’s Class. Bridge tables Hierarchies Multi-Valued Dimensions Extraction-Transformation-Load Data staging area vs. data warehouse Assigning surrogate keys Detecting changed rows

Download Presentation

CS 345: Topics in Data Warehousing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


CS 345:Topics in Data Warehousing

Tuesday, October 19, 2004


Review of Thursday’s Class

  • Bridge tables

    • Hierarchies

    • Multi-Valued Dimensions

  • Extraction-Transformation-Load

    • Data staging area vs. data warehouse

    • Assigning surrogate keys

    • Detecting changed rows

    • Detecting duplicate dimension rows


Outline of Today’s Class

  • Database System Architecture

    • Memory management

    • Secondary storage (disk)

    • Query planning process

  • Joins

    • Nested Loop Join

    • Merge Join

    • Hash Join

  • Grouping

    • Sort vs. Hash


Memory Management

  • Modern operating systems use virtual memory

    • Address space larger than physical main memory

    • Layer of indirection between virtual addresses and physical address

    • Physical memory serves as a cache for most recently accessed memory locations

    • Infrequently accessed memory locations written to disk when cache fills up

  • Memory transferred in large blocks called pages

    • Disk access: latency is large relative to throughput

    • Benefit from locality of reference

    • Transfer overheads amortized over many requests


Memory Management

  • Database management systems (DBMSs) use a similar model to virtual memory

    • Data is stored persistently on disk

    • Main memory serves as cache for frequently accessed data

      • This memory cache is called the buffer pool

    • Data organized into blocks called pages

  • DBMSs override built-in virtual memory

    • Database operations are very data intensive

    • Careful management of disk access essential for good performance

    • DBMS has specialized knowledge of its data access patterns

    • Specialized memory management includes:

      • Prefetching pages likely to be needed in the future

      • Essential / frequently accessed pages pinned in memory

      • Multiple buffer pools with different parameters for different needs

        • Parameters: Page size, page replacement policy

        • Needs: Relations, indexes, temporary results, sort space


Disk Access Time

  • Seek cost + rotational latency + transfer cost

  • Seek cost

    • Move the disk head to the right place

    • Physical machinery

  • Rotational latency

    • Wait for the disk to spin

    • Average wait = ½ rotation

  • Transfer cost

    • Read the data while the disk spins

Disk drive

Seek cost

Rotational latency


Random vs. Sequential I/O

  • Seek cost and rotational latency are overhead

    • Independent of the transfer size

    • ~ 20 ms on commodity disks

    • One large transfer is much cheaper than many small transfers

  • Random vs. Sequential I/O

    • I/O = Input/Output

    • Random I/O: Read/write an arbitrary page

    • Sequential I/O: Read/write the next page

    • Sequential I/O is 10-20x faster than random I/O

  • Avoiding random I/O will be a theme

    • Cost of sequential I/O is less

    • Prefetching easy when access patterns predictable


Query Planning Process

  • Many different query plans are possible

    • Logically equivalent ways of answering the same query

  • Job of the query optimizer: choose the most efficient plan for a given query

    • Most efficient = minimize resource consumption

    • Resources = CPU, disk I/O

    • Consume less resources → faster query responses

  • Design of query optimizer

    • Rule-based

      • Apply heuristic transformations (e.g. push selections down)

    • Cost-based:

      • Estimate cost (resource usage) of each possible query plan

      • Choose the plan with lowest estimated cost

      • Cost estimation uses statistics collected by the optimizer

    • In practice, optimizers use a combination of the two

      • Search space of possible plans is too large to exhaustively explore

      • Heuristics guide which types of plans are considered for cost-based optimization


SQL query

parse

parse tree

Query rewriting

statistics

Best logical query plan

Physical plan generation

Best physical query plan

execute

result


Join Processing

  • OLAP queries involve joins

    • Fact table joins to dimension tables

    • Primary-key / Foreign-key joins

  • Choice of join algorithm has large impact on query cost

  • Naïve join processing algorithmTo join relations A and B:

    • Generate “Cartesian product” of A and B

      • Each A tuple is combined with each B tuple

    • Apply selection conditions to each tuple in the Cartesian product

  • Problem: Intermediate result can be huge!

    • Relation A has 10 million rows

    • Relation B has 100,000 rows

    • A x B has 1 trillion rows (many terabytes of data)

  • Need more efficient, logically equivalent procedure


Nested Loop Join (NLJ)

R1

R2

  • NLJ (conceptually)

    for each r  R1 do

    for each s  R2 do

    if r.C = s.C then output r,s pair


Nested Loop Join

  • In theory, multi-way nested loop joins are possible

    • Joining n relations → nnested “for” loops

  • In practice, only binary (2-way) joins operators are implemented

    • Joining n relations → Cascade of n-1 binary joins

  • Nested-loop join terminology:

    • Outer input: scanned in outer loop of join

    • Inner input: scanned in inner loop of join

C

A

B


Analysis of Nested Loop Join

  • Outer has m tuples, inner has n tuples

  • CPU complexity = O(m*n)

    • For each of m outer tuples, compare with each of n inner tuples

    • Quadratic running time

  • What about I/O complexity?

    • Depends on amount of memory available

    • We’ll assume B tuples per page

    • Consider four scenarios

      • Use 2 pages of memory

      • Use (m/B + n/B) pages of memory

      • Use (n/B + 1) pages of memory

      • Use (k + 1) pages of memory

        • For k < m/B


Nested Loop Join (2 pages)

  • One page for outer, one page for inner

  • NLJ algorithm:

    • Load 1st outer block into memory

      • Load all n/B blocks of inner, one after another

    • Load 2nd outer block into memory

      • Load all n/B blocks of inner, one after another

    • Load (m/B)th outer block into memory

      • Load all n/B blocks of inner, one after another

  • Total I/O cost = m/B * (n/B + 1)


Nested Loop Join (m/B+n/B pages)

  • Load both inputs entirely into memory

  • All computation takes place in memory

  • No further paging necessary

  • Total I/O cost = m/B + n/B

  • Requires a lot of memory

    • Practical if both inputs are small


Nested Loop Join (n/B + 1 pages)

  • Allocate 1 page for outer

  • Load entire inner into memory (n/B pages)

  • NLJ algorithm:

    • Load entire inner into memory

    • Load 1st outer block into memory

      • Inner is already in memory → no additional I/O needed

    • Load 2nd outer block into memory

      • Inner is already in memory → no additional I/O needed

    • Load (m/B)th outer block into memory

      • Inner is already in memory → no additional I/O needed

  • Total I/O cost = m/B + n/B

    • Same I/O cost as previous slide

    • Practical if inner input is small


Nested Loop Join (k+1 pages)

  • Allocate 1 page for inner

  • Load k pages of outer into memory at a time

  • NLJ algorithm:

    • Load 1st group of k outer blocks into memory

      • Load all n/B blocks of inner, one after another

    • Load 2nd group of k outer block into memory

      • Load all n/B blocks of inner, one after another

    • Load (m/kB)th group of k outer blocks into memory

      • Load all n/B blocks of inner, one after another

  • Total I/O cost = m/kB * (n/B + k)

    • Reduction from 2-page case by a factor of k


Summary of NLJ Analysis

  • CPU usage is O(m*n)

    • Quadratic in size of inputs

  • I/O usage depends on memory allocated

    • If one input is small:

      • Make the small input the inner

      • I/O complexity is O(m/B + n/B)

      • Linear in size of inputs

      • The minimum possible

    • Otherwise:

      • Make the small input the outer

        • Slightly better than the other way around

      • I/O complexity is O(m/kB * (n/B + k))

        • When k+1 pages of memory are available

        • Approximately mn/kB2


Improving on Nested Loops Join

  • For each outer tuple, we scan the entire inner relation to find matching tuples

  • Maybe we can do better by using data structures

  • Two main ideas:

    • Sorting

    • Hashing


Sorting

  • Using sorting to improve NLJ

    • First we sort the inner relation

    • Then, for each outer tuple, find matching inner tuple(s) using binary search over sorted inner

  • CPU usage improves

    • O(mn) → O(m log n) + sort cost

    • Sub-quadratic

  • This strategy is indexed nested-loop join

    • To be covered in more detail on Thursday

  • Implementation details

    • Use B-Tree instead of binary tree as search structure

    • Larger branching factor → better I/O efficiency


Merge Join

  • What if we sort both inputs instead of one?

    • No need to perform binary search for each outer tuple

    • Since outer tuples are also in sorted order, access to inner relation is structured

      • Binary search accesses pages in arbitrary order → random I/O

      • Merge join uses sequential I/O

  • CPU usage improves

    • O(mn) → O(m + n) + sort costs

Random I/O

for eachouter tuple

Indexed NLJ

Merge Join

scan

1

3

2

1

2

3

4

5

6

7

1

2

3

4

5

6

7

5

2

7

3

5

1

4

4

5

1

2

3

4

4

5

5

5

7

scan

scan


External-Memory Sorting

  • How to efficiently sort large data sets?

    • Consider I/O complexity

  • Sort a relation with n tuples, B tuples per block

    • Case 1: Use n/B pages of memory

    • Case 2: Use k+1 pages of memory

  • Case 1: Use n/B pages of memory

    • Read the relation into memory

      • n/B sequential disk reads

    • Sort in main memory using quicksort

      • No disk access required

    • Write the sorted output back to disk

      • n/B sequential disk writes

    • Total: 2n/B disk I/Os (all sequential)

  • Requires lots of memory → only useful for small inputs


External-Memory Sorting

  • Case 2: Use k pages of memory

    • k ≥ sqrt(n/B)

  • Two phases (run generation + merge)

  • Run generation

    • Repeat for each group of k blocks:

      • Read group of k blocks into memory

      • Sort using quicksort

      • Write sorted “run” back out to disk

  • Merge

    • Read 1st page from each run into memory

    • Merge to produce globally sorted output

    • Write output pages to disk as they are filled

    • As a page is exhausted, replace by next page in run

  • Each block read + written once per phase

    • Total disk I/Os: 4n/B

Memory

Disk

Sorted Runs

Final output


Hashing

  • Store inner input in a hash table

    • Tuples grouped together into buckets using a hash function

      • Apply hash function to join attributes (called the key) to determine bucket

    • Idea: all records with the same key must be in the same bucket

  • For each outer tuple

    • Hash key to determine which bucket it goes in

    • Scan that bucket for matching inner tuples

  • CPU usage improves

    • O(mn) → O(m + n)

      • Assuming a good hash function with few collisions

      • Collision = tuples with different keys hash to same bucket

  • Only useful for equality joins

    • Not for join conditions like WHERE A.foo > B.bar

    • 99% of all joins are equality joins


Hash Join Analysis

  • Hash join terminology

    • Inner input is called build input

    • Outer input is called probe input

    • Build a hash table based on one input, probe into that hash table using the other

  • Probe has m tuples, build has n tuples

  • Using (n/B + 1) pages of memory:

    • Allocate 1 page for probe input

    • Read build input into memory + create hash table

      • Uses n/B pages

    • Read 1st page of probe input

      • Hash each tuple and search bucked for matching build tuples

      • Requires no additional I/O

    • Repeat for 2nd and subsequent pages of probe input

    • I/O complexity = O(n/B + m/B)

    • Suitable when one input is small


Partitioned Hash Join

  • Two-phase algorithm (Partition + Join)

    • Uses k blocks of memory

  • Phase 1: Partition

    • Performed separately for build and probe inputs

    • Read 1st page into memory

    • Assign each tuple to a partition 1..k using hash function

    • Each partition is allocated 1 block of memory

      • Tuples assigned to same partition are placed together

      • When the block fills up, write it to disk

      • Each partition stored in a contiguous region on disk

    • Repeat for 2nd and subsequent pages

  • Phase 2: Join phase

    • Join Partition 1 of probe input with Partition 1 of build input using ordinary hash join

      • Use a secondary hash function

    • Repeat with Partition 2 of each input, etc.

  • A form of recursion

    • Partitioning creates smaller problem instances


Partition Phase

Disk

Memory

Unsorted Input

(Similar to merge phase of merge join, only backwards)

Partitions


Summary of Join Methods

  • When 1 relation fits entirely in memory, all join methods have linear I/O complexity

  • When both relations are large:

    • NLJ has quadratic I/O complexity

    • Merge join and hash join have linear I/O complexity

  • CPU complexity

    • Quadratic for NLJ

    • N log N for merge join

    • Linear for hash join

  • Rules of Thumb:

    • Use nested loop join for very small inputs

    • Use merge join for large inputs that are already sorted

    • Use hash join for large, unsorted inputs


Query Plans with Multiple Joins

  • Two kinds of execution:

    • Pipelined

      • No need to store intermediate results on disk

      • Available memory must be shared between joins

    • Materialized

      • No memory contention between joins

      • Spend I/Os writing intermediate results to disk

  • Blocking operators

    • Sort, partition, build hash table

    • Pipelined execution is disallowed

  • Interesting orders

    • Sorted input → sorted output

      • NLJ: output is sorted by same key as outer input

      • Merge join: output is sorted by same key as inputs

    • Partitioned input → partitioned output

      • Hash join: output is partitioned by same key as inputs

    • Take advantage of interesting orders

      • Eliminate sorting / partitioning phases of later joins

C

A

B


Grouping

  • First partition tuples on grouping attributes

    • Tuples in same group placed together

    • Tuples in different groups separated

  • Then scan tuples in each partition, compute aggregate expressions

  • Two techniques for partitioning

    • Sorting

      • Sort by the grouping attributes

      • All tuples with same grouping attributes will appear together in sorted list

    • Hashing

      • Hash by the grouping attributes

      • All tuples with same grouping attributes will hash to same bucket

      • Sort / re-hash within each bucket to resolve collisions


  • Login