Synthesizing representative i o workloads for tpc h
Download
1 / 32

Synthesizing Representative I - PowerPoint PPT Presentation


  • 101 Views
  • Updated On :

Synthesizing Representative I/O Workloads for TPC-H. J. Zhang *, A. Sivasubramaniam *, H. Franke, N. Gautam *, Y. Zhang, S. Nagar * Pennsylvania State University IBM T.J. Watson Rutgers University. Outline. Motivation Related Work Methodology Arrival Time Access Pattern Request Sizes

Related searches for Synthesizing Representative I

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Synthesizing Representative I' - ojal


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Synthesizing representative i o workloads for tpc h l.jpg

Synthesizing Representative I/O Workloads for TPC-H

J. Zhang*,A. Sivasubramaniam*,

H. Franke,N. Gautam*, Y. Zhang, S. Nagar

* Pennsylvania State University

IBM T.J. Watson

Rutgers University


Outline l.jpg
Outline

  • Motivation

  • Related Work

  • Methodology

    • Arrival Time

    • Access Pattern

    • Request Sizes

  • Accuracy of synthetic traces

  • Concluding Remarks


Motivation l.jpg
Motivation

  • I/O subsystems are critical for commercial services and in production environments.

  • Real applications are essential for system design and evaluation.

  • TPC-H is a decision-support workload for business enterprises.


Disadvantages of traces l.jpg
Disadvantages of Traces

  • Not easily obtainable

  • Can be very large

  • Difficult to get statistical confidence

  • Very difficult to change workload behavior

  • Does not isolate the influence of one parameter

  • On the other hand, a deeper understanding of the workload can:

    • Help generate a synthetic workload

    • Help in system design itself.


What do we need to synthesize l.jpg
What do we need to synthesize?

  • Inter-arrival times (temporal behavior) of disk block requests.

  • Access pattern (spatial behavior) of blocks being referenced

  • Size (volume) of each I/O request.


Related work l.jpg
Related work

  • Scientific Application I/O behavior

    • Time-series models for arrivals

    • Sequentiality/Markov models for access pattern

  • Commercial/production workloads

    • Self-similar arrival patterns

    • Sequentiality in TPC-H/TPC-D

  • No prior complete synthesis of all three attributes for TPC-H


Our tpc h workload l.jpg
Our TPC-H Workload

  • Trace Collection Platform

    • IBM Netfinity 8-way SMP with 2.5GB memory and 15 disks

    • Linux 2.4.17

    • DB2 UDB EE V7.2

  • TPC-H Configuration

    • Power Run of 22 queries

    • Partitioning tables across the disks

    • 30 GB dataset


Validation l.jpg

CDF

Response time

Validation

Original I/O traces

Identify characteristics

Generate

synthetic traces

Disksim 2.0

Metrics

  • RMS: root-mean-square error of differences between two CDF curves

  • nRMS: RMS/m, m is average response time for the original trace


Overall methodology l.jpg
Overall Methodology

  • Arrival pattern characteristics

    • Investigate correlations

      • Time series

      • Self-similar

      • iid distributions

  • Access pattern characteristics

    • Sequentiality/pseudo sequentiality/randomness

    • Size characteristics

  • Investigating correlations between time, space and volume to get final synthesis


Arrival pattern l.jpg
Arrival pattern

  • Statistical analysis

    • Auto-correlation function (ACF) plots

      • Shows the correlation between current inter-arrival time and one that is x-steps away


Slide11 l.jpg


Slide12 l.jpg

  • Fitting distributions for the rest)

    • Tried hyper-exponential/normal/pareto

    • Used Maximum Likelihood Estimator (normal/pareto) and Expectation Maximization (hyper-exponential) to estimate distribution parameters

    • Use K-S test to measure goodness-of-fit

    • Maximum distance between fitted distribution and original CDF was ensured to be less than 0.1



Access pattern location size l.jpg
Access Pattern for the rest)(Location + Size)

  • Most studies use sequentiality to describe TPC-H

  • However, this is not always the case.

Location

Location

Location

Arrival Time

Arrival Time

Arrival Time

Cat1: Q10

Q4, Q14

Cat2: Q12,

Q1,Q3,Q5,Q7,

Q8,Q15,Q18,

Q19,Q21

Cat3: Q20

Q9, Q17


Category 1 intermingling sequential streams l.jpg
Category 1: Intermingling sequential streams for the rest)

  • Consider the following:

    • Run: A strictly sequential set of I/O requests

    • Stream: A pseudo-sequential set of I/O requests that could be interrupted by another stream.

    • i.e. a stream could have several runs that are interrupted by runs of other streams.


Run and stream l.jpg

1-4 for the rest)

5-8

9-10

11-14

15-18

1-4

7-8

9-12

11-14

Stream A

1-4

7-8

9-12

11-14

100-104

105-108

109-112

Stream B

Trace

1-4

100-104

7-8

9-12

105-108

109-112

11-14

Run and Stream

An example run of 5 requests

A stream (pseudo-sequential) of 4 requests

An example trace:


Secondary attributes l.jpg
Secondary Attributes for the rest)

  • Run Length: # of requests in a run

  • Run Start location: start sector of run

  • Stream Length: # of requests in a stream

  • Inter-stream Jump Distance: spatial separation between start of run and previous request

  • Intra-stream Jump Distance: spatial separation between successive requests within a stream

  • Number of active streams (at any instant)

  • Interference Distance: number of requests between 2 successive requests in a stream

  • Derive empirical distributions for these from the trace


Location synthesis q10 time and size from trace l.jpg
Location Synthesis - Q10 for the rest)(Time and size from trace)

  • LocIID: locations are i.i.d.

  • LocRUN: incorporate run length distribution and run start location distribution.

  • LocSTREAM: combine all stream and run statistics.


Request size l.jpg
Request Size for the rest)

  • Requests are one of

    • 64, 128, 192, 256, 320, 384, 448, 512 blocks

  • But attributes (location, size, time) are not independent !!!


Correlations between size and location l.jpg

Size for the rest)

All req.

Run start

Within run

Correlations between size and location

Fraction of requests




Final synthesis methodology category 1 l.jpg
Final Synthesis Methodology (Category 1) for the rest)

  • Location: use LocSTREAM to generate start locations. Two kinds of requests: a run start request or a request within a run

  • Time: use Pr(inter-arrival time | run start requests) and Pr(inter-arrival time | within a run requests) to generate times.

  • Size:

  • For run start request, use Pr(size | inter-arrival times of run start requests) to generate sizes.

  • For within a run requests, use Pr(size | within a run requests) to generate sizes.


Slide24 l.jpg


Validation of cdf of response times category 1 l.jpg
Validation of CDF of response times and Category 3 (random) queries.(Category 1)


Validation of cdf of response times category 2 l.jpg
Validation of CDF of response times and Category 3 (random) queries.(Category 2)


Validation of cdf of response times category 3 l.jpg
Validation of CDF of response times and Category 3 (random) queries.(Category 3)


Storage requirements l.jpg
Storage Requirements and Category 3 (random) queries.

Storage Fraction(x0.001)

nRMS

Storage Fraction(x0.001)

nRMS


Contributions l.jpg
Contributions and Category 3 (random) queries.

  • A synthesis methodology to capture

    • Inter-mingling streams of requests

    • Exploiting correlations between request attributes

  • An application of this methodology to TPC-H

  • Along the way (for TPC-H),

    • iid can capture arrival time characteristics

    • Strict sequentiality is not always the case


Backup slides l.jpg

Backup slides and Category 3 (random) queries.


Validating arrival time synthesis l.jpg
Validating arrival time synthesis and Category 3 (random) queries.


Locstream l.jpg
LocSTREAM and Category 3 (random) queries.

  • Use Pr(stream length) to generate stream lengths.

  • Use Pr(run length | stream length) to generate run lengths for each stream length.

  • Generate start location for each run:

    • Use Pr(inter-stream jump dist.) to generate the start location of the first run in the stream.

    • Use Pr(intra-stream jump distance | this stream) to generate other runs’ start location in this stream.

  • Use Pr(interference distance) to interleave all streams.


ad