Components of a Scalable Distributed Relational Information Service

Components of a Scalable Distributed Relational Information Service Dong Lu June 14, 2005

Outline • Bird’s Eye View • What is RGIS? • Architecture • What components are studied in the thesis? • Size-Based Scheduling With Inaccurate Info • Fairness and efficiency as function of correlation • Other applications: beyond RGIS • DualPats: Characterizing and Predicting TCP Throughput on the Wide Area Network • Why TCP throughput prediction? • Flow size / TCP throughput correlation • Issues with simple benchmarking • DualPats algorithm and dynamic rate adjustment • Thesis Contributions

RGIS • Grid computing • Providing dependable, reliable, consistent, pervasive and unlimited computing resources • RGIS: Relational Grid Information Service • Represents globally distributed resources, including the network • Relational Model allows complex compositional queries • Relational Model is well studied; large user population • RGIS servers distributed among multiple organizations and sites

Query and Update Example • A query example • Find a set of 16 Linux machines on the same LAN, each has memory over 1GB, they have a total memory of at least 32 GB, and each has a link capacity >100Mb • An update example • Host A has added 1GB memory, and will be available from 1:00 PM to 6:00 PM central time

RGIS Architecture Users Applications Web Interface Canned Approximate Queries Canned Queries SOAP Interface Authenticated Direct Interface Scoping Rewrite Content Delivery Network Interface For loose consistency Query Manager and Rewriter Update Manager Nondeterminism Rewrite Time Bounding (And Iteration Of Query) Updates encrypted using asymmetric cryptography on network. Only those with appropriate keys have access Oracle 9i Front End transactional inserts and updates using stored procedures, queries using select statements (uses database’s access control) RDBMS Oracle 9i Back End Windows,Linux,Parallel Server,etc site-to-site Schema, type hierarchy, indices, PL/SQL stored procedures for each object

RGIS Web Interface

RGIS Architecture Users Applications Web Interface Canned Approximate Queries Canned Queries SOAP Interface Authenticated Direct Interface Scoping Rewrite Content Delivery Network Interface For loose consistency Query Manager and Rewriter Update Manager Nondeterminism Rewrite Time Bounding (And Iteration Of Query) Updates encrypted using asymmetric cryptography on network. Only those with appropriate keys have access Oracle 9i Front End transactional inserts and updates using stored procedures, queries using select statements (uses database’s access control) RDBMS Oracle 9i Back End Windows,Linux,Parallel Server,etc site-to-site Schema, type hierarchy, indices, PL/SQL stored procedures for each object

Query Components • GridG: the first synthetic grid generator • Topology [Sigmetrics Performance Evaluation Review, Vol 30, No. 4, 2003] • Annotation [SC’03-1] • Query rewriting techniques to trade off query time and the result set size • Nondeterministic query [SC’03-2] • Scoped and approximate queries [GRID’03]

Update and CDN Components • Size-Based Scheduling with inaccurate info to minimize mean update time • Fairness and efficiency as function of correlation [MASCOTS’04-1] • P2P scheduling [LCR’04], one in submission • Web server scheduling, in submission • Other applications [MASCOTS’04-2] • Characterizing and predicting TCP throughput on the WAN to determine update transfer time • [ICDCS’05]

Update and CDN Components • Modeling and taming parallel TCP on the WAN to transfer updates faster • [IPDPS’05] • Fat-tree based end-system multicast to disseminate update scalably • [WCW’04], one in submission

Scheduling Section Outline • Review of Size-Based Scheduling • Motivation • Simulation Setup • Simulation Results • New Applications

The scheduling problem Scheduling: a general problem Goal: minimize the mean response time; be fair Updates come from CDN Scheduler 10K 8K 6K 3K Which update to run next? Database Response time: the time from job arrival to its completion

Review of Non-size-based scheduling • FCFS, PS, etc. • FCFS: First Come First Serve • Intuitive • Easiest to implement • PS: Processor Sharing • Fair: all jobs accept equal resources • Also easy to implement Problem: Unaware of job size information, which results in high mean response time

Review of size-based scheduling • SRPT, FSP, etc. • Use the job size (processing time, service time) information for scheduling • Optimal in mean response time • Fair? • Easy to implement? We use Job Size to refer to the Processing Time (Service Time) of the job

Shortest Remaining Processing Time (SRPT) • Always serve the job with minimum remaining processing time first, Preemptive scheduling • Yields minimum mean response time [Schrage, Operations Research, 1968] • Surprisingly, it is fair for heavy-tail job size distribution [Bansal and Harchol-Balter, Sigmetrics ‘01] • Easy to implement? • With accurate a priori job size information, YES • Otherwise, NO

Fair Sojourn Protocol (FSP) • Combined SRPT with PS, preemptive scheduling • Mean response time is close to that of SRPT; and more fair than SRPT and PS [Friedman, et al, Sigmetrics ‘03] • Easy to implement? • With accurate a priori job size information, YES • Otherwise, NO

Motivation • Size-based scheduling requires accurate knowledge of job sizes • In practice, a priori job size information is not always available • All the previous work assumes perfect knowledge of job sizes a priori • How does performance depend on quality of job size information?

Correlation We study the performance of Size-based schedulers as a function of the correlation coefficient (Pearson’s R) between actual job sizes and estimated job sizes.

Trace generator Correlation (Pearson’s R) Distribution A Distribution B Trace Generator • X Y • 100 • 300 • . . • . . • . . • Correlated random pairs of X and Y • X has distribution A • Y has distribution B • X and Y are correlated to R

Trace generator algorithm • Algorithm: “Normal-To-Anything” • First developed by Cario and Nelson, on INFORMS Journal on Computing 10, 1 (1998). • We simplified the algorithm and first introduced it into the simulation studies of computer systems

Scatter plot of example traces Y Y X X R=0.78 R=0.13

Performance metrics • Mean response time: Sojourn time, Turn-around time • Slowdown: the ratio of response time to its size. Fairness metric

Simulator • Simulator • Supports M/G/1 and G/G/n/m queuing model • Simulator validation • Little’s law • Repeat the simulations in the FSP paper [Friedman, et al, Sigmetrics ‘03] • Compare with available theoretical results [Bansal and Harchol-Balter, Sigmetrics ‘01]

Scheduling Policies • PS: Processor sharing • Size-based scheduling policies • SRPT: Ideal SRPT scheduler • SRPT-E: SRPT scheduler using estimated job size • FSP: Ideal Fair Sojourn Protocol • FSP-E: FSP scheduler using estimated job size Each simulation is repeated 20 times and we present the average

Mean response timeas function of R

Slowdown (R=0.0224)

Slowdown (R=0.239)

Slowdown (R=0.4022)

Slowdown (R=0.5366)

Slowdown (R=0.7322)

Slowdown (R=0.9779)

Simulation Results: Conclusions • Performance heavily depends on correlation • SRPT-E and FSP-E can outperform PS given an effective job size estimator • Crossover point of performance metrics is a function of correlation • Also of job size distributions (See TR NWU-CS-04-33)

New Applications: Web server scheduling (TR NWU-CS-04-33) • Is file size a good estimator of a job’s service time (processing time)? Not Really (R  0.14) File Size Service time (wall clock time)

New Applications: Web server scheduling • Domain-based estimator: much more accurate prediction of the service timeat low overhead

New Applications: P2P server side scheduling (LCR ’04) • “Server side” of current file sharing P2P applications superficially similar to web server • Both send back files upon requests. • However, P2P application can’t even know the file size accurately a priori • Partial downloads • Our ongoing work shows that SRPT-E performs well using our time-series based job size estimators.

Scheduling Section Summary • Performance of size-based scheduling policies depends on correlation between size estimates and actual sizes • Fairness, mean response time, etc. • Estimator must preserve ordering of job sizes for high performance • Performance degrades as correlation degrades • Effective new estimators for Web and P2P

DualPats Overview • Algorithm for predicting the TCP throughput as function of flow size • Minimal active probing • Dynamic probe rate adjustment • Explaining flow size / throughput correlation • Explaining why simple active probing fails Large scale empirical study

DualPats Section Outline • Why TCP Throughput Prediction? • Particulars of Study • Flow Size / TCP Throughput Correlation • Issues with Simple Benchmarking • DualPats Algorithm • Stability and Dynamic Rate Adjustment

Goal A library call BW = PredictTransfer(src,dst,numbytes); Expected Time = numbytes/BW; Ideally, we want a confidence interval: (BWLow,BWHigh) = PredictTransfer(src,dst,numbytes,p);

Available Bandwidth • Maximum rate a path can offer a flow without slowing other flows • pathchar, cprobe, nettimer, delphi, IGI, pathchirp, pathload … • mainly for traffic engineering • Available bandwidth can differ significantly from TCP throughput • Not real time, takes at least tens of seconds to run

Simple TCP Benchmarking • Benchmark paths with a single small probe • BW = ProbeSize/Time • Widely used Network Weather Service (NWS) and others (Remos benchmarking collector) • Not accurate for large transfers on the current high speed Internet • Numerous papers show this and attempt to fix it

Fixing Simple TCP Benchmarking • Logs [Sundharshan]: correlate real transfer measurements with benchmarking measurements • Recent transfers needed • Similar size transfers needed • Measurements at application chosen times • CDF-matching [Swany]: correlate CDF of real transfer measurements with CDF of benchmarking measurements • Recent transfers still needed • Measurements at application chosen times

Analysis of TCP • Extensive research on TCP throughput modeling in networking community • Really intended to build better TCPs • Difficult to use models online because of hard to measure parameters • Future loss rate and RTT

DualPats Section Outline • Why TCP Throughput Prediction? • Particulars of Study • Flow Size / TCP Throughput Correlation • Issues with Simple Benchmarking • DualPats Algorithm • Stability and Dynamic Rate Adjustment

Components of a Scalable Distributed Relational Information Service

Components of a Scalable Distributed Relational Information Service

Presentation Transcript

Frangipani: A Scalable Distributed File System

Nondeterministic Queries in a Relational Grid Information Service

Is Distributed Consistency Scalable?

Behavioural Verification of Distributed Components

Behavioural Verification of Distributed Components

Scalable Distributed Memory Multiprocessors

SD-SQL Server: a Scalable Distributed Database

A Scalable Information Management Middleware for Large Distributed Systems

A Scalable Distributed Information Management System (SDIMS)

Scalable Information Extraction

SDIMS: A Scalable Distributed Information Management System

Towards a Scalable Database Service

Distributed Components

Scalable Secure Distributed Computation

Scalable Distributed Memory Machines

SD-Rtree: A Scalable Distributed Rtree

Scoped and Approximate Queries in a Relational Grid Information Service

A Scalable Distributed Datastore for BioImaging

Components of a Children’s Worship Service

SDIMS: A Scalable Distributed Information Management System

Nondeterministic Queries in a Relational Grid Information Service