Explicit control in a batch aware distributed file system
This presentation is the property of its rightful owner.
Sponsored Links
1 / 33

Explicit Control in a Batch-aware Distributed File System PowerPoint PPT Presentation


  • 126 Views
  • Uploaded on
  • Presentation posted in: General

Explicit Control in a Batch-aware Distributed File System. John Bent Douglas Thain Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau Miron Livny University of Wisconsin, Madison. Grid computing. Physicists invent distributed computing!. Astronomers develop virtual supercomputers!.

Download Presentation

Explicit Control in a Batch-aware Distributed File System

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Explicit control in a batch aware distributed file system

Explicit Control in a Batch-aware Distributed File System

John Bent

Douglas Thain

Andrea Arpaci-Dusseau

Remzi Arpaci-Dusseau

Miron Livny

University of Wisconsin, Madison


Grid computing

Grid computing

Physicists invent distributed computing!

Astronomers develop virtual supercomputers!


Grid computing1

Grid computing

Internet

Home

storage

If it looks like a duck . . .


Are existing distributed file systems adequate for batch computing workloads

Are existing distributed file systems adequate for batch computing workloads?

  • NO. Internal decisions inappropriate

    • Caching, consistency, replication

  • A solution: Batch-Aware Distributed File System (BAD-FS)

  • Combines knowledge with external storage control

    • Detail information about workload is known

    • Storage layer allows external control

    • External scheduler makes informed storage decisions

  • Combining information and control results in

    • Improved performance

    • More robust failure handling

    • Simplified implementation


Outline

Outline

  • Introduction

  • Batch computing

    • Systems

    • Workloads

    • Environment

    • Why not DFS?

  • Our answer: BAD-FS

    • Design

    • Experimental evaluation

  • Conclusion


Batch computing

Batch computing

  • Not interactive computing

  • Job description languages

  • Users submit

  • System itself executes

  • Many different batch systems

    • Condor

    • LSF

    • PBS

    • Sun Grid Engine


Batch computing1

Compute node

Compute node

Compute node

Compute node

CPU

Manager

CPU

Manager

CPU

Manager

CPU

Manager

Jobqueue

1

2

3

4

Batch computing

Internet

Home

storage

Scheduler

1

2

3

4


Batch workloads

“Pipeline and Batch Sharing in Grid Workloads,” Douglas Thain, John Bent, Andrea Arpaci-Dusseau, Remzi Arpaci-Dussea, Miron Livny. HPDC 12, 2003.

Batch workloads

  • General properties

    • Large number of processes

    • Process and data dependencies

    • I/O intensive

  • Different types of I/O

    • Endpoint

    • Batch

    • Pipeline

  • Our focus: Scientific workloads

    • More generally applicable

    • Many others use batch computing

      • video production, data mining, electronic design, financial services, graphic rendering


Batch workloads1

Endpoint

Endpoint

Batch dataset

Endpoint

Pipeline

Pipeline

Batch dataset

Batch workloads

Endpoint

Endpoint

Pipeline

Pipeline

Pipeline

Pipeline

Pipeline

Pipeline

Pipeline

Endpoint

Endpoint

Endpoint

Endpoint


Cluster to cluster c2c

Cluster-to-cluster (c2c)

  • Not quite p2p

    • More organized

    • Less hostile

    • More homogeneity

    • Correlated failures

  • Each cluster is autonomous

    • Run and managed by different entities

  • An obvious bottleneck is wide-area

Internet

Home

store

How to manage flow of data into, within and out of these clusters?


Why not dfs

Why not DFS?

Internet

Home

store

  • Distributed file system would be ideal

    • Easy to use

    • Uniform name space

    • Designed for wide-area networks

  • But . . .

    • Not practical

    • Embedded decisions are wrong


Dfs s make bad decisions

DFS’s make bad decisions

  • Caching

    • Must guess what and how to cache

  • Consistency

    • Output: Must guess when to commit

    • Input: Needs mechanism to invalidate cache

  • Replication

    • Must guess what to replicate


Bad fs makes good decisions

BAD-FS makes good decisions

  • Removes the guesswork

    • Scheduler has detailed workload knowledge

    • Storage layer allows external control

    • Scheduler makes informed storage decisions

  • Retains simplicity and elegance of DFS

  • Practical and deployable


Outline1

Outline

  • Introduction

  • Batch computing

    • Systems

    • Workloads

    • Environment

    • Why not DFS?

  • Our answer: BAD-FS

    • Design

    • Experimental evaluation

  • Conclusion


Practical and deployable

Practical and deployable

  • User-level; requires no privilege

  • Packaged as a modified batch system

  • A new batch system which includes BAD-FS

  • General; will work on all batch systems

  • Tested thus far on multiple batch systems

SGE

SGE

SGE

SGE

BAD-

FS

BAD-

FS

BAD-

FS

BAD-

FS

BAD-

FS

BAD-

FS

BAD-

FS

BAD-

FS

SGE

SGE

SGE

SGE

Internet

Home

store


Contributions of bad fs

Storage

Manager

Storage

Manager

Storage

Manager

Storage

Manager

Jobqueue

1

2

3

4

Contributions of BAD-FS

Compute node

Compute node

Compute node

Compute node

CPU

Manager

CPU

Manager

CPU

Manager

CPU

Manager

BAD-FS

BAD-FS

BAD-FS

1) Storage managers

2) Batch-Aware Distributed

File System

Job queue

3) Expanded job

description language

4) BAD-FS scheduler

Home

storage

BAD-FS

Scheduler

Scheduler


Bad fs knowledge

BAD-FS knowledge

  • Remote cluster knowledge

    • Storage availability

    • Failure rates

  • Workload knowledge

    • Data type (batch, pipeline, or endpoint)

    • Data quantity

    • Job dependencies


Control through volumes

Control through volumes

  • Guaranteed storage allocations

    • Containers for job I/O

  • Scheduler

    • Creates volumes to cache input data

      • Subsequent jobs can reuse this data

    • Creates volumes to buffer output data

      • Destroys pipeline, copies endpoint

    • Configures workload to access containers


Knowledge plus control

Knowledge plus control

  • Enhanced performance

    • I/O scoping

    • Capacity-aware scheduling

  • Improved failure handling

    • Cost-benefit replication

  • Simplified implementation

    • No cache consistency protocol


I o scoping

I/O scoping

  • Technique to minimize wide-area traffic

  • Allocate storage to cache batch data

  • Allocate storage for pipeline and endpoint

  • Extract endpoint

Compute node

Compute node

AMANDA:

200 MB pipeline

500 MB batch

5 MB endpoint

Internet

Steady-state:

Only 5 of 705 MB

traverse wide-area.

BAD-FS

Scheduler


Capacity aware scheduling

Capacity-aware scheduling

  • Technique to avoid over-allocations

  • Scheduler runs only as many jobs as fit


Capacity aware scheduling1

Endpoint

Endpoint

Endpoint

Endpoint

Batch dataset

Batch dataset

Batch dataset

Pipeline

Pipeline

Endpoint

Endpoint

Endpoint

Pipeline

Pipeline

Endpoint

Batch dataset

Capacity-aware scheduling

Pipeline

Pipeline

Pipeline

Pipeline

Pipeline

Pipeline

Pipeline

Pipeline

Endpoint

Endpoint

Endpoint

Endpoint


Explicit control in a batch aware distributed file system

Capacity-aware scheduling

  • 64 batch-intensive synthetic pipelines

    • Vary size of batch data

  • 16 compute nodes


Improved failure handling

Improved failure handling

  • Scheduler understands data semantics

    • Data is not just a collection of bytes

    • Losing data is not catastrophic

      • Output can be regenerated by rerunning jobs

  • Cost-benefit replication

    • Replicates only data whose replication cost is cheaper than cost to rerun the job

  • Results in paper


Simplified implementation

Simplified implementation

  • Data dependencies known

  • Scheduler ensures proper ordering

  • No need for cache consistency protocol in cooperative cache


Real workloads

Real workloads

  • AMANDA

    • Astrophysics study of cosmic events such as gamma-ray bursts

  • BLAST

    • Biology search for proteins within a genome

  • CMS

    • Physics simulation of large particle colliders

  • HF

    • Chemistry study of non-relativistic interactions between atomic nuclei and electors

  • IBIS

    • Ecology global-scale simulation of earth’s climate used to study effects of human activity (e.g. global warming)


Real workload experience

Setup

16 jobs

16 compute nodes

Emulated wide-area

Configuration

Remote I/O

AFS-like with /tmp

BAD-FS

Result is order of magnitude improvement

Real workload experience


Bad conclusions

BAD Conclusions

  • Existing DFS’s insufficient

  • Schedulers have workload knowledge

  • Schedulers need storage control

    • Caching

    • Consistency

    • Replication

  • Combining this control with knowledge

    • Enhanced performance

    • Improved failure handling

    • Simplified implementation


For more information

For more information

  • http://www.cs.wisc.edu/adsl

  • http://www.cs.wisc.edu/condor

  • Questions?


Why not bad scheduler and traditional dfs

Why not BAD-scheduler and traditional DFS?

  • Cooperative caching

  • Data sharing

    • Traditional DFS

      • assume sharing is exception

      • provision for arbitrary, unplanned sharing

    • Batch workloads, sharing is rule

    • Sharing behavior is completely known

  • Data committal

    • Traditional DFS must guess when to commit

      • AFS uses close, NFS uses 30 seconds

    • Batch workloads precisely define when


Is cap aware imp in real world

Is cap aware imp in real world?

  • Heterogeneity of remote resources

  • Shared disk

  • Workloads changing, some are very, very large.


Capacity aware scheduling2

Capacity-aware scheduling

  • Goal

    • Avoid overallocations

      • Cache thrashing

      • Write failures

  • Method

    • Breadth-first

    • Depth-first

    • Idleness


Capacity aware scheduling evaluation

Capacity-aware scheduling evaluation

  • Workload

    • 64 synthetic pipelines

    • Varied pipe size

  • Environment

    • 16 compute nodes

  • Configuration

    • Breadth-first

    • Depth-first

    • BAD-FS

Failures directly correlate to workload throughput.


  • Login