1 / 9

Local SSD Cache Improves Network Throughput Limitations in Batch System Workers

This evaluation discusses the use of local SSD caches to avoid network throughput limitations in batch system workers. It explores the usage of a union file system for transparent redirection and evaluates the performance of different cache access methods.

gelaine
Download Presentation

Local SSD Cache Improves Network Throughput Limitations in Batch System Workers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Access Evaluation • ekpsg01: storagea, ssda, aufs • HPDA Update Talk - Cache Access Methods • EKP Tuesday Computing Meeting

  2. Brief Overview • Goal: machine-local reads on batch system workers • Avoid network throughput limitation with host-local SSD caches • Persistent data on remote file server, copies on local devices • Using union file system for transparent redirection • Current • work item • Worker • Job • Job • Job • Job • union fs • SSD • File Server

  3. Cache Device Evaluation Test • Artus JEC analysis (250GB Input) as reference • read from /storage/a, ssda or aufs (100% ssda) • via 32, 24, 16, 8, 4 or 2 concurrent processes • tracked by /usr/bin/time and dstat • Joram & • Dominik • Worker • dstat • Job • Job • Job • time • aufs • SSD • Manually • placed • File Server

  4. Input Rate (Network + Drive) • Local reads (ssda & aufs) consistently faster, stalls at HT barrier • Remote read stalls at ~1Gb/s, scales badly after 4 cores • => Local cache is adequate for improving scalability • physical cores • logical cores • 1Gb/s • Fileserver • stalled

  5. Event Rate (the thing that counts) • Translates host input speed (almost) directly to job • Local reads consistently faster, no loss from union filesystem • => Local cache delivers consistent performance improvement

  6. Notable Conclusions • SSDs sufficient • enough space for J&D analysis • scaled nicely • aufs sufficient • no notable performance loss • no problem from dated version (used v2 vs. current v3) • Fileserver test problematic… • Peak network speed slower than expected (x0.1) • Instantaneous network speed varied widely (2MB/s-115MB/s) • Output directory broken for hours… • Interpolation suggests 40-80 cores sufficient for saturation • Got 64 in ekpsg0X,150 in ekpblus

  7. BACKUP - Analysis • Artus Analysis • CMSSW_5_3_22 • ~250GB 2012 Data

  8. BACKUP - Test Machine • Host: EKPSG01 • 64GB RAM • 32 Cores @ 2.60GHz (16 physical + 16 logical) • SL6, kernel 2.6.32-504.3.3.el6.aufs21.x86_64 • /scratch/ssda • Model=ADATA SX910 • 512 GB, Read 550 MB/s, Write 530 MB/s • /storage/a • Fileserver A via 10GBit/s ethernet • dd read ~115 MB/s (1GBit/s from FSA?) • /hpda/storage/hpda • aufs 2.0 mount • br=/scratch/ssda/storage/a=ro:/storage/a=rw

More Related