1 / 13

Update on replica management

Update on replica management. Costin.Grigoras@cern.ch. Replica discovery algorithm. To choose the best SE for any operation (upload, download, transfer) we rely on a distance metric: Based on the network distance between the client and all known IPs of the SE Altered by current SE status

hovan
Download Presentation

Update on replica management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Update on replica management Costin.Grigoras@cern.ch

  2. Replica discovery algorithm • To choose the best SE for any operation (upload, download, transfer) we rely on a distance metric: • Based on the network distance between the client and all known IPs of the SE • Altered by current SE status • Writing: usage + weighted write reliability history • Reading: weighted read reliability history • Static promotion/demotion factors per SE • Small random factor for democratic distribution Update on replica management

  3. Network distance metric distance(IP1, IP2) = • same C-class network • same DNS domain name • same AS • f(RTT(IP1,IP2)), if known • same country + f(RTT(AS(IP1), AS(IP2))) • same continent • f(RTT(AS(IP1), AS(IP2))) • far, far away 0 1 Update on replica management

  4. Network topology Update on replica management

  5. SE status component • Driven by the functional add/get tests (12/day) • Failing last test => heavy demotion • Distance increases with a reliability factor: • ¾ last day failures + ¼ last week failures • http://alimonitor.cern.ch/stats?page=SE/table • The remaining free space is also taken into account for writing with: • f(ln(free space / 5TB)) • Storages with a lot of free space are slightly promoted (cap on promotion), while the ones running out of space are strongly demoted Update on replica management

  6. What we gained • Maintenance-free system • Automatic discovery of resources combined with monitoring data • Efficient file upload and access • From the use of well-connected, functional SEs • Local copy is always preferred for reading, unless there is a problem with it, and then the other copies are also close by (RTT is critical for remote reading) • Writing falls back to even more remote locations until the initial requirements are met Update on replica management

  7. Effects on the data distribution • Raw data-derived files stay clustered around CERN and the T1 that holds a copy • job splitting is thus efficient Update on replica management

  8. Effects on MC data distribution • Some simulation results are spread on ~all sites and in various combination of SEs • yielding inefficient job splitting • this translates in more merging stages for the analysis • affecting some analysis types • overhead from more, short jobs • no consequence for job CPU efficiency Very bad case Update on replica management

  9. Merging stages impact on trains • Merging stages are a minor contributor to the analysis turnaround time (few jobs, high priority) • Factors that do affect the turnaround: • Many trains starting at the same time in an already saturated environment • Sub-optimal splitting with its overhead • Resubmission of few pathological cases • The cut-off parameters in LPM could be used, with the price of 2 out of 7413 jobs the above analysis would finish in 5h Update on replica management

  10. How to fix the MC case • Old data: consolidate replica sets in larger, identical baskets for the job optimizer to optimally split • With Markus’ help we are now in the testing phase on a large data set for a particularly bad train • 155 runs, 58K LFNs (7.5TB), 43K transfers (1.8TB) • target: 20 files / basket • Waiting for the next departure to evaluate the effect on the overall turnaround time of this train Update on replica management

  11. How to fix the MC case (2) • Algorithm tries to find the least amount of operations that would yield large enough baskets • Taking SE distance into account (same kind of metric as for the discovery, in particular usage is also considered, keeps data nearby for fallbacks etc) • jAliEncan now move replicas (delete after copy), copy to several SEs at the same time, delayed retries • TODO: implement a “master transfer” to optimize the two stages of the algorithm (first copy&move operations, delete the extra replicas at the end) Update on replica management

  12. Option for future MC productions • Miguel’s implementation of output location extension: • “@disk=2,(SE1;SE2;!SE3)” • The distance to the indicated SEs is altered with +/- 1 • after the initial discovery, so broken SEs are eliminated, location is still taken into account • The set should be: • large enough (ln(subjobs) ?) • set at submission time per masterjob • with a different value each time, eg: • space- and reliability-weighted random set of working SEs • Caveats: • Inefficiencies for writing and reading • Not using the entire storage space, and later on not using all the available CPUs for analysis (though a large production would) Update on replica management

  13. Summary • Two possibilities to optimize replica placement (with the current optimizer) • Implement in LPM the algorithm described before • Trigger the consolidation algorithm at the end of a production/job • And/or fix the “se_advanced” splitting method so the SE sets become irrelevant Update on replica management

More Related