1 / 10

Job Matching – CPU Power Requirements

Job Matching – CPU Power Requirements. Stephen Burke RAL (with contributions from Marco Cecchi, Jeff Templon and Steve Traylen). Outline. The Installed Capacity document was approved by the WLCG Management Board on the 3 rd of February 2009

ray
Download Presentation

Job Matching – CPU Power Requirements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Job Matching – CPU Power Requirements Stephen Burke RAL (with contributions from Marco Cecchi, Jeff Templon and Steve Traylen) WLCG GDB, CERN 9th June 2010

  2. Outline • The Installed Capacity document was approved by the WLCG Management Board on the 3rd of February 2009 • Presentations at various GDBs, e.g. February 2009 • Progressive implementation over the last year • LHCb recently raised the question of how to scale queue time limits by the CPU power in the new publication scheme • I will discuss: • What problems we were trying to solve with the new scheme • What we did and how the implementation has progressed • What the problem is for JDL requirements, and how to solve it • NB I will only discuss publication of computing resources WLCG GDB, CERN 9th June 2010

  3. Schema Constraints • The GLUE 1 schema was defined a long time ago, and the only measures of CPU power are SpecInt and SpecFloat (the latter never used in practice) • GlueHostBenchmarkSI00, GlueHostBenchmarkSF00 • SubCluster attributes – the original schema assumption was that SubClusters are homogeneous • Multicore CPUs were not forseen – but we did add a LogicalCPUs attribute to count hyperthreaded CPUs in GLUE 1.2 • In practice SubClusters are often heterogeneous, so publication is ambiguous – minimum or average? • Some sites scale times to a reference power which is very different to the average (usually lower), so the published SI00 value doesn’t match the real power of the system • SpecInt is obsolete – we need a managed transition to HEPSpec06 • But non-LCG EGEE sites may not migrate? WLCG GDB, CERN 9th June 2010

  4. Goals • Publishing information in order to provide the WLCG management with a view of the total installed capacity and resource usage by VOs at sites. Focus on: • Static information that produces monthly reports • Publishing information to provide the WLCG management with a view of the total installed capacity at a site. • Publishing information to provide the WLCG and VO management with a view of the resource assignment per VO at a site • Dynamic/ongoing information • Publishing information to allow VO operators to monitor the VO usage of the resources at a site • Allowing clients (consumers of the information published) to select the correct resources and their characteristics WLCG GDB, CERN 11 February 2009

  5. Schema Usage Changes • Unused string attribute GlueHostProcessorOtherDescription used to publish #cores per CPU, and HEPSpec06 if calculated • Cores=4,Benchmark=9.40-HEP-SPEC06 • GlueHostBenchmarkSI00 should become the average physical power per core • Scaled from HS06 if measured (SI00 == 250*HS06) • Installed capacity per SubCluster = LogicalCPUs*SI00 • Multivalued string attribute GlueCECapability used to publish the effective CPU power used in the batch system • CPUScalingReferenceSI00=<ref CPU SI00> • The SI00 for which the published GlueCEMax{CPU/Wallclock}Time is valid • The reference CPU power if the batch system scales all times • The minimum CPU power if not • CECapability is also used to publish VO shares (e.g. Share=lhcb:11), and now glexec presence WLCG GDB, CERN 9th June 2010

  6. Other Changes • New YAIM needed to configure the new attributes • Released for SL4 in July 09, but SL5/CREAM was missed – eventually out in January • Currently 288/700 SubClusters publishing HEP-SPEC06, 625 publishing Cores, and 3727/4644 CEs publishing CPUScalingReferenceSI00 • and 300/342 CEs supporting LHCb • Some sites have CPUScalingFactorSI00 due to a bug in the original docs • Some other errors as usual • APEL changed to read CPU power from CECapability instead of SubCluster • Falls back to SubCluster if necessary • Released for SL4 in March, but SL5/CREAM was missed – coming RSN • Also a bug fix for invalid assumptions about SubClusterUniqueID • Sites should probably not yet have changed the SubCluster SI00 – but some may have done so • gstat 2 is now testing compliance with the document WLCG GDB, CERN 9th June 2010

  7. JDL Matching Problem • JDLs should have a Requirement to select queues with a sufficient time limit • The required time should be scaled by the CPU power • Requirements = other.GlueHostBenchmarkSI00*other.GlueCEPolicyMaxCPUTime > MinWork; • LHCb asked: how do we change that to use the CPUScalingReferenceSI00? • Conceptually the same, but this is a string in a multivalued attribute • classads has limited flexibility – and some “features” • It seems that we didn’t check this explicitly when the document was written – the focus was on accounting • Can we do it? Yes we can! • But it took some ingenuity … • Still tweaking the JDL and checking that it works in all cases WLCG GDB, CERN 9th June 2010

  8. Solution • Need to work around various classads problems, so a bit clunky • Can only cope with a fixed number of Capability attributes, but in practice this seems OK for now Lookup="CPUScalingReferenceSI00=*"; cap = isList(other.GlueCECapability) ? other.GlueCECapability : { "dummy" }; i0 = regexp(Lookup, cap[0]) ? 0 : undefined; i1 = isString(cap[1]) && regexp(Lookup, cap[1]) ? 1 : i0; i2 = isString(cap[2]) && regexp(Lookup, cap[2]) ? 2 : i1; i3 = isString(cap[3]) && regexp(Lookup, cap[3]) ? 3 : i2; i4 = isString(cap[4]) && regexp(Lookup, cap[4]) ? 4 : i3; i5 = isString(cap[5]) && regexp(Lookup, cap[5]) ? 5 : i4; index = isString(cap[6]) && regexp(Lookup, cap[6]) ? 6 : i5; i = isUndefined(index) ? 0 : index; ref = int(substr(cap[i],size(Lookup)-1)); check1 = !isUndefined(index) && int(ref*other.GlueCEPolicyMaxWallClockTime) > MinWork; check2 = isUndefined(index) && other.GlueHostBenchmarkSI00*other.GlueCEPolicyMaxWallClockTime > MinWork; requirements= FQANmember(check1 || check2 ? "VOMS:OK" : "VOMS:nomatch", {"VOMS:OK"}); WLCG GDB, CERN 9th June 2010

  9. Future improvements • Recent classads versions have a regexps() function which can extract the value which matched a wildcard • Could implement a new function directly in the WMS • GLUE 2 has explicit attributes for these things, but: • Probably 1-2 years before we have GLUE 2 fully available • Although we could have static publishing for accounting fairly quickly – 3 months? • There may always be a need for new attributes • All GLUE 2 objects have attributes similar to Capability • Should ensure that JDL matching can match against them WLCG GDB, CERN 9th June 2010

  10. Questions ? WLCG GDB, CERN 9th June 2010

More Related