1 / 53

Chris B. Papineau Software Architect Oracle Corporation Denver, CO

Non-linear scaling of long running batch jobs: “The twelve days of degradation” Computer Measurement Group Conference 2011 Paper # 5002 Session #514. Chris B. Papineau Software Architect Oracle Corporation Denver, CO. Non-linear scaling of long running batch jobs Introduction.

dutch
Download Presentation

Chris B. Papineau Software Architect Oracle Corporation Denver, CO

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Non-linear scaling of long running batch jobs: “The twelve days of degradation”Computer Measurement Group Conference 2011Paper # 5002Session #514 Chris B. Papineau Software Architect Oracle CorporationDenver, CO

  2. Non-linear scaling of long running batch jobsIntroduction • Forecast for today: 100% Cloud free

  3. … four software fixes … three hardware upgrades … two OS patches And a bug fix for your custom program! Non-linear scaling of long running batch jobsIntroduction • The Twelve Days of Christmas • A Christmas carol which scales non-linearly • Singing all twelve verses takes more than 12 times as long as verse #1 alone • Why? Each of the twelve verses takes progressively longer to sing • Each verse adds something new and includes all previous gifts

  4. Non-linear scaling of long running batch jobsIntroduction Day #1: 1 gift Day #2: 3 gifts Day #6: 21 gifts Day #12: 78 gifts . . . Day #n: n(n+1)/2 gifts

  5. Non-linear scaling of long running batch jobsIntroduction TOTAL of all gifts, all 12 days: n=1 ∑ n(n+1)/2 n=12 …. = 364 total gifts …. But we digress…….

  6. Non-linear scaling of long running batch jobs Introduction • This customer case study will address ONE common cause of non-linearity in large batch jobs • It is NOT the only possible cause – just a fairly common one • Along the way, performance analysis principles used will be highlighted • The cause deals with resource leaks • …as opposed to memory leaks • The batch job in question comes from a typical Enterprise Resource Planning (ERP) package

  7. Nonlinear scaling of long running batch jobsAgenda • Problem Definition • Analysis • Results • Conclusion

  8. Nonlinear scaling of long running batch jobsProblem Definition • “If you don't know where you're going, chances are you will end up somewhere else.” • Yogi Berra

  9. Nonlinear scaling of long running batch jobsProblem Definition • The Problem • Several batch programs from a typical ERP package did not scale with increasing data input • They are all Electronic Data Interchange (EDI) related • Batch ‘A’ - creates EDI Receiving Advices • 3000 records: 55m • 6000 records: 396m • 9000 records: 722m • Batch ‘B’ - generates EDI Inbound Purchase Orders • 3000 records: 8m • 6000 records: 22m • 9000 records: 110m • Batch ‘C’ - processes EDI Inbound Purchase Orders • 3000 records: 6m • 6000 records: 22m • 9000 records: 45m

  10. Nonlinear scaling of long running batch jobsProblem Definition Runtime(minutes) Data showing the non-linear scaling of batch jobs • The runtime rises much more dramatically than the amount of input data. • Batch ‘A’ was by far the worst and most disruptive offender 9000 records Batch ‘B’ 6000 records Batch ‘C’ Batch ‘A’ 3000 records Batch ‘B’ Batch ‘C’ Batch ‘A’

  11. Nonlinear scaling of long running batch jobs Problem Definition • Platform: IBM iSeries + DB2 DBMS • Background on the ERP product in question: • Business Logic in the batch programs is implemented in the C language in the form of dynamically loaded functions stored in Dynamically Linked Libraries (DLLs) • These are called “Business Functions” (BSFNs) • This is the “Application code” • End users have access to the source code of the BSFNs • Power users may modify the code, or create new BSFNs • BSFNs, in turn, make calls to proprietary libraries of functions which are part of the ERP infrastructure • “Tools APIs” provide technical back-end functionality • Data caching • Database I/O • SQL statement construction • Middleware • Data formatting • Other assorted functionalities

  12. Nonlinear scaling of long running batch jobsAnalysis • Problem Definition • Analysis • Results • Conclusion

  13. Nonlinear scaling of long running batch jobs Analysis • You get a sharp pain in your eye when drinking a cup of coffee… • Do you: • Run chemical tests on the coffee, send to a lab, involve an ophthalmologist, check for food allergies OR…. • Take the spoon out of the cup?

  14. Always validate and verify the data before pursuing more detailed analysis Nonlinear scaling of long running batch jobs Analysis • Identify low-hanging fruit items first • “When your mother tells you she loves you, get a second opinion”- from Primal Fear • Run conditions verified: • Was the SAME data set used for each test? • √ The input data range for all the runs (3000 / 6000 / 9000 records) was identical, and the database was reset prior to each test – so these were repeatable and reliable “apples-apples” tests. • Was nothing else running during the tests? • √ The system was quiescent before each run • Operating system tuning • IBM iSeries experts analyzed temp space, memory pools, paging, etc… • √ No hardware or OS tuning opportunities which were obvious • …although Memory pools are a very common issue with UBEs on iSeries

  15. Nonlinear scaling of long running batch jobs Analysis • “Everybody Lies”- Gregory House, M.D.

  16. WHERE IS THE CODE SPENDING ITS TIME??? WHERE IS THE CODE SPENDING ITS TIME??? • Performance First Principles: Where is the code and/or system spending its time? Nonlinear scaling of long running batch jobs Analysis • With the “low hanging fruit” items addressed, the next step was to turn to First Principles of Performance Analysis: • Where is the Code or System Spending its time? • This involves collection of profiling data for each batch job. See: “Performance Engineering Parables”C. PapineauCMG conference 2010

  17. Nonlinear scaling of long running batch jobs Analysis • Numerous options exist for profiling tools, depending on operating System, language, etc • Quantify • Tprof • Performance Explorer (PEX) • The ERP package in question provides a customized debugging feature which produces output in plain text format • Enabled by a configuration file setting • Debugging data output to a text file • The Batch programs are instrumented to collect the following time-stamped debugging data: • BSFNs • SQL • Hierarchical BSFN Call Stack data • Proprietary technical functions • Middleware APIs • Tools APIs

  18. Plain text output can be used for instrumentation and targeted custom profiling Nonlinear scaling of long running batch jobs Analysis • A tool called Performance Workbench (PW) is provided with the ERP product which can parse this text-based data and produce HTML-based reports and call stacks as output • The text based data can also be manipulated and analyzed using spreadsheets, Perl programs, or other text parsing tools of choice • So – the time-stamped ERP debugging output is used as custom profiling data • The ERP product provided the custom instrumentation critical to this analysis. But…. • Custom instrumentation / profiling data can be added to ANY software package to which source code is available • Format can be specified as per needs, in whatever manner is most useful to the desired purposes • Plain text can be readily parsed into human readable format by C, Perl, VB, etc…. • For details on Performance Workbench – see Poster Session 598 • Thursday 5:00 PM - 6:30 PM Room: Woodrow Wilson 

  19. Nonlinear scaling of long running batch jobsAnalysis • Collect profiling logs – 3000 record case • Run batch jobs with ERP debugging feature enabled • 3000 record case provided enough of a robust sample size but without the excess overhead of the 6000 or 9000 record cases • PW run was run against the plain text data, producing these outputs: • HTML Code profile • Hierarchical Call Stack of Business Functions • Parse profiling logs with Excel spreadsheet, extract BSFN runtimes • Columnar / fixed-width nature of the data made it a simple matter to import to Excel and leverage graphical capabilities • We will focus on Batch ‘A’ for the purposes of this study, since it was the most severely degraded and illustrates the key analysis concepts most clearly. However, similar procedures were applied to all three jobs yielding similar results.

  20. Performance Workbench profile of batch job: BSFN call stack output BSFN: R47011DataBaseUpdateSection() Import Performance Workbench output into Excel – graph BSFN timing trends Individual listing of BSFN runtimes Graphical analysis of BSFN runtimes Parse the profiling data with Excel spreadsheet Nonlinear scaling of long running batch jobsAnalysis

  21. Performance Workbench HTML profiling data – “Where is the code spending its time?” Critical Business Functions in profile of Batch Job ‘A’: F4312EditLine F4312EndDoc Nonlinear scaling of long running batch jobs Analysis • The two BSFNs which consumed the majority of the time in Batch ‘A’ are indicated in the PW profile below • The BSFNs are called about 3000 times each • This sample size warrants a measuring the trend of those runtimes for the duration of the run • This is where Excel is useful….

  22. Business function runtimes steadily increase as batch program progresses Elapsed Time Elapsed Time Nonlinear scaling of long running batch jobs Analysis • The degradation seemed to be distributed throughout the code, and not occurring in one area • Here are the runtimes for the two key BSFNs for the 3000 record run, plotted as a function of elapsed batch time • They clearly rise steadily over the course of the batch job:

  23. Nonlinear scaling of long running batch jobsAnalysis • What could be causing such an “across the board” degradation • What do all the critical parts of the BSFN code have in common that would exhibit this uniform rise in runtime? • The question to ask is: • What do these batch jobs DO ???? • A look at the profiling data shows a decisive insight: • These jobs are very heavy users of data caching

  24. Nonlinear scaling of long running batch jobsAnalysis The PW profiling data showed thousands of calls to data caching APIs • The ERP package in question provides a caching API set which allows the developer to create lists of indexed data in memory. • These lists can be manipulated and modified in many of the same ways as a database, e.g: FETCH, DELETE, UPDATE, INSERT. • There are also advanced features similar to that which one would see in a Database Management System (DBMS), such as cursors and multiple indexes. • jdeCacheInit() initializes a cache object • jdeCacheTerminate() releases it

  25. Specific product knowledge is critical to performance analysis of complex systems Nonlinear scaling of long running batch jobs Analysis • A simple feature in the ERP product’s logging capability tracks the caches whose handles have been initialized via jdeCacheInit() but not been released via jdeCacheTerminate() • Many of the ERP products applications utilize caching, and the PW tool + the debug instrumentation recognizes this fact A Lengthy list of unterminated caches in the raw ERP debug logging

  26. Nonlinear scaling of long running batch jobs Analysis • de·us ex ma·chi·na[dey-uh s eks mah-kuh-nuh,dee-uh s eks mak-uh-nuh] noun • Any artificial or improbable device resolving the difficulties of a plot. • If the leap to data caching seems a bit improbable: • While the Performance Engineer does not have to possess expert level knowledge of the code in question, he or she must have access to such a Subject Matter Expert (SME). • In this case, the insight of the cache APIs dominating the profile – provided by such a SME - was a critical leap which shortened this performance analysis effort greatly. • A code profile is data, but data is of limited use if it does not provide information. "I should think you Jedi would have more respect for the difference between knowledge and wisdom." - Dexter Jettster, Star Wars: Attack of the Clones

  27. Nonlinear scaling of long running batch jobs Analysis • Upon investigation of the debug logs, the crux of this issue is clear: • A very large number (tens of thousands) of uniquely named caches (thousands) are used in all three of these batch jobs • These caches are not properly terminated during the course of the job. • Note that this is a different problem from either of the following, and should not be confused with same: • Large number of records in a single cache • Large number of references to a single cache without terminating • Neither of the above would cause the sort of phenomenon we’re seeing….

  28. Resource and memory leaks can have a critical impact on Performance Nonlinear scaling of long running batch jobs Analysis • When many caches are created but not destroyed when no longer needed, the internal list of cache names in jdeCacheInit() gets progressively longer, so the calls to jdeCacheInit() take longer and longer. • The jdeCacheInit() API is called everywhere, thus having the effect of making “all” areas of the code quite uniformly slower as the run progresses. • This is what we call a “resource leak” • The actual memory cost is not significant

  29. Nonlinear scaling of long running batch jobsAnalysis • The reason for the unique cache names for thread safe code • Two users must not be using the same cache in the same process • This methodology itself is not what causes the problem • Failing to destroy each uniquely named cache as soon as it is no longer needed is the true problem • This causes the unneeded cache names to accumulate in the internal buffers of the jdeCacheInit() and jdeCacheTerminate() APIs • …which in turn causes progressively longer calls over the course of processing the data. • The impact on the batch jobs over time is very significant. • The true problem is the large cache name “footprint” – the total number of unique cache names stored in the buffers at any given time must remain small.

  30. Nonlinear scaling of long running batch jobsAnalysis • jdeCache dynamic runtime cache naming: BSFN Code sample: /* Add (const JCHAR*)before szMathString*/ sprintf(szCacheName, _J(“GeneralLedgerCache-%ls"), szUniqueJObID); /* Use the new cache name */ CacheInitReslt = jdeCacheInit(hUser, hCache, szCacheName, WfIndex); • Rather than hard coding a cache name, the name for each cache is created by appending a unique number to a hardcoded string: • The unique number could be job number, order number, or something else to do with the business purposes of the app. • In this case, the variable Job ID (szUniqueJObID) gets sprintf()’ed to a hardcoded string, e.g. “GeneralLedgerCache-” • For each record processed, a new unique cache name is dynamically created • This is only a problem when the reference count does not go to zero when the program is finished with the cache, and the cache name remains in the list. • In this case – the practice leads to an extremely large number of distinct caches– which causes the slowness in the bowels of the cache APIs

  31. Nonlinear scaling of long running batch jobs Analysis • Why does the accumulation of cache names slow down the batch job by such a large factor? • All these thousands of cache names get stored in static internal buffers in the tools cache APIs. • They must be processed by jdeCacheInit() and jdeCacheTerminate() each time. • The larger the run, the more cache names, and the larger these buffers get.  This is the reason why larger runs become increasingly unscalable and run longer by non-linear factors. • There are two while loops in each call to jdeCacheInit() which become longer and longer each time it is called to create or access a cache. • jdeCacheInit() must loop through an ever increasing list of all existing cache names each time a cache is created. • jdeCacheTerminate(),similarly, must loop through the same bloated list to find the proper cache to terminate “Tweleve Drummers Drumming, Eleven pipers piping, ten lords-a-leaping….”

  32. Nonlinear scaling of long running batch jobsAnalysis • jdeCacheInit() • If there is no existing cache of the same name: Creates the new cache and adds the name to the internal list • If a cache of the indicated name already exists:Increments the reference count for existing caches • In either case – the entire list of cache names is searched sequentially • jdeCacheTerminate() • If the reference count is > 1: Only decreases the reference count • If the reference count = 1: then the cache is actually destroyed and removed from the list • IN either case – the entire list of cache names is searched sequentially

  33. Loop through every cache object Compare cache name with every other cache stored in the list Nonlinear scaling of long running batch jobsAnalysis • jdeCacheInit() Tools API internals - code segment

  34. Nonlinear scaling of long running batch jobs Analysis • Back to the ERP debug log … • There is a repeating pattern of seven caches, each of which has one instance for each EDI job number • There is a static name appended to a variable (EDI job #) • Each of these cache names is stored in the internal cache name list and must be scanned each time jdeCacheInit() or jdeCacheTerminate() is called • …The 21,000 days of Christmas - jdeCache() style

  35. Nonlinear scaling of long running batch jobsAnalysis • So where do these caches occur in the code? • This question must be answered if the problem is to be fixed at the application level. • This can be determined via a customized string search of the profiling data. • Standard text parsing tools, such as those supporting regular expressions, can be used for this task. • However, the PW tool provides this functionality as yet another built-in feature. • The profiling data can searched for user-specified strings, and an output file listing all the BSFNs containing those strings within their scope will be listed • The hardcoded portions of the seven cache names in Batch ‘A’ can be entered as search strings in Performance Workbench to locate the “unterminated cache” names indicated in the debug log.

  36. Performance analysis involves understanding the data and deriving original insights, not just rote number crunching Nonlinear scaling of long running batch jobsAnalysis • The BSFNs containing the seven cache strings within their scope are indicated, thus it is these BSFNs which contain the offending unterminated caches. • Note that a fundamental insight was necessary to once again simplify the problem scope: • Namely, the patterns in the cache names which essentially reduced 21,000 names to seven. This is why software performance work is truly engineering, not rote technician’s tasks. • It is NOT “linear thinking”

  37. The use of the proper profiling tools is critical to performance analysis. This may entail customized profiling and data parsing. Nonlinear scaling of long running batch jobsAnalysis • It should be noted here that the Performance Workbench tool – customized to this ERP product - has once again provided an insight vital to the timely resolution of this issue. • It has done so in a manner much more straightforward than that of tools from a third-party vendor. • The lesson here is that customized time-stamped profiling is often the decisive factor in quickly narrowing down the source of performance problems in cases where access to the source code is possible.

  38. Nonlinear scaling of long running batch jobs Analysis • For the 3,000 record case: • Caches in Batch ‘A’ which are not terminated: total of 21,070 unique caches in seven patterns in three Business Functions • BSFN: F4111ClearDetailStack (XT4111Z1) • F41UI001-<number> • X4111Z1GL-<number> • X4111Z1LPACM-<number> • X4111Z1QTY-<number> • X4111Z1ACM-<number> • BSFN: CacheProcessPOHeaderCache (B4302180) • B4302180F1706<number> • BSFN: B3003130 (B3003130) • B3003130-<number> • Caches in Batch ‘B’ which are not terminated: total of 558 unique caches in one pattern in one Business Function • BSFN: F41021DeleteCache (B4200370) • B4200370C-<number> • Caches in Batch ‘C’which are not terminated: total of 3000 unique caches in one pattern in one Business Function • BSFN: CacheProcessConfigurationID (B3201470) • B3201470<number>

  39. Nonlinear scaling of long running batch jobsAnalysis • Applications Solutions: • Fix the BSFN code to include appropriate calls to jdeCacheTerminate() • This was the case for Batch ‘A’ .. OR, if the code logic dictates: • Fix the BSFN code to eliminate the superfluous calls to jdeCacheInit() • In other words, did we REALLY need to initialize all those caches in the first place… • This was the case for Batch ‘B’ and ‘C’

  40. Nonlinear scaling of long running batch jobsAnalysis • Pseudocode showing the essence of the fix for case 1 /* if cache cursor is NULL AND there are no records in the cache, terminate the pointer */ if (GeneralLegderCacheCursor == NULL) { if (jdeCacheGetNumRecords(hGeneralLegderCache) == 0) { /* MISSING CALL to jdeCacheTerminateAll() */ jdeCacheTerminate(hUser, hGeneralLegderCache); } } So this is the answer to all the riddles….

  41. The ability to read and understand source code is very important to determining the cause of Performance Problems Nonlinear scaling of long running batch jobsAnalysis • This leads us to another vital factor in Performance Analysis: the ability to read and understand the code. • In order to comprehend details of the problem, one does not need to be an expert at writing code. • However, being adept at grasping the nuances of code written by others is essential. • In other words, you do not need to take off and land the plane, but you should be able to navigate in the air.

  42. Loop through every cache object Compare cache name with every other cache stored in the list Nonlinear scaling of long running batch jobs Analysis • Elephant in the living room: • The sequential search in the Tools code shown below • Linked list data storage, combined with efficient search algorithms such as bubble sort and shell sort, are very fundamental “cookie cutter” concepts. Why was this not addressed????? Sequential search…YIKES!!!!

  43. Science means solving Academic problems Engineering means solving Business problems. Nonlinear scaling of long running batch jobs Analysis • The answer was driven by business reasons: • Modifications to the Tools code layer take longer to deliver to customers that those to the application level, since the Tools code is proprietary and can only changed by the vendor. • Changes to Tools APIs impact every application which makes calls to them, and so there is a much longer regression testing cycle involved. • Changes to application code, conversely, can be delivered quickly and tested right in the customer’s environment. • When a customers’ business cycle is impacted, the fastest means of delivering a stable solution must be leveraged, so an application-level solution to the problem was delivered

  44. Nonlinear scaling of long running batch jobs Analysis • There actually WAS a Tools code change which resulted from this effort, which will tend to “cover up” the application programming oversight which caused this batch scalability predicament. • This fix used a binary tree algorithm instead of the flawed sequential search of a “flat” array in jdeCacheInit() and jdeCacheTerminate(). • It was delivered at a later date, as a part of a regularly scheduled Tools “service pack”. • Application-side solution will always work, even without this Tools fix • With this fix in place, however, ANY batch scaling problems caused by this issue should improve dramatically

  45. Throughput(init per sec) # of cache objects left open Nonlinear scaling of long running batch jobsAnalysis • Characterization of the problem…. • Internal testing by the Tools team using a test driver which leaves varying numbers of caches open: • Throughput drops severely as a function of the number of unique cache names left in the internal list • If n is the caches leaked per iteration • The throughput falls off as O(1/n)…at least …the more unique cache objects left open, the slower the program runs

  46. Nonlinear scaling of long running batch jobs Analysis • Have many customers have complained about scalability in these batch jobs? • Customers may not notice it if they are running sufficiently low volumes, even if the issue has long been present. • Others may have noticed the problem, but their runtime windows could be long enough so that it does not cause a business disruption. • Ergo – they don’t enter calls… …performance is often in the eye of the beholder

  47. Nonlinear scaling of long running batch jobsAgenda • Problem Definition • Analysis • Results • Conclusion

  48. Nonlinear scaling of long running batch jobs Results • Results: • Following the simple application code fix, the throughput of all three batch programs was dramatically improved. • Note that the runtime of the 9000 record use case for Batch ‘A’, on which this paper has focused, was reduced from 722 minutes to 30 minutes The use case that REALLY mattered • Anomalous results...likely due to data inconsistencies

  49. Nonlinear scaling of long running batch jobs Results • Terminating the cache objects clearly solved a significant resource leak in these batch programs. • After the fix was applied, the batch runtimes rose at a normal, linear rate to the amount of input data processed. • The disruptions to the customer’s business cycle were completely resolved.

  50. Nonlinear scaling of long running batch jobsAnalysis • Problem Definition • Analysis • Results • Conclusion

More Related