1 / 34

Troubleshooting Data Movement

Troubleshooting Data Movement. Dan Gunter LBNL. Background. Work is part of SciDAC CEDPS (Center for Enabling Distributed Petascale Science) Basic question: Why did my transfer (or remote operation) fail? We want to answer this question before the users even ask it!

tillie
Download Presentation

Troubleshooting Data Movement

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Troubleshooting Data Movement Dan Gunter LBNL

  2. Background • Work is part of SciDAC CEDPS (Center for Enabling Distributed Petascale Science) • Basic question: Why did my transfer (or remote operation) fail? • We want to answer this question before the users even ask it! • Instrument: middleware, applications, etc. • Monitor: gather data (in response to problems) • Diagnose: failures and performance problems

  3. Topics • Two broad categories of work: • Gathering and normalizing existing data to allow analysis across sites (e.g. in OSG) • Adding new data through instrumentation of standard Grid middleware (e.g. GridFTP)

  4. CEDPS Troubleshooting Architecture

  5. Syslog-ng • Features: • Can filter logs based on level and content • Arbitrary number of sources and destinations • Provide remote logging • Can act as a proxy, tunnel thru firewalls • Execute programs • Send email, load database, etc. • Built-in Log rotation • Timezone support

  6. Log collection using syslog-ng

  7. Log Parser • Normalizes unformatted logs to name=value pairs • Plug-in architecture to make it easy to add additional log file formats • We will provide a set of example parsers • But… • Parsers will be hard to write and maintain • If middleware and application developers follow the “logging best practices” document, the parser component will be not necessary

  8. Missing Event Detector • Assumes all ‘start’ events should have a corresponding ‘end’ event • Looks for missing ‘end’ events • Generates a replacement ‘end’ event with an error code • Planning to develop more sophisticated anomaly detection capabilities as MDS trigger services.

  9. Log Filter • Some sites may not want to forward potentially sensitive information to the VO archive • E.g.: usernames, user DN’s, IP addresses • Syslog-ng can filter entire events • But would prefer to just filter out the sensitive fields • Log filter will be able to remove or anonymize specific fields in the event

  10. Database Loader • This component loads normalized logs into an SQL database • Ability to specify mapping of fields to database columns

  11. Sample Site Deployment

  12. Sample Grid Deployment

  13. Current Status

  14. Logging “Best Practices” Recommendations • Practices • Each logged event must contain a unique event name and an ISO-format timestamp (e.g. 2007-06-12T07:23:22.887332Z) • All system operations that might fail or experience performance variations should be wrapped with start and end events. • All logs from a given execution thread should be tagged with a globally unique ID (or GUID), such as a Universal Unique Identifiers (UUIDs) • Log format • Logs should be lines of ASCII name=value pairs • Example: ts=2006-12-08T18:48:27.598448Z event=org.globus.gridFTP.transfer.start guid=ID file=filename src.host=H1 src.port=P1 dst.host=H2 dst.port=P2

  15. Event Names • Use a '.' as a separator and go from general to specific • Same as Java class names • First part of name should be used as a unique namespace (e.g.: org.globus) • Use start/end suffixes whenever possible • Helps immensely with troubleshooting • Examples • org.globus.gridFTP.start • org.globus.gridFTP.authn.start • org.globus.gridFTP.authn.end • org.globus.gridFTP.transfer.start • org.globus.gridFTP.transfer.end • org.globus.gridFTP.end org.globus.MDS.response.start org.globus.MDS.query.start org.globus.MDS.query.end org.globus.MDS.write.net.start org.globus.MDS.write.net.end org.globus.MDS.response.end

  16. Reporting Errors • Errors should be reported as part of the ‘end’ event if possible • Use ‘status=N’ (>= 0 success) • Not attempting to define other status codes • too hard to get agreement on these • Example: • ts=2006-12-08T18:39:23.114369Z event=org.globus.authz.gridmap.end status=-1 DN=”/O=CEDS/CN=Some User” msg=”Cannot open gridmap file /etc/grid-security/grid-mapfile for reading” guid=F7D64975-069A-4152-A21F-57109AA46DFA level=ERROR

  17. Error Reporting cont. • Depending on how program is structured, it may be hard to propagate the error message to the ‘end’ event • Use ‘error’ event name suffix in this situation • Examples: • event=org.globus.gridFTP.write.error path=“/home/grid” msg=“write error, disk full” • event=myprogram.input.error msg=“invalid input”

  18. Globally Unique IDs • Use the ‘guid’ reserved name to allow correlation of a set of events together event=org.globus.gridFTP.authn.start guid=BFDD3DA5-7891-4885-A3AF-1C5E3B8EF9BB event=org.globus.gridFTP.authn.end guid=BFDD3DA5-7891-4885-A3AF-1C5E3B8EF9BB event=org.globus.gridFTP.transfer.start guid=BFDD3DA5-7891-4885-A3AF-1C5E3B8EF9BB event=org.globus.gridFTP.transfer.end guid=BFDD3DA5-7891-4885-A3AF-1C5E3B8EF9BB • UUID easy to generate • uuidgen, uuidlib • MD5 hash in hexadecimal • But free-form ‘id’ also allowed (e.g. process ID)

  19. Example: GridFTP ts=2006-12-08T18:39:23.114369Z event=org.globus.gridFTP.start prog=GridFTP-4.0.3 localhost=myhost remoteHost=somehost.gov:56010 serverMode=inetd id=56010 ts=2006-12-08T18:39:23.114567Z event=org.globus.gridFTP.authn.start DN=“/DC=org/DC=doegrids/OU=People/CN=Somebody” id=56010 ts=2006-12-08T18:39:25.514369Z event=org.globus.gridFTP.authn.end DN=“/DC=org/DC=doegrids/OU=People/CN=Somebody” msg=“123456 successfully authorized” localUser=uscmspool381 id=56010 status=0 ts=2006-12-08T18:39:25.864369Z event=org.globus.gridFTP.transfer.start file=/tmp/myfile tcpBufferSize=128KB dataBlockSize=262144 numStreams=1 numStripes=1 destHost=129.79.4.64 id=56010 ts=2006-12-08T18:45:02.214369Z event=org.globus.gridFTP.transfer.end file=/tmp/myfile bytesTransferred=678433 id=56010 status=0 ts=2006-12-08T18:45:02.214386Z event=org.globus.gridFTP.end id=56010 status=226

  20. Logging API • We are not requiring any special library to generate log messages • We assume that programmers use one of the standard logging APIs (syslog, Java log4j, python logger, etc.) • Could also use ‘printf’, custom logging API, etc. • Syslog-ng can be used to forward any newline-delimited ASCII log file

  21. Status • Working with GT4 developers to add this to • GRAM4, GridFTP, MDS4, Java Core, C Core, Delegation Service • Working with OSG on deployment of syslog-ng to gather up logs

  22. Log Summarizer

  23. Issue • Would like to have detailed I/O logging for performance analysis • But detailed logs can be far too large and intrusive • For example, a trace of the I/O operations performed by a single GridFTP server capable of saturating a 10 Gigabit network will generate • O(20,000) log events / second • over 70 million per hour • Need: • ongoing report of I/O characteristics • negligible perturbation ( < 1% ) and storage

  24. Solution • Summarization • developer can log 1000’s of events/sec • run-time choice of summary granularity (easy to turn off by default) and algorithm • NetLogger Summarization Library • Summarizes logs before they are ever written to disk • Huge reduction of log volume, while retaining important information • Can be used for bottleneck analysis and performance anomaly detection • General-purpose tool can be extended to do different kinds of summarization (currently only does time-based)

  25. How summarization works Log calls Output Code for (i=0; i < N; i++) { nl_write(log, “loop.start”, “id=i”, 0); double v = do_work(); nl_write(log, “loop.end”, “id=i val=d”, 0, v); } 0 sec start/end of each loop Configuration (XML version) <config> <event-name-keys>event</event-name-keys> <function id='f1' type='tsumm' url='/tmp/summarized.log'>" <param name='interval' >1</param> <param name='value' >val</param> </function> <event-group> <consume/> <function id='f1'> <param name='event'>loop.summ</param> </function> <id-keys>id</id-keys> <event>loop.start</event> <event>loop.end</event> </event-group> </config> 1 sec summary events with average time, average value per time, etc. 2 sec

  26. Programmatic configuration • New NetLogger calls (slightly simplified): • Add Events to Summarizer: add_eventpair( “my.event”, my.event.start / my.event.end “nbytes”, value field, e.g. nbytes=131024 “guid”) identifier field • Set summary interval: set_interval(10) 10 second summary interval • Get summary statistics: I = get_stats(“my.event”)

  27. Sample GridFTP Deployment with Summarizer

  28. Anomaly Detection • Summarized events can be used for simple anomaly detection • Summarize disk and network throughput every 10 seconds • Generate an alarm if disk or network drops below threshold X for duration Y

  29. Bottleneck Analysis • Can configure summarizer to just output a single summary at the end • Need to collect summary information at both client and server sides • Because the start/end events measure both... • time inside instrumented function (endi - starti) • time between successive calls (starti+1 - endi) • ..there is potential for determining which functions are busy and which are mostly waiting • admittedly, this is somewhat complicated by OS buffering

  30. Summary • Two broad categories of work: • Gathering and normalizing existing data to allow analysis across sites (e.g. in OSG) • syslog-ng, log parser, db loader • missing event detector • anomaly detection • Adding new data through instrumentation of standard Grid middleware • best practices logging recommendations • summarizer

  31. More Information • CEDPS TS home page: • http://www.cedps.net/wiki/index.php/Troubleshooting • Best-practices sub-page: • http://www.cedps.net/wiki/index.php/LoggingBestPractice • CEDPS TS team • Brian Tierney, LBNL (Area lead) • Jen Schopf, ANL • Stu Martin, ANL • Laura Perlman, ISI

  32. Extra Slides

  33. NL summarizer performance NL mode Log dest

  34. NL summarizer overhead I/O, compute overlap

More Related