high performance monitoring
Skip this Video
Download Presentation
High Performance Monitoring

Loading in 2 Seconds...

play fullscreen
1 / 18

High Performance Monitoring - PowerPoint PPT Presentation

  • Uploaded on

High Performance Monitoring. WG on Storage Federations December 6, 2012 Andrew Hanushevsky, SLAC http://xrootd.org. Setting The Context. High Performance Monitoring

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'High Performance Monitoring' - tyanne

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
high performance monitoring

High Performance Monitoring

WG on Storage Federations

December 6, 2012

Andrew Hanushevsky, SLAC


setting the context
Setting The Context
  • High Performance Monitoring
    • Collecting real-time information at statistically significant detail without impacting client or server performance that works at scale.
  • The relevant phrases
    • Real-time information
    • Statistically significant
    • Without impacting performance
    • At scale
at scale
At Scale
  • 1000’s of users
  • 10,000 or more simultaneous jobs
  • 100,000 or more active files
  • Geographically distributed across
    • Thousands of data servers
    • Hundreds of millions of files
    • Hundreds of peta-bytes of data
  • Potentially billions of events every second!
without impacting performance
Without Impacting Performance
  • This requires careful collection & reporting
    • Many trade-offs but generally
      • Highly encoded data to minimize traffic
        • Typically implies binary encoding
      • Offloading information serialization
        • More on this at the end
      • Network protocol that is fast and does not block
        • Typically implies using UDP
statistically significant i
Statistically Significant I
  • All events need not be 100% time accurate
    • No need to time-stamp each event
      • We can’t as server performance would suffer
    • So, we can report events in time-windows
      • Events are statistically post-distributed in the window
        • Note that events are reported in occurrence order
  • Any event is disposable
    • This means we can loose events
      • Allows use of non-blocking UDP packets for reporting
statistically significant ii
Statistically Significant II
  • Statistical significance relies on a large sample
    • We want the big picture
      • This is monitoring not accounting!
    • Build it up using a large number of events
      • And we can get a large number every second
      • But we don’t expect to get every event
  • This helps us achieve high performance
    • Yet provides a reasonably accurate picture
real time information
Real Time Information
  • Reporting events close to the time they happen
    • Regulated by the size of the window
      • Typically, in the seconds (e.g. 5 or 10, maybe longer)
  • What information?
    • Practically anything that might happen. . . .
      • Logins and logouts
      • File operations (open, close, remove, etc)
      • File I/O (i.e. reads and writes)
      • Request redirections
a practical implementation
A Practical Implementation
  • xrootdprovides a wide range of monitoring data at high performance
  • Information is broken out into streams
    • Asynchronous information packets for
      • Periodic summary data
        • Summary stream
        • Low event rate allows for it to be xml based
      • Real time detail data
        • F, M, R, T streams
        • Potentially high event rates necessitates binary format
why streams
Why Streams?
  • Allows one to easily
    • Group related information together
    • Independently select the level of detail in each group
    • Route information to different collectors
      • These can be specialized for each stream
    • Control the performance impact of each stream
      • Streams can be selectively enabled
    • Makes it easier to handle the raw data
the summary stream
The Summary Stream
  • Summary data periodically reported
    • Very large amount of data available
      • http://xrootd.org/doc/prod/xrd_monitoring.htm
    • Selectableby category
    • Centrally collected
      • Collector merges reporters
    • Fed into your favorite monitoring system
      • Ganglia, GRIS, Nagios, MonALISA, etc
    • Relatively low amount of traffic – negligible impact
the real time streams
The Real Time Streams
  • Easily> 50 MB/Sec of complex inter-related asynchronous monitoring data
    • Collector needs to be fast and robust
    • May need to cross-reference certain streams
    • Store the data is an easily analyzable format
      • E.g.mySQLor root files
    • Condense the information for suitable rendering
    • Send it to the rendering agent
      • E.g. via active MQ to the dashboard
  • High amount of traffic – high impact
the real time m stream
The Real Time M Stream
  • The Map stream
    • Server, user, and file names mapped to binary id’s
      • The id’s are used in other streams as backward refs
        • Allows >100x compression of redundant information
    • Gross file events
      • Purges (auto-removals) & stage-ins (auto-transfers)
    • Client generated event data
      • Job name, site, and performance data
  • Selectable detail levels
    • Typically, less than 1% overhead
the real time f stream
The Real Time F Stream
  • The File stream
    • Per-file I/O summary information
      • Bytes read, written vs method used
      • Sigma values for byte and operation counts
    • Per-file I/O progress information
      • Periodic report on bytes transferred
  • Selectable detail levels
    • 1 to 3% overhead
the real time r stream
The Real Time R Stream
  • The Redirect stream
    • Source to destination redirect information
      • Operation causing the redirect
    • Generated by any server that redirects clients
  • No selectable detail levels
    • Pretty much all of the information is needed
    • About 1% overhead
the real time t stream
The Real Time T Stream
  • The Trace stream
    • Per-file I/O information
      • Offset and bytes read or written for each operation
        • Identical to a seek trace
  • Selectable detail levels
    • 3 to 5% overhead
back to offloading
Back To Offloading
  • Recall xrootdmonitoring is async multi-stream
    • This means that the collector must time order the data as the server does not do this
      • Each packet has enough information to do this
    • We do this because serialization is very expensive
      • Extremely high impact in a multi-threaded application
    • The hard work is offloaded to another server
      • Allows the data server to concentrate on delivering user data not monitoring data
conclusion i
Conclusion I
  • High performance monitoring is hard work
    • It requires minute attention to detail
      • Data formats
      • Work load distribution
      • Non-blocking internal data structures
      • Information flow
  • We estimate that for xrootdit took about four person years to achieve an extremely low level of server performance impact
    • Making real-time monitoring practical at scale
conclusion ii
Conclusion II
  • Federations create an extreme scale system
    • Viewed as a single complex big data system
  • The outlined information is needed to asses it
    • Only practical with high performance monitoring
  • In essence
    • High performance real-time monitoring is a must to properly track federated storage systems