High performance monitoring
1 / 18

High Performance Monitoring - PowerPoint PPT Presentation

  • Uploaded on

High Performance Monitoring. WG on Storage Federations December 6, 2012 Andrew Hanushevsky, SLAC http://xrootd.org. Setting The Context. High Performance Monitoring

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' High Performance Monitoring' - tyanne

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
High performance monitoring

High Performance Monitoring

WG on Storage Federations

December 6, 2012

Andrew Hanushevsky, SLAC


Setting the context
Setting The Context

  • High Performance Monitoring

    • Collecting real-time information at statistically significant detail without impacting client or server performance that works at scale.

  • The relevant phrases

    • Real-time information

    • Statistically significant

    • Without impacting performance

    • At scale

At scale
At Scale

  • 1000’s of users

  • 10,000 or more simultaneous jobs

  • 100,000 or more active files

  • Geographically distributed across

    • Thousands of data servers

    • Hundreds of millions of files

    • Hundreds of peta-bytes of data

  • Potentially billions of events every second!

Without impacting performance
Without Impacting Performance

  • This requires careful collection & reporting

    • Many trade-offs but generally

      • Highly encoded data to minimize traffic

        • Typically implies binary encoding

      • Offloading information serialization

        • More on this at the end

      • Network protocol that is fast and does not block

        • Typically implies using UDP

Statistically significant i
Statistically Significant I

  • All events need not be 100% time accurate

    • No need to time-stamp each event

      • We can’t as server performance would suffer

    • So, we can report events in time-windows

      • Events are statistically post-distributed in the window

        • Note that events are reported in occurrence order

  • Any event is disposable

    • This means we can loose events

      • Allows use of non-blocking UDP packets for reporting

Statistically significant ii
Statistically Significant II

  • Statistical significance relies on a large sample

    • We want the big picture

      • This is monitoring not accounting!

    • Build it up using a large number of events

      • And we can get a large number every second

      • But we don’t expect to get every event

  • This helps us achieve high performance

    • Yet provides a reasonably accurate picture

Real time information
Real Time Information

  • Reporting events close to the time they happen

    • Regulated by the size of the window

      • Typically, in the seconds (e.g. 5 or 10, maybe longer)

  • What information?

    • Practically anything that might happen. . . .

      • Logins and logouts

      • File operations (open, close, remove, etc)

      • File I/O (i.e. reads and writes)

      • Request redirections

A practical implementation
A Practical Implementation

  • xrootdprovides a wide range of monitoring data at high performance

  • Information is broken out into streams

    • Asynchronous information packets for

      • Periodic summary data

        • Summary stream

        • Low event rate allows for it to be xml based

      • Real time detail data

        • F, M, R, T streams

        • Potentially high event rates necessitates binary format

Why streams
Why Streams?

  • Allows one to easily

    • Group related information together

    • Independently select the level of detail in each group

    • Route information to different collectors

      • These can be specialized for each stream

    • Control the performance impact of each stream

      • Streams can be selectively enabled

    • Makes it easier to handle the raw data

The summary stream
The Summary Stream

  • Summary data periodically reported

    • Very large amount of data available

      • http://xrootd.org/doc/prod/xrd_monitoring.htm

    • Selectableby category

    • Centrally collected

      • Collector merges reporters

    • Fed into your favorite monitoring system

      • Ganglia, GRIS, Nagios, MonALISA, etc

    • Relatively low amount of traffic – negligible impact

The real time streams
The Real Time Streams

  • Easily> 50 MB/Sec of complex inter-related asynchronous monitoring data

    • Collector needs to be fast and robust

    • May need to cross-reference certain streams

    • Store the data is an easily analyzable format

      • E.g.mySQLor root files

    • Condense the information for suitable rendering

    • Send it to the rendering agent

      • E.g. via active MQ to the dashboard

  • High amount of traffic – high impact

The real time m stream
The Real Time M Stream

  • The Map stream

    • Server, user, and file names mapped to binary id’s

      • The id’s are used in other streams as backward refs

        • Allows >100x compression of redundant information

    • Gross file events

      • Purges (auto-removals) & stage-ins (auto-transfers)

    • Client generated event data

      • Job name, site, and performance data

  • Selectable detail levels

    • Typically, less than 1% overhead

The real time f stream
The Real Time F Stream

  • The File stream

    • Per-file I/O summary information

      • Bytes read, written vs method used

      • Sigma values for byte and operation counts

    • Per-file I/O progress information

      • Periodic report on bytes transferred

  • Selectable detail levels

    • 1 to 3% overhead

The real time r stream
The Real Time R Stream

  • The Redirect stream

    • Source to destination redirect information

      • Operation causing the redirect

    • Generated by any server that redirects clients

  • No selectable detail levels

    • Pretty much all of the information is needed

    • About 1% overhead

The real time t stream
The Real Time T Stream

  • The Trace stream

    • Per-file I/O information

      • Offset and bytes read or written for each operation

        • Identical to a seek trace

  • Selectable detail levels

    • 3 to 5% overhead

Back to offloading
Back To Offloading

  • Recall xrootdmonitoring is async multi-stream

    • This means that the collector must time order the data as the server does not do this

      • Each packet has enough information to do this

    • We do this because serialization is very expensive

      • Extremely high impact in a multi-threaded application

    • The hard work is offloaded to another server

      • Allows the data server to concentrate on delivering user data not monitoring data

Conclusion i
Conclusion I

  • High performance monitoring is hard work

    • It requires minute attention to detail

      • Data formats

      • Work load distribution

      • Non-blocking internal data structures

      • Information flow

  • We estimate that for xrootdit took about four person years to achieve an extremely low level of server performance impact

    • Making real-time monitoring practical at scale

Conclusion ii
Conclusion II

  • Federations create an extreme scale system

    • Viewed as a single complex big data system

  • The outlined information is needed to asses it

    • Only practical with high performance monitoring

  • In essence

    • High performance real-time monitoring is a must to properly track federated storage systems