1 / 18

Measuring and monitoring Microsoft’s enterprise network

Measuring and monitoring Microsoft’s enterprise network. Richard Mortier (mort) , Rebecca Isaacs, Laurent Massouli é , Peter Key. We monitored our network…. …and this is how… …and this is what we saw… How did we monitor it? What did we see?. Microsoft CorpNet @ MSR Cambridge. CORPNET.

justis
Download Presentation

Measuring and monitoring Microsoft’s enterprise network

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Measuring and monitoring Microsoft’s enterprise network Richard Mortier (mort), Rebecca Isaacs, Laurent Massoulié, Peter Key

  2. We monitored our network… …and this is how… …and this is what we saw… • How did we monitor it? • What did we see?

  3. Microsoft CorpNet @ MSR Cambridge CORPNET EMEA MSRC area3 area2 LatinAmerica area1 Area 0 eBGP NorthAmerica AsiaPacific

  4. Capture setup • MSRC site organized using IP subnets • Roughly one per wing plus one for datacenter • Datacenter is by far the most active • Captured using VLAN spanning • 1:1 mapping between (Ethernet) VLAN and IP subnet • Mapped all VLANs to one port (NS trace)… • …except datacenter, mapped to second port (DC trace) • Also took a capture at one VLAN’s Ethernet switch • Allowed us to estimate amount of traffic not captured • >99% traffic is routed (i.e. goes ‘off-VLAN’) • Missed printer, some subnet broadcast, some SMB

  5. Packet processing • Assigned packets to application • Used port numbers, RPC GUID, signature byte strings, server name • Assigned applications to category • ~40 applications  ~10 categories • Generated packet and flow records • Reduce disk IO, increase performance • Still took ~10 days per complete run • Python scripts processed records

  6. Problems with this setup • Duplication • No DC switch: some hosts directly connected to router • See their packets twice (on the way in and out) • Deduplicate both traces; careful selection from NS trace • IPSec transport mode deployment • Packet encapsulated in shim header plus trailer • IP protocol moved into trailer and header rewritten • Wrote custom capture tools to unpick encapsulation • Flow detection • Network flow ≠ transport flow ≠ application flow • Used IP 5-tuple and timeout = 90 seconds

  7. Trace characteristics

  8. Traffic classification

  9. Protocol distribution

  10. # flows ~ # src ports suggesting client behaviour flows use few src ports suggests server behaviour neither client nor server suggests peer-to-peer neither client nor server suggests peer-to-peer

  11. Traffic dynamics • Headlines: seasonal, highly volatile • Examine through • Autocorrelations • Variation per-application per-hour • Variation per-application per-host • Variation in heavy-hitter set

  12. Correlograms: onsite traffic

  13. Correlograms: offsite traffic

  14. Variation per-application per-hour • Onsite (left) • Offsite (down) • Exponential decay • Light-tailed

  15. Variation per-application per-host • Onsite (left) • Offsite (down) • Linear decay • Heavy-tailed • Heavy hitters

  16. Implications for modelling • Timeseries modelling is hard • Tried ARMA, ARIMA models but per-application only • Exponentiation leads to large errors in forecasting • Client/server distinction unclear • Tried PCA, “projection pursuit method” • Neither found anything • PCA discovered singleton clusters in rank order...

  17. Implications for endsystem measurement • Heavy hitter tracking a useful approach for network monitoring • Must be dynamic since heavy hitter set varies • between applications and • over time per-application • …but is it possible to define a baseline against which to detect (volume) anomalies?

  18. Questions?

More Related