1 / 49

Damien Garros , Network Reliability Engineer / Roblox

Introduction to Time Series Database for Network Engineer. CHINOG, May 23th 2019. Damien Garros , Network Reliability Engineer / Roblox. Twitter @damgarros Github @dgarros. What is Roblox ?. Educational platform for young software developers Gaming and Social platform

alanj
Download Presentation

Damien Garros , Network Reliability Engineer / Roblox

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Time Series Database for Network Engineer CHINOG, May 23th 2019 Damien Garros, Network Reliability Engineer / Roblox Twitter @damgarros Github @dgarros

  2. What is Roblox ? • Educational platform for young software developers • Gaming and Social platform • Core audience for player is kids ages 9-12 • 2 Million Active developers • 90+ Million monthly active users

  3. Agenda • Why do we need to learn that ? • Where are we today ? • Introduction to Time Series Database • Monitoring Stack @ Roblox • Q & A

  4. 1 Why do we need to learn that ?

  5. DATA IS THE NEW OIL FIND it EXTRACT it REFINE it MONETIZE it

  6. FIND - Some data we can get from a network

  7. As Network Engineer, we are sitting on a lot of data But we don’t have the right tools

  8. EXTRACT & REFINE • How much data are you currently extracting from your network • How fast can you extract new data ?

  9. DHCP Relay pool status

  10. “MONETIZE” network data • Reduce time to root cause, for everyone in our org(no it’s not the network) • Improve reliability • Enable new applications/system/use cases • Increase the value of the network to the organization

  11. Edge Traffic per site/provider

  12. BGP interface Status tracking

  13. 2 Where are we today ?

  14. Legacy Network Monitoring Solution NMS Logs SNMP (Pull) Server(All in One) Devices

  15. What tools are you using ?

  16. RRD Tools • Introduced in 1999 • Storage • Aggregation • Visualization No query engine Data retention is poor.

  17. RRD Tools - Down Sampling 1 Day 1 Week 1 Month

  18. Telemetry has been a hot topic in the network industry Telemetry Streaming Kill SNMP Openconfig gNMI

  19. ... Network Monitoring Solution Telemetry Steaming ??? Devices Transport

  20. PULL PUSH Streaming SNMP

  21. What are other doing outside of the Network Industry ?? Push andPull MetricsStore Each components can scale-out independently The storage and visualization are decoupled. Store once, visualize as required Agent Visualize Alert Agent Logs Visualize Agent Alert Agent

  22. Datastore specialized by data format Metrics..Time Series LogsEvents Structured Data Numeric value evolving over time Constant Interval Counters CPU Number peers Mostly Text data Unpredictable interval Routing/Forwarding Table Configuration

  23. Open source projects Monitoring / Alerting CollectorAgent Time Series Database Alerting Visualization Kapacitor Elastalert

  24. Telegraf - The Swiss Army Knife • Plugins driven agent / Extensible • Support out of the box • Over 80 Input Plugins • Most Databases (output) • Data manipulation • SNMP Input Plugin • Juniper / OpenConfig

  25. Cloud Based Solutions MetricsStore Agent Visualize Alert Agent Agent Agent

  26. Reuse the same components for network devices Agent Linux Based NOS Store Agent Visualize SNMP Legacy NOS Alert Agent Legacy NetconfeAPI NOS Custom Collector Collector Streaming Telemetry Enabled Collector

  27. 3 Time Series Database

  28. Modern Time Series Database • New generation of database optimized for Time serie data • Started around 2013, Mainstream since 2016 • Powerful query engine • Decorelate storage and visualization

  29. Introduction to Modern TSDB interface_output_bytes{device="spine1",interface="et-0/0/4"} 4569765412 measurement nameWhat is it ? Tags/LabelsContextual information Value

  30. Introduction to Modern TSDB interface_output_bytes{device="br1-fra1"}

  31. Introduction to Modern TSDB interface_output_bytes{device="br1-fra1"}

  32. Introduction to Modern TSDB deriv(interface_output_bytes{device="br1-fra1"}[5m])*8

  33. Introduction to Modern TSDB sumby(device)( deriv(interface_output_bytes{device=~"br.*"} [5m]))

  34. Introduction to Modern TSDB deriv(interface_output_bytes{device="br1-fra1"} [5m]) / interface_speed{device="br1-fra1"}

  35. Introduction to Modern TSDB interface_output_bytes{device="spine1",interface="et-0/0/4"} 4569765412 interface_output_bytes{device="spine1",interface="et-0/0/4", role="leaf",site="fra1",provider="level3", intf_role="uplink"}

  36. Introduction to Modern TSDB sumby(provider)( deriv(interface_output_bytes{device="br1-fra1"} [5m]))

  37. 4 Monitoring @ Roblox

  38. Network Monitoring / Alerting @ Roblox Created a Collector based on Netconf Created an Alert Manager Visualize Collect Netconf Alert

  39. Collector - py-metric-collector • Dynamic Inventory • Dynamic Tagging • Sharding • Enum for State • Support Junos and F5 • Support Multiple database https://github.com/dgarros/py-metric-collector

  40. Source of Data

  41. Network Monitoring / Alerting Stack Visualize Collect Netconf Alert Get list of devices Get contextual information (role, site etc..) Topology information Device status IP to interface mapping Source of Truth

  42. REFINE - Add more contextual data at runtime All • Device Role • Site • Service Group • Junos Version Interfaces • peer_role • Interface_role • Provider • circuit_id • geo_type

  43. Alert Manager • Modular system to ingest alerts from any source • Advanced Suppression rules • Integration with external data sources • Group alerts (interface, bgp) based on topology information https://github.com/mayuresh82/alert_manager

  44. Thank You

  45. Resources

  46. Blogs / Videos / Tuto

  47. Links

  48. Links

  49. Pictures

More Related