1 / 35

Monitoring best practices & tools for running highly available databases

Monitoring best practices & tools for running highly available databases. Miguel Anjo & Dawid Wojcik DM meeting – 20.May.2008. Oracle Real Application Clusters. Architecture. RAC1. RAC2. RAC5. RAC6. RAC3. RAC4. Highly Available databases – Oracle ‘services’.

fell
Download Presentation

Monitoring best practices & tools for running highly available databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Monitoring best practices &toolsfor runninghighly available databases Miguel Anjo & Dawid Wojcik DM meeting – 20.May.2008

  2. Oracle Real Application Clusters

  3. Architecture RAC1 RAC2 RAC5 RAC6 RAC3 RAC4

  4. Highly Available databases – Oracle ‘services’ • Resources distributed among Oracle services • Applications assigned to dedicated service • On node failure, resources re-distributed

  5. Highly Available databases – Apps and DB Release cycle Development service Validation service Production service Production service version 10.2.0.n Production service version 10.2.0.(n+1) • Applications’ release cycle • Database software release cycle Validation service version 10.2.0.(n+1)

  6. Why monitor? • Monitor (n.) • Computer Science. A program that observes, supervises, or controls the activities of other programs. • Need to keep all components in healthy state • We are prepared for single failures, some double failures • Commitment to give 24/7 best effort service • SW misbehavior affecting performance • Trends might indicate need to grow system • Security breaches

  7. Monitoring participants Presentation title - 7

  8. Monitoring participants Presentation title - 8

  9. What we monitor 25 database clusters 124 servers, 450 cores, 150 disk-arrays, 2000 disks at Tier0 10 Tier1 sites for Streams replication 150+ Oracle ‘services’ / applications 2000+ user schemas 1M+ connections/day

  10. PDB-Backup • 2 node cluster • Using Oracle Clusterware • Running: • RACMon (monitoring agents) • StreamMon (monitoring agents) • Backups • Scripts repository • Monitored by Lemon. Set as Critical in Operator procedures

  11. Monitored components Servers Accessibility CDB state Tools: Lemon + RACMon + OEM Disk arrays Accessibility State given by controller Firmware, disk state, disk size, disk speed Tools: Lemon + RACMon Database SW Clusterware state Service accessibility Space available Oracle Streams Tools: RACMon + OEM + StreamMon Database usage OS CPU, I/O User Sessions, CPU, I/O User quotas, tablespace usage Bad usage (short connections, bind variables) Table fragmentation Tools: RACMon, Reports

  12. Best practises (I) No overhead to DB (monitored object) Monitor as much as possible Presentation layer simple & compact Possibility to drill down

  13. Best practises (II) Hierarchy of alarms and notifications Simplicity  reliability Centralized version vs. deployed everywhere Independent blocks (monitoring, dashboard, reporting) for HA

  14. Monitoring tools • Monitoring tools • Lemon, SLS • Basic Monitoring (in house development) • SQL scripts (reactive monitoring) • RACMon (in house development, openlab) • StreamMon (in house development , openlab) • OEM – Oracle Enterprise Manager (Grid Control) - openlab • Service oriented monitoring tools • Experiment reports • DB Availability & Performance Pages

  15. Basic monitoring • Checking every 5 minutes • Each failure  e-mail with error • 3 consecutive failures  SMS • Almost perfect for single instance databases • Limitations • On RAC, system survives to single HW failures • Users connect to ‘service’, not database instance • No other components (storage, clusterware) monitoring • Missing dashboard view

  16. DBA monitoring • SQL scripts – reactive monitoring (ad-hoc monitoring) • Pros: • Easy to use • Fast real time information • Cons: • No global overview • Diagnosing single problem • Requires expert knowledge

  17. RACMon requirements • Reliable (24/7) • Easy to use and configure • Provides up to date information (frequent runs) • Centralized – no configuration or deployment on RAC side • Web interface (RAC monitoring dashboard) – one common place for RACs’ status • Monitoring of Oracle services (DB and user level) and Oracle clusterware • Monitoring of ASM instances (diskgroups and failgroups) • Monitoring other parts of the infrastructure – backups, storage, … (easy extensibility) • Notification send via emails & SMSs to DBAs • Availability numbers (over extended periods of time) • Disabling monitoring for specific machines or clusters (scheduled and unscheduled intervention logbook)

  18. RACMon Architecture

  19. RACMon - examples

  20. RACMon - examples

  21. RACMon • Pros/Features: • Customized for our environment • Gives an overview of all our HW and RACs • Configurable alerts (via email and SMS) and alert levels (production or non-production systems) • Drill down details available via multiple links to other types of monitoring software (OEM, Lemon, StreamMon) • Cons: • Requires manpower for development

  22. Oracle Streams “Oracle Streams enables the propagation and management of data, transactions and events in a data stream either within a database, or from one database to another.”

  23. StreamMon

  24. StreamMon

  25. StreamMon • Streams availability and usage monitoring • Build in alerting in case of any error in streams stack • Pros: • Monitoring of all T1 sites in one place (streams monitoring not available in any other tool, including OEM) • Convenient and easy to use web interface • Advanced plotting utilities • Cons: • Required manpower for development (currently in maintenance only) • Uses not-standard libraries, requires customized server

  26. Oracle Enterprise Manager • Architecture: • Agent running on each server uploads information to central repository, if repository is not available, it caches data • Management Service provides insight into any monitored target details • Management Service based on set-up metrics and policies sends e-mails (SMSes) • Proactive monitoring possible (actions based on problem diagnostics)

  27. Oracle Enterprise Manager • Oracle Enterprise Manager Grid Control features

  28. Oracle Enterprise Manager • Pros: • Highly configurable alerts, metrics and notification policies • Advanced and easy to use web interface • Easy drill down • External product – fully supported • Cons: • Universal – requires more navigation • No global overview (per target oriented) • Customization for many target requires much work • Bugs may by intrusive (e.g. affecting streams, excessive memory/CPU consumption, storage, DB instances) • Manpower required for maintenance and configuration • Not reliable enough for 24/7 monitoring

  29. Weekly reports • Targeted to experiment DBAs and Coordinators • Information about • Bookkeeping – Application names, contacts • Resource usage – Sessions, CPU, Logical and Physical I/O • Security: Connection errors, expiring passwords, not used schemas • Space: consumed, fragmentation, recycle bin • Bad usage: short connections, queries missing bind variables

  30. Weekly reports PHP scripts Generate report over last 7 days Specific to one RAC cluster

  31. Weekly reports

  32. Weekly reports • Current functionality • Simple way to visualize whole DB usage • Concentrates on main users (dynamic) • Easy to spot problems (color coded) • Very good feedback from our users • Now working on user configurable reports

  33. DB availability and performance page • PHP, aggregation of other tools • Requested by experiments • Dashboard of “current” DB activity • Almost real time monitoring (up to last hour) • Application resource usage • No extra load • uses SLS, RACMon, StreamMon, weekly reports • Possibility to drill down

  34. DB availability and performance page

  35. Summary • Many monitoring components developed for our environment • Out of the box tools not sufficient • Open frameworks – new features easily added • Feedback given to Oracle Enterprise Manager development (openlab) • Very good feedback from T1s and experiments • Components included in experiment dashboards, WLCG ServiceMaps, SLS

More Related