evaluation of a new grid engine monitoring and reporting setup n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Evaluation of a new Grid Engine Monitoring and Reporting Setup PowerPoint Presentation
Download Presentation
Evaluation of a new Grid Engine Monitoring and Reporting Setup

Loading in 2 Seconds...

play fullscreen
1 / 15

Evaluation of a new Grid Engine Monitoring and Reporting Setup - PowerPoint PPT Presentation


  • 146 Views
  • Uploaded on

Evaluation of a new Grid Engine Monitoring and Reporting Setup. Thomas Finnern. Abstract for Conference. Title Evaluation of a new Grid Engine Monitoring and Reporting Setup Summary Dashboards and Event Correlation for Grid Engine with Splunk Content

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Evaluation of a new Grid Engine Monitoring and Reporting Setup' - devin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
abstract for conference
Abstract for Conference
  • Title
    • Evaluation of a new Grid Engine Monitoring and Reporting Setup
  • Summary
    • Dashboards and Event Correlation for Grid Engine with Splunk
  • Content
    • Splunkis a commercial software platform for collecting, searching, monitoring and analyzing machine data providing interactive real-time dashboards integrating multiple charts, reports and tables. We have been working on a grid engine setup based on the free branch supporting standard reporting and simple job and fairshare debugging with easy chart generation and smart event correlation features. On top of this we try understand the added value of the enterprise branch supporting integrated user authentication and role-based access controls. There is a plan to share our work in a public available Splunk grid engine app.
outline
Outline
  • Grid Engine Data andSplunk
    • Integration intoSplunk
    • Job Data: submit, prolog, epilog plus Infos on Errors
    • SoGE (Son ofGrid Engine) Data: Messages andAccounting
    • System Data: Resources andUsage (numbers, projects)
  • Needs
    • Job Inspection (View 1)
    • Accountingand Reports (View 2)
    • Weekly Project Views (View 3)
    • Realtime System Data (View 4)
    • RoleBased Access
    • Grid Engine App …
  • Status andConclusions
integration into s plunk
Integration intoSplunk
  • SplunkIndexing
    • Index: Separate Directory usedby all SoGEData
    • Index: Event indexing Time, Originating Host andSourcetype
    • ASCII Store for Data Sets
    • Field Mapping on theFly during Analysis
  • Splunk Data Input
    • Extra SyslogPort mappedSoGE Index
      • Submit, Jobstart, Jobend, System Data
    • SplunkForwarder
      • Running on Grid Engine Master
      • Reliable Upload toSplunk Server
      • Configured for Message File
      • Configured for Accounting File
job data via udp syslog
Job Data via UDP Syslog
  • Setup
    • Perl Script in JSV(Job SubmitVerification on Server), Prolog and Epilog
  • Submit
    • eventtype=“sgelog” sgeevent=“submit"
    • sgeusersgejobidsgerootsgecellsgehost
  • Prolog, Epilog
    • eventtype=sgelogsgeevent={prolog|epilog}
    • sgeusersgejobidsgerootsgecellsgehostsgequeuesgeslotssgetaskidsgearchsgehostssgepesgesubmithost
job inspection view 1 find active users jobs and hosts
Job Inspection (View 1): Find activeusers, jobsandhosts
  • index=bird| transactionsgejobidstartswith=submit span=<time> | searchsgejobid=* sgeuser=* | chartvalues(sgeproject) values(sgeuser) values(sgejobid) values(sgehost) bysgeuser
job data via tcp splunk forwarder
Job Data via TCP Splunk Forwarder
  • Setup
    • SplunkForwarder (same RPM as on Splunk Master)
    • Reliable Data CollectionoverSplunk Protocol for accountingandmessagesfiles
  • Grid Engine Messages
    • Running on qmaster, optionally on workernode
    • sourcetype=ge_messagessgeevent=„{I,W,E}“ sourcesgejobidsgettaskidsgemessagesgescope
  • Grid Engine Accounting
    • Running on qmaster
    • sourcetype=ge_accountingsourcesgejobidsgedistrosgeusersgear_submission_timesgearidsgecpusgedepartmentsgeend_timesgeexit_statussgefailedsgegranted_pesgegroupsgehostsgeiosgeiowsgejobnamesgemaxvmemsgememsgepe_taskidsgeprioritysgeprojectsgequeuesgeru_idrsssgeru_inblocksgeru_ismrsssgeru_isrsssgeru_ixrsssgeru_majfltsgeru_maxrsssgeru_minfltsgeru_msgrcvsgeru_msgsndsgeru_nivcswsgeru_nsignalssgeru_nswapsgeru_nvcswsgeru_oublocksgeru_stimesgeru_utimesgeru_wallclocksgeslotssgestart_timesgesubmission_timesgetask_number
reports and accounting view 2 grid engine accounting 30d
Reports andAccounting (View 2): GridEngine Accounting(30d)
  • Pie „Sum CPU Secondsby Project“
  • Timechart „ Sum CPU Secondsby Project“
  • Queue Timing „Wait Times and Wall Times by Queue“
  • Timechart „Jobs in Error by Project“
  • Table „All Values“
system data via udp syslog
System Data via UDP Syslog
  • Setup
    • PerlscriptrunningCommandsqhostandqstat in Cron Job on Master and Slave
    • qhostprovides Worker Node Resources, qstatshows Project Data
  • sgeevent=„numbers“
    • sgelog: sgehost="global" sgeShareTotalsgeSumProject-<ProjectName> sgeSumJobs-Error sgeSumJobs-Waiting sgeSumJobs-SlotRunsgeSumJobs-RunningsgeSumShares-<ProjectName> sgeSumShare-Store sgeSumShare-Memory sgeSumShare-Cores sgeSumShare-Hosts sgeSumCores-Total sgeSumCores-Available sgeSumDistro-<Name> sgeSumQueue-l<QName> sgeSumQueue-total sgeSumStore-mem_usedsgeSumStore-h_vmemsgeSumStore-h_ftotalsgeSumStore-mem_totalsgeSumStore-h_fused
  • sgeevent=„projects“
    • sgelog: sgehost="global" sgeprojectsgesharesgestcktsgeovrtsSlotInErrorsgeotcktsgetcktssgeftcktJobsInErrorSlotRunningsgememJobsRunningsgeiosgecpuJobsWaiting
weekly project view view 3 project infos plus system infos
Weekly Project View (View 3):Project Infos plus System Infos
  • Pies „Slots, CPU, IO and MEM by Project“
  • Timechart „Slots, Waiting Jobs and Tickets by Project“
realtime system data view 4 system health and trends
Realtime System Data (View 4): System Healthand Trends
  • Last 15 Minutes Jobs, Slots, Waitsand Errors
  • Timechart 24 HoursJobs, Slots, Waitsand Errors
  • Timechart „Core Usageby Queue“
  • Sparkling Lines: Trends for Unavailablecores, Distros, Memory Usage
  • Table „All Summed Values“
enterprise vs free version
Enterprise vs. Free Version
  • Price per Data Volume…
  • Limit / License Handling
  • More Data (> 500 M)
  • RoleBased Data Access
  • Auto Report
  • HA
  • Limit on Data Volume …
  • Limit / License Handling
  • 500 Mbyte
  • Interactive Analysis
  • Overall Correlations
  • Easy Debugging
  • Puppet Install
status and conclusion
Status and Conclusion
  • Status
    • Work in Progress
    • Finetuning Reports
    • Checking Data Consistency
    • Still Learning Splunk
  • Conclusions
    • Wouldliketobuy
      • RoleBased Data Access
      • High Availibilty
  • Thankyou for Listening