tape monitoring
Skip this Video
Download Presentation
Tape Monitoring

Loading in 2 Seconds...

play fullscreen
1 / 18

Tape Monitoring - PowerPoint PPT Presentation

  • Uploaded on

Tape Monitoring. Vladimír Bahyl IT DSS TAB Storage Analytics Seminar February 2011. Overview. From low level Tape drives; libraries Via middle layer LEMON Tape Log DB To high level Tape Log GUI SLS TSMOD What is missing? Conclusion. Low level – towards the vendors.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Tape Monitoring' - armen

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
tape monitoring

Tape Monitoring



Storage Analytics Seminar

February 2011

  • From low level
    • Tape drives; libraries
  • Via middle layer
    • LEMON
    • Tape Log DB
  • To high level
    • Tape Log GUI
    • SLS
  • What is missing?
  • Conclusion
low level towards the vendors
Low level – towards the vendors
  • Oracle Service Delivery Platform (SDP)
    • Automatically opens tickets with Oracle
    • We also receive notifications
    • Requires “hole” in the firewall, but quite useful
  • IBM TS3000 console
    • Central point collecting all information from 4 (out of 5) libraries
    • Call home via Internet (not modem)
    • Engineers come on site to fix issues
low level cern usage
Low level – CERN usage
  • SNMP
    • Using it (traps) whenever available
    • Need MIB files with SNMPTT actuators:
    • IBM libraries send traps on errors
    • ACSLS sends activity traps
    • Event log messages on multiple lines concatenated into one
    • Forwarded via syslog to central store
    • Useful for tracking issues with library components (PTP)

EVENT ibm3584Trap004 . ibm3584Trap CRITICAL

FORMAT ON_BEHALF: $A SEVERITY: \'3\' $s MESSAGE: \'ASC/ASCQ $2, Frame/Drive $6, $7\'

EXEC /usr/local/sbin/ibmlib-report-problem.sh $A CRITICAL

NODES ibmlib0 ibmlib1 ibmlib2 ibmlib3 ibmlib4


Trap for library TapeAlert 004.


middle layer lemon
Middle layer – LEMON
  • Actuators constantly check local log files
  • 4 situations covered:
    • Tape drive not operational
    • Request stuck for at last 3600 seconds
    • Cartridge is write protected
    • Bad MIR (Media Information Record)
  • Ticket is created= email is sent
    • All relevant informationis provided within the ticketto speedup the resolution
  • Workflow is followed tofind a solution

Dear SUN Tape Drive maintainer team,

this is to report that a tape drive [email protected] has became non-operational.

Tape T05653 has been disabled.


01/28 15:33:05 10344 rlstape: tape alerts: hardware error 0, media error 0, read failure 0, write failure 0

01/28 15:33:05 10344 chkdriveready: TP002 - ioctl error : Input/output error

01/28 15:33:05 10344 rlstape: TP033 - drive [email protected] not operational


Drive Name: T10B661D

Location: acs0,6,1,13

Serial Nr:

Volume ID: T05653

Library: SL8600_1

Model: T10000

Producer: STK

Density: 1000GC

Free Space: 0

Nb Files: 390


Pool Name: compass7_2

Tape Server: tpsrv963

middle layer tape log db
Middle layer – Tape Log DB
  • CASTOR log messages from all tape servers are processed and forwarded to central database
  • Allows correlation of independent errors (not a complete list):
    • X input/output errors with Y tapes on 1 drive
    • X write errors on Y tapes on 1 drive
    • X positioning errors on Y tapes on 1 drive
    • X bad MIRs for 1 tape on Y drives
    • X write/read errors on 1 tape on Y drives
    • X positioning errors on 1 drive on Y drives
    • Too many errors on a library
  • Archive for 120 days all logs slit by VID and tape server
    • Q: What happened to this tape?
tape log the data
Tape Log – the data
  • Origin: rtcpd & taped log messages
    • All tape servers sending data in parallel
  • Content: various file state information
  • Volume:
    • Depends on the activity of the tape infrastructure
    • Past 7 days: ~30 GBs of text files (raw data)
  • Frequency:
    • Depends on the activity of the tape infrastructure
    • Easily > 1000 lines / second
  • Format: plain text
tape log data transport
Tape Log – data transport
  • Protocol: (r)syslog log messages
  • Volume: ~150 KB/second
  • Accepted delays: YES/NO
    • YES: If the tape log server can not upload processed data into the database, it will try later as it has local text log file
    • NO: If the rsyslog daemon is not running the the tape log server, lost messages will not be processed
  • Losses acceptable: YES (to some small extent)
    • The system is only used for statistics or slow reactive monitoring
    • Serious problem will reoccur elsewhere
    • We use TCP in order not to loose messages
tape log data storage
Tape Log – data storage
  • Medium: Oracle database
  • Data structure: 3 main tables
    • Accounting
    • Errors
    • Tape history
  • Amount of data in store:
    • 2 GB
    • 15-20 millions of records (2 years worth of data)
  • Aging: no, data kept forever
tape log data processing
Tape Log – data processing
  • No additional post processing, once data is stored in database
  • Data mining and visualization done online
    • Can take up to a minute
high level tape log gui
High level – Tape Log GUI
  • Oracle APEX on top of data in DB
  • Trends
    • Accounting
    • Errors
    • Media issues
  • Graphs
    • Performance
    • Problems
  • http://castortapeweb
tape log pros and cons
Tape Log – pros and cons
  • Pros
    • Used by DG in his talk!
    • Using standard transfer protocol
    • Only uses in-house supported tools
    • Developed quickly; requires little/no support
  • Cons
    • Charting limitations
      • Can live with that; see point 1 – not worth supporting something special
    • Does not really scale
      • OK if only looking at last year’s data
high level sls
High level – SLS
  • Service view for users
  • Life availability information as well as capacity/usage trends
    • Partially reuses Tape Log DB data
  • Information organized per VO
    • Text and graphs
  • Per day/week/month
  • Tape Service Manager on Duty
    • Weekly changing role to
      • Resolve issues
      • Talk to vendors
      • Supervise interventions
  • Acts on twice-daily summary e-mail which monitors:
    • Drives stuck in (dis-)mounting
    • Drives not in production without any reason
    • Requests running or queued for too long
    • Queue size too long
    • Supply tape pools running low
    • Too many disabled tapes since the last run
  • Goal: have one common place to watch
what is missing
What is missing?
  • We often need the full chain
    • When was the tape last time successfully read?
    • On which drive?
    • What was the firmware of that drive?
  • Users hidden within upper layers
    • We do not know which exact user is right now reading/writing
    • The only information we have is the experiment name and that is deducted from the stager hostname
  • Details investigations often require request ID
  • CERN has extensive tape monitoring covering all layers
  • The monitoring is fully integrated with the rest of the infrastructure
  • It is flexible to support new hardware (e.g. higher capacity media)
  • The system is being improved as new requirements arise