Tape monitoring
1 / 18

Tape Monitoring - PowerPoint PPT Presentation

  • Uploaded on

Tape Monitoring. Vladimír Bahyl IT DSS TAB Storage Analytics Seminar February 2011. Overview. From low level Tape drives; libraries Via middle layer LEMON Tape Log DB To high level Tape Log GUI SLS TSMOD What is missing? Conclusion. Low level – towards the vendors.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Tape Monitoring' - armen

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Tape monitoring

Tape Monitoring



Storage Analytics Seminar

February 2011


  • From low level

    • Tape drives; libraries

  • Via middle layer

    • LEMON

    • Tape Log DB

  • To high level

    • Tape Log GUI

    • SLS


  • What is missing?

  • Conclusion

Low level towards the vendors
Low level – towards the vendors

  • Oracle Service Delivery Platform (SDP)

    • Automatically opens tickets with Oracle

    • We also receive notifications

    • Requires “hole” in the firewall, but quite useful

  • IBM TS3000 console

    • Central point collecting all information from 4 (out of 5) libraries

    • Call home via Internet (not modem)

    • Engineers come on site to fix issues

Low level cern usage
Low level – CERN usage

  • SNMP

    • Using it (traps) whenever available

    • Need MIB files with SNMPTT actuators:

    • IBM libraries send traps on errors

    • ACSLS sends activity traps


    • Event log messages on multiple lines concatenated into one

    • Forwarded via syslog to central store

    • Useful for tracking issues with library components (PTP)

EVENT ibm3584Trap004 . ibm3584Trap CRITICAL

FORMAT ON_BEHALF: $A SEVERITY: '3' $s MESSAGE: 'ASC/ASCQ $2, Frame/Drive $6, $7'

EXEC /usr/local/sbin/ibmlib-report-problem.sh $A CRITICAL

NODES ibmlib0 ibmlib1 ibmlib2 ibmlib3 ibmlib4


Trap for library TapeAlert 004.


Middle layer lemon
Middle layer – LEMON

  • Actuators constantly check local log files

  • 4 situations covered:

    • Tape drive not operational

    • Request stuck for at last 3600 seconds

    • Cartridge is write protected

    • Bad MIR (Media Information Record)

  • Ticket is created= email is sent

    • All relevant informationis provided within the ticketto speedup the resolution

  • Workflow is followed tofind a solution

Dear SUN Tape Drive maintainer team,

this is to report that a tape drive T10B661D@tpsrv963 has became non-operational.

Tape T05653 has been disabled.


01/28 15:33:05 10344 rlstape: tape alerts: hardware error 0, media error 0, read failure 0, write failure 0

01/28 15:33:05 10344 chkdriveready: TP002 - ioctl error : Input/output error

01/28 15:33:05 10344 rlstape: TP033 - drive T10B661D@tpsrv963.cern.ch not operational


Drive Name: T10B661D

Location: acs0,6,1,13

Serial Nr:

Volume ID: T05653

Library: SL8600_1

Model: T10000

Producer: STK

Density: 1000GC

Free Space: 0

Nb Files: 390


Pool Name: compass7_2

Tape Server: tpsrv963

Middle layer tape log db
Middle layer – Tape Log DB

  • CASTOR log messages from all tape servers are processed and forwarded to central database

  • Allows correlation of independent errors (not a complete list):

    • X input/output errors with Y tapes on 1 drive

    • X write errors on Y tapes on 1 drive

    • X positioning errors on Y tapes on 1 drive

    • X bad MIRs for 1 tape on Y drives

    • X write/read errors on 1 tape on Y drives

    • X positioning errors on 1 drive on Y drives

    • Too many errors on a library

  • Archive for 120 days all logs slit by VID and tape server

    • Q: What happened to this tape?

Tape log the data
Tape Log – the data

  • Origin: rtcpd & taped log messages

    • All tape servers sending data in parallel

  • Content: various file state information

  • Volume:

    • Depends on the activity of the tape infrastructure

    • Past 7 days: ~30 GBs of text files (raw data)

  • Frequency:

    • Depends on the activity of the tape infrastructure

    • Easily > 1000 lines / second

  • Format: plain text

Tape log data transport
Tape Log – data transport

  • Protocol: (r)syslog log messages

  • Volume: ~150 KB/second

  • Accepted delays: YES/NO

    • YES: If the tape log server can not upload processed data into the database, it will try later as it has local text log file

    • NO: If the rsyslog daemon is not running the the tape log server, lost messages will not be processed

  • Losses acceptable: YES (to some small extent)

    • The system is only used for statistics or slow reactive monitoring

    • Serious problem will reoccur elsewhere

    • We use TCP in order not to loose messages

Tape log data storage
Tape Log – data storage

  • Medium: Oracle database

  • Data structure: 3 main tables

    • Accounting

    • Errors

    • Tape history

  • Amount of data in store:

    • 2 GB

    • 15-20 millions of records (2 years worth of data)

  • Aging: no, data kept forever

Tape log data processing
Tape Log – data processing

  • No additional post processing, once data is stored in database

  • Data mining and visualization done online

    • Can take up to a minute

High level tape log gui
High level – Tape Log GUI

  • Oracle APEX on top of data in DB

  • Trends

    • Accounting

    • Errors

    • Media issues

  • Graphs

    • Performance

    • Problems

  • http://castortapeweb

Tape log pros and cons
Tape Log – pros and cons

  • Pros

    • Used by DG in his talk!

    • Using standard transfer protocol

    • Only uses in-house supported tools

    • Developed quickly; requires little/no support

  • Cons

    • Charting limitations

      • Can live with that; see point 1 – not worth supporting something special

    • Does not really scale

      • OK if only looking at last year’s data

High level sls
High level – SLS

  • Service view for users

  • Life availability information as well as capacity/usage trends

    • Partially reuses Tape Log DB data

  • Information organized per VO

    • Text and graphs

  • Per day/week/month


  • Tape Service Manager on Duty

    • Weekly changing role to

      • Resolve issues

      • Talk to vendors

      • Supervise interventions

  • Acts on twice-daily summary e-mail which monitors:

    • Drives stuck in (dis-)mounting

    • Drives not in production without any reason

    • Requests running or queued for too long

    • Queue size too long

    • Supply tape pools running low

    • Too many disabled tapes since the last run

  • Goal: have one common place to watch

What is missing
What is missing?

  • We often need the full chain

    • When was the tape last time successfully read?

    • On which drive?

    • What was the firmware of that drive?

  • Users hidden within upper layers

    • We do not know which exact user is right now reading/writing

    • The only information we have is the experiment name and that is deducted from the stager hostname

  • Details investigations often require request ID


  • CERN has extensive tape monitoring covering all layers

  • The monitoring is fully integrated with the rest of the infrastructure

  • It is flexible to support new hardware (e.g. higher capacity media)

  • The system is being improved as new requirements arise