1 / 9

CVMFS Post Mortem

CVMFS Post Mortem. Doug Benjamin Duke University. What happened?. PoolFileCatalog.xml became corrupt The relevant section of the file is - <File ID="6651E9BA-061E-DD11-8F27- 00304879FC6E“>  <physical>  

perry-sloan
Download Presentation

CVMFS Post Mortem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CVMFS Post Mortem Doug Benjamin Duke University

  2. What happened? • PoolFileCatalog.xml became corrupt • The relevant section of the file is - <File ID="6651E9BA-061E-DD11-8F27-00304879FC6E“>  <physical>      <pfn filetype="ROOT_All" name="/cvmfs/a<File ID="F80FEF94-CAF8-E011-8FBD-003048F0E7A2">  <physical>    <pfnfiletype="ROOT_All" name="/cvmfs/atlas-condb.cern.ch/repo/conditions/cond11/cond11_data. 000012.gen.COND/cond11_data.000012.gen.COND._0002.pool.root"/>  </physical>  <logical>    <lfn name="cond11_data.000012.gen.COND._0002.pool.root"/>  </logical> </File> The first <pfnfiletype="ROOT_ALL" name="/cvmfs/a ...  is bogus.

  3. What happened (2) • Lead cvmfs developer was cleaning the repository and triggered the publishing of the bogus file. • He did not know it was bogus (There is no way he would have known) • Stratum 1 servers within 1 hour picked up the bogus file and published it. • Cron jobs on Stratum 1 servers fetch files from the Stratum 0 server hourly • Cvmfs clients fetch files from the Stratum 1 servers whenever either time to live information expires or automount of cvmfs areas is triggered

  4. How was the PFC created • The PoolFileCatalog.xml is create by a cron scriptthat runs this command in loop: where $dir_list is dir_list="oflcondcmccondcomcond cond08 cond09 cond10 cond11 cond12 cond13 cond14 cond15 cond16 cond17 cond18 cond19 cond20" and ATLAS_POOLCOND_PATH is export ATLAS_POOLCOND_PATH=/cvmfs/atlas-condb.cern.ch/repo/conditions # loop over the directories for dir in $dir_list do # determine if there are any data sets ls -1 ${ATLAS_POOLCOND_PATH}/${dir}/* > /dev/null 2>&1 if [ "$?" = "0" ] then echo "running command - dq2-ls -T ${ATLAS_POOLCOND_PATH}/${dir}" >> $LogFile 2>&1 dq2-ls -T ${ATLAS_POOLCOND_PATH}/${dir} >> $PoolFileCatalogLog 2>&1 retcode=RC$? if [ $retcode != "RC0" ] ; then echo "Error - failed to update PoolFileCatalog - exiting " >> $LogFile 2>&1 echo "Error - failed to update PoolFileCatalog - exiting " exit 1 fi else echo "${ATLAS_POOLCOND_PATH}/${dir} does not have datasets" >> $LogFile 2>&1 fi done

  5. What was the immediate fix? • The bogus lines were removed from the PoolFileCatalog.xml • The cron job that does the file checkout and ultimate publishing was stopped and has not been restarted

  6. Why it happened? • Not sure why the PoolFileCatalog creation failed? • Logs did not give any indication of the failure. • Did not have a backup PFC file.

  7. Remediation steps • Ultimately use Alessandro DeSalvo’ssw-mgr code to get the datasets, create the PFC (saves older version) • Requires ATLAS software releases available on the conditions db machine. • Steve Traylen working on cvmfs mounts – It is a bit tricky and troublesome • Run in cron job xml and file verification step from Misha Borodin

  8. Short term plans • Resume fetching of datasets to machine • Will be done manually (with same script w/o the publishing step) • Will run PFC file creation separately. • Add xml format verification • PFC file backup (keep a few copies) • Once everything looks good. Publish manually • Will update every day or so

  9. Intermediate plans • Once ATLAS code is available • Implement sw-mgr creation of PFC and fetch of the datasets. • Initially will be done by hand • Ultimately moved to cron job • Will add e-mail notification in case of failures

More Related