1 / 20

Functionality Tests and Stress Tests on a StoRM Instance at CNAF

Functionality Tests and Stress Tests on a StoRM Instance at CNAF. by Elisa Lanciotti (CNAF-INFN, CERN IT/PSS) Roberto Santinelli (CERN IT/PSS) Vincenzo Vagnoni (INFN Bologna). Storage Workshop at CERN, 2-3 July 2007. Contents. Setup description

ofira
Download Presentation

Functionality Tests and Stress Tests on a StoRM Instance at CNAF

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Functionality Tests and Stress Tests on a StoRM Instance at CNAF by Elisa Lanciotti (CNAF-INFN, CERN IT/PSS) Roberto Santinelli (CERN IT/PSS) Vincenzo Vagnoni (INFN Bologna) Storage Workshop at CERN, 2-3 July 2007 CERN Storage Workshop 2-3 July 2007

  2. Contents • Setup description • Goal of the tests: to tune the parameters of StoRM to use it as T0D1 storage system at CNAF T1 • Tests about: • Data throughput • Access to data stored in StoRM • Response of the system under stress • Summary and next to do CERN Storage Workshop 2-3 July 2007

  3. Setup description • Two different instances have been used: • storm02.cr.cnaf.infn.it • Very small instance (supposed for functionality tests and small scale stress tests) where all StoRM services run in one single box (4TB) • storm-fe.cr.cnaf.infn.it • Large instance used for throughput in write and read mode and for carrying on stress tests (36TB) Most of the results refer to this setup CERN Storage Workshop 2-3 July 2007

  4. storm04 storm01 MySQL storm03 Please note UI is a very old machine (PIII 1GHz,512MB) storm-fe.cr.cnaf.infn.it FE accepts request, authenticates and queues data into DB DNS balanced Front End BE reads requests from DB and executes them on GPFS (running clients) Storm-gridftpd Storm-gridftpd Storm-gridftpd Storm-gridftpd GPFS GPFS GPFS GPFS CERN Storage Workshop 2-3 July 2007 4 Gridftpd servers running also GPFS servers

  5. More details on the testbed • Front-end: storm03-storm04 • dual AMD Opteron 2.2 GHz, 4 GB ram • Back-end: storm01 • dual Intel Xeon 2.4 GHz, 2GB ram • also runs mysqld • 4 GPFS disk servers • dual Intel Xeon 1.6 GHz, 4 GB ram • also running gridftpd StoRM version 1.3-15 CERN Storage Workshop 2-3 July 2007

  6. Throughput test description • Tests using low level tools (preliminary to FTS tests, which will be finally used by LHCb) • Multithreaded script . Each thread keeps transferring the same source files (real LHCb DST and DIGI files O(100M)) to always different destination files. • Each thread does sequentially (for a configurable period of time): • PtP on Storm • Storm polling until destination TURL ready (up to 10 retries exponential varying time between 2 retries) • Globus-url-copy (or lcg-cp) sourceturl-destturl (source_surl-dest_turl) respectively • PutDone (on StoRM) • Ls on StoRM for computing the size transferred • Iterate the previous action until total time is reached CERN Storage Workshop 2-3 July 2007

  7. Throughput test description (cont’d) • Tuning of the optimal nstreams*nprocess*sources. • short tests varying the number of streams and number of source files (=threads) for each of the 7 sources (=T1 site endpoint) and varying the sources […] • Use case test: Running a test with the best combination files*streams for a non negligible time (>12 hours) by using/emulating the full transfer chain (i.e. SRM1 @7 source sites, SRM2 @StoRM destination with lcg_utils and StoRM Clients) • Evaluation of the maximum throughput (from CERN using directly source TURLs  no delay added from SRM at the source). • Running with these last (server) extreme conditions for a sustained period of time (14 hours and more…) • Testing the removal capability offered by SRM2 and more in detail by StoRM  LHCb (struggling on DC06 clean up) would be very interested. CERN Storage Workshop 2-3 July 2007

  8. Throughput tests Only using CERN that (from previous tests) has been confirmed to be the best WAN connected site to CNAF-Storm. The throughput test is carried out without SRM @ the source (i.e. only input TURLs) 150MB/s  7.7TB in 14h Storm Failures (no TURL returned, no putdone status set, no info on the size retrieved) <0.5 % Transfer Failures 1-5% (depending on the source) Use Case Test CERN Storage Workshop 2-3 July 2007

  9. Linux memory killer on my desktop (the auxiliary client instance) we started loosing procs Throughput tests (from CERN only) Total handled files ~100K  At least 400K interactions Failure rate in copying =0.2%** Failure rate due to Storm<0.1%Amount of data (50K sec)>14TB Bandwidth peak: 370MB/s ** number of != 0 exit code from globus-url-copy from GridView: throughput form CERN CERN Storage Workshop 2-3 July 2007

  10. Removal Test disk occupancy vs time for i in $(seq -w 1 50); do clientSRM rmdir -r -e httpg://storm-fe.cr.cnaf.infn.it:8444 -s srm://storm-fe.cr.cnaf.infn.it/lhcb/roberto/testrm/$i;done 17 TB of data spread over 50 directories deleted in 20 minutes CERN Storage Workshop 2-3 July 2007

  11. Access to data stored in StoRM (I) Preliminary operation: transferring to StoRM some LHCb datasets (1.3 TB). A test suite has been set up for basic functionality tests: • submit a job which opens a dataset in StoRM with ROOT • submit a job which runs DaVinci on datasets in StoRM • The test job ships an executable (bash script) and an options file containing a list of SURLs • the executable: • downloads a tarball with a SRM2.2 client and installs it locally on the WN • executes some client commands to get a TURL list from StoRM • runs DaVinci, which takes in input the TURL list and opens and reads the files stored in StoRM CERN Storage Workshop 2-3 July 2007

  12. Access to data stored in StoRM (II) • Results: on both storm02 and storm-fe the functionality tests are successfull • Ongoing activity: • repeat the test with many (~hundreds) jobs accessing data at same time to prove the feasibility of the HEP experiment analysis • Next step: • use Ganga and DIRAC interfaces to submit the jobs (PPS) CERN Storage Workshop 2-3 July 2007

  13. First stress tests • Objective: • Test how many simultaneous requests the system can handle • What happens when the saturation is reached • Tests done on both systems: storm02 and storm-fe • First test: load the system with an increasing number parallel jobs which make ptp (PrepareToPut) requests CERN Storage Workshop 2-3 July 2007

  14. Testbed description • A main script launches NPROC parallel processes. • Each process: • First phase: list the content of the destination directory in StoRM and removes all the files in it. • Second phase: performs N ptp requests to the system and polls it to get the TURL (No data transfer). • Measurement of: • Total time to perform the N requests • Percentage of failed requests storm-fe proc 1 ->/lhcb/../dir1/ proc 2 ->/lhcb/../dir2/ proc n ->/lhcb/../dirn/ UI Main script CERN Storage Workshop 2-3 July 2007

  15. Preliminary results on storm-fe Mean time per request vs number of parallel requests. Slight increase with the number of parallel requests. Some examples of the distribution of the time per request for 500 (left) and 600 (right) parallel processes CERN Storage Workshop 2-3 July 2007

  16. Results (II) Failed requests vs number of parallel requests: • Almost no failure up to 500 • For 600 parallel processes a no negligible rate of failures is observed • Causes of the failed requests: mainly 3 types of error found: • “CGSI-gSoap: error reading token data: connection reset by peer” and “CGSI-gSoap: could not open connection! TCP connect failed in tcp_connect()” • Ls returns SRM_INTERNAL_ERROR “client transport failed to execute the RPC. HTTP response: 0. • Some client commands hung for hours (mainly statusptp) • Almost 100% of failures for gSoap timeout occurr in the first phase: when creating the destination directory or listing the content of the directories and deleting the files►specific tests needed for rm, ls, mkdir • Almost no failure in the ptp-statusptp phase CERN Storage Workshop 2-3 July 2007

  17. Ongoing activity about stress tests • Specific tests on the functionalities which have shown problems: Ls,rm,mkdir • Preliminary results show: • mkdir: 2% with 600 parallel jobs • rm: 6-7% with 600 parallel jobs all failures due to gsoap timeouts. More systematic tests are needed to study the dependecy of the failure rate with the load of the system • Noticed a very high load on front-end during the test: 85% CPU usage. Back-end only 15-20% • During the tests: collaboration with StoRM developers to investigate and fix the problems found • Done some optimization of the DB on the basis of the results of these tests CERN Storage Workshop 2-3 July 2007

  18. Data transfer using FTS2.0 • Simple tests of data transfer using FTS 2.0 from Castor of CERN (SRM 1) to StoRM at CNAF (SRM 2.2) (already proved by Flavia Donno) • FTS service endpoint of CERN pps: https://pps-fts.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer Aim: running throughput tests with production instance of FTS CERN Storage Workshop 2-3 July 2007

  19. Summary and what is next to do • So far: • Estimation of throughput: very good results obtained from several sites to StoRM • Access to data: proved that DaVinci can access files stored in StoRM • First stress tests: very promising results. Still some tuning of StoRM parameters ongoing • File transfer from CERN Castor to StoRM (SRM 2.2) via FTS 2.0 • Next to do: • Ongoing stress tests for tuning the service parameters in collaboration with Storm developers • About access to data: perform many (~hundreds) parallel jobs on the CNAF local batch system. Include Ganga and DIRAC in the job submission sequence CERN Storage Workshop 2-3 July 2007

  20. Acknowledgements • Thanks to the CNAF Storage staff and to the StoRM team for providing the resources for these tests and for their support CERN Storage Workshop 2-3 July 2007

More Related