Put your hands
Download
1 / 24

gLite - PowerPoint PPT Presentation


  • 244 Views
  • Updated On :

Put your hands on gLite. g L ite. <<

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'gLite' - richard_edik


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

Put your hands

on gLite


G l ite l.jpg
gLite

<<

gLite (pronounced "gee-lite") is the next generation middleware for grid computing. Born from the collaborative efforts of more than 80 people in 12 different academic and industrial research centres as part of the EGEE Project, gLite provides a bleeding-edge, best-of-breed framework for building grid applications tapping into the power of distributed computing and storage resources across the Internet.

>>

see http://glite.web.cern.ch/glite/


Before the party begins l.jpg
Before the party begins…

Before you can use the grid, you must have a personal certificate: it should be stored on your UI under $HOME/.globus like this:

CB:~ gnegri$ ls -l .globus

total 0

-rw-r--r-- 1 gnegri staff 0 Sep 18 23:26 usercert.pem

-r-------- 1 gnegri staff 0 Sep 18 23:26 userkey.pem

The user certificate is not directly used: in order to raise the security level, a time-limited proxy is required. There are basically 2 commands for creating a proxy:

grid-proxy-init <options>

voms-proxy-init <options>

If given without options, voms-proxy-init does exactly the same as grid-proxy-init. Giving also a -voms <VO> option, you create a proxy with attributes read from a VOMS server (your proxy will tell the grid which VO you belong to, which privileges you have and your priority level on some resources… definitely, not all grid users are equal!)

The proxy is stored on the UI, under /tmp, with name x509_u<local_id>


Before the party begins4 l.jpg
Before the party begins…

Creating a proxy (either with grid-proxy-init or voms-proxy-init) is usually enough to run most of gLite commands on the grid, but it may be not enough for submitting a job!

Job submission (passing the job from the local UI to the RB) can be done in two different ways:

glite-job-submit sends the job to the Network Server

glite-wms-job-submit sends the job to the WMProxy

The main difference is that the NS is a socket, while the WMProxy is a web services interface, allowing some more flexibility and powerful features such as the “bulk submission” (jobs are sent in a collection, possibly sharing their InputSandbox).

The glite-job-submit command only needs a valid proxy on your UI, while the

glite-wms-job-submit also requires a delegationID stored on the WMProxy server. The command to do this is

glite-wms-job-delegate-proxy –d <delegationID>

where delegationID is a user-defined string that will be used when submitting the job (the option is mandatory)

glite-wms-job-submit –delegationid <delegationID> YourJob.jdl


G l ite elements l.jpg
gLite elements

  • The gLite middleware is deployed through different elements:

  • UI - User Interface

  • RB - Resource Broker

  • LB - Logging & Bookkeeping

  • CE - Computing Element

  • WN - Worker Node

  • SE- Storage Element

  • BDII - Information Indexe

  • LFC - LCG File Catalog

  • plus some other “service” elements


G l ite job workflow l.jpg
gLite: job workflow

RB: the heart of the grid. Sends the jobs on the grid and keeps track of them

BDII: LDAP database with info on LCG resources

UI: local machine on which the user defines his jobs.

All commands to the grid are issued from a UI

LB: a SQL database in which each changing of status of a job is registered

CE: the server of a LRMS (LSF, PBS, Torque…)

LFC: files stored on a SE are registered in the catalog

SE: output files are written on storage resources throughout the grid

WN: CPUs that actually execute the jobs


G l ite job workflow7 l.jpg
gLite: job workflow

  • The user defines his job on his User Interface by writing a JDL (see next slide).

  • The JDL is submitted to the Resource Broker.

  • From now on, the RB notifies the L&B about every change in status of the job.

  • The RB parses the JDL and queries the BDII in order to find the best CE matching the job requirements.

  • The RB sends the job to the Computing Element proposed by the BDII.

  • The CE submits the job and sends it to one of the underlying Worker Nodes.

  • Usually, at the end a job writes its output files to a Storage Element and, if the operation is successful, it registers them in the LFC catalog, so that they’ll be available to all grid users.

  • The log files are usually sent back to the RB and then to the UI, so that the user can check that the job has really run as expected.


Defining a job l.jpg
Defining a job

A job is an executable that will run on a grid resource. In order to specify the executable (a simple command or a script), its arguments and its requirements, you have to write a JDL file.

JDL (Job Description Language) is a high-level language based on Class Advertisement (ClassAd) Language used to describe the job’s characteristics and constraints. The JDL file consists of lines of the form

attribute = expression;

For example:

Executable = “/bin/echo”;

For a full list of the attributes of the gLite JDL, please refere to gLite documentation


Helloworld jdl l.jpg
HelloWorld.jdl

This is the (almost) simplest JDL possible

[

Executable = “/bin/echo”;

Arguments = “Hello World!”;

StdOutput = “HelloWorld.out”;

StdError = “HelloWorld.err”;

OutputSandbox = {“HelloWorld.out”,” HelloWorld.err”};

VirtualOrganisation = “atlas”;

]

Note that the attribute VirtualOrganisation is not necessary if you issued a voms-proxy-init –voms <VO>

You may submit it with

glite-wms-job-submit –delegationid <delegateID> HelloWorld.jdl

When submitted, the RB returns a job unique identifier, the JobID, in the form

https://<RB_name>:9000/<unique_string>


Helloworld jdl10 l.jpg
HelloWorld.jdl

To get the status of the job, you pass its JobID to the command

glite-wms-job-status

> glite-wms-job-status https://egee-rb-01.mi.infn.it:9000/BgWNAqxr_Vo1sNZu6uuXow

*************************************************************

BOOKKEEPING INFORMATION:

Status info for the Job : https://egee-rb-01.mi.infn.it:9000/BgWNAqxr_Vo1sNZu6uuXow

Current Status: Waiting

Submitted: Tue Sep 19 15:03:57 2006 CEST

*************************************************************

  • Possible job status are:

  • Submitted: job is entered by the user to the UI but not yet transferred to NS or WMP

  • Waiting: job has been accepted by the NS or WMP but not yet processed

  • Ready: job has been processed (matchmaking) but not yet transferred to the CE

  • Scheduled: job is waiting in the queue of the CE

  • Running: job is running on a WN

  • Done: job exited or it’s considered in a terminal state by CondorC

  • Aborted: job processing was aborted by WMS

  • Canceled: job has been canceled on user request

  • Cleared: output of the job has been retrieved after job successful conclusion


Helloworld jdl11 l.jpg
HelloWorld.jdl..?!

> glite-wms-job-status https://egee-rb-01.mi.infn.it:9000/BgWNAqxr_Vo1sNZu6uuXow

*************************************************************

BOOKKEEPING INFORMATION:

Status info for the Job : https://egee-rb-01.mi.infn.it:9000/BgWNAqxr_Vo1sNZu6uuXow

Current Status: Aborted

Logged Reason(s):

- File not available.Cannot read JobWrapper output, both from Condor and from Maradona.

- Job got an error while in the CondorG queue.

- Job got an error while in the CondorG queue.

- Job got an error while in the CondorG queue.

Status Reason: hit job shallow retry count (3)

Destination: cmsitbsrv01.fnal.gov:2119/jobmanager-condor-atlas

Submitted: Tue Sep 19 15:03:57 2006 CEST

*************************************************************

What happened?

We see from the output of the status command that the job tried to run on a machine in the US, at FNAL, which has condor as LRMS (not supported by LCG…).

Uhm… FNAL… Shouldn’t it be in OSG??


Investigations l.jpg
Investigations

  • How do we get info on a site?

  • There are 3 ways:

  • the LCG command lcg-infosites

  • the LCG command lcg-info

  • an LDAP query to the BDII

Geeks prefer the third one!

All you need is the name of a BDII.

The BDII is a LDAP database in which informations of grid sites are collected and organized in a hierarchical schema named GlueSchema (its structure is well described in gLite user guide, available at gLite documentation web page).

If we want to use the BDII named exp-bdii.cern.ch, keeping in mind that it responds on port 2170 and has the base "mds-vo-name=local,o=grid", we may write

> ldapsearch -x -h exp-bdii.cern.ch -p 2170 -b "mds-vo-name=local,o=grid"

This will print out lots of informations about all the sites “registered” in it. Looking for the CE we’re investigating on, cmsitbsrv01.fnal.gov:2119/jobmanager-condor-atlas, we would find that it’s actually published as a OSG Site and with:

GlueCEInfoLRMSType: condor


Investigations13 l.jpg
Investigations

Before solving the problem, another tip: in this case, the status command already offered good hints of the cause of the abortion. Anyway, it’s often necessary to go a bit deeper in the job history.

Any information about a job is stored in an almost persistent way in the LB, the Logging&Bookkeeping, which is accessed through the command

glite-wms-job-logging-info [options] <JobID>

among the options, the verbosity may be tuned from 0 (only the status are reported) to 2 (damn verbose!) with -v <0|1|2>

> glite-wms-job-logging-info -v 2 https://egee-rb-b1.mi.infn.it:9000/BgWNAqxr_Vo1sNZu6uuXow


Requirements l.jpg
Requirements

How can we exclude sites with condor queues from our list of possible Ces?

The glite JDL has an attribute named Requirements that perfectly fits!

The Requirements attribute will tell the RB to choose only sites satisfying some constraints the job imposes in order to run properly. Note that the Requirements attribute can have only one value, not a list, so if you have more than one requirement you have to “concatenate” them using boolean operators (&&, ||, <, !=,…).

In our case the simplest Requirements would be

  • Requirements = other.GlueCEInfoLRMSType!="condor";

You may construct any expression you may need using the GlueSchema attributes of the CE (and SE).

As an example, suppose that your job has to run for approximately 1 day on a generic grid WN and needs a certain ATLAS sw version, let’s say 11.0.42. Then you should add to your JDL file the following line:

  • Requirements = (other.GlueCEMaxWallClockTime > 86400) &&

  • Member(“VO-atlas-release-11.0.42”,other.GlueHostApplicationSoftwareRunTimeEnvironment);

Note that Member is a function of gLite jdl ClassAd


Who s first l.jpg
Who’s first?

Before trying to submit the HelloWorld.jdl with its brand new Requirements attribute, let’s introduce another attribute that plays an important role in the matchmaking: the Rank.

You may use the Rank to order the list of matching CE by certain characteristics that may affect your jobs. Like the Requirements, also the Rank is usually constructed with GlueSchema attributes.

As an example, you’d prefere that your job be sent to sites with a higher number of free CPUs, so that you’d be sure that it will not be queued in already trafficked sites. Then you will add to your JDL

  • Rank = other.GlueCEStateFreeCPUs;

The Rank is a floating point number and is ordered from the higher value to the least. The CE with the highest value will receive the job. To see the CE matching your job in a Rank-ordered list you may issue

  • > glite-wms-job-list-match [--rank] your.jdl


Back to helloworld l.jpg
Back to HelloWorld

Now our HelloWorld.jdl looks like this

[

Executable = "/bin/echo";

Arguments = "Hello World!";

StdOutput = "HelloWorld.out";

StdError = "HelloWorld.err";

OutputSandbox = {"HelloWorld.out","HelloWorld.err"};

Requirements = other.GlueCEInfoLRMSType!="condor";

]

  • > glite-wms-job-submit --delegationid guidone HelloWorld.jdl

> glite-wms-job-status https://egee-rb-01.mi.infn.it:9000/HyUGIcK5n6JdobvQU1kdFw

*************************************************************

BOOKKEEPING INFORMATION:

Status info for the Job : https://egee-rb-01.mi.infn.it:9000/HyUGIcK5n6JdobvQU1kdFw

Current Status: Done (Success)

Exit code: 0

Status Reason: Job terminated successfully

Destination: tbat01.nipne.ro:2119/jobmanager-lcgpbs-atlas

Submitted: Tue Sep 19 16:03:29 2006 CEST

*************************************************************


Back to helloworld17 l.jpg
Back to HelloWorld

Once a job finishes, you can retrieve the output files specified in the OutputSandbox back to your local UI:

> glite-wms-job-output https://egee-rb-01.mi.infn.it:9000/HyUGIcK5n6JdobvQU1kdFw

Connecting to the service https://193.205.78.5:7443/glite_wms_wmproxy_server

================================================================================

JOB GET OUTPUT OUTCOME

Output sandbox files for the job:

https://egee-rb-01.mi.infn.it:9000/HyUGIcK5n6JdobvQU1kdFw

have been successfully retrieved and stored in the directory:

/tmp/negri_HyUGIcK5n6JdobvQU1kdFw

================================================================================

> ls -l /tmp/negri_HyUGIcK5n6JdobvQU1kdFw

total 4

-rw-r--r-- 1 negri atlas 0 Sep 19 17:05 HelloWorld.err

-rw-r--r-- 1 negri atlas 13 Sep 19 17:05 HelloWorld.out

> cat /tmp/negri_HyUGIcK5n6JdobvQU1kdFw/ HelloWorld.out

Hello World!


A more concrete case l.jpg
A more concrete case

A generic ATLAS job has these JDL attributes:

Rank = (other.GlueCEStateWaitingJobs == 0) ? ( (other.GlueCEStateFreeCPUs * 100) / ((other.GlueCEStateRunningJobs == 0) ? 1 : other.GlueCEStateRunningJobs) ) :

( -(other.GlueCEStateWaitingJobs * 100) / other.GlueCEStateRunningJobs) ;

Requirements =

(other.GlueCEStateStatus == "Production") &&

((other.GlueCEPolicyMaxCPUTime * other.GlueHostBenchmarkSI00) >= 120016) && (other.GlueHostMainMemoryRAMSize >= 500) &&

(other.GlueHostNetworkAdapterOutboundIP == true) &&

(Member("VO-atlas-release-11.0.42", other.GlueHostApplicationSoftwareRunTimeEnvironment) ||

Member("VO-atlas-offline-11.0.42", other.GlueHostApplicationSoftwareRunTimeEnvironment));

The Rank uses two nested constructs “ true ? value1 : value2 “ and says

if a site has no waiting jobs, then use

(number of free CPU / 1) if there are no running jobs

(number of free CPU / number of running jobs) if there are running jobs

else, if there are waiting jobs, use

- (number of waiting jobs * 100) / number of running jobs


Data management l.jpg
Data management

The Storage Element is the service that allows a user or an application to store data for future retrieval. In gLite, every SE must have a GSIFTP server, offering basically the same functionalitis of FTP but enhanced to support GSI security.

Files that are copied to a SE should then be registered in a catalog. A catalog is basically a database that maps the name of a file (logical file name) to its physical location (physical file name).

Files in a catalog may have more than one LFN (in principle, it has nothing to do with its real name), they can have more than one replica (that is, the aame file may be present on two different SE). What uniquely identifies them is the guid, grid unique identifier, a string of 40 bytes.


Lcg file catalog l.jpg
LCG File Catalog

gLite supports two different types of catalogs: LFC (LCG File Catalog) and RLS (Replica Location Server).

In this overview we’ll only deal with LFC, which is now the most used in ATLAS (the two catalogs are not synchronized!)

The catalog can be accessed using data management commands from the UI. Two environment variables must be set: the file catalog type and its address

export LCG_CATALOG_TYPE=lfc

export LFC_HOST=lfc-atlas-test.cern.ch

There are several LFC hosts on LCG and they’re not synchronized, so the choice of the user has to be consistent throughout his activity!

Usually, there’s a central LFC per VO, so that basically there are no risks of this kind.

LFN in LFC have a particular form: they’re organized in hierarchical directory-like structure, having the following look

lfn:/grid/<VO>/<dir>/<filename>


Lfc commands l.jpg
LFC commands

There are, on the UI, some commands that directly interact with the LFC catalog.

Due to its particular LFN structure, files in the LFC catalog can be browsed as if they were in a unix filesystem. Try this:

> lfc-ls /grid/atlas

The lfc-ls command works just like a ls on a local filesystem (also allowing the -l option). In the same way, lfc-mkdir, lfc-chmod or lfc-chown behave almost like their corresponding brothers on unix.

In spite of the easyness of LFC commands, usually only lfc-ls is used.

Commands that perform actions on the catalog, that write on it or delete “directories” from it should be used with great caution: the risk is to cause inconsistencies between the catalog and the files on the SE.

Data management command assure that such inconsistencies are not created. These commands write on the catalog but they also check that no “harm” is done to the system.


Data management commands l.jpg
Data management commands

Data management commands are of the form lcg-**.

Some of them only access the catalog:

lcg-aa add alias

lcg-ra remove alias

lcg-rf register file

lcg-uf unregister file

lcg-la list aliases

lcg-lg list guid

lcg-lr list replicas

Some of them perform real data movement operations, usually updating the catalog about the new changes:

lcg-cp copy locally a file (this command do not write on the catalog)

lcg-cr copy and register a file on a SE

lcg-del delete (physically) a file and its entry in the catalog

lcg-rep replicate a file from a SE to another

In order for these commands to work, besides the 2 catalog variables, another env variable must be set:

export LCG_GFAL_INFOSYS=<BDII_address:2170>


Low level commands l.jpg
Low level commands

There are some “low-level” commands made available to grid users that should be used with caution, working merely on the SE without updating the catalog. Anyway, 2 of them will prove to be real friends to anyone who has to look for files on the grid:

edg-gridftp-ls gsiftp://<SE_address>/<dir>/

globus-url-copy <src_file> <dest_file>

The first command lists the content of a directory on a remote SE, the second one is the base for every lcg tool that has to move data. The <src_file> and <dest_file> have to be in a fully qualified format:

file:///<abs_path>/<file_name> for local files

gsiftp://<SE_address>/<abs_path>/<file_name> for remote files

Other useful low-level commands (to be used carefully!) are

edg-gridftp-rm <URL>

edg-gridftp-rmdir <URL>

edg-gridftp-rename <src_URL> <dest_URL>


Want to know more l.jpg
Want to know more…

You may find all the informations presented in these slides and much much more in the gLite documentation web page


ad