200 likes | 318 Views
This presentation details the ARDA-gLite Metadata Interface for efficient data management in GRID computing. It covers key concepts including metadata organization, schema management, and entry operations, along with a performance study comparing SOAP and TCP Streaming protocols. The ARDA interface aims to simplify metadata access and support robust data storage across various backends. Key implementations, user evaluations, and scalability metrics are discussed, making this a valuable resource for researchers and developers involved in GRID technologies.
E N D
Metadata Services on the GRID Nuno Santos ACAT’05 May 25th, 2005
Contents • Metadata on the GRID • ARDA-gLite Metadata Interface • The ARDA Implementation • Performance study: SOAP vs TCP Streaming
Metadata on the GRID • Metadata is data about data • Metadata on the GRID • Mainly information about files • Other information necessary for running jobs • Usually living on DBs • Need simple interface for Metadata access • Advantages • Easier to use by clients - no SQL, only metadata concepts • Common interface - clients don’t have to reinvent the wheel • Must be integrated in the File Catalogue • Also suitable for storing information about other resources
ARDA-gLite Metadata Interface • ARDA proposed an interface for Metadata access on the GRID • Designed jointly with the gLite/EGEE team • Incorporates feedback from GridPP • Endorsed by the EGEE standards committee (PTF) • Being implemented in gLite File Catalog (FiReMan) • Interface concepts • Metadata - Key-value pairs • Entry - Entities to which metadata is attached • Attribute – Holds information about an entry • Schema – A collection of attributes • Type – The type (int, float, string,…) • Name/Key – The name of the attribute • Value - Value of an entry's attribute • Entries are associated with schemas • Think of schemas as tables, attributes as columns, entries as rows
Interface Operations • Schema management void createSchema(String schemaName, Attribute[] attributes) void dropSchema(String schemaName) void removeSchemaAttributes(String schemaName, String[] attributeNames) void addSchemaAttributes(String schemaName, Attribute[] attributes) • Entry management void createEntry(MDEntry[] entries, String[] schemas) void removeEntry(String query) int setAttributes(String query, Attribute[] attributes) Attribute[] listAttributes(String entry)
Interface Operations • Searching and retrieving entries MDResult query(MDQuery query) MDResult nextQuery(String token, MDQuery query) void endQuery(String token) • Datatypes Allows either stateful or stateless server implementations Attribute { String schema String name String type String value } MDEntry { String entry Attribute[] attributes } MDQuery { String query String queryType } MDResult { MDEntry[] entries String token Boolean done }
ARDA Prototype • Validate proposed interface • Architecture: • Metadata organized in a hierarchy • Schemas can contain sub-schemas • Can inherit attributes • Analogy to file system: • Schema Directory; Entry File • Stability with large responses • Send large responses in chunks • Otherwise preparing large responses could crash server • Stateful server • DB → Server – Data streamed using DB cursors • Server → Client – Response sent in chunks
ARDA Implementation • Backends • Currently: Oracle, PostgreSQL, SQLite • Two frontends • TCP Streaming • Chosen for performance • SOAP • Formal requirement of EGEE • Compare SOAP with TCP Streaming • Also implemented as standalone Python library • Data stored on filesystem
TCP Streaming Frontend • Text based protocol (like SMTP, POP3,…) • Data streamed to client in single connection • Implementation • Server – C++, multiprocess • Clients – C++, Java, Python, Perl, Ruby Client:listattr entry Server:0 entry value1 value2 … <EOT>
SOAPFrontend • Most operations in interface implemented as simple SOAP calls • query() - based oniterators • Initial request – create session • Open cursor on DB • Return initial chunk of data and session token • Subsequent requests • Client calls nextQuery() using session token • Termination – session closed when: • End of data • Client calls endQuery() • Client timeout • Implementations • Server – gSOAP (C++). • Clients – Tested WSDL with gSOAP, ZSI (Python),AXIS (Java)
Current Uses of the ARDA prototype • Evaluated by LHCb-bookkeeping • Migrated bookkeeping metadata to ARDA prototype • 20M entries, 15 GB • Feedback valuable in improving interface and fixing bugs • Interface found to be complete • ARDA prototype showing good scalability • Ganga (LHCb, ATLAS) • User analysis job management system • Stores job status on ARDA prototype • Highly dynamic metadata
Performance Study • SOAP increasingly used as standard protocol for GRID computing • Promising web services standard - Interoperability • Some potential weaknesses • XML encoding increases message size (4x to 10x typical) • XML processing is compute and memory intensive • How significant are these weaknesses? What is the cost of using SOAP? • ARDA metadata implementation ideal for comparing SOAP with a traditional RCP protocol
Benchmark Description • Protocols • TCP-S – TCP Streaming • SOAP – Clients with gSoap (C++), Axis (Java) and ZSI (Python) • Operations • ping – A null RPC • add – Adds an entry • get – Gets all attributes of an entry • get (bulk) – Gets all attributes of several entries in a single operation • Entries • 60 attributes (ints, floats and strings) • 700 bytes on average • HTTP Keepalive/Persistant connections • HTTP Keepalive increase HTTP performance. Should improve SOAP performance. • gSOAP supports Keepalive. Axis and ZSI don’t. • TCP-S uses persistent TCP connections to compare with HTTP Keepalive
SOAP Data Overhead • Measure size overhead of XML encoding • Ping • 1000 requests • Minimal payload – less than 5 bytes per request • SOAP overhead around 8 times • Get attributes in bulk • Retrieve 1000 entries • Around 800KB of application data • Streaming in TCP • Iterators with SOAP – 4KB average SOAP packet payload • With keepalive • SOAP overhead around 2.5 times Total data transferred (in KB)
SOAP Toolkits performance • Test protocol performance • No work done on the backend • Switched 100Mbits LAN • Language comparison • TCP-S with similar performance in all languages • SOAP performance varies strongly with toolkit • Protocols comparison • Keepalive improves performance significantly • On Java and Python, SOAP is several times slower than TCP-S 1000 pings
Single client results (LAN) • Compare performance of different operations • C++ clients (gSOAP) • When backend must do work, differences between gSOAP and TCP-S are small • Bulk operations very important for performance • getBulk 4x faster than get 1000 pings/1000 Entries
Single client results (WAN) • Client CERN, server Taiwan • ≈300 ms latency • Results dominated by latency • Execution time at server irrelevant • Large performance boost from latency hiding techniques: • keepalive – fewer TCP handshakes • bulk operations – fewer client/server interactions 1000 pings/1000 Entries
Scalability with Multiple Clients - Pings • Measure scalability of protocols • Switched 100Mbits LAN • TCP-S 3x faster than gSoap (with keepalive) • Poor performance without keepalive • Around 1.000 ops/sec (both gSOAP and TCP-S) 1000 pings
Scalability with Multiple Clients - getAttr • Measure scalability with realistic payload • Switched 100Mbits LAN • All tests with keepalive • Smaller difference between gSOAP and TCP-S • TCP-S 2x faster (1000 vs 500 entries/sec) • Poor performance of non-bulk operations • 100 entries/sec 1000 entries
Conclusions • A common Metadata Interface was developed by ARDA and gLite • Endorsed by the EGEE standards committee • Interface validated by ARDA prototype • Prototype in use by LHCb (bookkeeping, Ganga) and ATLAS (Ganga) • SOAP performance studied using ARDA implementation • Toolkit performance varies widely • Large SOAP overhead (over 100%)