1 / 38

Introduction to Data Management

Introduction to Data Management. Scientific data management: – Large data volumes (10s of PB) – Distributed user base – Need for high performance transfers – Need for data security (or not) – Scalability. Data in “the Grid”?. “The Grid”. Data. Data. Data in “the Cloud”?.

yul
Download Presentation

Introduction to Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Data Management

  2. Scientific data management: – Large data volumes (10s of PB) – Distributed user base – Need for high performance transfers – Need for data security (or not) – Scalability

  3. Data in “the Grid”? “The Grid” Data Data

  4. Data in “the Cloud”? “The Cloud” Data Data

  5. Transfer Protocols – GridFTP (http://www.ogf.org/documents/GFD.20.pdf) Aka “gsiftp” (GSI = Globus (Grid) Security Infrastructure, cf RFC3820) – HTTP(S) – WebDAV (RFC 4918)

  6. GridFTP – based on FTP Ancient protocol... RFCs 114 (1971), 141 (1971), 172 (1971), 265 (1971), 354 (1972), 542 (1973), 765 (1980), 959 (1985) Splitting control and data connection Extensions RFC 2228, 2773 (security), 2640 (internationalisation), 3659 (misc.), 2389, 5797 (FEAT)

  7. Control connection: port 21 (FTP), 2811 (GridFTP) Client Server Data connections and firewalls (active vs passive mode (PASV))

  8. (Grid)FTP - “3rd party copying”

  9. GridFTP – extensions to FTP GSI security (later RFC 3820) Striping (and EBLOCK mode) TCP buffer size control/negot.? Data channel authentication (DCAU)

  10. The Grid.... Ad-hoc transfers between GridFTP endpoints Initial user ingest? scp? Hands on with GridFTP: uberftp (cf ftp)

  11. Moving data in (and to, and from) the Grid “Manually,” with GridFTP Portals – e.g. NGS portal GlobusOnline FTS (as of 3.0, tbc)

  12. The gLite grid – daily TLA dose EMI – European Middleware Initiative UMD – Unified Middleware Distribution EGI – European Grid Infrastructure IGE – Infrastructure for Globus in Europe NGI – National Grid Initiative

  13. The gLite grid – component TLAs SE – Storage Element SRM – Storage Resource Manager LFC – LHC file catalogue FTS – File Transfer Service BDII – Berkeley Database Information Index (LDAP)

  14. SRM (OGF GFD.129) – control interface – support for “spaces” (reserved areas) – retention policies (replica, output, custodial) – access latencies (offline, nearline, online) – storage “type” - permanent, volatile FTS LFC LFN – Logical File Name (optional) Resolved by LFC into GUID – Globally Unique Identifier Resolved by LFC into SURL – Storage URL (or Site URL) Resolved by SE into TURL – Transfer URL (eg gsiftp) SRM GridFTP BDII Storage Element

  15. gLite - Summary of basic data commands lcg-cp <srcfile> <dstfile> Copy to/from SE, or between SEs (no LFC) lcg-cr <srcfile> <dst> Copy file into SE, and register in LFC (guid) lcg-del <guid> lcg-rep <src> <dst> Replicate

  16. Exercises Lots of small files (105, 106) Large files (108-1012) Migration Format migration, checksumming Who can copy data? Write/Modify?

  17. Exercises How is scientific data mgmt different? • How do research disciplines differ? • What are the interdisciplinary benefits? How grids and clouds differ...? Can we trust the grids/clouds? Who leads the way? HEP? Industry?

  18. Storage Accounting - static Ongoing work... – Distributed storage systems – Temporary file copies created – Scheduled deletions – Inaccessible free spaces, reserved space – Filesystem/tape overheads – Timeliness and accuracy – Impact of compression

  19. GridFTP today GridFTP – workhorse of WAN grid data (OGF standard) The need for GSI (non-TLS) Numerous LAN protocols... … moving towards more common standards? (eg HTTP)

  20. lcg-cr --vo dteam -l lfn:my_stuff -d srm-dteam.gridpp.rl.ac.uk file://`pwd`/foo.tmp guid:921ac0b8-82aa-61dc-0192-6effece Subsequent access and replication is by GUID

  21. Data Security • Data security is like data security everywhere... • Except that the devil is in the detail • And the details are always different...

  22. Data Security – Confidentiality • In flight, or at rest • The performance issue • And the time issue • Who can “activate” it? Data Data Data Data Data

  23. Data Security – Availability LOCKSS again ... clouds are good at this. Somebody already thought about the difficult stuff...? Data Liability, SLAs,...

  24. Data Security – Availability DDoS • Intentional • Botnets • Unintentional

  25. Referencing Data • DOIs for data • DONA – Digital Objects Numbering Authority • Granularity? • Licences, permissions • Implementing data policies

  26. Cloud Data – Cost • Clouds are elastic • Elasticity is good for (rapid) growth • Or shrinkth • Elasticity can be expensive, though • Compared to “traditional” data centre • Or in-house (but don’t underestimate this!) • Different cost models (Hybrids!)

  27. Infrastructure Security • End-to-end security • Authentication and authorisation • Developing a threat model • Protecting credentials • Usability of security • Anonymised??

  28. Infrastructure • Federated identity and single sign-on • Integration with existing infrastructures • Accounting • Securely... • Anonymously? • And billing

  29. The Role of Standards • Standards promote interoperation • And maturity (sometimes) • Interoperation solves problems • Sometimes • E.g. eggs and baskets • Standards peer reviewed

  30. Other Data Services IRODS – “data grid” Successor to SRB Server side workflows: rules, microservices Safety Deposit Box Commercial product from Tessella Data preservation

  31. NGS data services NGS portal – https://portal.ngs.ac.uk/ http://www.ngs.ac.uk/tools/vbrowser Databases: Oracle, MySQL

  32. EU Funded Data Projects EUDAT (www.eudat.eu) Collaborative iRODS based infrastructure Multidisciplinary, scalable, long tail SCIDIP-ES (earth science) www.scidip-es.eu SCAPE (www.scape-project.eu) PANDATA (neutron/synchrotron) pan-data.eu

  33. New Stuff? More mature approach to clouds? CCN – Content Centric Networking RAID --> ECC, “object” storage

  34. Exercises Lots of small files (105, 106) Large files (108-1012) Migration Format migration, checksumming Who can copy data? Write/Modify?

  35. Exercises How is scientific data mgmt different? • How do research disciplines differ? How much can be shared? • What are the interdisciplinary benefits? How grids and clouds differ...? Can we trust the grids/clouds? Who leads the way? HEP? Industry?

  36. References www.ngs.ac.uk www.ogf.org UMD user guide https://edms.cern.ch/document/722398/ GridPP storage and data management group • http://www.gridpp.ac.uk/wiki/Grid_Storage

More Related