data intensive services for the lsdf n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data Intensive Services for the LSDF PowerPoint Presentation
Download Presentation
Data Intensive Services for the LSDF

Loading in 2 Seconds...

  share
play fullscreen
1 / 24
tahir

Data Intensive Services for the LSDF - PowerPoint PPT Presentation

106 Views
Download Presentation
Data Intensive Services for the LSDF
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Data Intensive Services for the LSDF Jos van Wezel

  2. Intro Past and Context The Data Challenge ahead LSDF at KIT Software Services Roadmap BioQuant, First Byte Symposium

  3. Steinbuch Centre for Computing Computer Centre of the Karlsruhe Institute of Technology IT Services for KIT High Performance Computing Scientific Computing und Simulation Large Scale Data Management & Analysis Grid Computing Cloud Computing Virtualisierung BioQuant, First Byte Symposium

  4. Data services at SCC • GridKa – LHC Tier 1 centre / 2002 • WLCG Tier 1 centre • 10 PB storage, 16000 cores, 40 Gb/s networking • Dedicated to Physics off-line computing • Biology contacts • Institute for Toxicology and Genetics at KIT / 2007 • Initial use of ‘spare’ GridKa capacity • indicating storage and computing needs • BioQuant / 2008 • Prof. Dr. Wolfrum and Prof. Dr. Juling: Joint proposal • Cooperation to procure storage for genomic research BioQuant, First Byte Symposium

  5. LSDF development time line at SCC/KIT • First ideas for LSDF / early 2008 • Installation of an LSDF pilot with 150 TB storage and 4 servers • Development of initial concepts , i.e. tiered storage, hadoop • Result: KIT proposal • SCC Helmholtz external review / spring 2009 • LSDF is an excellent idea, but DO plan beyond KIT • Workshop held in 2/2009 to coordinate BioQuant and SCC efforts • Storage, funded by State of Baden-Wuerttemberg / late 2009 • Tendering and negotiations by R. Eils (BioQuant) and R. Kupsch (SCC) • Storage for systems biology in Heidelberg (@BioQuant) • Storage for Universities in Baden-Wuerttemberg (@SCC) • Long Digital time Archives (@SCC) • Storage Support and Services for State Universities (@SCC) • Compute cluster for DIC & cloud research / late 2009 • Bring computing (low latency) to storage • Use hadoop to allow fast distributed data access BioQuant, First Byte Symposium

  6. LSDF Hardware today • Dedicated LSDF Data Acquisition Network • 10Gb/s redundant backbone with 2 Nexus routers • Several KIT institutes, ITG,IPE, IAI, ANKA, GPI • Since 1 week: 10Gb/s to BioQuant • File Servers and On-line Storage • IBM → 2 PB, 6 servers, SoFS • DDN → 750 TB, 8 servers, GPFS • Computing cluster • 464 cores, 2 TB total memory • Directly attached to storage (GPFS/DDN) • 110 TB HDFS, Hadoop native filesystem • Available from the Cloud environment OpenNebula • users can deploy own dedicated VMs • reliable, highly flexible, and very fast to deploy • Archival and off-line storage • Tape library • 6 LTO 5 drives Executive scientists: Serguei Bourov Ariel Garcia BioQuant, First Byte Symposium

  7. LSDF realms • Access to LSDF (KIT) via standard protocols • Internal (inside Firewall) via NFS/CIFS and DataBrowser • External (outside Firewall) via ‘grid’ tools or http BioQuant, First Byte Symposium

  8. Users ofthe LSDF @ SCC/KIT • Biology • High Throughput Microscopy • Gene Sequencing • 0.5 PB/a, automated image processing • Synchrotron radiation facility (ANKA) • Tomographie-Beamlines • 240 TB/a – 1 PB/a, data management • Climate research (IMK) • Several instruments mounted on satallites • 300 TB/a (till 2024), 20 years archiving • In development • BioQuant Archives • Biophisics (Nanoscopy, Nanoparticles, …) • Arts and Humanities (DARIAH) • Geophysics (Seismology, Applied seismic research) • Many others LSDMA-Treffen BioQuant, First Byte Symposium

  9. Measurement Data Data is generated at increasing rates Costs per byte measured is decreasing Costs per byte of storage is decreasing BioQuant, First Byte Symposium

  10. Scientific data sources In the past big data resulted from simulations on supercomputers Today big data results from experiments, observations, measurements Data is valuable because it is either unique or costly to obtain or both BioQuant, First Byte Symposium

  11. From data to knowledge • The fourth paradigm • Experiment • Theory • Simulation • Data Exploration • Widely Recognised i.e. “Riding the wave, How Europe can gain from the rising tide of scientific data.” Final report of the High Level Expert Group on Scientific Data. October 2010 Tony Hey, Stewart Tansley, Kristin Tolle, The Fourth Paradigm: Data-Intensive Scientific Discovery, Microsoft Research, ISBN 978-0982544204, http://research.microsoft.com/en-us/collaboration/fourthparadigm/ Jim Gray, eScience Talk at NRC-CSTB meeting Mountain View CA, 11 January 2007, http://research.microsoft.com/en-us/um/people/gray/talks/NRC-CSTB_eScience.ppt BioQuant, First Byte Symposium

  12. A collaborative Data Infrastructure Scientific Experiments DARIAH CESSDA LifeWatch ENES etc. EUDAT D4Science ELIXER etc. LSDF BioQuant, First Byte Symposium

  13. Key demands of modern data driven science • Data storage and management beyond PetaBytes • Long-Term digital archiving of raw and publicised data • Analysis with tools for data intensive computing • Visualisation and data mining tools for large amounts of measurement data • Integration of data handling with scientific workflows • Support and services from IT and data experts LSDF Blueprint BioQuant, First Byte Symposium

  14. Workflow: applicable for many data sources • data is measured, buffered and validated in storage near the instrument (T0) • data is curated, registered and moved in the LSDF (T1) • data is processed for analysis. • each analysis step produces new, derived data that is also registered, stored and archived (T2) • new data is archived: immutable data BioQuant, First Byte Symposium

  15. LSDF developments LSDF Infrastructure Technologies“for happy users” Scientific Experiments, Applications, Communities • Software for Scientific data • Data management • Secure Access and Global Authentication • Archival and Bit Preservation • Persistent Identifiers • Data intensive Computing • Storage and computing optimisation • Storage and file system design • Community services • Helpdesk and support • Integration of existing applications • Storage for the state of Baden-Württemberg • Scientific Data (BioQuant) • Universities, Archives, Libraries • Desktop-Data

  16. Data Services • File systems, file protocols, databases • GPFS, NFS, CIFS, GridFTP, Oracle, MySQL • Hadoop • Shared cluster wide file system, Map/Reduce framework • Cloud/Open Nebula • Fast deployment of virtual machines • iRods • Rule-Oriented Data System • Automated Processing of large image stacks • Kepler workflow engine • Data Ingest • Meta Data • ADALAPI • Data Browser BioQuant, First Byte Symposium

  17. Meta Data Meta data describes the contents of data • Everybody uses meta data: • File name and extension(e.g. picture.jpg, budget.xls, Readme.doc) • Location(e.g. /…/EU-projects/2011/Fishy/budget.xls) • Personal know-how  Sufficient for small file systems , desktops Try to locate a file or infosomewhere-in-a-file-system • 15 years old ? • in the file system of a colleague ? • in a 100 PetaByte file system ? BioQuant, First Byte Symposium

  18. Access to large scale data • Separate frameworks for data and meta data • Good scalability and Access • Complicates transparent access BioQuant, First Byte Symposium

  19. Hierarchical Catalog System (Repository) APIs and Tools Catalogs LSDF Systems • Sustainable and easily extensible for large amounts of data (size and number) • Independent of data formats • Performanceby distributed access • Safety by redundancy • Use of openstandards Computing Meta data scheme repository Logical Project Catalog DB LPN  LDN, meta data Zebrafish I Zebrafish II Logical Directory Catalog DB Logical File Catalog Logical File Catalog Logical File Catalog Logical File Catalog ANKA BL1 DB DB DB DB Material research LDN  LDN, LFN LFN  Physical File Name LFN  Physical File Name LFN  Physical File Name LFN  Physical File Name Digital objects in Arts and Humanities Storage Logical File Catalogs DB Generic file tree LFN  Physical File Name BioQuant, First Byte Symposium

  20. DataBrowser • API: Data and meta data organizationGUI: File, data and project explorer • Easy-to-use • Extensible • World-wideaccess • Functions: • Data management • Queries in metadata • cataloges • Up-/Download • Controlofdataanalysis + vis.workflows BioQuant, First Byte Symposium

  21. ADALAPI (Abstract Data Access Layer API) Client software Grid Applications Tools Cloud DataBrowser Scientific exp. Workstations Visualization DAQ … … LSDF Storage Infrastructure • Java class library • Seamless application access to LSDF • Independent of transfer protocol and location • Authentification • X.509 certificates • user/passwd • Protocols and file systems • local files • gsiftp • sftp • http(s) • hdfs BioQuant, First Byte Symposium

  22. Conclusions Important services have been deployed Different communities at KIT are successfully using the LSDF (storing as well as on-line computing) Development on new tools in progress Roadmap LSDF will grow, adding users and hardware Contributing to EUDAT and Helmholtz Association infrastructures Adding software and community services and support to hardware services BioQuant, First Byte Symposium

  23. The Steinbuch Centre for Computing at KIT congratulates BioQuant with its successful LSDF4LS launch. We are proud to cooperate with them and look forward to mutually enhance science by deploying innovative large scale data services. BioQuant, First Byte Symposium

  24. You have the data, we have the technologyThank you very much for your attentionJos.vanWezel@kit.eduMany thanks to: Serguei Bourov, Ariel Garcia, Rainer Kupsch, Achim Streit, Rainer Stotzka and all other KIT colleagues making LSDF happen