slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
The Large Scale Data Management and Analysis Project (LSDMA) Dr. Andreas Heiss , SCC, KIT PowerPoint Presentation
Download Presentation
The Large Scale Data Management and Analysis Project (LSDMA) Dr. Andreas Heiss , SCC, KIT

Loading in 2 Seconds...

play fullscreen
1 / 29

The Large Scale Data Management and Analysis Project (LSDMA) Dr. Andreas Heiss , SCC, KIT - PowerPoint PPT Presentation


  • 195 Views
  • Uploaded on

The Large Scale Data Management and Analysis Project (LSDMA) Dr. Andreas Heiss , SCC, KIT. Overview. I ntroducing KIT and SCC Big Data Infrastructures at KIT: GridKa and the Large Scale Data Facility (LSDF) Large Scale Data Management and Analysis ( LSDMA) Summary and Outlook.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The Large Scale Data Management and Analysis Project (LSDMA) Dr. Andreas Heiss , SCC, KIT' - angeni


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
overview
Overview
  • Introducing KIT and SCC
  • Big Data
  • Infrastructures at KIT: GridKa and the Large Scale Data Facility (LSDF)
  • Large Scale Data Management and Analysis (LSDMA)
  • Summary and Outlook
introducing kit
Introducing KIT

KIT is both

  • stateuniversitywithresearchandteachingand
  • researchcenteroftheHelmholtz Associationwithprogramorientedprovidentresearch

Objectives:

  • research
  • teaching
  • innovation
  • Numbers
  • 24,000 students
  • 9,400 employees
  • 3,200 PhD researchers
  • 370 professors
  • 790 million EUR annual budget in 2012
introducing steinbuch center for computing
Introducing Steinbuch Center for Computing
  • Provisioning and development of IT services for KIT and beyond
  • R&D
    • High Performance Computing
    • Grids and Clouds
    • Big Data
  • ~ 200 employees in total
    • 50% scientists
    • 50% technicians, administrative personnel and student assistants
  • named after Karl Steinbuch, professor at Karlsruhe University, creator of the term “Informatik” (German term for computer science)
big data
Big Data

Comparing Google trends

Cloud computing

Big Data

Grid Computing

2010

2013

big data1
Big Data

Comparing Google trends

Cloud computing

Big Data

Grid Computing

big data 2000 years ago
Big Data 2000 yearsago
  • clearly defined purpose for collecting data: tax lists of all tax payers
  • data collection
    • distributed
    • analog
    • time-consuming
  • distributed storage of data
  • tedious data aggregation

“In those days Caesar Augustus issued a decree that a census should be taken of the entire Roman world.”

(Luke 2,1)

big data today
Big Data today

OneBuzzword ….. variouschallenges!

Industry Science

  • Data mining
  • Business intelligence
  • Get additional information from (often) already existing data.
  • Data aggregation
  • Typically O(10) or O(100) TBs
  • New field to make money!
  • Products
  • Services
  • Market shared between some ‘big players’ and many start-ups / spin-offs!
  • Handling huge amounts of data
    • PetaBytes
    • Distributed data sources and/or storage
    • (Global) data management
    • High Throughput
    • Data preservation
definition of data science
Definition ofData Science

Venn-Diagramm by Drew Conway (IA Ventures)

big data in science lhc at cern
Big Data in science: LHC at CERN
  • Goals
    • search for the origin of mass
    • understanding the early state of the universe
  • LHC
    • went live in 2008
    • four detectors
    • main discovery until now: a Higgs boson

40 MHz (1,000 TB/sec) equivalent

Level 1- Hardware

100 KHz (100 GB/sec digitized)

Level 2 – Online Farm

5 KHz (5 GB/sec)

Level 3 – Online Farm

300 Hz (250 MB/sec)

world-wide

LHC community

2012: 25 PB of data taken

Goal for 2015: 500 Hz@L3

big data in science lhc at cern1
Big Data in science: LHC at CERN
  • Goals
    • search for the origin of mass
    • understanding the early state of the universe
  • LHC
    • went live in 2008
    • four detectors
    • main discovery until now: a Higgs boson

O(1000) physicists

distributed worldwide

40 MHz (1,000 TB/sec) equivalent

Level 1- Hardware

100 KHz (100 GB/sec digitized)

Level 2 – Online Farm

5 KHz (5 GB/sec)

Level 3 – Online Farm

300 Hz (250 MB/sec)

world-wide

LHC community

2012: 25 PB of data taken

Goal for 2015: 500 Hz@L3

worldwide lhc computing grid hierarchical tier structure
Worldwide LHC Computing Grid – Hierarchical Tier Structure

Hierarchy of services, response times and availability:

  • 1 Tier-0 center at CERN
    • copy of all raw data (tape)
    • first pass reconstruction
  • 11 Tier-1 centers worldwide
    • 2 to 3 distributed copies of raw data
    • large-scale data reprocessing
    • Storage of simulated data from Tier-2 centers
    • tape storage
  • ~150 Tier-2 centers worldwide
    • user analysis
    • simulations

Hierarchical model relaxed

Hierarchy

Mesh

Courtesy of Ian Bird, CERN

big data in science synchrotron light sources1
Big Data in science: synchrotron light sources
  • Dectris Pilatus 6M
    • 2463 x 2527 pixels
    • 7 MB images
    • 25 frames/s
    • 175 MB/s
    • Several TB/day
  • Data doesn‘t fit any more on USB drive
  • Users are usually not affiliated to the synchrotron lab
  • Users from physics, biology, chemistry, material sciences, …
big data in science high throughput imaging
Big Data in science: high throughput imaging
  • Imaging machines / microscope
    • 1 – 100 frames/s => up to 800 MByte/s => O(10) TBytes/day

Reconstruction of zebrafish early embryonic development

big data in science
Big Data in science
  • Many research areas, where the data growth is very fast
    • Biology, chemistry, earth sciences, …
  • Data sets became too big to take home
  • Data rates require dedicated IT infrastructures to record and store
  • Data analysis requires farms and clusters. Single PCs not sufficient.
  • Collaborations require distributed infrastructures and networks
  • Data management becomes a challenge
  • Less IT experienced and IT interested people than e.g. in phyisics
definition of data science1
Definition of Data Science

Physicist

Biologist, chemist, …

Venn-Diagramm by Drew Conway (IA Ventures)

kit infrastructures gridka
KIT infrastructures: GridKa

German WLCG Tier-1 Center

  • Supports all LHC experiments + Belle II + several small communities and older experiments
  • >10,000 cores
  • Disk space: 12 PB, tape space: 17 PB
  • 6x10 Gbit/s network connectivity
  • ~ 15% of LHC data permanently stored at GridKa
  • Services:file transfer, workload management, file catalog, …
  • Global Grid User Support (GGUS): service development and operation of the trouble ticket system for the world-wide LHC Grid
  • Annual international GridKa School
    • 2013: ~140 participants from 19 countries
gridka experiences
GridKa Experiences
  • evolving demands and usage patterns
    • no common workflows
  • hardware commodity, software not
  • hierarchical storage with tape is challenging
  • data access and I/O is thecentral issue
    • Different users / user communities have different data access methods and access patterns!
  • on-site experiment representation highly useful
kit infrastructure large scale data facility
KIT infrastructure: Large Scale Data Facility

Main goals

  • provision of storage for multiple research groups at KIT and U-Heidelberg
  • support of research groups in data analysis

Resources and access

  • 6 PB of on-line storage
  • 6 PB of archival storage
  • 100 GbE connection between LSDF@KIT and U-Heidelberg
  • analysis cluster of 58*8 cores
  • variety of storage protocols
  • jointly funded by Helmholtz Association and state of Baden-Württemberg
lsdf experiences
LSDF experiences
  • high demand for storage, analysis and archival
  • research groups vary in
    • research topics (from genetic sequencing to geophysics)
    • size
    • IT expertise
    • need for services and protocols
  • Important needs common to many user groups
    • sharing data with other groups
    • data security and preservation
    • ‘consulting’
  • many small groups depend on LSDF
the large scale data management and analysis lsdma project facts and figures
The Large Scale Data Management and Analysis (LSDMA) project: facts and figures
  • Helmholtz portfolio extension
  • initial project duration: 2012-2016
  • partners:
  • project coordinator: Achim Streit (KIT)
  • sustainability: inclusion of activities into respective Helmholtz program-oriented funding in 2015
  • next annual international symposium:

September 24th at KIT

lsdma dual approach
LSDMA: Dual Approach

Data Life Cycle Labs

Joint R&D with scientific user communities

  • optimization of the data life cycle
  • community-specific data analysis tools and services

Data Services Integration Team

Generic methods R&D

  • data analysis tools and services common to several DLCLs
  • interface between federated data infrastructures and DLCLs/communities
selected lsdma activities i
Selected LSDMA activities (I)

DLCL Energy (KIT, U-Ulm)

  • analyzing stereoscopic satellite images for estimating the efficiency of solar energy with Hadoop
  • privacy policies for personal energy data

DLCL Key Technologies (KIT, U-Heidelberg, U-Dresden)

  • optimization of tomographical reconstruction using data-intensive computing
  • visualization for high throughput microscopy

DLCL Health (FZJ)

  • workflow support for data-intensive parameter studies
  • efficientmetadataadministrationandindexing
selected lsdma activities ii
Selected LSDMA activities (II)

DLCL Earth&Environment (KIT, DKRZ)

  • MongoDB for data and metadata of meteorologic satellite data
  • Data Replication within the European EUDAT project using iRods

DLCL Structure of Matter (DESY, GSI, HTW)

  • Development of a portal for PETRA-III data
  • Determining the computing requirements for FAIR data analysis

DSIT (all partners)

  • Federated identity management
  • Archive
  • Federated storage (e.g. dCache)
lsdma challenges
LSDMA Challenges

Within communities

  • focus on data analysis
  • high fluctuation of computing experts
  • running tools and services

Lessons learned

  • interoperable AAI crucial
  • data privacy very challenging, both legally and technically
  • communities need evolution, not revolution
  • needs can be very specific

Communities differ in

  • previous knowledge
  • level of specification of the data life cycle
  • tools and services used

Needs driven by

  • increasing amount of data
  • cooperation between groups
  • policies
    • open access/data
    • long-term preservation
summary and outlook
Summary and Outlook
  • data facilities and R&D very important for KIT
  • extensive experience at GridKa and LSDF
  • wide variety of user communities
  • often very specific needs
  • Interoperable AAI and privacy crucial topics
  • Today, data is important to basically all research topics
  • more projects on state, national and international levels to come
  • LSDMA: research on generic data methods, workflows and services and community specific support and R&D.