The data deluge and the grid
This presentation is the property of its rightful owner.
Sponsored Links
1 / 44

The Data Deluge and the Grid PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on
  • Presentation posted in: General

The Data Deluge and the Grid. The Data Deluge The Large Hadron Collider The LHC Data Challenge The Grid Grid Applications GridPP Conclusion. Steve Lloyd Queen Mary University of London [email protected] The Data Deluge.

Download Presentation

The Data Deluge and the Grid

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The data deluge and the grid

The Data Deluge and the Grid

  • The Data Deluge

  • The Large Hadron Collider

  • The LHC Data Challenge

  • The Grid

  • Grid Applications

  • GridPP

  • Conclusion

Steve Lloyd

Queen Mary University of London

[email protected]

The Data Deluge and the Grid


The data deluge

The Data Deluge

Expect massive increases in amount of data being collected in several diverse fields over the next few years:

  • Astronomy - Massive sky surveys

  • Biology - Genome databases etc.

  • Earth Observing

  • Digitisation of paper, film, tape records etc to create Digital Libraries, Museums . . .

  • Particle Physics - Large Hadron Collider

  • . . .

    1PByte ~1000 TBytes ~ 1M GBytes ~ 1.4M CDs

    [Petabyte Terabyte Gigabyte]

The Data Deluge and the Grid


Digital sky project

Digital Sky Project

Federating new astronomical surveys:

~ 40,000 square degrees

~ 1/2 trillion pixels (1 arc second)

~ 1 TB x multi-wavelengths

> 1 billion sources

Integrated catalogue and image database:

  • Digital Palomer Observatory Sky Survey

  • 2 All Sky Survey

  • NRAO VLA Sky Survey

  • VLA FIRST Radio Survey

    Later:

  • ROSAT

  • IRAS

  • Westerbork 327 MHz Survey

The Data Deluge and the Grid


Sloan digital sky survey

Sloan Digital Sky Survey

Survey 10,000 square degrees of Northern Sky over 5 years

  • ~ 1 million spectra

  • positions and images of 100 million objects

  • 5 wavelength bands

  • ~ 40 TB

The Data Deluge and the Grid


Vista

VISTA

Visible and Infrared Survey Telescope for Astronomy

The Data Deluge and the Grid


Virtual observatories

Crab Nebula

X-ray

Optical

Infra-red

Radio

Virtual Observatories

Chandra X-ray

HST optical

Gemini mid-IR

VLA radio

Jet in M87

The Data Deluge and the Grid


Nasa s earth observing system

NASA’s Earth Observing System

Galapagos Oil Spill:

1 TB/day

The Data Deluge and the Grid


Esa eo facilities

ESA EO Facilities

LANDSAT 7

TERRA/MODIS

AVHRR

SEAWIFS

SPOT

IRS-P3

MATERA

(I)

HISTORICAL

ARCHIVES

KIRUNA (S)

- ESRANGE

TROMSO

(N)

MATERA

(I)

STANDARD

PRODUCTION

CHAINS

MASPALOMAS

(E)

NEUSTREL.ITZ

(D)

PRODUCTS

GOME analysis detected ozone

thinningoverEurope

31 Jan 2002

ESRIN

USERS

USERS

The Data Deluge and the Grid


Species 2000

Species 2000

To enumerate all ~1.7 million known species of plants, animals, fungi and microbes on Earth

A federation of initially 18 taxonomic databases - eventually ~ 200 databases

The Data Deluge and the Grid


Genomics

Genomics

The Data Deluge and the Grid


The lhc

The LHC

The Large Hadron Collider (LHC) will be a 14 TeV centre of mass proton proton collider operating in the existing 26.7Km LEP tunnel at CERN. Due to start operation > 2006

  • 1,232 superconducting main dipoles of 8.3Tesla

  • 788 quadrupoles

  • 2,835 bunches of 1011 protons per bunch spaced by 25ns

The Data Deluge and the Grid


Particle physics questions

Particle Physics Questions

  • Need to discover (confirm) Higgs Particle

    • Study its properties

    • Prove that Higgs couplings depend on masses

  • Other unanswered questions:

    • Does Supersymmetry exist?

    • How are quarks and leptons related?

    • Why are there 3 sets of quarks and leptons?

    • What about Gravity?

    • Anything unexpected?

The Data Deluge and the Grid


The lhc1

The LHC

The Data Deluge and the Grid


The lep lhc tunnel

The LEP/LHC Tunnel

The Data Deluge and the Grid


Lhc experiments

LHC Experiments

LHC will house 4 experiments:

  • ATLAS and CMS are large 'General Purpose' detectors designed to detect everything and anything

  • LHCb is a specialised experiment designed to study CP violation in the b quark system

  • ALICE is a dedicated Heavy Ion Physics Detector

The Data Deluge and the Grid


Schematic view of the lhc

Schematic View of the LHC

The Data Deluge and the Grid


The atlas experiment

The ATLAS Experiment

ATLAS Consists of

  • An inner tracker to measures the momentum of each charged particle

  • A calorimeter to measure the energies carried by the particles

  • A muon spectrometer to identify and measure muons

  • A huge magnet system for bending charged particles for momentum measurement

    A total of > 108 electronic channels

The Data Deluge and the Grid


The atlas detector

The ATLAS Detector

The Data Deluge and the Grid


Simulated atlas higgs event

Simulated ATLAS Higgs Event

The Data Deluge and the Grid


Lhc event rates

LHC Event Rates

  • The LHC proton bunches collide every 25ns and each collision yields ~20 proton proton interactions superimposed in the Detector i.e.

    • 40 MHz x 20 = 8x108 pp interactions/sec

  • The (110 GeV) Higgs cross section is 24.2pb.

  • A good channel is H   with a branching ratio of 0.19% and a detector acceptance ~50%

    • At full (1034cm-2s-1) LHC luminosity this gives 1034 x 24.2x10-12 x 10-24 x 0.0019 x 0.5

      = 2x10-4 H   per second

      A 2x10-4 needle in a 8x108 Haystack

The Data Deluge and the Grid


Online data reduction

'Online' Data Reduction

Collision Rate 40 MHz

40 TB/sec

Level 1 Special Hardware Trigger

104 - 105 Hz

10-100 GB/sec

Selecting interesting events based on progressively more detector information

Level 2 Embedded Processor Trigger

1-10 GB/sec

102 - 103 Hz

Level 3 Processor Farm

10 - 100 Hz

100-200 MB/sec

Raw Data Storage

Offline Data Reconstruction

The Data Deluge and the Grid


Offline analysis

Offline Analysis

Raw Data from Detector

1-2 MB/event @ 100-400 Hz

Total Data per year from one experiment

1 to 8 PBytes (1015 Bytes)

Data Reconstruction

(Digits to Energy/momentum etc)

Event Summary Data

0.5 MB/event

Analysis Event Selection

10 kB/event

Analysis Object Data

Physics Analysis

The Data Deluge and the Grid


Computing resources required

Computing Resources Required

CPU Power (Reconstruction, Simulation, User Analysis etc)

  • 20 Million SpecInt2000

  • (A 1 GHz PC is rated at ~400 SpecInt2000)

  • i.e. 50,000 of yesterday/today's PCs

    'Tape' Storage

  • 20,000 TB

    Disk Storage

  • 2,500 TB

    Analysis carried out throughout the world by hundreds of Physicists

The Data Deluge and the Grid


Worldwide collaboration

Worldwide Collaboration

CMS:1800 physicists150 institutes32 countries

The Data Deluge and the Grid


Solutions

Solutions

  • Centralised Solution:

    • Put all resources at CERN

      • Funding agencies certainly won't place all their investment at CERN

      • Sociological problems

  • Distributed solution:

    • exploit established computing expertise & infrastructure in national labs and universities

    • reduce dependence on links to CERN

    • tap additional funding sources (spin off)

      Is the Grid the solution?

The Data Deluge and the Grid


What is the grid

What is The Grid?

Analogy with the Electricity Power Grid:

  • Unlimited ubiquitous distributed computing

  • Transparent access to multipetabyte distributed databases

  • Easy to plug in

  • Complexity of infrastructure hidden

The Data Deluge and the Grid


The grid

The Grid

  • Five emerging models:

  • Distributed Computing

    • - synchronous processing

  • High-Throughput Computing

    • - asynchronous processing

  • On-Demand Computing

    • - dynamic resources

  • Data-Intensive Computing

    • - databases

  • Collaborative Computing

    • - scientists

Ian Foster andCarl Kesselman, editors, “The Grid: Blueprint for a New Computing Infrastructure,” Morgan Kaufmann, 1999, http://www.mkp.com/grids

The Data Deluge and the Grid


The grid1

The Grid

Ian Foster / Carl Kesselman:

"A computational Grid is a hardware and software infrastructure that provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities."

The Data Deluge and the Grid


The grid2

The Grid

  • Dependable - Need to rely on remote equipment as much as the machine on your desk

  • Consistency - Machines need to communicate so need consistent environments and interfaces

  • Pervasive - The more resources that participate in the same system the more useful they all are

  • Inexpensive - Important for pervasiveness - i.e. built using commodity PCs and disks

The Data Deluge and the Grid


The grid3

The Grid

  • You simply submit your job to the 'Grid'- you shouldn't have to know where the data you want is or where the job will run. The Grid software (Middleware) will take care of:

    • running the job where the data is or

    • moving the data to where there is CPU power available

The Data Deluge and the Grid


The grid for the scientist

E = mc2

@#%&*!

Grid Middleware

The Grid for the Scientist

“Putting the bottleneck back in the Scientist’s mind”

The Data Deluge and the Grid


Grid tiers

Grid Tiers

  • For the LHC we envisage a 'Hierarchical' structure based on several 'Tiers' since the data mostly originates at one place:

    • Tier-0 - CERN - the source of the data

    • Tier-1 - ~ 10 Major Regional Centres (inc UK)

    • Tier-2 - smaller more specialised Regional Centres (4 in UK?)

    • Tier-3 - University Groups

    • Tier-4 – My laptop? Mobile Phone?

  • Doesn't need to be hierarchical e.g. for Biologists probably not desirable

The Data Deluge and the Grid


Grid services

Grid Services

Cosmology

Chemistry

Environment

Applications

Biology

Particle Physics

Data-

Remote

Problem

Remote

Collaborative

Distributed

Intensive

Solving

Instrumentation

Application

Visualization

Applications

Computing

Applications

Applications

Applications

Applications

Toolkits

Toolkit

Toolkit

Toolkit

Toolkit

Toolkit

Toolkit

Grid Services

Resource-independent and application-independent services

(Middleware)

authentication, authorization, resource location, resource allocation, events, accounting,

remote data access, information, policy, fault detection

Resource-specific implementations of basic services

Grid Fabric

e.g., Transport protocols, name servers, differentiated services, CPU schedulers, public key

(Resources)

infrastructure, site accounting, directory service, OS bypass

The Data Deluge and the Grid


Problems

Problems

  • Scalability

    • Will it scale to thousands of processors, thousands of disks, PetaBytes of data, Terabits/sec of IO?

  • Wide-area distribution

    • How to distribute, replicate, cache, synchronise, catalogue the data?

    • How to balance local ownership of resources with the requirements of the whole?

  • Adaptability/Flexibility

    • Need to adapt to rapidly changing hardware and costs, new analysis methods etc.

The Data Deluge and the Grid


Seti@home

[email protected]

  • A distributed computing project - not really a Grid project

  • You pull the data from them rather than they submit the job to you

    • total of 4,591,332 users

    • 963,646,331 results received

    • 1,545,634 years of cpu time

    • 3.3x1021 floating point operations

    • 125 different cpu types

    • 143 different operating systems

Arecibo telescope in Puerto Rico

The Data Deluge and the Grid


Seti@home1

[email protected]

The Data Deluge and the Grid


Entropia

Entropia

  • Uses idle cycles on Home PCs for profit and non-profit projects:

  • Mersenne Prime Search

    • 42,519 machines active

    • 560 years of cpu per day

  • [email protected]

    • 60,000 Machines

    • 1,400 years of cpu time

The Data Deluge and the Grid


Nasa information power grid

NASA Information Power Grid

  • Knit together widely distributed computing, data, instrumentation and human resources

  • to address complex large scale computing and data analysis problems

The Data Deluge and the Grid


Collaborative engineering

Collaborative Engineering

Unitary Plan Wind Tunnel

Multi-source

Data Analysis

Real-time

collection

Archival

storage

The Data Deluge and the Grid


Other grid applications

Other Grid Applications

  • Distributed Supercomputing

    • Simultaneous execution across multiple supercomputers

  • Smart Instruments

  • Enhance the power of scientific instruments by providing access to data archives and online processing capabilities and visualisation

e.g. coupling Argonne’s Photon Source to a supercomputer

The Data Deluge and the Grid


Gridpp

GridPP

http://www.gridpp.ac.uk

The Data Deluge and the Grid


Gridpp overview

GridPP Overview

Provide architecture and middleware

Future LHC Experiments

Running US Experiments

Build prototype Tier-1 and Tier-2s in the UK and implement middleware in experiments

Use the Grid with simulation data

Use the Grid with real data

The Data Deluge and the Grid


The prototype uk tier 1

The Prototype UK Tier-1

March 2003

  • 560 CPUs (450Mhz-1.4GHz)

  • 50 TB Disk

  • 35 TB Tape in use (theoretical tape capacity 366 TB)

The Data Deluge and the Grid


Conclusions

Conclusions

  • Enormous data challenges in next few years.

  • The Grid is likely solution.

  • The Web gives ubiquitous access to distributed information.

  • The Grid will give ubiquitous access to computing resources and hence knowledge.

  • Many Grid projects and testbeds starting to take off.

  • GridPP is building a UK Grid for Particle Physicists to prepare for future LHC Data.

The Data Deluge and the Grid


  • Login