Programming models for data-intensive computing

Programming models fordata-intensive computing

A multi-dimensional problem • Sophistication of the target user • N(data analysts) > N(computational scientists) • Level of expressivity • High level important for interactive analysis • Volume of data • The complex gigabyte vs. the enormous petabyte • Scale and nature of platform • How important are reliability, failure, etc. • What QoS needs? Where enforced?

Separating concerns • What things carry over from conventional HPC? • Parallel file systems, collective I/O, workflow, MPI, OpenMP, PETsc etc., ESMF • What things carry over from conventional data? • Need for abstractions and data-level APIs: R, SPSS, MatLab, SQL, NetCDF, HDF, Kepler, Taverna • Streaming databases, streaming data systems • What is unique to “data HPC”? • New needs at the platform level • New tradeoffs between HL and platform

Current models • Data-parallel • A space of data objects • A set of operators on those objects • Streaming • Scripting

Conclusions • Current HPC programming models fail to address important data-intensive needs • An urgent need for a careful gap analysis aimed at identifying important things that cannot [easily] be done with current tools • Ask people for their “top 20” questions • Ethnographic studies • A need to revisit the “stack” from the perspective of data-intensive HPC apps

Programming models for data-intensive computing • Will flat message-passing model scale for >1M cores? • How does multi-level //ism impact DIC (e.g., GPUs) • MR, Dryad, Swift—what apps do they support? – how suited for PDEs • How will 1K-core PCs change DIC? • Powerful data-centric programming primitives to express HL //ism in a natural way while shielding physical configuration issues—what do we need? • If we design a supercomputer for DIC, what are reqs? • If storage controllers allow application-level control? Permit cross-layer control • New frameworks for reliability and availability (go beyond checkpointing) • How will different models and frameworks interoperate? • How do we support people who want large shared memory?

Programming models • Data parallel – MapReduce • Loosely synchronized chunks of work – Dryad, Swift, scripting • Libraries – e.g., Ntropy • Expressive power vs. scale • BigTable (Hbase) • Streaming, online • Dataflow • What operators for data-intensive computing? (>M/R) • Sum, Average, … • Two main models • Data parallel • Streaming • Goal: “use within 30 minutes; still discovering new power in 2 years time” • Integration with programming environments • Working remotely with large datasets

Dataset – put in time domain, freq domain, plot the result • Multiple levels of abstraction? All-pairs. • Note that there are many ways to express things at the high level, the challenge is implementing things • “Users don’t want to compile anymore” • Who are we targeting? Specialists or generalist? • Focus on need for rapid decision making • Composable models • Dimensions of problem • Level of expressivity • Volume of data • Scale of platform – reliability, failure, etc. • Gauge the size of the problem you are asking to solve • QoS guarantees • Ability to practice on smaller datasets

Types of data + nature of the operators • Select, e.g. on spatial region, temporal operators • Data scrubbing: Data transposition, transforms • Data normalization • Statistical analysis operators • Look at LINQ • Aggregation – combine • Smart segregation to fit on the hardware • Need to deal with distributed data – e.g., column-oriented stores can help with that

What things carry over from conventional HPC • Parallel file systems, collective I/O, workflow, MPI, OpenMP, PETsc etc., ESMF • What things carry over from conventional data • Need for abstractions and data-level APIs: R, SPSS, MatLab, SQL, NetCDF, HDF, Kepler, Taverna • What is unique to data HPC

Moving forward • Ethnographic studies (e.g., Borgman) • Ask for people’s top 20 questions/scenarios • Astronomers • Environmental science • Chemistry … • … • E.g., see SciDB is reaching out to communities

DIC hardware architecture • Different compute-I/O balance • 0.1 B/flop for supercomputer (“all mem to disk in 5 mins” is an unrealizable goal) • Assume that it should be greater: Amdahl • See Alex Szalay paper • GPU-like systems but with more memory per core • Future streaming rates – what are they? • Innovating networking—data routing • Heterogeneous systems perhaps –e.g., M vs Ws • Reliability – where is it implemented? • What about software failures • A special OS? • New ways of combining hardware and software? • Within a system, and/or between systems

Modeling • “Query estimation” and status monitoring for DIC applications

1000-core PCs • Increases data management problem • Enables a wider range of users to do DIC • More complex memory hierarchy—200 mems • We’ll have amazing games with realistic physics

Infinite bandwidth • Do everything in the cloud

MapReduce-related thoughts • MR is library-based. This makes optimization more difficult. Type checking. Annotations. • Are there opportunities for optimization if we incorporate ideas into extensible languages? • Ways to enforce/leverage/enable domain-specific semantics. • Interoperability/portability?

Most important ideas • How badly it doesn’t work so well: current HPC practice fails for DIC. Make it easier for the domain scientist, enable new types of science • Gap analysis—articulate what we can do with MPI and MR; what we can’t do with either, and why • Propagating information between layers

Programming models for data-intensive computing

Programming models for data-intensive computing

Presentation Transcript

Data-Intensive Distributed Computing

Data-Intensive Computing

Data-Intensive Distributed Computing

Petascale Data Intensive Computing

Cloud Technologies for Data Intensive Computing

Models and Frameworks for Data Intensive Cloud Computing

Data Intensive Computing

Petascale Data Intensive Computing for eScience

Cooperative Computing for Data Intensive Science

Data -Intensive Computing Systems

Chapter 8 Data Intensive Computing: Map-Reduce Programming