1 / 17

Programming models for data-intensive computing

Programming models for data-intensive computing. A multi-dimensional problem. Sophistication of the target user N(data analysts) > N(computational scientists) Level of expressivity High level important for interactive analysis Volume of data

armani
Download Presentation

Programming models for data-intensive computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming models fordata-intensive computing

  2. A multi-dimensional problem • Sophistication of the target user • N(data analysts) > N(computational scientists) • Level of expressivity • High level important for interactive analysis • Volume of data • The complex gigabyte vs. the enormous petabyte • Scale and nature of platform • How important are reliability, failure, etc. • What QoS needs? Where enforced?

  3. Separating concerns • What things carry over from conventional HPC? • Parallel file systems, collective I/O, workflow, MPI, OpenMP, PETsc etc., ESMF • What things carry over from conventional data? • Need for abstractions and data-level APIs: R, SPSS, MatLab, SQL, NetCDF, HDF, Kepler, Taverna • Streaming databases, streaming data systems • What is unique to “data HPC”? • New needs at the platform level • New tradeoffs between HL and platform

  4. Current models • Data-parallel • A space of data objects • A set of operators on those objects • Streaming • Scripting

  5. Conclusions • Current HPC programming models fail to address important data-intensive needs • An urgent need for a careful gap analysis aimed at identifying important things that cannot [easily] be done with current tools • Ask people for their “top 20” questions • Ethnographic studies • A need to revisit the “stack” from the perspective of data-intensive HPC apps

  6. Programming models for data-intensive computing • Will flat message-passing model scale for >1M cores? • How does multi-level //ism impact DIC (e.g., GPUs) • MR, Dryad, Swift—what apps do they support? – how suited for PDEs • How will 1K-core PCs change DIC? • Powerful data-centric programming primitives to express HL //ism in a natural way while shielding physical configuration issues—what do we need? • If we design a supercomputer for DIC, what are reqs? • If storage controllers allow application-level control? Permit cross-layer control • New frameworks for reliability and availability (go beyond checkpointing) • How will different models and frameworks interoperate? • How do we support people who want large shared memory?

  7. Programming models • Data parallel – MapReduce • Loosely synchronized chunks of work – Dryad, Swift, scripting • Libraries – e.g., Ntropy • Expressive power vs. scale • BigTable (Hbase) • Streaming, online • Dataflow • What operators for data-intensive computing? (>M/R) • Sum, Average, … • Two main models • Data parallel • Streaming • Goal: “use within 30 minutes; still discovering new power in 2 years time” • Integration with programming environments • Working remotely with large datasets

  8. Dataset – put in time domain, freq domain, plot the result • Multiple levels of abstraction? All-pairs. • Note that there are many ways to express things at the high level, the challenge is implementing things • “Users don’t want to compile anymore” • Who are we targeting? Specialists or generalist? • Focus on need for rapid decision making • Composable models • Dimensions of problem • Level of expressivity • Volume of data • Scale of platform – reliability, failure, etc. • Gauge the size of the problem you are asking to solve • QoS guarantees • Ability to practice on smaller datasets

  9. Types of data + nature of the operators • Select, e.g. on spatial region, temporal operators • Data scrubbing: Data transposition, transforms • Data normalization • Statistical analysis operators • Look at LINQ • Aggregation – combine • Smart segregation to fit on the hardware • Need to deal with distributed data – e.g., column-oriented stores can help with that

  10. What things carry over from conventional HPC • Parallel file systems, collective I/O, workflow, MPI, OpenMP, PETsc etc., ESMF • What things carry over from conventional data • Need for abstractions and data-level APIs: R, SPSS, MatLab, SQL, NetCDF, HDF, Kepler, Taverna • What is unique to data HPC

  11. Moving forward • Ethnographic studies (e.g., Borgman) • Ask for people’s top 20 questions/scenarios • Astronomers • Environmental science • Chemistry … • … • E.g., see SciDB is reaching out to communities

  12. DIC hardware architecture • Different compute-I/O balance • 0.1 B/flop for supercomputer (“all mem to disk in 5 mins” is an unrealizable goal) • Assume that it should be greater: Amdahl • See Alex Szalay paper • GPU-like systems but with more memory per core • Future streaming rates – what are they? • Innovating networking—data routing • Heterogeneous systems perhaps –e.g., M vs Ws • Reliability – where is it implemented? • What about software failures • A special OS? • New ways of combining hardware and software? • Within a system, and/or between systems

  13. Modeling • “Query estimation” and status monitoring for DIC applications

  14. 1000-core PCs • Increases data management problem • Enables a wider range of users to do DIC • More complex memory hierarchy—200 mems • We’ll have amazing games with realistic physics

  15. Infinite bandwidth • Do everything in the cloud

  16. MapReduce-related thoughts • MR is library-based. This makes optimization more difficult. Type checking. Annotations. • Are there opportunities for optimization if we incorporate ideas into extensible languages? • Ways to enforce/leverage/enable domain-specific semantics. • Interoperability/portability?

  17. Most important ideas • How badly it doesn’t work so well: current HPC practice fails for DIC. Make it easier for the domain scientist, enable new types of science • Gap analysis—articulate what we can do with MPI and MR; what we can’t do with either, and why • Propagating information between layers

More Related