180 likes | 325 Views
Programming models for data-intensive computing. A multi-dimensional problem. Sophistication of the target user N(data analysts) > N(computational scientists) Level of expressivity High level important for interactive analysis Volume of data
E N D
A multi-dimensional problem • Sophistication of the target user • N(data analysts) > N(computational scientists) • Level of expressivity • High level important for interactive analysis • Volume of data • The complex gigabyte vs. the enormous petabyte • Scale and nature of platform • How important are reliability, failure, etc. • What QoS needs? Where enforced?
Separating concerns • What things carry over from conventional HPC? • Parallel file systems, collective I/O, workflow, MPI, OpenMP, PETsc etc., ESMF • What things carry over from conventional data? • Need for abstractions and data-level APIs: R, SPSS, MatLab, SQL, NetCDF, HDF, Kepler, Taverna • Streaming databases, streaming data systems • What is unique to “data HPC”? • New needs at the platform level • New tradeoffs between HL and platform
Current models • Data-parallel • A space of data objects • A set of operators on those objects • Streaming • Scripting
Conclusions • Current HPC programming models fail to address important data-intensive needs • An urgent need for a careful gap analysis aimed at identifying important things that cannot [easily] be done with current tools • Ask people for their “top 20” questions • Ethnographic studies • A need to revisit the “stack” from the perspective of data-intensive HPC apps
Programming models for data-intensive computing • Will flat message-passing model scale for >1M cores? • How does multi-level //ism impact DIC (e.g., GPUs) • MR, Dryad, Swift—what apps do they support? – how suited for PDEs • How will 1K-core PCs change DIC? • Powerful data-centric programming primitives to express HL //ism in a natural way while shielding physical configuration issues—what do we need? • If we design a supercomputer for DIC, what are reqs? • If storage controllers allow application-level control? Permit cross-layer control • New frameworks for reliability and availability (go beyond checkpointing) • How will different models and frameworks interoperate? • How do we support people who want large shared memory?
Programming models • Data parallel – MapReduce • Loosely synchronized chunks of work – Dryad, Swift, scripting • Libraries – e.g., Ntropy • Expressive power vs. scale • BigTable (Hbase) • Streaming, online • Dataflow • What operators for data-intensive computing? (>M/R) • Sum, Average, … • Two main models • Data parallel • Streaming • Goal: “use within 30 minutes; still discovering new power in 2 years time” • Integration with programming environments • Working remotely with large datasets
Dataset – put in time domain, freq domain, plot the result • Multiple levels of abstraction? All-pairs. • Note that there are many ways to express things at the high level, the challenge is implementing things • “Users don’t want to compile anymore” • Who are we targeting? Specialists or generalist? • Focus on need for rapid decision making • Composable models • Dimensions of problem • Level of expressivity • Volume of data • Scale of platform – reliability, failure, etc. • Gauge the size of the problem you are asking to solve • QoS guarantees • Ability to practice on smaller datasets
Types of data + nature of the operators • Select, e.g. on spatial region, temporal operators • Data scrubbing: Data transposition, transforms • Data normalization • Statistical analysis operators • Look at LINQ • Aggregation – combine • Smart segregation to fit on the hardware • Need to deal with distributed data – e.g., column-oriented stores can help with that
What things carry over from conventional HPC • Parallel file systems, collective I/O, workflow, MPI, OpenMP, PETsc etc., ESMF • What things carry over from conventional data • Need for abstractions and data-level APIs: R, SPSS, MatLab, SQL, NetCDF, HDF, Kepler, Taverna • What is unique to data HPC
Moving forward • Ethnographic studies (e.g., Borgman) • Ask for people’s top 20 questions/scenarios • Astronomers • Environmental science • Chemistry … • … • E.g., see SciDB is reaching out to communities
DIC hardware architecture • Different compute-I/O balance • 0.1 B/flop for supercomputer (“all mem to disk in 5 mins” is an unrealizable goal) • Assume that it should be greater: Amdahl • See Alex Szalay paper • GPU-like systems but with more memory per core • Future streaming rates – what are they? • Innovating networking—data routing • Heterogeneous systems perhaps –e.g., M vs Ws • Reliability – where is it implemented? • What about software failures • A special OS? • New ways of combining hardware and software? • Within a system, and/or between systems
Modeling • “Query estimation” and status monitoring for DIC applications
1000-core PCs • Increases data management problem • Enables a wider range of users to do DIC • More complex memory hierarchy—200 mems • We’ll have amazing games with realistic physics
Infinite bandwidth • Do everything in the cloud
MapReduce-related thoughts • MR is library-based. This makes optimization more difficult. Type checking. Annotations. • Are there opportunities for optimization if we incorporate ideas into extensible languages? • Ways to enforce/leverage/enable domain-specific semantics. • Interoperability/portability?
Most important ideas • How badly it doesn’t work so well: current HPC practice fails for DIC. Make it easier for the domain scientist, enable new types of science • Gap analysis—articulate what we can do with MPI and MR; what we can’t do with either, and why • Propagating information between layers