1 / 6

Big Data and HP(D)C

Big Data and HP(D)C. Gagan Agrawal Ohio State . Answering the Questions. Definition: I like the volume, velocity, variety (and verocity ) My collaborations and associated challenges Genomic data - scale, parallelization of analysis

faye
Download Presentation

Big Data and HP(D)C

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data and HP(D)C GaganAgrawal Ohio State

  2. Answering the Questions • Definition: I like the volume, velocity, variety (and verocity) • My collaborations and associated challenges • Genomic data - scale, parallelization of analysis • Scientific simulation data – scale, data management solutions • Geosensing data - noise, data management and processing • Broader Observations from Application Collaborations • Everything is a challenge! • No training in parallelization • Or even writing efficient code!

  3. Answers (Contd.) • Big Data and HP(D)C • Hype exists because there is a real (commercial) market • Database community has always addressed these problems • Big Data commercial efforts look at their work • HP(D)C community is not really in focus • e.g. our HPDC 2004 almost rediscovered as the NoDBSIGMOD 2012 paper! • What do we bring? • More experience with scientific applications • Real need, but little incentive for commercial or DB efforts • More insights into performance, parallelization, general programming models, and fault-tolerance than database community

  4. Other Thoughts • Onus on HPC Community • Database, Cloud, and Viz communities active for a while now • Abstractions like MapReduce are neat! • So are Parallel and Streaming Visualization Solutions • Many existing solutions very low on performance • Do people realize how slow Hadoop really is? • And, yet, one of the most successful open source software? • We need to make our presence felt • Programming model design and implementation community hasn’t even looked at `big-data’ applications • We must engage application scientists • Who are often struck in `I don’t want to deal with the mess’

  5. Open Questions • How do we develop parallel data analysis solutions? • Hadoop? • MPI + file I/O calls? • SciDB – array analytics? • Parallel R? • Desiderata • No reloading of data (rules out SciDBand Hadoop) • Performance while implementing new algorithms (rules out parallel R) • Transparency with respect to data layouts and parallel architectures

  6. Data Management/Reduction Solutions • Must provide Server-side data sub-setting, aggregation and sampling • Without reloading data into a `system’ • Our Approach: Light-weight data management solutions • Automatic Data Virtualization • Support virtual (e.g. relational) view over NetCDF, HDF5 etc. • Support sub-setting and aggregation using a high-level language • A new sampling approach based on bit-vector • Create lower-resolutions representative datasets • Measure loss of information with respect to key statistical measures

More Related