1 / 11

Working with Big Data in the Geosciences - Finding the Needle in the Haystack

Working with Big Data in the Geosciences - Finding the Needle in the Haystack. Sangmi Pallickara Computer Science Department Colorado State University sangmi@cs.colostate.edu. Big Data in Geosciences. Volume Velocity Variety. Storage must be over a collection of machines.

alden
Download Presentation

Working with Big Data in the Geosciences - Finding the Needle in the Haystack

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Working with Big Data in the Geosciences - Finding the Needle in the Haystack Sangmi PallickaraComputer Science DepartmentColorado State Universitysangmi@cs.colostate.edu

  2. Big Data in Geosciences • Volume • Velocity • Variety

  3. Storage must be over a collection of machines • Avoid central coordinators • Cope with failures • Preserve data locality without introducing storage imbalances • And the accompanying query hotspots • Support range queries and fast ingest of new data

  4. Galileo Design Considerations • Symmetric storage nodes • No special-function or “controller” nodes • Storage and retrievals may go to any node, and will be forwarded to the targeted node(s) • Incremental scale-up • Failure-resiliency • Accounts for geospatial component in data

  5. Galileo key features • Support for large numbers (109) of small files • High throughput storage and retrieval • Data is multidimensional with multiple types • Time-series data • Support for exact match and range queries (with wildcards) along multiple dimensions • Support for multiple data formats • netCDF, BUFR, HDF 4/5, and data from the Defense Meteorological Satellite Program

  6. Planned/Ongoing deployments for Galileo • International Centre for Radio Astronomy Research • Australian SKA Pathfinder telescope • ~ 1 PB of time-series data • CSU Atmospheric Sciences & Precision Wind (Boulder) • Short-term wind forecast predictions • CSU Civil & Environmental Engineering department • Sustainable management of watershed systems • Climate.org

  7. Related work • Google File system • BigTable • Distributed Hash Table (DHT) based Systems • Pastry, Chord, Dynamo, and CAN • SciDB • MongoDB

  8. Dataset used in performanceevaluations • Sourced from NOAA NAM Project • Dimensions/Features: • Geospatial: Latitude, Longitude • Time Series: Start Time, End Time • Temperature • Relative Humidity • Wind Speed • Snow Depth • Composed of 1 billion files (8 TB)

  9. Storage Throughput • Block is about 8 KB of data • 56,000 blocks per second in a system with 48-nodes

  10. Query Performance

  11. Thank you! • Galileo • http://galileo.cs.colostate.edu • Sangmi Pallickara • sangmi@cs.colostate.edu • http://www.cs.colostate.edu/~sangmi

More Related