1 / 33

The unique qualities & responsibilities of a geographical cyberinfrastructure

This overview explores the challenges of organizing and computing with big data in GIScience, while highlighting the unique responsibilities of geographical cyberinfrastructure. It also discusses the emerging data opportunities and the need for systematic approaches in spatial analysis. The goal is to attract and engage the GIScience community in developing innovative tools and workflows for managing and analyzing massive geospatial datasets.

sandro
Download Presentation

The unique qualities & responsibilities of a geographical cyberinfrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland

  2. Overview • Data-intensive GIScience: from data poor to drowning in 40 years • Challenges of organizing Big Data for GIScience • Challenges of computing with Big Data for GIScience

  3. Data poor to drowningthe case of remote sensing

  4. Early remote sensing platform

  5. 1980s: 30m x 30m pixels

  6. 2000s: 2.5 m x 2.5m pixels

  7. Airborne Sensor platform(much cheaper and more flexible than satellite)

  8. One of the latest unmanned remote sensing platforms

  9. How much data so far? • NASA’s Earth Observation System (EOS) program has about 4.2 petabytes (2010) • 430 times larger than the DT-LoC • 3 times smaller than the output from the Large Hadron Collider in a single year • Similar sized collections can be expected in Europe and Asia • EOS contains mostly satellite data…not air photos, map or field data • What about ‘Volunteered’ data? • And “The long tail of dark data…”?

  10. How does that compare to other science disciplines? • Large Hadron Collider (Physics) • 10-14TB year • A 20km high stack of DVDs or 400,000 large PC disks • Genomics (Biology) • Imaging sequencers: Data volume doubling every 6 months • Can’t back it up to tape fast enough

  11. Big Data Challenges for CI • Storing unprecedented volumes of data (and accelerating) • Data production passed storage capacity in 2007 • Cost differential is increasing, Rate of data production is increasing • Describing what we have in ways that are helpful to future users (and our future selves) • Metadata and Semantics for describing content (this tends to be producer-focused) • But also use-case metadata and emergent relationships (tends to be consumer-focused) • Finding what we need, in the context of our current task • semantically-enabled search engines that can use the above descriptions, (ideally from within analytical tools and workflows) • Working out what we do not need to keep • Because it will not be used again or offers no ‘information gain’ • Because it is easier to recreate than to store • Governing data collections well, within their communities of use • information and knowledge portals • effective governance of data resources • quality control strategies, including peer review and rewarding excellent contributions

  12. The Big Data responsibilities of CyberGIS • Create successful tools and languages to describe and find data, so that reuse is actively encouraged • Enable the analysis, • re-educate to reset the expectations… • The data that we collect forms a natural history of the changing planet on which we live • The same cannot be said for many other sciences... • This ongoing record is more important than the individual research we each engage in • Note we may not anticipate the questions that future researchers may need to answer using our data

  13. Emerging data opportunities… What do these four different spatial analysis tasks have in common? • Find traffic bottlenecks…? • Compute earthquake epicenters…? • Track Influenza epidemics…? • Perform land cover classification…?

  14. ‘Fourth paradigm’ –science led from Big Data

  15. BUT

  16. HPC Analysis challenges of massive data in Earth Science Re-express (spatial) analysis algorithms so that they scale across HPC hardware AND Big Data: • Geometry: Point / line / region / volume—algebra, selection, transformation, projection • Topology: connectivity, route-finding, friction • (Spatial) statistics: classification, interpolation • Point pattern analysis / discovery: cluster detection The challenge is to be SYSTEMATIC, not piecemeal

  17. What’s limiting the task? • Memory? • 1TB on a single compute node now • 2-8TB on some equipment (e.g. SGI UV) • CPU? • Tightly bound—needs a lot of inter-process communication • Embarrassingly parallel—can be perfectly decomposed • Data? • Random? Linear? Blocky? (Degree of locality of reference) • Replication? • Communications? • Data channels, infiniband, metadata • Nothing? • Not everything needs to be parallelized…just the limiting segments

  18. Domain Decomposition

  19. Or to put it another way… • How do GISci algorithms map onto the well understood supercomputing templates • Dense Grids • Sparse Grids • Computational Fluid Dynamics • N-Body interactions • Monte Carlo • Data Intensive • etc... (Are all our algorithms covered by these templates?)

  20. Cost of reengineering vs. slowdown for GIS algorithms Slowdown Utility Cost of reengineering

  21. Sticky CyberGIS How to attract and keep the community involved? • Outreach & community engagement • Compelling and appealing functionality • Data and method repositories • Workflows • Semantic interoperability • Killer Apps… • Incentives to contribute • Continuity

  22. Computational workflows embedded in social media , • Scripts, workflows, simulations, experimental plans statistical models, ... • Repeatable, reproducible, comparable and reusable • Sharing propagates expertise and builds reputation • One can be ‘friends with an experiment’ in a science, social network http://myexperiment.org

  23. Semantically translating map dataSemDat Web Service: http://semdat.bestgrid.org/semdat/

  24. Seismicity (ANSS) Paleoseismology Local site effects Geologic structure (USArray) Faults (USArray) Rupture dynamics (SAFOD, ANSS, USArray) Seismic Hazard Model Stress transfer (InSAR, PBO, SAFOD) (from Leinen, 2004) Crustal motion (PBO) Crustal deformation (InSAR) Seismic velocity (USArray) Killer App example from geosciences: earthquake modelling GEON: Chaitan Baru, SDSC

  25. Conclusions • Big Data creates new ways of approaching GIScience: discovery-led rather than theory-led • Need to scale up our storage • Useful data is the data that can be reused… • Scalable GIScience methods are needed now • Domain decomposition has always been the challenge for GIScience, and is still. • A systematic analysis of algorithm bottlenecks and amenability to parallelization has been missing for 20 years • Such an analysis is an ongoing task…as new parallel HPC and data paradigms become possible • Re-educate to reset expectations among researchers • Use the best technologies and tools from other disciplines who have made this leap, especially bio-informatics, computational chemistry, high energy physics

  26. Questions?

  27. Fourth paradigm and data complexity • Experiment & Measurement • Analytical Theory • Numerical Simulations • Data Intensive Computing Data fusion + data mining + synthesis/learning + explanation http://research.microsoft.com/en-us/collaboration/fourthparadigm/

  28. Utilizing massive data to discover and explain Is not as easy as you might think… • Poor and sparse samples, surrogates, bias… • As number of dimensions increases it becomes increasingly difficult to add in any data point without giving rise to some kind of statistically significant ‘pattern’ or ‘cluster’ • And parametric distributions become unreliable • It is very difficult to discover useful things that are unknown by experts

  29. aligning heterogeneous definitions in content, schema Era Eon Period Series from www.GEONgrid.org We need to capture the meaning of data, not just the data itself • STANDARD DEFINITIONS • data content: rock types, time scale, … • data schema

  30. OneGeology interoperability portal Data from different countries can be integrated, despite using different geologic categories /legends

  31. Complete connected neighborhood of a research article or dataset (Alfred knowledge browser)

More Related