1 / 32

MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer

MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer. Lars Arge Datalogisk Institut Aarhus Universitet Regionalt endagskursus datalogi 20 Marts 2006. Outline. Massive (terrain) data Scalability problems (I/O bottleneck)

lowell
Download Presentation

MASSIVE Terrain Datasæt − om vigtigheden af effektive algoritmer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MASSIVE Terrain Datasæt −om vigtigheden af effektive algoritmer Lars Arge Datalogisk Institut Aarhus Universitet Regionalt endagskursus datalogi 20 Marts 2006

  2. Massive terrain datasæt Outline • Massive (terrain) data • Scalability problems (I/O bottleneck) • Processing massive terrain data: Flow modeling on grid terrains • Summary

  3. Massive terrain datasæt Massive Data

  4. Massive terrain datasæt Massive Data • Massive datasets are being collected everywhere • Storage management software is billion-$ industry Examples (2002): • Phone: AT&T 20TB phone call database, wireless tracking • Consumer: WalMart 70TB database, buying patterns (supermarket checkout) • WEB: Web crawl of 200M pages and 2000M links, Akamai stores 7 billion clicks per day • Geography: NASA satellites generate 1.2TB per day

  5. Massive terrain datasæt Example: Satellite Images • Terrabyte image database

  6. Massive terrain datasæt Example: Grid Terrain Data • Grid terrain data increasingly available • NASA SRTM mission acquired 30m data for around 80% of earth land mass • US data readily available through USGS National Map Seamless Data Distribution System • Appalachian Mountains (800km x 800km) • 100m resolution  ~ 64M cells  ~128MB raw data (~500MB when processing) • ~ 1.2GB at 30m resolution • ~ 12GB at 10m resolution (much of US available from USGS) • ~ 1.2TB at 1m resolution (selected, mostly military, availability)

  7. Massive terrain datasæt Example: LIDAR Terrain Data • Massive (irregular) point sets (1-10m resolution) • Becoming relatively cheap and easy to collect • NC floodplain mapping program: www.ncfloodmaps.com • Collected LIDAR for all NC after Hurricane Floyd in 1999 • Still processing it

  8. Massive terrain datasæt Hurricane Floyd • Sep. 15, 1999 7 am 3pm

  9. Massive terrain datasæt Example: LIDAR Terrain Data • US LIDAR data becoming available: • www.ncfloodmaps.com • USGS Center for LIDAR Information Coordination and Knowledge (CLICK) • NOAA LIDAR Data Retrieval Tool (LDART)

  10. Massive terrain datasæt Scalability Problems

  11. read/write head read/write arm track magnetic surface Massive terrain datasæt Scalability Problems: I/O-Bottleneck • I/O is often bottleneck when handling massive datasets • Disk access is 106 times slower than main memory access “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer) • Disk systems try to amortize large access time transferring large contiguous blocks of data • Need to store and access data to take advantage of blocks (locality)

  12. Massive terrain datasæt Scalability Problems: Block Access Matters • Example: Reading an array from disk • Array size N = 10 elements • Disk block size B = 2 elements • Main memory size M = 4 elements (2 blocks) • Difference between N and N/B large since block size is large • Example: N = 256 x 106, B = 8000 , 1ms disk access time NI/Os take 256 x 103 sec = 4266 min = 71 hr  N/BI/Os take 256/8 sec = 32 sec 1 2 10 9 5 6 3 4 8 7 1 5 2 6 3 8 9 4 7 10 Algorithm 1: Loads N=10 blocks Algorithm 2: Loads N/B=5 blocks

  13. running time data size Massive terrain datasæt Scalability Problems: Block Access Matters • Most programs developed without memory considerations • Infinite memory • Uniform access cost • Run on large datasets because OS moves blocks as needed • Moderns OS utilizes sophisticated paging and prefetching strategies • But if program makes scattered accesses even good OS cannot take advantage of block access  Scalability problems! R A M

  14. running time R A M data size Massive terrain datasæt Scalability: Hierarchical Memory • Block access not only important on disk level • Machines have complicated memory hierarchy • Levels get larger and slower • Block transfers on all levels • We focus on disk level: R A M L 2 L 1

  15. Massive terrain datasæt Processing Massive Terrain Data: Flow

  16. Massive terrain datasæt Flow on Terrains • Modeling of water flow on terrains has many important applications • Predict location of streams • Predict areas susceptible to floods • Compute watersheds • Predict erosion • Predict vegetation distribution • …… • Conceptually flow is modeled using two basic attributes • Flow direction: The direction water flows at a point • Flow accumulation: Amount of water flowing through a point • Flow accumulation used to compute other hydrological attributes, e.g. drainage network, topographic convergence index…

  17. SFD MFD 3 3 3 3 2 2 2 2 4 4 4 4 7 7 7 7 5 5 5 5 8 8 8 8 7 7 7 7 1 1 1 1 9 9 9 9 Massive terrain datasæt Flow Directions on Grid Terrains • Common terrain representation: Grid • Flow directions: Water in each cell flows to downslope neighbor(s) • Commonly used: • Single flow direction (SFD or D8): Flow to downslope neighbor • Multiple flow direction (MFD): Flow to all downslope neighbors

  18. Massive terrain datasæt Flow Accumulation on Grid Terrains • Flow accumulation • Initially one unit of water in each cell • Water distributed from each cell according to flow direction(s) • Flow accumulation of cell is total flow through it

  19. Massive terrain datasæt Flow Accumulation Example (Panama dataset)

  20. Massive terrain datasæt Flow Modeling on Massive Grid Terrains • Duke University Environmental researchers had problems with computing flow accumulation for Appalachian Mountains • Recall ~128MB raw data and ~500MB when processing  Running time: 14 days • It could be much worse; Recall • ~ 1.2GB at 30m resolution • ~ 12GB at 10m resolution • ~ 1.2TB at 1m resolution

  21. Massive terrain datasæt Flow Modeling on Massive Grid Terrains • We surveyed other flow accumulation software • GRASS (leading open-source GIS) • Killed after 17 days on a 50MB dataset (6700 x 4300 grid) • TARDEM (specialized hydrology software) • Could handle 50MB dataset • Killed after 20 days on a 240MB dataset (12000 x 10000 grid) • CPU utilization5%, 3GB swap file • ArcGIS (leading commercial GIS) • Could handle the 240MB dataset • Sometimes very slow: • 3 days to process 490MB dataset • 1 day to process 560MB dataset • Does not work for datasets larger than 2GB

  22. Massive terrain datasæt Flow Accumulation Scalability Problem • Natural algorithm may require ~N I/Os • “Push” flow down the terrain by visiting cells in height order  Problem since cells of same height scattered over terrain • Natural to try “tiling” (ArcGIS?) • But computation in different tiles not independent

  23. Massive terrain datasæt TerraFlow • We developed theoretically I/O-optimal algorithms using ~N/B I/Os • Avoiding scattered access by: • Grid storing input: Data duplication • Grid storing flow: “Lazy write” • Implementation was very efficient • Appalachian Mountains flow accumulation in 3 hours! • Developed into comprehensive software package for flow computation on massive grids (www.cs.duke.edu/geo*/terraflow) • Efficient: 2-1000 times faster than other software on massive grids • Scalable: 1 billion elements! (>2GB data) • Flexible: Different flow modeling (direction) methods

  24. 500 MHz Alpha, FreeBSD 4.0 TerraFlow 512 90 TerraFlow 128 ArcInfo 512 80 ArcInfo 128 70 60 50 Running Time (Hours) 40 30 20 10 Hawaii 56M 0 Midwest 561M Lower NE 256M East-Coast 491M Washington 2G Cumberlands 80M Massive terrain datasæt TerraFlow • Significant speedup over ArcInfo for large datasets • East-Coast (100m) TerraFlow: 8.7 Hours ArcInfo: 78 Hours • Washington state (10m) TerraFlow: 63 Hours ArcInfo: % • Incorporated in Grass 5.0.2 and later • Recently also extensions for ArcGIS 8 and 9

  25. Massive terrain datasæt Denmark?

  26. Massive terrain datasæt Denmark Terrain Data • Mainly two data suppliers in Denmark • Kort & Matrikelstyrelsen • COWI A/S • Grid/vector models based on paper maps/ortofoto • LIDAR data for major cities • Unfortunately not available online (and not free) • But obviously increasing interest in terrain data/applications

  27. Massive terrain datasæt New Project • New (NABIIT) project: Development of algorithms and software for processing massive terrain data • COWI A/S • Problems processing LIDAR data during production and analysis (e.g. railroad noise) • Spatial analysis unit, Danish Institute of Agricultural Sciences • Use data, e.g. to comply with EU directives • Computer science, Aarhus University • Efficient algorithms • Focus on • Terrain modeling, terrain flow analysis, influence of simplification

  28. Massive terrain datasæt Example Sub-Projects • Terrain modeling, e.g: • Terrain models from “raw” LIDAR Process >10G raw data in a few hours using only 128M memory • Terrain analysis, e.g: • Erosion modeling (USLE factor computation) • Watershed hierarchy computation NC Neuse basin at 10m resolution (~400M cells) in 3 hours

  29. Massive terrain datasæt Summary

  30. Massive terrain datasæt Summary • Massive datasets appear everywhere • Leads to scalability problems • Due to hierarchical memory and slow I/O • I/O-efficient algorithms greatly improves scalability • Terrain data: • Massive grid data exists • New technologies are creating massive and very detailed datasets • Processing capabilities lag behind

  31. Massive terrain datasæt Summary - Resources • Google earth: http://earth.google.com/ • USGS national map: http://seamless.usgs.gov • USGS center for LIDAR information: http:/lidar.cr.usgs.gov • NC floodmaps: http://www.ncfloodmaps.com • NOAA LIDAR data retrieval tool: http://www.csc.noaa.gov/crs/tcm/about_ldart.html • TerraFlow: http://www.cs.duke.edu/geo*/terraflow • Duke STREAM project: http://terrain.cs.duke.edu • Kort & Matrikelstyrelsen: http://www.kms.dk • COWI A/S: http://www.cowi.dk • Geoforum: http://www.geoforum.dk/

  32. Massive terrain datasæt THANKS/TAK Lars Arge large@daimi.au.dk

More Related