Efficient Quadtree Construction for Indexing Large-Scale Point Data on GPUs: Bottom-Up vs. Top-Down

Efficient Quadtree Construction for Indexing Large-Scale Point Data on GPUs: Bottom-Up vs. Top-Down Jianting Zhang1,2,4, Le Gruenwald3 1 Depart of Computer Science, CUNY City College (CCNY) 2 Department of Computer Science, CUNY Graduate Center 3 School of Computer Science, the University of Oklahoma 4 Visiting Professor, Nvidia Corporation CISE/IIS Medium Collaborative Research Grants 1302423/1302439: “Spatial Data and Trajectory Data Management on GPUs”

Outline • Introduction & Background • GPU Quadtree Data Layout • The top-down approach • Phase 1: Identifying Leaf Quadrants • Phase 2: Construction Quadtree from Leaf Quadrants • The Bottom-Up approach • Key ideas and Conceptual Design • Phase 1: Identifying “Full Quadrants” • Phase 2: Identifying Valid Quadtree Nodes • Phase 2: Populating INDICATOR and F_POS arrays • Experiments • Summary and Future Work

Introduction & Background OGC Simple Features Specification (SFS) for SQL Methods for Geometry: Basic Methods Dimension, GeometryType, SRID, Envelope, AsText, AsBinary, IsEmpty, IsSimple, Boundary Methods for testing Spatial Relations Equals, Disjoint, Intersects, Touches, Crosses, Within, Contains, Overlaps, Relate [more general] Methods that support Spatial Analysis Distance, Buffer, ConvexHull, Intersection, Union, Difference, SymDifference Point-to-Polygon KNN Distance Point-in-Polygon Test Point-to-Polyline NN Distance GEOS/JTS, GDAL/OGR, GRASS, PostGIS/PostgreSQL ArcMap/ArcGIS, Oracle, SQLServer ......

Introduction & Background • Dec. 2017 • 21.1 billion transistors • 5120 processors • 14.90 TFLOPS (FP32) • 7.450 TFLOPS (FP64) • Max bandwidth 651.3 GB/s • PCI-E peripheral device • 250W • Suggested retail price: $2999 ASCI Red: 1997 First 1 Teraflops (sustained) system with 9298 Intel Pentium II Xeon processors (in 72 Cabinets) What can we do today using a device that is MUCH more powerful than ASCI Red 20 years ago?

Introduction & Background David Wentzlaff, “Computer Architecture”, Princeton University Course on Coursea Spatial Data Management Computer Architecture How to fill the big gap effectively?

Introduction & Background Data Parallelisms  Parallel Primitives Parallel libraries Parallel hardware Source: http://parallelbook.com/sites/parallelbook.com/files/SC11_20111113_Intel_McCool_Robison_Reinders.pptx The appendix of this paper listed 7 parallel primitives and their variants used in all of our implementations: principled tradeoff between efficiency and usability

Related works Point Indexing on GPUs: • Hashing • Space Partition based Trees • Good for visualization and interpretability 2D/3D • Kd-Tree (computer graphics) • Octree (3D ) • Quadtree (popular in 2D geospatial applications) • All top-down • Direct CUDA implementation • Sort points multiple times • Unbalanced workload • Exploit limited parallelism Previous works on GPU-Accelerated Quadtree Construction • (Kelly and Breslow, 2011) - each thread processes a quadtree node and loops over all points under the node • (Gluck and Danner, 2014) – each thread block processes a quadtree node and points under it • (Nvidia CUDA SDK, after 2013) – each warp processes one of four quadrants; the main purpose is to demonstrate support for dynamic parallelism weak baseline • (Nour and Tu, 2018): uses atomic operations for dynamic memory allocation to reduce memory footprint

Our GPU Quadtree Data Layout

Our Top-Down Approach: key idea and running example • Identify leaf nodes level-by-level in phase 1 (Zhang et al 2012) • Sort points in all non-leaf quadrants at all levels

Step-by-step illustration of Phase 1 of the Top-Down Approach (Zhang et al. 2012) For k from 1..M levels Transform point dataset P to key set PK using Z-ordering at level k. Sort_by_key using PK as the key and P as the value Reduce_by_key to count numbers of points in partitioned quadrants Identify leaf quadrants (further subdivisions not needed) and identify points that fall within them Move the identified points to the front of the point array and record level boundary

Top-Down Approach: Construct quadtree in phase 2 Inputs: l_key: array of Morton Codes of leaf quadrants；n_point: Array of numbers of points in leaf quadrants; indicator: leaf/non-leaf indicator array; max_level output: f_pos, length (Section 3.1) Algorithm td_leafquad2tree 1 t-keyl_key 2 for k=0, max_level-1 3 t_keytransform(t_key) …(t_key[k]/=4) 4 (p_key, n_child)+=reduce_by_key(t_key) // filling length and f_pos for leaf nodes 5 n_mapexclusive_scan (indicator) 6 lengthgather_if(n_point, n_map, indicator) 7 p_posexclusive_scan(n_point) 8 f_posgather_if(p_pos, n_map,indicator) // filling length and f_pos for non-leaf nodes 9 lengthcopy_if(n_child, ~indicator) 10 n_childreplace_if (n_child, indicator,0) 11 c_posexclusive_scan(n_child) 12 f_poscopy_if(c_pos, ~indicator) • Data parallel design using primitives • Only involves element-wise operations • Maximize parallelism • Workload balance

Our Bottom-Up Approach: key idea and running example • Compute all non-empty quadrants (termed as “full quadrants”) at the finest level in phase 1 • construct quadtree by removing non-qualified quadtree nodes in phase 2 • Need sorting points only once at the finest level • nt: threshold (# points) • nk: #of points at node k • np: #of points at the parent node of node k • A quadtree node is a leaf node if: • nk>nt AND at finest level • nk<=nt <np nk<=nt AND np> nt

Bottom-up Approach: computing full quadrant in phase 1 First three steps of phase 1 of the top-down approach at a single (finest) level 1 Transform point dataset P to key set l_key using Z-ordering at finest level 2 Sort_by_key using l_key as the key and P as the value 3 Reduce_by_key to count numbers of points in partitioned quadrants and set nlen Major source of improved efficiency 4 t_keyl_key 5 for k=0, max_level-1 6 t_keytransform(t_key) …(t_key[k]/=4) 7 (pkey, clen)+=reduce_by_key(t_key) First four steps of phase 2 of the top-down approach at a single (finest) level

Bottom-up Approach: filling indicator and f_pos arrays Inputs: pkey: array of Morton codes of quadrants; clen: numbers of non-empty sub-quadrants; nlen: array of the numbers of points in these quadrants; nt: #of points threshold Output: indicator, f_pos Algorithm genValidQuadrants 1 tposexclusive_scan(clen) 2 tmapscatter ([0..|clen|],tpos) 3 tmapinclusive_scan(tmap, maximum) 4 tlenclen 5{pkey,clen,nlen,tmap}remove_if({pkey,tlen, nlen,tmap}, (nlen, nt)) 6 indicatortransform(clen,( nt )) 7 nlenreplace_if(nlen, indicator,0) 8 clenreplace_if(clen, ~ indicator, 0) 9 pposexclusive_scan(nlen) 10 cposexclusive_scan(clen) 11 f_postransform({ppos,cpos}, indicator) Computing parent node offsets for access in Step 5 clen is modified due to remove_if; a copy is used Remove unqualified nodes if #of points in their parent nodes are less than nt leaf node? Adjust nlen/clen to prepare for computing pos Filling f_pos

Observations/Discussions The duality between the two approaches • Top-Down Phase 1 is more complex while Bottom-Up phase 1 is simpler • Top-Down Phase 2 is simple but Bottom-Up phase 2 is relatively more complex (logic-wise) • Top-Down phase 1 uses a double-array for reordering on sorted and unsorted points at each level • Bottom-Up phase 2 • Analogy to the comparisons between merge sort and quick sort What does Bottom-Up lose in exchange for single-pass sort? • BFS ordering of quadtree nodes and points they indexed

Experiment Setup Data: NYC taxi trip dataset • ~170 million pickup locations in 2009 • using accumulative 1-12 month data as 12 datasets for scalability tests Hardware/software configurations • 4-core/8-thread Intel i7-6700K CPU @ 4.00GHz, Linux 18.04 • Nvidia RTX 2080 Ti with 4352 (68*64) CUDA cores @1.65 GHZ, 11 GB GDDR memory, CUDA 10.1 with Capability 7.5 Baselines: • CUDA SDK code as weak baseline (requires pre-allocate memory for a full pyramid; max_level set to 14 due to OOM) • Top-Down approach as a strong baseline • Both top-down and bottom-up approaches use max_level =16 and nt=200 GTX 1650 with 4GB ~$150 GPU Memory footprint: Top-Down 5.99 GB, Bottom-Up 3.15 GB

Table 1 Runtimes of Three Approaches (in milliseconds): CUDA SDK Sample Code (SDK), Top-Down (TD) and Bottom-up (BU) and Speedups

Table 2 Breakdown Runtimes (in milliseconds) of the Top-Down Approach (TD) and Bottom-Up Approach I: Initialization time – GPU memory allocation and CPU->GPU data transfer P1 and P2: phase 1 and phase 2 Speedup=(TD-P1+TD-P2)/(BU-P1+BU-P2)

Summary and Future Work • Develop a bottom-up approach to quadtree construction on GPUs, which are different from existing top-down approaches • Provide a data parallel design and parallel primitive-based implementation for the proposed bottom-up approach • Experiments show that the bottom-up approach is capable of indexing approximately 170 million taxi pickup locations in NYC in less than 200 milliseconds on a commodity Nvidia RTX 2080 Ti GPU and is 3.4X and 4.9X times faster than the top-down approach with and without including CPU/GPU data transfer time, respectively. • Fine-tune the memory management part to reduce both memory allocation/deallocation time and to minimize memory footprint. • Integrate the indexing techniques into GPU-based data management/analytics systems

Q&A http://www-cs.ccny.cuny.edu/~jzhang/ jzhang@cs.ccny.cuny.edu

Efficient Quadtree Construction for Indexing Large-Scale Point Data on GPUs: Bottom-Up vs. Top-Down