Parallel OpenCL-based Solver for Large-Scale DNS of Incompressible Flows

Towards a parallel OpenCL-based solver for large-scale DNS of incompressible flows on heterogenous systems A. V. Gorobets1,2, F. X. Trias1, A. Oliva1 1 Heat and Mass Transfer Technological Center Technical University of Catalonia, Barcelona, Spain 2 CAA LAB of KIAM RAS, Moscow, Russia

Applications with high computing power demands Direct numerical simulations of incompressible flows ●Incompressible turbulent flows with heat transfer ●High order numerical schemes ●Simplified geometry but complex physical phenomena ●High space and time resolution, long time integration period ●Validation basis for turbulence models Differentially heated cavity Ra up to 1012 (110M nodes) Surface mounted cube, Re=7300 (16M nodes) Impinging jet, Re=20000 (100M nodes) Square cylinder, Re=22K (75M nodes) Square duct, Re_tau=1200 (170M nodes)

A CFD algorithm for incompressible flows ●Navier-Stokes system to solve: . The algorithm of the time step ●Discrete system for pressure-velocity coupling: • Predictor velocity field, , is obtained explicitly • Correction, , is obtained from thePoissonequation • Resulting velocity field, , is obtained • Energy equation is solved explicitly where ●Fractional step projection method: Predictor velocity: where Unknown velocity: Mass conservation equation: The Poisson equation

The Poisson solver for cases with one periodic direction ●The method is based on a time-consuming preprocessing ●Allows flexible configuration for different parallel systems, CPU groups and problem sizes ●3D geometry is restricted to extruded 2D shapes with a constant mesh step, so called “2.5D” ●A multigrid-like extension makes the solver applicable for fully 3D configurations The algorithm of the solver • Fourier diagonalization using FFT uncouples 3D problem into set of independent 2D problems (planes) • The Schur complement based direct method is used to solve planes that correspond to lower Fourier frequencies • The preconditioned conjugate gradient (PCG) method is used to solve the remaining planes • Inverse FFT to restores solution of the 3D problem. Uncoupling of a 3D domain using FFT Gorobets, F. X. Trias, M. Soria and A. Oliva, “A scalable parallel Poisson solver for three-dimensional problems with one periodic direction”, Computers & Fluids journal, 39 (2010) 525-538, Elsevier

The MPI+OpenMP two-level parallelization MPI parallelization ●Domain is decomposed into P = Px x Pyz parts in all 3 directions ●Periodic direction is decomposed in Px parts: FFT is replicated within 1D groups and each FFT is sequential. ●Each plane is decomposed into Pyz parts ●First D planes are solved with the direct method, the remaining planes with the iterative method OpenMP parallelization ●Explicit part is parallelized by decomposing loops over nodes ●FFT stages (1,4) are parallelized in the same way: set of subvectors is divided between threads FFT itself stays sequential! ●The set of planes is divided between threads on stages 2, 3 ●I/O and MPI communications are performed by thread 0 A. Gorobets, F. X. Trias, R. Borrell, O. Lehmkuhl, A. Oliva, “Hybrid MPI+OpenMP parallelization of an FFT-based 3D Poisson solver with one periodicdirection”, Elsevier, Computers & Fluids (2011)

Speedups with MPI+OpenMP Speedup varying Pt, Pyz=64, Px=1, MVS100K ●In total P = Px x Pyzx Pt CPU cores engaged ●With Px=8 and Pt=8 the solver reaches its limits on 8-core nodes ●However Pyz = 200 is still far from its limits, 12800 cores just was maximum available for tests ●Till Pyz = 800 it works well, so we estimate limits as at least 800x8x8 =51200CPU cores ●Symmetric extension for cases with domain symmetry allows to scale this number up to 4 times leading beyond 105cores Speedups varying Px, Pyz=200, Pt=8, Lomonosov supercomputer Speedup varying Pyz, Pt=8, Px=1, Lomonosov Mesh2: 330 millions of nodes Mesh3: 1003 millions of nodes

Extension of the parallel model “usual” supercomputer supercomputer with hybrid nodes ●Nodes of a supercomputer are coupled by means of MPI within distributed memory model and MIMD parallelism ●OpenMP provides parallelization inside multi-core nodes within shared memory model and the same MIMD parallelism ●OpenCL engagesGPUs, where SIMD is used on the level of SMs

Using GPUs A program for a heterogeneous system consists of a CPU code, so called host code, and a GPU code, so called kernel code. ●CUDA - Compute Unified Device Architecture Is now most popular, can work only on NVidia hardware ●OpenCL - Open Computing Language works on both NVidia and AMD (ATI) OpenCLwas chosen for its portability Comparison of GPUvs. CPU ●Streaming processors with SIMDparallelism ●Much more cores – hundreds, thousands ●Latency hiding for memory access ●Complex memory hierarchy ●5-6 times bigger memory bandwidth ●2-3 times more power consumption ●>3 times more expensive (NVidia) OpenCL memory model NVIdia Fermi

Interconnection 5.6Gbit Latency 10-6 sec. PCI-Express Node Node Node Node Node Node Node Node Computing node 12 CPU cores at 2.93 GHz 96 Gb RAM DDR3 3 GPU NVidia C2050: 3x448=1344 cores 3x2.5 = 7.5 Gb DDR5 Supercomputer К100 (www.kiam.ru) Equipment Reference CPU: ●Intel Xeon X5670 2.93GHz 12Gflops, 32Gb/s GPU devices: ●NVidia Tesla C2050 (K100)●AMD Radeon 7970 ●NVidia GeForce GTX480●AMD Radeon 6970 515Gflops, 144Gb/s 947 Gflops, 264Gb/s 164Gflops, 177Gb/s 683Gflops, 176Gb/s Computing module NIISI RAS

The overall algorithm and basic operations • ●Momentum equation • diffusion operator (3 x op. 1) • convection operator (3 x op. 2) • mass force term (op. 1) • summation of terms (3 x op. 3.3) • ●Energy eq. • diffusion operator (op. 1) • convection operator (op. 2) • summation of terms (op. 3.3) • ●Intermediate calculations • get predictor fields (some linear field operation) • boundary conditions for velocity and temperature • calculate divergence field, the r.h.s. for the Poisson eq. (op. 1 x 3) • ●Mass conservation eq. - Poisson • FFT (op. 4) • DSD for planes 1,..,D - on CPU • CG for planes D,..,Nx • SpMV (op. 5), dot product • IFFT (op. 4) • ●Intermediate calculaitons • gradient calculation (op.1 x 3) • linear field operations to get resulting velocity field • ●Time integration step, average fields processing • linear field operations • ●Halo update Basic operations 1.Diffusion stencil product 2.Convection stencil product 3.Linear field operations: y = a1*x1+…+an*xn + c; n from 1 to 4 4.FFT and IFFT 5.Sparse matrix vector product c o m p l e t e communications GPU → CPU: D planes of r.h.s. vector GPU → CPU→ GPU: halos, dots CPU → GPU: D planes of solution vector t o d o GPU → CPU→ GPU: halos

Extended subdomain border Subdomain nodes Borders between subdomains Halo nodes stencil subdomain halo Details on stencil product computing pattern Diffusion stencil product ●It is in fact a 19-diagonal SpMV u = Dv for “3D” vectors of size N = Nxx Nyz ●The matrix consists of Nx equal blocks hence only one “2D”block of size Nyz is stored ●The stencil is stored as 19 fields flattened in 1D vectors ●Offsets in all directions are pre-calculated for fast access to neighboring nodes for the center node (i,j,k) positioned in memory at P-th array position (i+1,j,k) is at P+1, (i,j+1,k) is at P+off_y, (i,j,k) at P+off_z instead of calculating position directly by #define _AMD_3D(i,j,k) ((i-OFFX) + (j-OFFY)*Nx + (k-OFFZ)*Nxy) where OFFX, OFFY, OFFZ – halo offsets ●Preprocessor macro programming is extensively used Diffusion stencil

Details on stencil product computing pattern // array field access macro: #define _AMD_2D(i,j) ((i-OFFY) + (j-OFFZ)*TY) #define _AMD_3D(i,j,k) ((i-OFFX) + (j-OFFY)*TX + (k-OFFZ)*TXTY) // stencil coefficients and advection field product macro: #define MULSTENCIL(aa) \ { \ for(int s=0; s<st_ssize##aa; s++){ /*loop over stencil nodes */ \ const double buf1=bufStencil[p2d]; bufStencil+=sfsize; /* left coef */ \ const double buf2=bufStencil[p2d]; bufStencil+=sfsize; /* right coef*/ \ const int offst3Daa = offst3D##aa*(s+1); /*offset between neighboring levels*/ \ pr += buf1*sfphi[p3d0-offst3Daa] + buf2*sfphi[p3d0+offst3Daa]; \ } \ } __kernel void STENCIL_GPU(__global double *sfphi, __global double *prphi, __global double *bufStencil){ const int I = get_global_id(0)/ls; const int lID = get_global_id(0)%ls; int p2d, p3d0; { const int i1 = I%NY+offY, i2 = I/NY+offZ; p2d = _AMD_2D(i1,i2); p3d0 = _AMD_3D(offX+lID,i1,i2); } /*central node*/ double pr = bufStencil[p2d]*(sfphi[p3d0]); bufStencil+=sfsize; MULSTENCIL(0); MULSTENCIL(1); MULSTENCIL(2); prphi[p3d0]=pr; } #define TX 70 #define TY 92 #define TXTY 6440 #define NX 60 #define NY 80 #define NZ 160 #define NYZ 12800 #define OFFX 1 #define OFFY -4 #define OFFZ -4 #define offX 6 #define offY 2 #define offZ 2 #define add 0 #define sfsize 0 #define ls 1 #define ssizemax 0 #define offst3D0 1 #define offst3D1 70 #define offst3D2 6440 #define st_ssize0 0 #define st_ssize1 0 #define st_ssize2 0 #define st_pos0 0 #define st_pos1 0 #define st_pos2 0

Details on stencil product computing pattern • Convection stencil product • ●Is much more complicated, since C is C(v), where v is the current velocity field, and since the mesh is • staggered calculation of indexes turns to a mess • ●To compute each resulting coefficient of the stencil a variable number of pre-calculated values is needed • its number is less or equal to the size of the stencil legs. • ●The data structure that represents a stencil is following: • int *st_nr// numbers of “pre-coefficient” positions for each node • int *st_spos// stencil data position for each node • int *st_i, // indexes of neighbours for each stencil node • double *st_coef // stencil coefficients • ●Computing each stencil coefficient looks like this: • // stencil coefficients and advection field product macro: • #if(ssizemax==1) /*lo order scheme*/ • #define GET_BCPV(padv) \ • (((nr>0) ? bcoef[0]*padv[bi[0]] : 0.0) + \ • ((nr>1) ? bcoef[1]*padv[bi[1]] : 0.0)) • #else /*high order*/ • #define GET_BCPV(padv) \ • (((nr>0) ? bcoef[0]*padv[bi[0]] : 0.0) + \ • ((nr>1) ? bcoef[1]*padv[bi[1]] : 0.0) + \ • ((nr>2) ? bcoef[2]*padv[bi[2]] : 0.0) + \ • ((nr>3) ? bcoef[3]*padv[bi[3]] : 0.0)) • #endif b2t2 b3t2 b1t2 b4t2

Poisson solver GPUzation ●DSD solver is fully on CPU – too much MPI traffic and complexity ●CG – approx. inverse or two-level preconditioned with block-CG ●Multi-SpMV, multi-dot – communications combined for all remaining planes ●Communication and computation overlap for MPI and GPU-CPU comms The algorithm of the solver • Fourier diagonalization using FFT uncouples 3D • problem into set of independent 2D problems (planes) • The Schur complement based direct method is used to solve planes that correspond to lower Fourier frequencies • The preconditioned conjugate gradient (PCG) method is used to solve the remaining planes • Inverse FFT to restores solution of the 3D problem. • Operations of the solver • FFT (op. 4) • DSD for planes 1,..,D - on CPU • CG for planes D,..,Nx • SpMV (op. 5), dot product • IFFT (op. 4) communications GPU → CPU: D planes of r.h.s. vector GPU → CPU→ GPU: halos, dots CPU → GPU: D planes of solution vector

Performance on GPU Net performance, Gflops* The test case ●DHC typical case is used ● 4-th order scheme ●Mesh sizes 430K, 770K,1.2M nodes ●Imitates a typical computing load % of peak *by fairly counting the number of operations in the algorithms 1 CPU core vs 1 GPU 6-cored Xeon vs 1 GPU* *assuming typical OpenMP speedup 4.5 out of 6 cores

Overal multi-GPU parallel model ●CPUs only for control, MPI comms and DSD – if GPU is much faster MPI – 1st level 2nd level domain decomposition OpenCL – 3rd level ●CPUs and GPUs are computing together - if CPU is comparable to GPU MPI – 1st level 2nd level domain decomposition OpenMP – 3rd level OpenCL – 3rd level

Technology of porting operations ● Each operation has standalone separated test that loads all the input data from file ●Each operation has GPUzed CPU version ●Everything scalar known at the time of kernel compilation taken to defined constants ●Loops unrolled, branching replaced with preprocessor commands ●Positioning pre-calculated and added as defined constants ●Divisions minimized ●Data access reorganized for achieving coalescing ●Data is properly aligned ●Usage of registers is minimized ●Workgroup size optimized for particular device ●Additional correctness control is implemented for each operation

General conclusions • ● AMD is much faster than NVidia • ●On CPU-like kernels NVidia 2050 is ~5-10 times slower than AMD 7970 • (record so far is 12) • ●On kernels optimized for NVidia its 2050 is ~2-3 times slower than AMD 7970 • ●AMD software is shameful! • ●CUDA easier than OpenCL – is a dramatic delusion • ●OpenCL portability is illusion: it works on everything but may work very slow until optimized for • particular architecture • ●CUDA faster than OpenCL – very questionable statement • in most cases performance is equal, however we observed 10-15% slowdown on certain kernels • ●What works well on NVidia works well on AMD. • What works well on AMD may appear 10 times slower on NVidia. • ●Optimization priority should be NVidia architecture since its more weak • ●OpenCL works well on CPU. OpenCL vs. OpenMP showed like 4.3 vs. 4.5 speedup on 6-cored Xeon NVidia → AMD: yes AMD → NVidia: no

Flops vs Gb/s ● 7970 has 950 Gflops and 264 Gb/s hence we need 3.6 flop per byte to get those flops ● 2050 has 515 Gflops and 144 Gb/s, so the same 3.6 flop per byte to get those flops ● This means around 30 flop per one double! ● Our stencil product say roughly around 0.25 flops per byte or 2 flop per double… ●GTX480 with its pitiful 164 Gflops outperformed 2050 around 30%. Memory bandwidth is far more important than flops

CUDA vs OpenCL CUDA ●Much better infrastructure – manuals, tutorials, support ●In rare cases little better performance ●More software and libs available ●Compilation together with CPU code ●Vendor-locked ●Oriented on particular architecture OpenCL ●More advanced and elegant programming model ●Compilation during runtime of CPU code ●Supported by major hardware vendors – NVidia, AMD, Intel, Apple, … ●Fits wide range of architectures – CPU, GPU, … ●No to Fortran

AMD vs nVidia nVidia ●Much better infrastructure ●Better software ●Brutally overpriced GPGPUs – x10 price for the same chip ●More weak architecture, poor and sometimes very poor performance AMD ●Eats not adapted CPU code much better, hence easier to program ●Bigger memory bandwidth and computing power ●Times faster. Order of magnitude sometimes. Due to advantage in architecture as well ●Software is a shame! Needs window session to be opened, hangs, fails. A mixture of AMD GPU chip and nVidia soft would lead to an ideal

CPU vs GPU Power consumption ● ~100 Wt for 6 core Xeon ● ~250 Wt for GPU Speedup must be >2.5 vs Xeon or > 12 vs 1 core Price ● ~1K for Xeon ● ~3K for Tesla Speedup must be >3 vs Xeon or > 13.5 vs 1 core Complexity, effort, .. ? ..but if the system is available who cares about speedup? CPU+GPU must just be faster than CPU

Conclusions Our recent DNS of a square duct at Re_tau=1200, 170M nodes ● GPU is not for kids ●Things rapidly changing and future is not clear ●The large-scale multi-level parallel CFD solver for hybrid systems to be finished soon! Thanks for your attention!

Symmetric extension Symmetric extension for cases with mesh symmetries ●The original system to solve: ●Changing basis in the following way: The vector is decomposed into symmetric and skew-symmetric parts: ●Applying change of basis to the linear system: The algorithm is following: • Transform the right-hand-side sub-vector, • Solve the two decoupled systems, • Reconstruct solution, ●Stages 1 and 3 require a point-to-point communication

CPU vs GPU Total 1 37.945032 37.945032 100.00 cr_info: GPU write 10 0.018545 0.185451 0.49 cr_info: r_cdm 10 0.385833 3.858332 10.17 cr_info: rc_cdm 10 0.250128 2.501276 6.59 cr_info: prod_nl_stencil_scaf 30 0.083371 2.501128 6.59 cr_info: rd_cdm 10 0.080982 0.809824 2.13 cr_info: prod_stencil_scaf 40 0.026940 1.077591 2.84 cr_info: rf_cdm 10 0.026800 0.267998 0.71 cr_info: op: cl_3vecf 10 0.027911 0.279112 0.74 cr_info: r_cdm_GPUzed 10 0.021045 0.210450 0.55 cr_info: rc_cdm_GPU 10 0.014247 0.142465 0.38 cr_info: NL_STENCIL_GPU 10 0.014245 0.142453 0.38 cr_info: rd_cdm_GPU 10 0.003601 0.036014 0.09 cr_info: D STENCIL_GPU 10 0.003600 0.036001 0.09 cr_info: rf_cdm_GPU 10 0.001275 0.012755 0.03 cr_info: F STENCIL_GPU 10 0.001274 0.012741 0.03 cr_info: op: cl_3vecf_GPUzed 10 0.001917 0.019167 0.05 cr_info: GPU read 1 0.050634 0.050634 0.13 cr_info: Total 1 24.007549 24.007549 100.00 cr_info: GPU write 10 0.038326 0.383260 1.60 cr_info: r_cdm 10 0.371780 3.717800 15.49 cr_info: rc_cdm 10 0.233489 2.334894 9.73 cr_info: prod_nl_stencil_scaf 30 0.077825 2.334750 9.73 cr_info: rd_cdm 10 0.082973 0.829735 3.46 cr_info: prod_stencil_scaf 40 0.027621 1.104841 4.60 cr_info: rf_cdm 10 0.027535 0.275351 1.15 cr_info: op: cl_3vecf 10 0.027771 0.277709 1.16 cr_info: r_cdm_GPUzed 10 0.009513 0.095134 0.40 cr_info: rc_cdm_GPU 10 0.006281 0.062814 0.26 cr_info: NL_STENCIL_GPU 10 0.006280 0.062797 0.26 cr_info: rd_cdm_GPU 10 0.001634 0.016338 0.07 cr_info: D STENCIL_GPU 10 0.001632 0.016320 0.07 cr_info: rf_cdm_GPU 10 0.000616 0.006161 0.03 cr_info: F STENCIL_GPU 10 0.000614 0.006140 0.03 cr_info: op: cl_3vecf_GPUzed 10 0.000976 0.009764 0.04 cr_info: GPU read 1 0.099515 0.099515 0.41 ================================================== Convection: 0.490571 sec, 43.9387 GFlops, 17.189 Gb/s ================================================== ================================================== Diffuion: 0.132634, 75.2388 GFlops ================================================== cr_info: Total 1 25.387849 25.387849 100.00 cr_info: GPU write 10 0.014997 0.149967 0.59 cr_info: r_cdm 10 0.241860 2.418604 9.53 cr_info: rc_cdm 10 0.159842 1.598423 6.30 cr_info: prod_nl_stencil_scaf 30 0.053276 1.598270 6.30 cr_info: rd_cdm 10 0.047335 0.473349 1.86 cr_info: prod_stencil_scaf 40 0.015823 0.632934 2.49 cr_info: rf_cdm 10 0.015979 0.159794 0.63 cr_info: op: cl_3vecf 10 0.018693 0.186928 0.74 cr_info: r_cdm_GPUzed 10 0.013812 0.138117 0.54 cr_info: rc_cdm_GPU 10 0.009267 0.092669 0.37 cr_info: NL_STENCIL_GPU 10 0.009266 0.092657 0.36 cr_info: rd_cdm_GPU 10 0.002371 0.023713 0.09 cr_info: D STENCIL_GPU 10 0.002369 0.023691 0.09 cr_info: rf_cdm_GPU 10 0.000870 0.008705 0.03 cr_info: F STENCIL_GPU 10 0.000869 0.008685 0.03 cr_info: op: cl_3vecf_GPUzed 10 0.001298 0.012977 0.05 cr_info: GPU read 1 0.038492 0.038492 0.15

Performance on GPU ================================================== Convection: 0.231484 sec, 93.1169 Flops, 36.4277 Gb/s ================================================== ================================================== Diffuion: 0.068738 150 ================================================== cr_info: nom o N mode tsum tavg (s) percentage cr_info: Total 1 15.700642 15.700642 100.00 cr_info: GPU write 10 0.025440 0.254405 1.62 cr_info: r_cdm 10 0.231576 2.315758 14.75 cr_info: rc_cdm 10 0.147438 1.474383 9.39 cr_info: prod_nl_stencil_scaf 30 0.049142 1.474256 9.39 cr_info: rd_cdm 10 0.049354 0.493539 3.14 cr_info: prod_stencil_scaf 40 0.016448 0.657906 4.19 cr_info: rf_cdm 10 0.016456 0.164559 1.05 cr_info: op: cl_3vecf 10 0.018319 0.183186 1.17 cr_info: r_cdm_GPUzed 10 0.006380 0.063796 0.41 cr_info: rc_cdm_GPU 10 0.004099 0.040988 0.26 cr_info: NL_STENCIL_GPU 10 0.004097 0.040972 0.26 cr_info: rd_cdm_GPU 10 0.001122 0.011223 0.07 cr_info: D STENCIL_GPU 10 0.001120 0.011202 0.07 cr_info: rf_cdm_GPU 10 0.000444 0.004441 0.03 cr_info: F STENCIL_GPU 10 0.000442 0.004424 0.03 cr_info: op: cl_3vecf_GPUzed 10 0.000708 0.007084 0.05 cr_info: GPU read 1 0.067318 0.067318 0.43 cr_info: Total 1 25.387849 25.387849 100.00 cr_info: GPU write 10 0.014997 0.149967 0.59 cr_info: r_cdm 10 0.241860 2.418604 9.53 cr_info: rc_cdm 10 0.159842 1.598423 6.30 cr_info: prod_nl_stencil_scaf 30 0.053276 1.598270 6.30 cr_info: rd_cdm 10 0.047335 0.473349 1.86 cr_info: prod_stencil_scaf 40 0.015823 0.632934 2.49 cr_info: rf_cdm 10 0.015979 0.159794 0.63 cr_info: op: cl_3vecf 10 0.018693 0.186928 0.74 cr_info: r_cdm_GPUzed 10 0.013812 0.138117 0.54 cr_info: rc_cdm_GPU 10 0.009267 0.092669 0.37 cr_info: NL_STENCIL_GPU 10 0.009266 0.092657 0.36 cr_info: rd_cdm_GPU 10 0.002371 0.023713 0.09 cr_info: D STENCIL_GPU 10 0.002369 0.023691 0.09 cr_info: rf_cdm_GPU 10 0.000870 0.008705 0.03 cr_info: F STENCIL_GPU 10 0.000869 0.008685 0.03 cr_info: op: cl_3vecf_GPUzed 10 0.001298 0.012977 0.05 cr_info: GPU read 1 0.038492 0.038492 0.15

Parallel OpenCL-based Solver for Large-Scale DNS of Incompressible Flows

Parallel OpenCL-based Solver for Large-Scale DNS of Incompressible Flows

Presentation Transcript

Large Scale Navigation Based on Perception

Implementing Evidence- Based Practice on a Large Scale

Efficient Parallel Software for Large-Scale Semidefinite Programs

Towards Viable Large Scale Heterogeneous Wireless Networks

Towards large-scale, open-domain and ontology-based named entity classification

Large-scale Hybrid Parallel SAT Solving

Large-Scale Content-Based Image Retrieval

Linear Solver Challenges in Large-Scale Circuit Simulation

Large Scale Parallel Print Service

NTUplace: A Partitioning Based Placement Algorithm for Large-Scale Designs

Path Towards A Large Scale

A Parallel Hierarchical Solver for the Poisson Equation

Large Scale Parallel I/O with HDF5

Large Scale Crawling the Web for Parallel Texts

Planning for a Large-Scale Implementation

Parallel Subgraph Listing in a Large-Scale Graph

A Parallel Hierarchical Solver for the Poisson Equation

NTUplace: A Partitioning Based Placement Algorithm for Large-Scale Designs

Large Scale Parallel Print Service

Large Scale Computations in Astrophysics: Towards a Virtual Observatory