Mehran Maghoumi 4V82 Seminar (November 2012)

Fast Evaluation of GP Trees on GPGPU by Optimizing Hardware Scheduling (Maitre, O., Lachiche, N., & Collet, P. (2010)) MehranMaghoumi 4V82 Seminar (November 2012)

Seminar Outline Fast Evaluation of GP Trees on GPGPU by Optimizing Hardware Scheduling • Goals of the Paper • Overview of CUDA and GPGPU • Description of the Problem • Implementation • The Experiment • Results • Conclusion & Future Work

Goals of the Paper Fast Evaluation of GP Trees on GPGPU by Optimizing Hardware Scheduling • Speedup GP Evolution by • Using parallel programming methods to fast evaluate GP trees • Exploiting hardware scheduling mechanism to maximize performance • Examine the influence of the following on the developed system • Tree depth • Tree shapes

What is CUDA and GPGPU? Fast Evaluation of GP Trees on GPGPU by Optimizing Hardware Scheduling • CUDA: Compute Unified Device Architecture • Compile C programs and run them on the GPU • Cuda C: An extension of C with additional keywords • GPGPU: General-Purpose computing on Graphics Processing Units • A typical nVIDIA GPU can do 224 operations at the same time. • The aim is to take existing programs and modify them in a way so that they can be executed in parallel. • CUDA vs. RapidMind(Open GL) & Accelerator (Direct X)

CUDA Paradigms Fast Evaluation of GP Trees on GPGPU by Optimizing Hardware Scheduling • Thread (=Classical Definition): Independent process which executes instructions on data. • Block: Threads can be arranged in 2D arrays called Blocks. • MP (Multi-processor): A block is executed by an MP. Each MP can execute 4 blocks at a time. • A visualization of this configuration: • 5 degrees of freedom

Description of the Problem Fast Evaluation of GP Trees on GPGPU by Optimizing Hardware Scheduling • The workflow of image rendering is identical to that of evolutionary algorithms: • Pixel Shader Algorithm  Fitness Function • Pixels  Genomes • Image  Population • Regression Problem • Hardware Scheduler: • To overcome memory stalls, bundles of 32 threads are frozen and are replaced by another bundle which is ready to execute.

Description of the Problem Fast Evaluation of GP Trees on GPGPU by Optimizing Hardware Scheduling • To optimize hardware scheduler: • We use 32 fitness cases • Each fitness case is evaluated by a single thread • Each individual is evaluated in a single block • Each MP is responsible for executing up to 4 blocks • The card gets fully utilized (pipelines are fully loaded) and theoretically the performance should be maximized.

Implementation Fast Evaluation of GP Trees on GPGPU by Optimizing Hardware Scheduling • Use Reverse Polish Notation (RPN / postfix) to represent GP individuals to speedup transfer from RAM to VRAM. • RPN interpreter uses the very fast shared memory on card to interpret the individuals. • To compare speedup, trees are evaluated on the CPU and then flattened and transferred to GPU for evaluation. • CPU delay = evaluation time • GPU delay = evaluation time + upload to VRAM + download from VRAM

Implementation Fast Evaluation of GP Trees on GPGPU by Optimizing Hardware Scheduling

Experiment Parameters Fast Evaluation of GP Trees on GPGPU by Optimizing Hardware Scheduling

Influence of Tree Depth Fast Evaluation of GP Trees on GPGPU by Optimizing Hardware Scheduling

Influence of Tree Shapes Fast Evaluation of GP Trees on GPGPU by Optimizing Hardware Scheduling

Conclusion Fast Evaluation of GP Trees on GPGPU by Optimizing Hardware Scheduling • Speedups ranging from x50 up to x250 were observed • With newer “Fermi” cards speedups of up to x5,000 can be obtained • Evaluation is usually the most time-consuming part of GP but using GPGPU makes this statement obsolete • The standard GP evolutionary engine is the bottleneck • The trees need to be flattened before they are passed to the GPU • Methods such as linear GP can be used to eliminate this bottleneck

Future Work Fast Evaluation of GP Trees on GPGPU by Optimizing Hardware Scheduling • Implement a load balancing system using CUDA’s built-in capabilities to speedup the evaluation when trees have different shapes • Work on GP engine and optimize it eliminate the bottleneck it incurs • Develop a GP framework using CUDA so that other people can use it without knowing any CUDA

References

Mehran Maghoumi 4V82 Seminar (November 2012)