Hardware Architecture

Exploring Performance Improvement Opportunities in Directive-Based GPU Programming

Published on - 2018 Conference on Design and Architectures for Signal and Image Processing (DASIP)

Authors: Rokiatou Diarra, Alain Merigot, Bastien Vincke

GPUs offer an impressive computing power, but their effective exploitation remains an open problem. Kernel based programming models such as CUDA or OpenCL, allow a direct programming of the GPU architecture and can drive to excellent performance. However, these programming models require significant code changes, often tedious and errorprone, before getting an optimized program. Directive-based programming models (such as OpenMP and OpenACC) are now available for GPU and can offer good trade-off between performance, portability and development cost. In this paper, we do a comparative performance study between OpenACC, OpenMP 4.5 and CUDA, which is essential for facilitating parallel programming for GPUs. In order to find most significant performance issues, we port a suite of representative benchmarks and three computer vision applications to OpenACC, OpenMP and CUDA. Beyond runtime, we explore factors that influence performance, such as register counts, workload, grid and block sizes. The results of this work show that either OpenACC or OpenMP are good alternatives to kernel based programming models, provided some careful manual optimization is performed. Through the analysis of generated PTXs files, we discover that there is in general a systematic overhead in the kernel launch for OpenMP, but, for most applications, it is not a big issue provided the kernel has a sufficient workload.