APPENDIX
Setting up CUDA
E
E.1 INSTALLATION CUDA SDK installation is typically a trouble-free experience. Nvidia has shifted in recent SDK releases to a single archive approach: You just have to download a single file from https://developer.nvidia.com/cuda-downloads and either run it (if you select the executable one), or feed it to your system software administration tool in the case of Linux or Mac OS X (if you select the deb/rpm/pkg one). The single archive contains the Nvidia tool chain in addition to reference and tutorial documentation, and sample programs. Nvidia’s site is the ultimate source of information for proper installation steps. In this section we highlight some of the pitfalls that can plague someone making their first steps into the CUDA world: •
•
Always install the Nvidia display driver that accompanies the CUDA release you select to install, or a newer one. Failure to do so may cause your CUDA programs, or even the samples that come with the toolkit, to fail to run, even if they manage to compile without a problem. In multiuser systems such as Linux, a CUDA sample programs installation on a system-wide location (e.g., /opt or /usr) will create problems for users who try to compile or modify these programs due to permission limitations. In that case, we have to copy the entire samples directory to our home directory to perform any of these actions.
E.2 ISSUES WITH GCC The Nvidia Cuda Compiler driver has been known to have incompatibility problems with newer GCC versions. The remedy to obtain a properly working tool chain is to install an older GCC version. In the case of the Ubuntu Linux distribution, the following commands will accomplish just that (for this particular example, version 4.4 was chosen, but the ultimate choice depends on the CUDA SDK installed and the particular Linux distribution): $ sudo update −alternatives −−install / usr / bin / gcc gcc / usr / bin / gcc −4.4←
50 $ sudo update −alternatives −−config gcc
The response to the last command is the following prompt that allows the switch to the alternate C compiler:
643
644
APPENDIX E Setting up CUDA
Selection
Path
Priority
Status
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 50 auto mode 0 / usr / bin / g ++ −4.7 1 / usr / bin / g ++ −4.4 10 manual mode ∗ 2 / usr / bin / g ++ −4.7 50 manual mode Press enter to keep the current choice [ ∗ ] , or t y p e selection number :
The same sequence should be repeated separately for the C++ compiler: $ sudo update −alternatives −−install / usr / bin / g ++ g ++ / usr / bin / g ++ −4.4←
50 $ sudo update −alternatives −−config g ++
As mentioned, this is a CUDA SDK version-specific issue, so your mileage may vary. For example, CUDA 6.5.14 is able to work properly with GCC 4.8.2, successfully compiling all Nvidia-supplied CUDA sample projects.
E.3 RUNNING CUDA WITHOUT AN Nvidia GPU CUDA is available in all the GPUs that Nvidia has been producing for some time. However, there is still a big market share of machines that are equipped with AMD’s, GPUs or Intel’s graphics solutions, neither of which are capable of running CUDA programs. Fortunately, there are two solutions: 1. Emulate a CUDA GPU on the CPU by compiling CUDA programs with the -deviceemu switch. Example: $ nvcc −deviceemu hello . cu −o hello $ . / hello
The problem with this approach is that the -deviceemu switch has been deprecated. The last CUDA release supporting it is 2.3. Although we can still download and use this version, it requires an old GCC compiler (version 4.3), which is no longer available as an option in the majority of the contemporary Linux distributions. Additionally, recent CUDA releases have a wealth of additional features that make the use of CUDA 2.3 a viable option only if we fail to use the alternative given next. 2. Use the Ocelot1 framework. Ocelot is a dynamic compilation framework (i.e., code is compiled at run-time) providing various back-end targets for CUDA programs. Ocelot allows CUDA programs to be executed on both AMD GPUs and x86-CPUs. Employing Ocelot requires a number of additional steps during compilation. Here is an example for a CUDA program called hello.cu. The first step involves the compilation of the device code into PTX format: $ nvcc −arch = sm_20 −cuda hello . cu −o hello . cu . cpp
1 http://code.google.com/p/gpuocelot/,
last visited in July 2013.
APPENDIX E Setting up CUDA
The hello.cu.cpp file that is produced contains the device PTX code in the form of an array of bytes. The next step compiles the intermediate source code into an object file: $ g ++ −o hello . cu . o −c −Wall −g hello . cu . cpp
The last step links the object file (or files, for larger projects) with the Ocelot shared library, along with a number of libraries required by Ocelot (GLU, GLEW, and glut) into the final executable file: $ g ++ −o hello hello . cu . o −lglut −locelot −lGLEW −lGLU −L / usr / l / ← checkout / gpuocelot / ocelot / build_local / lib /
In this code, the -L switch points to the location where the Ocelot library is installed. The only concern with using Ocelot for running CUDA programs is that certain attributes of the virtual platform may not match what is normally expected from a physical GPU, making code that adapts to GPU characteristics (such as the number of SMs) likely to fail or to perform in unexpected ways. For example, warpSize is set equal to the block size.
E.4 RUNNING CUDA ON OPTIMUS-EQUIPPED LAPTOPS Nvidia Optimus is an energy-saving technology that automatically performs GPU switching, i.e., switching between two graphics adapters, based on the workload generated by client applications. The two adapters are typically an Intel CPUintegrated unit that draws minimum power but has low performance and an Nvidia GPU that offers higher performance at the expense of more power consumption. On Linux, the switch between the two GPUs is not seamless; instead it requires a third party software called Bumblebee.2 Distribution-specific installation instructions are provided at http://bumblebee-project.org/install.html. In Ubuntu, Bumblebee can be installed with the following commands that ensure the respective repository is added to the system prior to the installation procedure: $ sudo add−apt−repository ppa : bumblebee / stable $ sudo apt−get update $ sudo apt−get install bumblebee bumblebee−nvidia virtualgl
A program that requires the use of the Nvidia GPU for graphical display should be invoked, as the following example illustrates: $ optirun glxspheres64 Polygons in scene : 62464 Visual ID of window : 0 x20 Context is Direct OpenGL Renderer : GeForce GTX 870 M / PCIe / SSE2
2
http://bumblebee-project.org.
645
646
APPENDIX E Setting up CUDA
3 2 2 . 7 3 3 9 2 2 frames / sec − 3 6 0 . 1 7 1 0 5 6 Mpixels / sec 3 2 5 . 7 3 2 3 8 1 frames / sec − 3 6 3 . 5 1 7 3 3 8 Mpixels / sec 3 2 2 . 6 3 6 1 5 7 frames / sec − 3 6 0 . 0 6 1 9 5 1 Mpixels / sec
where the optirun command is the GPU switcher provided by Bumblebee, and glxspheres64 is a program used to test graphics performance in OpenGL. The use of optirun is not required if the CUDA program does not utilize screen graphics capabilities.
E.5 COMBINING CUDA WITH THIRD-PARTY LIBRARIES Nvidia offers Parallel NSight as the primary IDE for CUDA development on all supported platforms. However, Parallel NSight does not accommodate easy integration into a project of other tools such as Qt’s toolchain. Although the sequence of commands is straighforward, making NSight or Qt’s qmake generate this sequence is a challenge. For this reason we present a makefile that can perform this task while at the same time being easy to modify and adapt. The key points are: • • •
Have all .cu files compiled separately by nvcc. The compilation has to generate relocatable device code (–device-c). Have all the .c/.cpp files that do not contain device code compiled separately by gcc/g++. Have all the object files .o linked by nvcc. The linking has to generate relocatable device code (-rdc=true).
During compilation, all the necessary include file directories should be supplied. Similarly, during the linking process, all the library directories and dynamic/static libraries needed, must be specified. We use the fractal generation test case of Section 5.12.1 as an example. That particular project is made up of the following files: •
• •
kernel.cu: Contains device code for calculating the Mandelbrot fractal set. It
also contains a host front-end function for launching the kernel, copying the results from the device to the host, and using them to color the pixels of a QImage object. Hence it needs to include Qt header files. kernel.h: Header file containing the declaration of the host function that launches the GPU kernel. Needed by main.cpp. main.cpp: Program entry point responsible for parsing user input, calling the host front-end function in kernel.cu, and saving the output with the assistance of Qt.
Given the above, the following makefile can be used to compile and link the project: NVCC = nvcc CC = g ++
APPENDIX E Setting up CUDA
CUDA_COMPILE_FLAGS = −−device −c −arch = compute_20 −code = sm_21 CUDA_LINK_FLAGS = −rdc = t r u e −arch = compute_20 −code = sm_21 QT_COMPILE_FLAGS = −I / usr / include / qt5 / QtCore −I / usr / include / qt5 / QtGui ← −I / usr / include / qt5 QT_LINK_FLAGS = −L / usr / lib / x86_64 −linux−gnu −lQtGui −lQtCore −lpthread mandelbrotCUDA : main . o kernel . o $ { NVCC } $ { CUDA_LINK_FLAGS } $ { QT_LINK_FLAGS } $ ^ −o $@ main . o : main . cpp kernel . h $ { CC } $ { QT_COMPILE_FLAGS } −c main . cpp kernel . o : kernel . cu kernel . h $ { NVCC } $ { CUDA_COMPILE_FLAGS } $ { QT_COMPILE_FLAGS } −c kernel . cu clean : rm ∗ . o
In the above, the automatic variable $@ represents the target, i.e., mandelbrotCUDA for that particular rule, and $ˆ represents all the dependencies listed. A similar procedure can be applied for integrating MPI and CUDA, or any possible combination of tools. As an example, let’s consider the makefile used in the MPI-CUDA implementation of the AES block cipher described in Section 5.12.2. This makefile describes the creation of three targets, two standalone GPU implementations, aesCUDA and aesCUDAStreams, and the MPI-enhanced version aesMPI: NVCC = nvcc CC = g ++ CUDA_LINK_FLAGS = −rdc = t r u e −arch = compute_20 −code = sm_21 CUDA_COMPILE_FLAGS = −g −−device −c −arch = compute_20 −code = sm_21 CC_COMPILE_FLAGS = −g −I / usr / include / openmpi CC_LINK_FLAGS = −lm −lstdc ++ −lmpi −L / usr / lib −lpthread −lmpi_cxx all : aesMPI aesCUDA aesCUDAStreams aesMPI : main . o rijndael_host . o rijndael_device . o $ { NVCC } $ { CUDA_LINK_FLAGS } $ { CC_LINK_FLAGS } $ ^ −o $@ main . o : main . cpp rijndael . h $ { CC } $ { CC_COMPILE_FLAGS } −c main . cpp rijndael_host . o : rijndael_host . cu rijndael . h rijndael_device . h $ { NVCC } $ { CUDA_COMPILE_FLAGS } $ { CC_COMPILE_FLAGS } −c rijndael_host ← . cu rijndael_device . o : rijndael_device . cu rijndael . h rijndael_device . h $ { NVCC } $ { CUDA_COMPILE_FLAGS } $ { CC_COMPILE_FLAGS } −c ← rijndael_device . cu
647
648
APPENDIX E Setting up CUDA
aesCUDA : aesCUDA . o rijndael_host . o rijndael_device . o $ { NVCC } $ { CUDA_LINK_FLAGS } $ { CC_LINK_FLAGS } $ ^ −o $@ aesCUDA . o : aesCUDA . cu rijndael . h $ { NVCC } $ { CUDA_COMPILE_FLAGS } $ { CC_COMPILE_FLAGS } −c aesCUDA . cu
aesCUDAStreams : aesCUDA . o rijndael_host_streams . o rijndael_device . o $ { NVCC } $ { CUDA_LINK_FLAGS } $ { CC_LINK_FLAGS } $ ^ −o $@ rijndael_host_streams . o : rijndael_host_streams . cu rijndael . h ← rijndael_device . h $ { NVCC } $ { CUDA_COMPILE_FLAGS } $ { CC_COMPILE_FLAGS } −c ← rijndael_host_streams . cu clean : rm ∗ . o
An issue that can be encountered is that the third-party libraries to be linked with your project require the generation of “position-independent code.” This can be accomplished by the use of the -fPIC or -fPIE compiler flags. These flags cannot be passed directly to nvcc, but they have to “pass through” to the GCC compiler. The appropriate nvcc flag to achieve this is: -Xcompiler ’-fPIC’.