GPU Computing

Two nodes (nodeib-51 & nodeib-52) are available for GPU computing. Both nodes are equipped with Dual Intel Xeon(R) CPU E5-2640 @ 2.00GHz (Ivy Bridge) with 64 GiB memory and one NVIDIA Grid K2 card each, with the following specifications:

Number of GPUs	2
Total NVIDIA CUDA cores per GPU	3072
Total memory size	8 GB GDDR5

For a more detailed technical analysis of these cards, please visit the following link.

Both nodes have the CUDA v7.0 programming environment installed, available with the cuda/7.0 module. In order to use these nodes, you should submit your jobs with the "--gres=gpu" option, which instructs SLURM to allocate nodes containing GPU's.

NVIDIA GPU cores are not compatible with Intel x86 code. This means that you should recompile your code for the NVIDIA GPU i.e. the CUDA cores and use the the following CUDA libraries for your scientific calculations:

cuBLAS: NVIDIA GPU version of the BLAS library
cuFFT: NVIDIA GPU version of the FFT library
cuRAND: High performance random number generator
cuSparse: A collection of basic linear algebra subroutines used for sparse matrices
NPP: A collection of image and signal processing primitives

Programming Models

There are two dominant programming platforms/models for GPU computing, CUDA (Compute Unified Device Architecture) and OpenCL (Open Computing Language). CUDA, was developed by NVIDIA and allows the programmers to use CUDA-enabled GPUs for general purpose processing, while OpenCL is a framework for writing code for a variety of processors, like, CPUs, GPUs, FPGAs, DSPs etc. While there are many differences between these two programming models, they are similar in many respects:

device and memory models are quite similar
they separate device and host code and memory
they provide a data-parallel computation model

Programming with CUDA

A typical CUDA program usually has device and host code. With the terms device and host we imply the GPU and its memory, and the CPU and its memory respectively and as a result, a portion of a CUDA program can run on CPU and another on GPU. For example, a simple HelloWorld program in C would be as follows:

in main( void) {

printf( "Hello, World!\n" );
return 0;

}

In this example, the program runs exclusively in the CPU. The same program, but with device code (code that runs on the GPU):

__global__ void kernel(void) {
}
int main( void) {

kernel<<< 1, 1 >>>();
printf( "Hello, World!\n" );
return 0;

}

This version of the HelloWorld program contains device code, that is, code that will be executed in the GPU. The portion of the program that will be executed in the GPU is identified by the _global_ keyword and it is encapsulated in a standard C function called kernel. Back in the host portion of this code, we notice the statement "kernel<<< 1, 1 >>>();". By using the special triple bracket notion (<<< and >>>) we indicate a "kernel launch" i.e. a call from host code to device code.

In order to compile and submit a CUDA aware program, we will use the CUDA environment and call the nvcc compiler. For example, let's assume that we have a CUDA program named HelloWorld.cu. We should compile and submit it as follows:

% nvcc -o HelloWorld HelloWorld.cu
% srun --gres=gpu ./HelloWorld

For more information about programming with CUDA, please visit the NVIDIA Programming Guide.