GPU Computing

Hugh Merz, HPC Programming Specialist

Overview

This tutorial provides information about GPU resources at SHARCNET and an introduction to the use of GPUs for performing general purpose calculations.

This tutorial is a work in progress - if you have any suggestions for information to add or find anything to be incorrect or broken please submit a ticket to the problem tracking system.

Prerequisites

One should understand the software development process and know how to use the bash shell.

GPU is an acronym for graphics processing unit. It is a special-purpose co-processor that helps the traditional central processor (CPU) with graphics tasks. A GPU is designed to process parallel data streams at an extremely high throughput. It does not have the flexibility of a typical CPU, but it can speed up some calculations by over an order of magnitude. Recent architectural decisions have made the GPU more flexible, and new general purpose software stacks allow them to be programmed with far greater ease.

The use of GPUs in HPC is targeted at data-intensive applications that spend nearly all of their time in a set of particular mathematical kernels. These kernels must exhibit finely-grained parallelism, both in terms of being able to process many independent steams of data as well as pipelining operations on each stream. This emphasis on data-parallelism means that GPUs will not aid complex programs that are constrained by Ahmdal's Law.

Back to Index

SHARCNET GPU Systems

tope.sharcnet.ca

This is a 4-GPU NVIDIA Tesla S870 system located at Wilfrid Laurier University. The combined peak theoretical single precision FLOPS for this GPU server is over 2 TFLOPS. The host system is an HP DL160G5, containing 2 quad-core Xeon processors at 2.0GHz and 32GB of system RAM. It also mounts the work and scratch filesystems from silky.

tope is installed with the CUDA 1.1 SDK and toolkit. The path to the toolkit is set automatically for users when they login. See below for example usage.

disparate systems

SHARCNET has a number of disparate workstations on our network that contain GPGPU capable GPUs. These machines are not publicly available, so users must request access by submitting a ticket to the problem ticket system.

NVIDIA Quadro 4600, 4GB of system RAM, quad core Xeon 5150 @ 2.66GHz, installed with CentOS 5 and the NVIDIA CUDA 2.0 beta GPU programming stack
NVIDIA Quadro 4600, 4GB of system RAM, quad core Xeon 5150 @ 2.66GHz, installed with CentOS 5 and the NVIDIA CUDA 2.0 beta GPU programming stack
NVIDIA Geforce 9800 GX2, 4GB of system RAM, Core(TM)2 Quad CPU Q6600 @ 2.4GHz, installed with CentOS 5.2 and NVIDIA CUDA 2.0 beta GPU programming stack
AMD Firestream 9170, 4GB of system RAM, Core(TM)2 Quad CPU Q8200 @ 2.33GHz, installed with CentOS 5 and AMD Stream 1.3 beta GPU programming stack

Back to Index

GPU Programming Overview

More to come...

Back to Index

NVIDIA CUDA

CUDA, or "Compute Unified Device Architecture" is an NVIDIA SDK and associated toolkit for programming GPUs to perform general purpose computation, implemented as an extension to standard C. As well as a fully featured API for parallel programming on the GPU, CUDA includes standard numerical libraries, including FFT (Fast Fourier Transform) and BLAS (Basic Linear Algebra Subroutines), a visual profiler, and numerous examples illustrating the use of the CUDA in a wide variety of applications.

For up to date information, tutorials, forums, etc. see the official NVIDIA CUDA site.

SDK Usage

The SDK is a good way to learn about CUDA, one can compile the examples and learn how the tool kit works. As we have a mix of different systems, all with varying capabilities (such is life on the bleeding edge), it is suggested that users try the supported versions on the corresponding systems. We will make these available.

tope.sharcnet.ca

This is the Tesla S870 (compute capability 1.0) system. It currently has the CUDA v1.1 SDK and tool kit installed. To compile examples in the sdk, one should perform the following (or similar) steps:

ssh tope.sharcnet.ca cp -rL /opt/sharcnet/cuda-toolkit+sdk/current /work/$USER/cuda_sdk cd /work/$USER/cuda_sdk

Before proceeding, one has to modify the SDK common makefile as per:

edit common/common.mk, line 38:

CUDA_INSTALL_PATH ?="/opt/sharcnet/cuda-toolkit+sdk/current"

edit common/common.mk, line 122:

OPENGLLIB := -lGL -lGLU /usr/lib64/libglut.so.3

Once the modifications are made, return to the root sdk directory and attempt to compile all of the examples in the SDK:

cd /work/$USER/cuda_sdk make

This will fail for a few of the examples, so if you want to compile a particular example, then do the following. To see all of the examples:

[sn_user1@tope cuda_sdk]$ ls projects/ alignedTypes BlackScholes convolutionTexture eigenvalues imageDenoising matrixMulDrv nbody scalarProd simpleCUFFT simpleTextureDrv asyncAPI boxFilter cppIntegration fastWalshTransform lineOfSight MersenneTwister oceanFFT scan simpleGL SobelFilter bandwidthTest clock deviceQuery fluidsGL Mandelbrot MonteCarlo particles scanLargeArray simpleStreams template binomialOptions convolutionFFT2D dwtHaar1D histogram256 marchingCubes MonteCarloMultiGPU postProcessGL simpleAtomics simpleTemplates transpose bitonic convolutionSeparable dxtc histogram64 matrixMul multiGPU reduction simpleCUBLAS simpleTexture

Now to compile the transpose example, for example:

[sn_user1@tope cuda_sdk]$ cd projects/transpose/ [sn_user1@tope transpose]$ make

Now go to the executable directory, run the GPU-accelerated CUDA program:



[sn_user1@tope transpose]$ cd ../../bin/linux/release/

[sn_user1@tope release]$ ./transpose 

Transposing a 256 by 4096 matrix of floats...

Naive transpose average time:     3.298 ms

Optimized transpose average time: 0.305 ms



Test PASSED



Press ENTER to exit...



[sn_user1@tope release]$ 



Further Information

To learn more about CUDA, one should read the Programming Guide, which can be found in the  docs directory of the SDK on tope.  The most recent CUDA documentation can be found online at the CUDA Zone.
An excellent article series on CUDA was written by Rob Farber in Dr. Dobb's Journal, it is highly recommended. 
Back to Index



AMD Stream Processing
Under Construction

Back to Index




© 2008,2009, Hugh Merz, SHARCNET

GPU Computing

Overview

Prerequisites

Index

GPU Background

SHARCNET GPU Systems

tope.sharcnet.ca

disparate systems

GPU Programming Overview

NVIDIA CUDA

SDK Usage

tope.sharcnet.ca

Further Information

AMD Stream Processing