This tutorial provides information about GPU resources at SHARCNET and an introduction to the use of GPUs for performing general purpose calculations.
This tutorial is a work in progress - if you have any suggestions for information to add or find anything to be incorrect or broken please submit a ticket to the problem tracking system.
One should understand the software development process and know how to use the bash shell.
GPU is an acronym for graphics processing unit. It is a special-purpose co-processor that helps the traditional central processor (CPU) with graphics tasks. A GPU is designed to process parallel data streams at an extremely high throughput. It does not have the flexibility of a typical CPU, but it can speed up some calculations by over an order of magnitude. Recent architectural decisions have made the GPU more flexible, and new general purpose software stacks allow them to be programmed with far greater ease.
The use of GPUs in HPC is targeted at data-intensive applications that spend nearly all of their time in a set of particular mathematical kernels. These kernels must exhibit finely-grained parallelism, both in terms of being able to process many independent steams of data as well as pipelining operations on each stream. This emphasis on data-parallelism means that GPUs will not aid complex programs that are constrained by Ahmdal's Law.
Back to IndexThis is a 4-GPU NVIDIA Tesla S870 system located at Wilfrid Laurier University. The combined peak theoretical single precision FLOPS for this GPU server is over 2 TFLOPS. The host system is an HP DL160G5, containing 2 quad-core Xeon processors at 2.0GHz and 32GB of system RAM. It also mounts the work and scratch filesystems from silky.
tope is installed with the CUDA 1.1 SDK and toolkit. The path to the toolkit is set automatically for users when they login. See below for example usage.
SHARCNET has a number of disparate workstations on our network that contain GPGPU capable GPUs. These machines are not publicly available, so users must request access by submitting a ticket to the problem ticket system.
More to come...
Back to IndexCUDA, or "Compute Unified Device Architecture" is an NVIDIA SDK and associated toolkit for programming GPUs to perform general purpose computation, implemented as an extension to standard C. As well as a fully featured API for parallel programming on the GPU, CUDA includes standard numerical libraries, including FFT (Fast Fourier Transform) and BLAS (Basic Linear Algebra Subroutines), a visual profiler, and numerous examples illustrating the use of the CUDA in a wide variety of applications.
For up to date information, tutorials, forums, etc. see the official NVIDIA CUDA site.
The SDK is a good way to learn about CUDA, one can compile the examples and learn how the tool kit works. As we have a mix of different systems, all with varying capabilities (such is life on the bleeding edge), it is suggested that users try the supported versions on the corresponding systems. We will make these available.
This is the Tesla S870 (compute capability 1.0) system. It currently has the CUDA v1.1 SDK and tool kit installed. To compile examples in the sdk, one should perform the following (or similar) steps:
ssh tope.sharcnet.ca
cp -rL /opt/sharcnet/cuda-toolkit+sdk/current /work/$USER/cuda_sdk
cd /work/$USER/cuda_sdk
Before proceeding, one has to modify the SDK common makefile as per:
CUDA_INSTALL_PATH ?="/opt/sharcnet/cuda-toolkit+sdk/current"
OPENGLLIB := -lGL -lGLU /usr/lib64/libglut.so.3
Once the modifications are made, return to the root sdk directory and attempt to compile all of the examples in the SDK:
cd /work/$USER/cuda_sdk
make
This will fail for a few of the examples, so if you want to compile a particular example, then do the following. To see all of the examples:
[sn_user1@tope cuda_sdk]$ ls projects/
alignedTypes BlackScholes convolutionTexture eigenvalues imageDenoising matrixMulDrv nbody scalarProd simpleCUFFT simpleTextureDrv
asyncAPI boxFilter cppIntegration fastWalshTransform lineOfSight MersenneTwister oceanFFT scan simpleGL SobelFilter
bandwidthTest clock deviceQuery fluidsGL Mandelbrot MonteCarlo particles scanLargeArray simpleStreams template
binomialOptions convolutionFFT2D dwtHaar1D histogram256 marchingCubes MonteCarloMultiGPU postProcessGL simpleAtomics simpleTemplates transpose
bitonic convolutionSeparable dxtc histogram64 matrixMul multiGPU reduction simpleCUBLAS simpleTexture
Now to compile the transpose example, for example:
Now go to the executable directory, run the GPU-accelerated CUDA program: To learn more about CUDA, one should read the Programming Guide, which can be found in the An excellent article series on CUDA was written by Rob Farber in Dr. Dobb's Journal, it is highly recommended. Under Construction
[sn_user1@tope cuda_sdk]$ cd projects/transpose/
[sn_user1@tope transpose]$ make
[sn_user1@tope transpose]$ cd ../../bin/linux/release/
[sn_user1@tope release]$ ./transpose
Transposing a 256 by 4096 matrix of floats...
Naive transpose average time: 3.298 ms
Optimized transpose average time: 0.305 ms
Test PASSED
Press ENTER to exit...
[sn_user1@tope release]$
Further Information
docs
directory of the SDK on tope. The most recent CUDA documentation can be found online at the CUDA Zone.
AMD Stream Processing
© 2008,2009, Hugh Merz, SHARCNET