STA 250 Lecture 16

Advanced Statistical Computation

Paul D. Baines

Welcome to STA 250!

Today is the "Efficient Computing: Parallelization and GPUs" module (otherwise known as the GPU module), lecture 1.

Reminder: Homework 3 due Wednesday.

Note: No code swap for homework 3 (not enough code to make it worthwhile)

On the menu for today...

  1. Intro to GPUs

  2. Programming GPUs

Credit: Lots of slides taken from the web!

GPUs @ UCD

For HW4 you will have a chance to program some algorithms on some GPUs kindly provided by NVIDIA and Duncan Temple Lang. The main GPUs (Tesla K20's) reside on lipschitz and pearson.

To obtain access, please email nnismail@ucdavis.edu with your UCD email and public key. Note: Stat/Biostat students who already have access to these servers do not need to email Nehad.

# To login:
ssh username@lipschitz.ucdavis.edu
ssh username@pearson.ucdavis.edu

Background and History

Graphical Processing Units (GPUs) are specialized units designed for rendering computer graphics.

They work very differently from CPU's (central processing units) which perform the bulk of the tasks on your computer.

Rendering high-definition computer graphics quickly and smoothly requires billions of simple calculations to be performed in seconds. GPU's are designed specifically for this task.

In recent years, there has been a great deal of progress in using GPU's for more general purpose calculations, not just graphics.

NVIDIA (and their language CUDA) are at the forefront of this effort. The other main GPU producer AMD (and the OpenCL language, backed by Apple) also offer potential in this arena.

Low-Level Programming for GPU's

About CUDA

CUDA is...

  • a bunch of C/C++ libraries allowing the coder to use the GPU
  • a fine-grain, low-level language (user controls all memory management, synchronicity etc)
  • for NVIDIA GPU's only (will not work on AMD GPU's)

  • There are also new higher-level interfaces to CUDA that do much of the dirty work for you, we will see these in later lectures.

Before we talk specifics... what you need to know...

Types of Parallelism

Two main types of parallelism:

  • Type I: Task Parallelism:
    Idea is to parallelize different tasks that do not depend on other uncompleted tasks.
    The taks being parallelized can be completely different.

    Example: Computing multivariate normal densities:
  1. Compute Cholesky decomposition
  2. Compute inverse of Cholesky factor
  3. Compute determinant of Cholesky factor
  4. Finish computing the density

Task Parallelism Example


Credit: CS264 (N. Pinto)

Parallelism II: Data Parallelism

GPU's are not especially useful for task parallelism (CPUs are), for are useful for a different kind of parallelism: data parallelism.

Type II: Data Parallelism:
Perform the same task on multiple pieces of data.

Examples:

  • Matrix multiplication: same task (multiplication), on multiple pieces of data (matrix elements)
  • Numerical integration: same task (function evaluation), on multiple pieces of data (integration grid)

Data Parallelism Example


Credit: CS264 (N. Pinto)

CPU vs. GPU


ALU: Arithmetic Logic Unit (thing that does calculations!) Credit: CS264 (N. Pinto)

  • CPU: Lots of fast memory (cache), few ALUs
  • GPU: Little fast memory, lots of ALUs


Credit: CS264 (N. Pinto)

Grids and Blocks

When programming in CUDA it is generally up to the programmer to determine the grid/block structure. The choice of grid/block arrangement can have a large impact on efficiency, so we will see some experiments to select appropriate sizes.

  • 3D Grid: gridDim.x, gridDim.y, gridDim.z
  • 3D Block: blockDim.x, blockDim.y, blockDim.z

Note: Do not need to use all dimensions (e.g., 1D grid of 2D blocks is fine).

Later on, we will also see more general approaches to automatically determining the block/grid structure.

10-Series Tesla Architecture

Terminology:

  • Host: The CPU
  • Device: The GPU
  • Kernel: Function that runs on the device
  • Thread: Think of as a series of calculations/operations

So...

  • Kernels are typically executed by lots of threads
  • One kernel is executed at a time
  • Threads are cheap to launch on GPUs
  • Gains in efficiency come with using large numbers of threads to perform calculations in parallel

CUDA

Basics of CUDA:

  • Memory management: The GPU has its own memory, which must be allocated, (possibly) initialized, and freed.
  • Data transfer: Data required by the GPU is copied from host to device
  • Kernel launch: The kernel is launched, with specified grid/block configuration.
  • Result transfer: If needed, the results must be copied back from the CUDA device to the host.

Notes:

  • These lines are becoming blurred as the CUDA API develops and hybrid CPU-GPU systems are developed. For example, using pinned memory host memory can be accessed by the GPU (time permitting we may see some examples of this).

CUDA: Hello World!

  • example0: Hello world! (See code).

  • example1: Illustrating cudaMalloc and cudaMemcpy (See code).

  • example2: Hello world! (See code).

Arrays of Threads

Thread Batching

The NVIDIA Compiler

CUDA code is compiled by the NVIDIA compiler nvcc, which functions in much the same way as gcc and g++ for those familiar with C and C++. Linking and header files require care (just as they do with vanilla C/C++).

Example CUDA Program

My example, modified from some code on the NVIDIA forums:

See CUDA_example_01.cu

Compile with:

nvcc CUDA_example_01.cu # plain: makes a.out
nvcc CUDA_example_01.cu -use_fast_math -o CUDA_example_01.out

Run with:

./CUDA_example_01.out

Compiling Schematic

Kernel Memory Access

CUDA Variable Types

CUDA Variable Performance

CUDA Variable Scale

CUDA Performance Example


Credit: CS264 (N. Pinto)

Perspective on GPU's

What tasks are they good for?

  • ☺ Numerical integration (nearly always)
  • ☺ (Very) slow iteration MCMC (use within-iteration parallelism)
  • ☺ "Simple" bootstraps
  • ☺ Particle Filtering (Sequential Monte Carlo)
  • ☺ (Extremely difficult) brute force optimization
  • ☺ Large matrix calculations (with sufficient expertise)
  • ☺ Single-use applications

Perspective on GPU's

What tasks are they not good for?

  • ☹ Fast iteration MCMC
  • ☹ "Difficult" bootstraps
  • ☹ (Many) optimization problems
  • ☹ Methodological work (portable code) [may change]
  • ☹ Any problem that is not worth the additional effort...

Resources

Getting started:

  • Find a CUDA-enabled computer and install CUDA first!
  • For those without an NVIDIA GPU, use Pearson + Gauss
  • NVIDIA GPU Computing SDK has lots of (rich) examples
  • Work through lecture examples, try to modify
  • See lecture links on course website for more examples

Appendix: Installation (Mac & Linux)

CUDA on my Macbook Pro (10.8)

Device 0: "GeForce 320M"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    1.2
  Total amount of global memory:                 253 MBytes (265027584 bytes)
  ( 6) Multiprocessors x (  8) CUDA Cores/MP:    48 CUDA Cores
  GPU Clock rate:                                950 MHz (0.95 GHz)
  Memory Clock rate:                             1064 Mhz
  Memory Bus Width:                              128-bit
  Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  ...

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = GeForce 320M

Testing CUDA

$ ./bandwidthTest # locate bandwidthTest to find in SDK
 Device 0: GeForce 320M

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         1024.1

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         1565.8

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         6448.6

That is enough for today... :)