STA 250 Lecture 18

Advanced Statistical Computation

Paul D. Baines

Homework 4

Homework 4 has been posted. It should be a fun foray into the world of GPUs.

You will get to see the potential, and drawbacks of using GPUs via higher-level languages.

For Q2(d), just do the best you can i.e., analyze as many as datasets as you can. Due to time constraints it is fine to run for fewer iterations just to get things to run and allows for runtime comparisons.

For example, my RCUDA code to run dataset 5 takes a long time. So if you only manage to run, say, the first 3 datasets that is fine. Or run datasets 4 and 5 for a very small number of iterations (e.g., 100). Obviously the posterior samples are not useful at that point, but this homework is more about learning CUDA so that is fine.

Sampling Truncated Normal Random Variates

I have posted a link to a paper by Christian Robert about efficient sampling for truncated normal variates.

As explained last week, sampling truncated rvs on GPUs is easier using rejection sampling, but some care is needed to handle cases where rejection sampling is prohibitively inefficient. For the MCMC sampling in Q2 it is very important that your sampling function handles all cases well (otherwise the MCMC will break).

Key things to do:

  • Add a maxtries argument to avoid ending up in an endless while loop
  • Handle the corner cases via rejection sampling as per the Robert paper
  • Approximate methods may also be acceptable if empirical performance is verified

CURAND

Be careful with how you set the curand generators across kernel launches and threads.

As per the CUDA documentation:

For the highest quality parallel pseudorandom number generation, each experiment should be assigned a unique seed. Within an experiment, each thread of computation should be assigned a unique sequence number. If an experiment spans multiple kernel launches, it is recommended that threads between kernel launches be given the same seed, and sequence numbers be assigned in a monotonically increasing way. If the same configuration of threads is launched, random state can be preserved in global memory between launches to avoid state setup time.

See: http://docs.nvidia.com/cuda/curand/device-api-overview.html#device-api-overview

RCUDA

RCUDA provides full bindings to the NVIDIA CUDA API for R users i.e., it provides a mechanism to call any function within the CUDA API from within R. In addition to this, it also provides higher level functionality that can hide some of the memory management associated with CUDA.

  • Kernels still need to be written in CUDA C.
  • Kernels are compiled to ptx code using nvcc --ptx
  • Kernels are loaded via modules into R

PyCUDA

From http://mathema.tician.de/software/pycuda/

PyCUDA lets you access Nvidia‘s CUDA parallel computation API from Python. Several wrappers of the CUDA API already exist–so what’s so special about PyCUDA?

  • Object cleanup tied to lifetime of objects. This idiom, often called RAII in C++, makes it much easier to write correct, leak- and crash-free code. PyCUDA knows about dependencies, too, so (for example) it won’t detach from a context before all memory allocated in it is also freed.
  • Convenience. Abstractions like pycuda.driver.SourceModule and pycuda.gpuarray.GPUArray make CUDA programming even more convenient than with Nvidia’s C-based runtime.
  • Completeness. PyCUDA puts the full power of CUDA’s driver API at your disposal, if you wish.
  • Automatic Error Checking. All CUDA errors are automatically translated into Python exceptions.
  • Speed. PyCUDA’s base layer is written in C++, so all the niceties above are virtually free.
  • Helpful Documentation.

Example

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda.compiler import SourceModule

# Define the kernel within a SourceModule within Python:

mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
  const int i = threadIdx.x;
  dest[i] = a[i] * b[i];
}
""")

# Extract the kernel function
multiply_them = mod.get_function("multiply_them")

Example cont...

In PyCUDA:

m = SourceModule("""
// Define kernel
""")
my_kernel = m.get_function("multiply_them")

Is the analogue of the following RCUDA code:

system("nvcc --ptx -o foo.ptx foo.cu")
m <- loadModule("foo.ptx")
my_kernel <- m$multiply_them

Example cont...

For PyCUDA:

# Sample two random vectors:
a = np.random.randn(400).astype(numpy.float32)
b = np.random.randn(400).astype(numpy.float32)

# Storage for the result:
dest = np.zeros_like(a)

# Launch the kernel as a regular function:
multiply_them(drv.Out(dest), drv.In(a), drv.In(b),
              block=(400,1,1), grid=(1,1))

print dest-a*b

The kernel launch specifies the block and grid dimensions, just as with .cuda.

See hello_world.py in the course GitHub repo for a slight modification of this example.

PyCUDA Basics

Just as with RCUDA, we have to very careful to ensure that the supplied data types match what CUDA is expecting.

For PyCUDA, arguments to the kernel should be numpy datatypes (typically float32, int32 or vectors thereof). For example, to pass the value 100 as an int use:

n = np.int32(100)

The block and grid sizes must be specified as regular Python ints:

kernel_func(...,n,...,block=(int(n),1,1),grid=(1,1)) 

Do not wrap scalar arguments with drv.In() or drv.InOut(), those are only used for pointer arguments.

PyCUDA vs. RCUDA

See PyCUDA/example1 in the GitHub repo.

Evaluating \(10^8\) normal densities:

PyCUDA SourceModule time:  0.153s
scipy time:               28.126s
Breakdown of RCUDA time:   2.052s
Copy to device:            0.727s
Kernel:                    0.028s
Copy from device:          1.195s
Vanilla R time:            6.264s 

Clearly PyCUDA is smarter with copying than naive usage of RCUDA.

Note: Those doing the homework with PyCUDA will likely get far faster code than unoptimized RCUDA code.

CUBLAS

Neither RCUDA or PyCUDA currently offer access to the CUBLAS libraries.

[CUDA] is composed of two APIs:

  • A low-level API called the CUDA driver API,
  • A higher-level API called the CUDA runtime API that is implemented on top of the CUDA driver API. These APIs are mutually exclusive: An application should use either one or the other.

Ditto for RCUDA.

See: here for more details.

PyCUDA

See code examples for GPUArray examples and other more involved functionality.

GPUArray provides neat functionality in that arrays on the GPU can be treated very similarly to regular numpy arrays. They have sum, min, max and other methods, can be added, subtracted, multipled etc.

To transfer back to host use the get() method.

Also enables direct generation of random numbers into a GPUArray.

That is enough for today... :)