STA 250 :: Homework Policy
For all questions you must show your work. This enables us to understand your thought process, give partial credit and prevent crude cheating. Please see the code of the conduct in the Syllabus for rules about collaborating on homeworks.
For questions requiring computing, if you use R
, python
or any programming environment then you must turn in a printout of your output with your solutions.
In addition, a copy of your code must be uploaded to the appropriate HW
directory of your forked GitHub repo within 24 hours of the homework due date.
Homework 04
Due: In Class, Fri December 6th
Assigned: Wednesday November 27th
GPU Module
(Click here for printable version)
In this homework you will implement several calculations using both
the CPU and GPU. For the GPU-related problems, you are welcome to
use either R (RCUDA), Python (PyCUDA) or C (CUDA C).
- Sync your fork of the course GitHub repo to include the latest updates using the instructions provided here.
-
In this question, you will implement a kernel to obtain samples from a truncated normal random variable, and test your code.
- Write a kernel in CUDA C to obtain samples from a truncated normal random variable of the form:
\[
X \sim \textrm{TN}\left(\mu,\sigma^{2};\left(a,b\right)\right) \equiv N(\mu,\sigma^{2})\mathbf{1}_{\left\{X\in\left(a,b\right)\right\}} .
\]
Your function should have (roughly) the following structure:
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand_kernel.h>
#include <math_constants.h>
extern "C"
{
__global__ void
truncnorm_kernel(
float *x, // Vector to contain returned samples
int n, // Number of samples to return
float *mu, // Vector of mu's
float *sigma, // Vector of sigma's
float *a, // Vector of lower-truncation values
float *b) // Vector of upper-truncation values
{
// CODE ... blah
return;
}
} // END extern "C"
Important notes:
- In the above prototype, the function is vectorized for all arguments i.e.,
mu
, sigma
, a
and b
are all vectors. You could add more arguments to allow for recycling of scalar arguments if desired.
- The above format provides the simplest interface i.e., each RNG needs to be initialized within the thread (unless you do something more sophisticated). If you want to develop more efficient code then consider passing in the pre-initialized random number generator states (and the number of RNGs). However, I strongly urge you to implement a simple version first and get it working before trying to optimize for speed.
- As explained in class, the inverse-CDF method is feasible, but requires that you create a `__device__` function for both the CDF and inverse CDF of a normal distribution.
- The simplest sampling method to implement is rejection sampling, for which you will want to specify the maximum number of attempts. This method will work well in most settings.
- For truncation regions in the tail of the distribution rejection sampling may be either inefficient or infeasible, so you will require fall back code for these cases. One option is to implement the algorithms in Robert (2009). Another, simpler option is to implement a simpler but less accurate approximate method discussed in class. Note that ideally, given more time, the methods in the Robert paper are far superior to the approximate methods discussed.
- To select the grid/block dimensions it is recommend to use the function inside `utility.R` in the GitHub repo.
- Compile your CUDA kernel using
nvcc
and check it can be launched properly.
The basic steps for doing this in RCUDA
are as follows:
- Compile to ptx with `nvcc --ptx -o rtruncnorm.ptx rtruncnorm.c`
- Use `loadModule` to load in the ptx file
- Copy a vector to the device to store the samples
- Select the grid/block dimensions
- Launch the kernel with `.cuda`
- Copy the samples back to the CPU via `[]` or `copyFromDevice`
- Sample 10,000 random variables from \(TN(2,1;(0,1.5))\), and verify the expected value (roughly) matches the theoretical value (see class notes for details).
- Write an R/Python/C function for sampling truncated normal random variables (possibly using a different algorithm). You may also use the code provide in the GitHub repo. Sample 10,000 random variables from this function and verify the mean (roughly) matches the theoretical values.
- Time your (R/Py/C)CUDA function and pure R/Python/C function for \(n=10^{k}\) for \(k=1,2,\ldots,8\). Plot the total runtimes for both functions on the y-axis as a function of \(n\) (on the log-scale as the x-axis). At what point did/do you expect the GPU function to outperform the non-GPU function? You may also want to decompose the GPU runtimes into copy to/kernel/copy back times for further detailed analysis.
- Verify that both your GPU and CPU code work for \(a=-\infty\) and/or \(b=+\infty\).
- Verify that both your GPU and CPU code work for truncation regions in the tail of the distribution e.g., \(a=-\infty,b=-10,\mu=0,\sigma=1\).
- In this question you will implement Probit MCMC i.e., fitting a Bayesian Probit regression model using MCMC. This model turns out to be computationally nice and simple, lending itself to a Gibbs sampling algorithm with each distribution available in sample-able form. The model is as follows:
\[ Y_i | Z_i \sim \mathbf{1}_{\left\{Z_{i}>0\right\}} \\
Z_i | \beta \sim N(x_{i}^{T}\beta,1) \\
\beta \sim N(\beta_0,\Sigma_0) ,
\]
where \(\beta_0\) is a \(p\times{}1\) vector corresponding to the prior mean, and \(\Sigma_{0}^{-1}\) is the prior precision matrix. Note that for convenience we supply \(\Sigma_{0}^{-1}\) as an argument to the probit MCMC function, to allow for flat priors for \(\beta\).
- Write a C/R/Python function `probit_mcmc` to sample from the posterior distribution of \(\beta\) using the CPU only. The recommended function prototype is:
probit_mcmc_cpu(
y, # vector of length n
X, # (n x p) design matrix
beta_0, # (p x 1) prior mean
Sigma_0_inv, # (p x p) prior precision
niter, # number of post burnin iterations
burnin # number of burnin iterations
)
Your function should return the posterior samples of \(\beta\) as a matrix/array or `mcmc` object (if using `R`). The posterior samples of \(Z\) do not need to be returned (and should not be stored -- they will take up too much memory!).
- Write a R/Py/C(CUDA) function `probit_mcmc` to sample from the posterior distribution of \(\beta\) using the CPU and GPU. The recommended function prototype is:
probit_mcmc_gpu(
y, # vector of length n
X, # (n x p) design matrix
beta_0, # (p x 1) prior mean
Sigma_0_inv, # (p x p) prior precision
niter, # number of post burnin iterations
burnin, # number of burnin iterations
block_dims, # block dimensions
grid_dims # grid_dimensions
)
You can also compute the block and grid dimensions within your function if preferred. Note that the GPU should only be used for the sampling of the \(Z_i\) vector. Your function should return the posterior samples of \(\beta\) as a matrix/array or `mcmc` object (if using `R`). The posterior samples of \(Z\) do not need to be returned (and should not be stored -- they will take up too much memory!).
- Test your code by fitting the mini dataset `mini_test.txt`. This dataset can be generated by running the file `sim_probit.R`, supplied in the course GitHub repo. The first column of the dataset corresponds to `y`, with all other columns corresponding to `X`. Assume prior parameters \(\beta_{0}=\vec{0}\) and \(\Sigma_{0}^{-1}=0\). Verify that both functions give posterior means/medians that are at least relatively close to the true values (in `mini_pars.txt`, also generated
when you run `sim_probit.R`) and the estimates produced by standard GLM functions.
- Run `sim_probit.R` to create each of the following datasets. Then analyze as many as possible, using both your CPU and GPU code:
- `data_01.txt`: \(n=1000\) for `niter=2000`, `burnin=500`.
- `data_02.txt`: \(n=10000\) for `niter=2000`, `burnin=500`.
- `data_03.txt`: \(n=100000\) for `niter=2000`, `burnin=500`.
- `data_04.txt`: \(n=1000000\) for `niter=2000`, `burnin=500`.
- `data_05.txt`: \(n=10000000\) for `niter=2000`, `burnin=500`.
Again, the first column of each dataset corresponds to `y`, with all other columns corresponding to `X`. Record the runtimes of `probit_mcmc_cpu` and `probit_mcmc_gpu` for each of the datasets.
Note I: If your code is too slow, you may also run fewer iterations. It may take several minutes to fit each probit model using the GPU code.
Not II: The number of iterations used here is not sufficient to obtain reliable posterior estimates. This exercise is for illustration and learning purposes, if you want to actually use the estimates be sure to run the chains for longer and check the usual diagnostics!
- Discuss the relative performance of your CPU and GPU code. At what point do you think your GPU code would become competitive with the CPU code?
(: Happy Coding! :)