For all questions you must show your work. This enables us to understand your thought process, give partial credit and prevent crude cheating. Please see the code of the conduct in the Syllabus for rules about collaborating on homeworks.
For questions requiring computing, if you use R
, python
or any programming environment then you must turn in a printout of your output with your solutions.
In addition, a copy of your code must be uploaded to the appropriate HW
directory of your forked GitHub repo within 24 hours of the homework due date.
In this homework you will fit a Bayesian model (or models) using MCMC, validate your code using a simulation study, and then apply your method to a real dataset.
HW1/BayesLogit
directory of your GitHub repo on Gauss. Run the script BLR_sim.sh
on Gauss by executing the command:
sarray ./BLR_sim.sh
After the job completes, you should have 200 data files, and 200 parameter files inside the data folder. The datasets are generated with the following specifications:
blr_fit.[R,py]
to fit the Bayesian Logistic Regression model to the dataset you have copied to your laptop/desktop. If using R
you should write a function with prototype:
"bayes.logreg" <- function(m,y,X,beta.0,Sigma.0.inv,
niter=10000,burnin=1000,
print.every=1000,retune=100,
verbose=TRUE){
...
}
If using Python
use:
def bayes_logreg(m,y,X,beta_0,Sigma_0_inv,
niter=10000,burnin=1000,
print_every=1000,retune=100,
verbose=True):
In both cases the arguments are as follows:m
: Vector containing the number of trials for each observation (of length \(n\))y
: Vector containing the number of successes for each observation (of length \(n\))X
: Design matrix (of dimension \(n\times{}p\))beta.0
: Prior mean for \(\beta\) (of length \(p\))Sigma.0.inv
: Prior precision (inverse covariance) matrix for \(\beta\) (of dimension \(p\times{}p\))niter
: Number of iterations to run the MCMC after the burnin periodburnin
: Number of iterations for the burnin period (draws will not be saved)print.every
: Print an update to the user after every period of this many iterationsretune
: Retune the proposal parameters every return
iterations. No tuning should be done after the burnin period is completedverbose
: If TRUE
then print lots of debugging output, else be silentRun the code on the dataset, and verify that it produces posterior credible intervals that cover the true parameter values (at least to some extent).
The script must output the \(1,2,\ldots,99\%\) percentiles of the marginal posterior distributions for \(\beta_{0}\) and \(\beta_{1}\) to file in a 99 row and 2 column .csv
(with no header).
BLR_fit.[R,py]
to the BayesLogit
directory of your repo on Gauss. Modify the code to take a single command line argument (an integer from 1 to 200) and analyze the corresponding dataset from the results
directory. Please see the files BLR_fit_skeleton.R
and/or BLR_fit_skeleton.py
for further details.BLR_fit_R.sh
or BLR_fit_py.sh
to fit the 200 simulated datasets. These are array job scripts so run using:
sarray BLR_fit_[R,py].sh
After analyzing each dataset, you should have 200 results files in the results
directory:
$ ls results/
blr_res_1001.csv blr_res_1002.csv ... blr_res_1200.csv
Again, each file should contain 99 rows and 2 columns corresponding to the \(1,2,\ldots,99\%\) percentiles of the marginal posterior distributions of \(\beta_{0}\) and \(\beta_{1}\) i.e.,
$ head results/blr_res_1001.csv -n 2
-0.615697082535097,0.594913779947118
-0.610734788568127,0.601946320177968
blr_post_process.R
on Gauss using:
sbatch blr_post_process.sh
The result will be a file coverage_line_plot.pdf
showing the empirical vs. nominal coverage for your analysis of the 200 datasets. Numerical results will also be stored in coverage_summaries.txt
and coverage_summaries.tex
. If your coverage lines stray outside the guiding bands then you may want to check your code for bugs or run your MCMC for more iterations. (The guiding bands are based on a simple pointwise approximate Binomial calculation, so it is possible that your lines could fall outside the bands, but if they do so consistently then you likely have a
problem!).If your coverage plot passes the test then congratulations -- you have coded your Bayesian Logistic Regression MCMC algorithm successfully.
What to turn in:
coverage_summaries.txt
or coverage_summaries.tex
(placed inside a full report).coverage_line_plot.pdf
Next up, time to analyze a real dataset using your MCMC code. The dataset breast_cancer.txt
in the GitHub repo is taken from the UCI machine learning repository, see here for full references. The dataset has 10 covariates, and a single response variable (either "M" for malignant, or "B" for benign).