For all questions you must show your work. This enables us to understand your thought process, give partial credit and prevent crude cheating. Please see the code of the conduct in the Syllabus for rules about collaborating on homeworks.
For questions requiring computing, if you use R
, python
or any programming environment then you must turn in a printout of your output with your solutions.
In addition, a copy of your code must be uploaded to the appropriate HW
directory of your forked GitHub repo within 24 hours of the homework due date.
(Click here for printable version)
In this homework you will implement the "Bag of Little Bootstraps" for a large dataset using the array job capabilities of Gauss, develop a MapReduce algorithm and run it via Hadoop Streaming on Amazon Elastic MapReduce, and, finally, perform some basic data manipulation using Hive (again on Elastic MapReduce). These exercises are largely tasters to introduce you to the data technologies, if you want to dig deeper into genuine statistical problems then the final projects will provide an opportunity to do so.
gauss
and type:
ls /home/pdbaines/data/blb_lin_reg*
You should see a listing of files of the form:
-rw-rw-r-- 1 pdbaines pdbaines 7.5G Sep 12 19:12 blb_lin_reg_data.bin
-rw-rw-r-- 1 pdbaines pdbaines 479 Sep 12 19:00 blb_lin_reg_data.desc
-rw-rw-r-- 1 pdbaines pdbaines 15G Sep 12 18:34 blb_lin_reg_data.txt
-rw-rw-r-- 1 pdbaines pdbaines 78M Sep 12 16:02 blb_lin_reg_medm.bin
-rw-rw-r-- 1 pdbaines pdbaines 476 Sep 12 16:02 blb_lin_reg_medm.desc
-rw-rw-r-- 1 pdbaines pdbaines 146M Sep 12 15:52 blb_lin_reg_medm.txt
-rw-rw-r-- 1 pdbaines pdbaines 3.2M Sep 13 13:16 blb_lin_reg_mini.bin
-rw-rw-r-- 1 pdbaines pdbaines 473 Sep 13 13:16 blb_lin_reg_mini.desc
-rw-rw-r-- 1 pdbaines pdbaines 6.0M Sep 16 17:53 blb_lin_reg_mini.txt
NOTE: Since the data for this question is somewhat large, you will **not** be downloading or copying the data. All files must be read directly from these locations e.g., open("/home/pdbaines/data/blb_lin_reg_data.txt","rb")
.
For this question you will be analyzing the data contained in blb_lin_reg_data.*
. There are up to three files that may be of use:
blb_lin_reg_data.txt
: Dataset containing 1,000,000 observations. Each row corresponds to an observation, each column to a variable. The first 1,000 columns correspond to covariates (X1,X2,...,X1000)
, and the final column corresponds to the response variable y
. The data are comma-separated, and the uncompressed file is approximately 15Gb.blb_lin_reg_data.bin
: Binary file produced by the bigmemory
package in R
, containing the same data as blb_lin_reg_data.txt
but in compressed and pre-processed format.blb_lin_reg_data.desc
: File describing the properties needed by the bigmemory
package in R
to read in the dataset.If you wish to examine blb_lin_reg_data.txt
you must you the less
text reader i.e. less /home/pdbaines/data/blb_lin_reg_data.txt
. Using editors such as vi
or nano
is not advisable when working with such large files.
In tackling this assignment, you may wish to download and use blb_lin_reg_mini.[txt,bin,desc]
. These files contain mini versions of the full dataset with d=40
covariates and n=10,000
observations that can be used to rapidly test, develop and debug your code (they can also be used on your laptop instead of gauss
).
The model we will be fitting to the data is just the simple linear regression model:
\[ y_i = x_{i,1} \beta_{1} + x_{i,2} \beta_{2} + ... + x_{i,1000} \beta_{1000} + \epsilon_i , \qquad \epsilon_i \sim N(0,\sigma^2) , \]with the \(\epsilon_i\)'s independent. Note that the model does not include an intercept term. Denote the maximum likelihood estimator of \(\beta=(\beta_1,\ldots,\beta_{1000})\) by \(\hat\beta\).
The goal of this question is to find the (marginal / elementwise) standard error of \(\hat\beta\) i.e., \(SE(\hat\beta_1),\ldots,SE(\hat\beta_{1000})\).
To do this, you will be using the batch processing pipeline on gauss
to implement a BLB procedure to estimate the SE's of the 1,000 regression coefficients. Use the following specifications:
To access the data using R
it is strongly recommended to use the attach.big.matrix
function in conjunction with the .desc
file. To access the data in python
, simply use open
in conjuction with the csv
module.
The following subparts will guide you through this process of obtaining the BLB SE estimates.
BLB_lin_reg_job.R
, blb_lin_reg_job.py
or similar (depending on whether you intend to use R
or python
or some other programming language).
You must write code that reads in the data and samples \(b\approx n^{\gamma}\) rows (without replacement) from the full dataset. In R
you can use sample
to select the row numbers, in python
you can use np.random.choice
(only available in numpy
v1.7.1+). Once the rows are selected, the subsample must be extracted from the full dataset. For selecting rows in Python you may utilize the function in read_some_lines.py
.R
you can use rmultinom
, in python
you can use np.random.multinomial
. The entries of the randomly sampled multinomial tell you how many times each of the b data points is replicated in the bootstrap dataset.R
you can use the weights
argument to lm
to specify the number of times each data point occurs, unfortunately python
offers no such capability at present. Fortunately, you can use Prof. Baines' code on GitHub (in Stuff/HW2/BLB/wls.py
) to fit a weighted least squares regression to the data.BLB_lin_reg_job.[R,py]
your code should subsample a dataset of size b from the full dataset, then bootstrap resample to size n and fit a weighted linear regression. Lastly, you just need to modify the code to do this rs
times as a batch job.
There are s*r=250
bootstrap datasets that must be analyzed (r=50
bootstrap samples from each of s=5
distinct subsets). This could be organized as a batch job on gauss
in a number of ways, the most obvious being:
s
jobs, each performing r
linear regressions, or,r
jobs, each performing s
linear regressions, or,rs
jobs, each performing 1 linear regression.rs
jobs, each running a single linear regression). The supplied file BLB_lin_reg_run.sh
provides the configuration to run a batch job on Gauss with rs
separate sub-jobs.
1,2,...,250
.s_index
(in 1,2,...,5
) and r_index
(in 1,2,...,50
)s_index
must subsample identical rowsr_index
must bootstrap resample a different datasetoutfile
, where (R):
outfile = paste0("output/","coef_",sprintf("%02d",s_index),"_",sprintf("%02d",r_index),".txt")
or (Python),
outfile = "output/coef_%02d_%02d.txt" % (s_index,r_index)
When your script is prepared and fully tested, submit the batch job by executing:
sarray ./BLB_lin_reg_run_[R,py].sh
Please note that you may need to modify BLB_lin_reg_run_[R,py].sh
to reflect the appropriate filenames/paths for your environment. Always remember to check the resulting *.[err,out]
files (written to the dump directory) to figure out if/why things went wrong. Results will be written to the output
directory. (Note: Each job should comfortably take less than ten minutes, or else your code is highly inefficent... :)rs=250
different output files:
$ ls output/ -alh
-rw-rw---- 1 pdbaines pdbaines 19K Sep 12 22:18 coef_01_01.txt
...
-rw-rw---- 1 pdbaines pdbaines 19K Sep 12 22:18 coef_05_49.txt
-rw-rw---- 1 pdbaines pdbaines 19K Sep 12 22:18 coef_05_50.txt
Next, edit and run the supplied script BLB_lin_reg_process.R
. This can be run via:
sbatch BLB_lin_reg_process.sh
Depending on how you stored the output files, you may need to edit the script very slightly, but in most cases no modification will be needed. This will read each of the files and produce BLB SE estimates. These will be placed in the final
directory.BLB_lin_reg_job.[R,py,*]
x_lo,x_hi,y_lo,y_hi,count
where \(x_{lo},x_{hi},y_{lo},y_{hi}\) are the bin boundaries, and count is the number of observations falling
within the bin.
mini_out.txt
can be tested with your mapper and reducer in a non-Hadoop environment by
running:
cat out_mini.txt | ./mapper.py | sort -k1,1 | ./reducer.py
Do not run your Hadoop job until this works correctly! Note that you may need to change the
permissions on your mapper and reducer scripts using:
chmod u+x *.py
ls
. To access the HDFS, use hadoop fs
as done in class (lecture 09). First off, you need to create a directory on the HDFS:
hadoop fs -mkdir data
hadoop distcp s3://sta250bucket/bsamples data/
Remember to check this was successful by running hadoop fs -ls data/
.
hadoop_hw.sh
.lynx
.
scp -i yourkeypair.pem -r hadoop@(amazon_dns):~/binned-output ./
cat
(Linux or Mac; For Windows users, do this on the Amazon cluster before scp-ing back to your laptop/desktop).PlotRes.R
.What to turn in:
Next up, time to use try out some basic data manipulation with Hive via Amazon EMR.
Your goal in this question is to load in the groups.txt
data stored in the course
S3 bucket at s3://sta250bucket/groups.txt
. For testing puroposes it is recommended
that you use s3://sta250bucket/mini_groups.txt
.