STA 250 :: Homework Policy

For all questions you must show your work. This enables us to understand your thought process, give partial credit and prevent crude cheating. Please see the code of the conduct in the Syllabus for rules about collaborating on homeworks.

For questions requiring computing, if you use R, python or any programming environment then you must turn in a printout of your output with your solutions.
In addition, a copy of your code must be uploaded to the appropriate HW directory of your forked GitHub repo within 24 hours of the homework due date.


Homework 02

Due: In Class, Wed November 13th

Assigned: Thursday Oct 31st

Big Data Module

(Click here for printable version)

In this homework you will implement the "Bag of Little Bootstraps" for a large dataset using the array job capabilities of Gauss, develop a MapReduce algorithm and run it via Hadoop Streaming on Amazon Elastic MapReduce, and, finally, perform some basic data manipulation using Hive (again on Elastic MapReduce). These exercises are largely tasters to introduce you to the data technologies, if you want to dig deeper into genuine statistical problems then the final projects will provide an opportunity to do so.

  1. Sync your fork of the course GitHub repo to include the latest updates using the instructions provided here.


  2. In this question, you will implement the "Bag of Little Bootstraps" (BLB) procedure of Kleiner et al. (2012) on a simulated linear regression dataset.
    1. First, create a file called BLB_lin_reg_job.R, blb_lin_reg_job.py or similar (depending on whether you intend to use R or python or some other programming language). You must write code that reads in the data and samples \(b\approx n^{\gamma}\) rows (without replacement) from the full dataset. In R you can use sample to select the row numbers, in python you can use np.random.choice (only available in numpy v1.7.1+). Once the rows are selected, the subsample must be extracted from the full dataset. For selecting rows in Python you may utilize the function in read_some_lines.py.
    2. From the extracted subset, you must now draw a bootstrap sample of size n. This is done by sampling from a \(Multinomial\left(n,\frac{1}{b}1_{b}\right)\). In R you can use rmultinom, in python you can use np.random.multinomial. The entries of the randomly sampled multinomial tell you how many times each of the b data points is replicated in the bootstrap dataset.
    3. Fit the linear model to the bootstrap dataset. In R you can use the weights argument to lm to specify the number of times each data point occurs, unfortunately python offers no such capability at present. Fortunately, you can use Prof. Baines' code on GitHub (in Stuff/HW2/BLB/wls.py) to fit a weighted least squares regression to the data.
    4. By this point, if you run BLB_lin_reg_job.[R,py] your code should subsample a dataset of size b from the full dataset, then bootstrap resample to size n and fit a weighted linear regression. Lastly, you just need to modify the code to do this rs times as a batch job. There are s*r=250 bootstrap datasets that must be analyzed (r=50 bootstrap samples from each of s=5 distinct subsets). This could be organized as a batch job on gauss in a number of ways, the most obvious being:
      • s jobs, each performing r linear regressions, or,
      • r jobs, each performing s linear regressions, or,
      • rs jobs, each performing 1 linear regression.
      In this assignment we will use the third option (rs jobs, each running a single linear regression). The supplied file BLB_lin_reg_run.sh provides the configuration to run a batch job on Gauss with rs separate sub-jobs.
      • The jobs are numbered 1,2,...,250.
      • Your R/python code must extract the job number from the command-line
      • Given the job number, you must figure out the values of s_index (in 1,2,...,5) and r_index (in 1,2,...,50)
      • Each job with the same value of s_index must subsample identical rows
      • Each job with the same value of r_index must bootstrap resample a different dataset
      • The coefficients from each linear regression should be written to a file with name outfile, where (R):
      • 
              outfile = paste0("output/","coef_",sprintf("%02d",s_index),"_",sprintf("%02d",r_index),".txt")
              
        or (Python),
        
              outfile = "output/coef_%02d_%02d.txt" % (s_index,r_index)
              
      When your script is prepared and fully tested, submit the batch job by executing:
      
          sarray ./BLB_lin_reg_run_[R,py].sh
          
      Please note that you may need to modify BLB_lin_reg_run_[R,py].sh to reflect the appropriate filenames/paths for your environment. Always remember to check the resulting *.[err,out] files (written to the dump directory) to figure out if/why things went wrong. Results will be written to the output directory. (Note: Each job should comfortably take less than ten minutes, or else your code is highly inefficent... :)
    5. By this point, you should have rs=250 different output files:
      
        $ ls output/ -alh
        -rw-rw---- 1 pdbaines pdbaines  19K Sep 12 22:18 coef_01_01.txt
        ...
        -rw-rw---- 1 pdbaines pdbaines  19K Sep 12 22:18 coef_05_49.txt
        -rw-rw---- 1 pdbaines pdbaines  19K Sep 12 22:18 coef_05_50.txt
          
      Next, edit and run the supplied script BLB_lin_reg_process.R. This can be run via:
      
          sbatch BLB_lin_reg_process.sh
          
      Depending on how you stored the output files, you may need to edit the script very slightly, but in most cases no modification will be needed. This will read each of the files and produce BLB SE estimates. These will be placed in the final directory.
    6. Produce an index plot of the obtained SE's i.e., a scatterplot with the index on the x-axis (1,2,...,d) and the value of the SE's on the y-axis.
    7. What to turn in:
      • Your code file BLB_lin_reg_job.[R,py,*]
      • Your index plot of the SE values you obtained in part (f)
      • Brief discussion of the algorithm and how well it worked for this example (the SE's should be \(\approx{}0.01\) for this example).


  3. In this question you will implement a MapReduce algorithm using the Hadoop Streaming API on Amazon Elastic MapReduce (EMR). First, follow the instructions to get started using EMR.
  4. What to turn in:

    1. The 2D histogram or density estimate produced from your MapReduce results.
    2. A description of your MapReduce algorithm. This should address the following questions:

      • What was your Map function?
      • What was your Reduce function?
      • What were the important implementation choices?
      • How well/quickly did your program run?
      • Any comments/improvements you could make?


  5. Next up, time to use try out some basic data manipulation with Hive via Amazon EMR. Your goal in this question is to load in the groups.txt data stored in the course S3 bucket at s3://sta250bucket/groups.txt. For testing puroposes it is recommended that you use s3://sta250bucket/mini_groups.txt.

    1. Fire up a Hadoop+Hive cluster as in Q2, and figure out how to get the data loaded into Hive.
    2. Compute the within-group means for the 1000 groups.
    3. Compute the within-group variances for the 1000 groups.
    4. Make a scatterplot of the within-group means (x-axis) vs. the within-group variances (y-axis).

(: Happy Coding! :)