STA 250 :: Advanced Statistical Computing (UCD, Fall 2013)

Code + goodies used in Prof. Baines' STA 250 Course (UC Davis, Fall 2013)


Project maintained by STA250 Hosted on GitHub Pages — Theme by mattgraham

STA 250 :: Homework Policy

For all questions you must show your work. This enables us to understand your thought process, give partial credit and prevent crude cheating. Please see the code of the conduct in the Syllabus for rules about collaborating on homeworks.

For questions requiring computing, if you use R, python or any programming environment then you must turn in a printout of your output with your solutions.
In addition, a copy of your code must be uploaded to the appropriate HW directory of your forked GitHub repo within 24 hours of the homework due date.


Homework 02

Due: In Class, Wed November 13th

Assigned: Thursday Oct 31st

Big Data Module

(Click here for printable version)

In this homework you will implement the "Bag of Little Bootstraps" for a large dataset using the array job capabilities of Gauss, develop a MapReduce algorithm and run it via Hadoop Streaming on Amazon Elastic MapReduce, and, finally, perform some basic data manipulation using Hive (again on Elastic MapReduce). These exercises are largely tasters to introduce you to the data technologies, if you want to dig deeper into genuine statistical problems then the final projects will provide an opportunity to do so.

  1. Sync your fork of the course GitHub repo to include the latest updates using the instructions provided here.


  2. In this question, you will implement the "Bag of Little Bootstraps" (BLB) procedure of Kleiner et al. (2012) on a simulated linear regression dataset.
    • First, log in to gauss and type:
    • 
          ls /home/pdbaines/data/blb_lin_reg*
          
      You should see a listing of files of the form:
      
      -rw-rw-r--  1 pdbaines pdbaines 7.5G Sep 12 19:12 blb_lin_reg_data.bin
      -rw-rw-r--  1 pdbaines pdbaines  479 Sep 12 19:00 blb_lin_reg_data.desc
      -rw-rw-r--  1 pdbaines pdbaines  15G Sep 12 18:34 blb_lin_reg_data.txt
      -rw-rw-r--  1 pdbaines pdbaines  78M Sep 12 16:02 blb_lin_reg_medm.bin
      -rw-rw-r--  1 pdbaines pdbaines  476 Sep 12 16:02 blb_lin_reg_medm.desc
      -rw-rw-r--  1 pdbaines pdbaines 146M Sep 12 15:52 blb_lin_reg_medm.txt
      -rw-rw-r--  1 pdbaines pdbaines 3.2M Sep 13 13:16 blb_lin_reg_mini.bin
      -rw-rw-r--  1 pdbaines pdbaines  473 Sep 13 13:16 blb_lin_reg_mini.desc
      -rw-rw-r--  1 pdbaines pdbaines 6.0M Sep 16 17:53 blb_lin_reg_mini.txt
          
      NOTE: Since the data for this question is somewhat large, you will **not** be downloading or copying the data. All files must be read directly from these locations e.g., open("/home/pdbaines/data/blb_lin_reg_data.txt","rb").

      For this question you will be analyzing the data contained in blb_lin_reg_data.*. There are up to three files that may be of use:

      • blb_lin_reg_data.txt: Dataset containing 1,000,000 observations. Each row corresponds to an observation, each column to a variable. The first 1,000 columns correspond to covariates (X1,X2,...,X1000), and the final column corresponds to the response variable y. The data are comma-separated, and the uncompressed file is approximately 15Gb.
      • blb_lin_reg_data.bin: Binary file produced by the bigmemory package in R, containing the same data as blb_lin_reg_data.txt but in compressed and pre-processed format.
      • blb_lin_reg_data.desc: File describing the properties needed by the bigmemory package in R to read in the dataset.

      If you wish to examine blb_lin_reg_data.txt you must you the less text reader i.e. less /home/pdbaines/data/blb_lin_reg_data.txt. Using editors such as vi or nano is not advisable when working with such large files.

      In tackling this assignment, you may wish to download and use blb_lin_reg_mini.[txt,bin,desc]. These files contain mini versions of the full dataset with d=40 covariates and n=10,000 observations that can be used to rapidly test, develop and debug your code (they can also be used on your laptop instead of gauss).



      The model we will be fitting to the data is just the simple linear regression model:

      \[ y_i = x_{i,1} \beta_{1} + x_{i,2} \beta_{2} + ... + x_{i,1000} \beta_{1000} + \epsilon_i , \qquad \epsilon_i \sim N(0,\sigma^2) , \]

      with the \(\epsilon_i\)'s independent. Note that the model does not include an intercept term. Denote the maximum likelihood estimator of \(\beta=(\beta_1,\ldots,\beta_{1000})\) by \(\hat\beta\).

      The goal of this question is to find the (marginal / elementwise) standard error of \(\hat\beta\) i.e., \(SE(\hat\beta_1),\ldots,SE(\hat\beta_{1000})\).

      To do this, you will be using the batch processing pipeline on gauss to implement a BLB procedure to estimate the SE's of the 1,000 regression coefficients. Use the following specifications:

      • s = 5
      • r = 50
      • gamma = 0.7

      To access the data using R it is strongly recommended to use the attach.big.matrix function in conjunction with the .desc file. To access the data in python, simply use open in conjuction with the csv module.

      The following subparts will guide you through this process of obtaining the BLB SE estimates.

    1. First, create a file called BLB_lin_reg_job.R, blb_lin_reg_job.py or similar (depending on whether you intend to use R or python or some other programming language). You must write code that reads in the data and samples \(b\approx n^{\gamma}\) rows (without replacement) from the full dataset. In R you can use sample to select the row numbers, in python you can use np.random.choice (only available in numpy v1.7.1+). Once the rows are selected, the subsample must be extracted from the full dataset. For selecting rows in Python you may utilize the function in read_some_lines.py.
    2. From the extracted subset, you must now draw a bootstrap sample of size n. This is done by sampling from a \(Multinomial\left(n,\frac{1}{b}1_{b}\right)\). In R you can use rmultinom, in python you can use np.random.multinomial. The entries of the randomly sampled multinomial tell you how many times each of the b data points is replicated in the bootstrap dataset.
    3. Fit the linear model to the bootstrap dataset. In R you can use the weights argument to lm to specify the number of times each data point occurs, unfortunately python offers no such capability at present. Fortunately, you can use Prof. Baines' code on GitHub (in Stuff/HW2/BLB/wls.py) to fit a weighted least squares regression to the data.
    4. By this point, if you run BLB_lin_reg_job.[R,py] your code should subsample a dataset of size b from the full dataset, then bootstrap resample to size n and fit a weighted linear regression. Lastly, you just need to modify the code to do this rs times as a batch job. There are s*r=250 bootstrap datasets that must be analyzed (r=50 bootstrap samples from each of s=5 distinct subsets). This could be organized as a batch job on gauss in a number of ways, the most obvious being:
      • s jobs, each performing r linear regressions, or,
      • r jobs, each performing s linear regressions, or,
      • rs jobs, each performing 1 linear regression.
      In this assignment we will use the third option (rs jobs, each running a single linear regression). The supplied file BLB_lin_reg_run.sh provides the configuration to run a batch job on Gauss with rs separate sub-jobs.
      • The jobs are numbered 1,2,...,250.
      • Your R/python code must extract the job number from the command-line
      • Given the job number, you must figure out the values of s_index (in 1,2,...,5) and r_index (in 1,2,...,50)
      • Each job with the same value of s_index must subsample identical rows
      • Each job with the same value of r_index must bootstrap resample a different dataset
      • The coefficients from each linear regression should be written to a file with name outfile, where (R):
      • 
              outfile = paste0("output/","coef_",sprintf("%02d",s_index),"_",sprintf("%02d",r_index),".txt")
              
        or (Python),
        
              outfile = "output/coef_%02d_%02d.txt" % (s_index,r_index)
              
      When your script is prepared and fully tested, submit the batch job by executing:
      
          sarray ./BLB_lin_reg_run_[R,py].sh
          
      Please note that you may need to modify BLB_lin_reg_run_[R,py].sh to reflect the appropriate filenames/paths for your environment. Always remember to check the resulting *.[err,out] files (written to the dump directory) to figure out if/why things went wrong. Results will be written to the output directory. (Note: Each job should comfortably take less than ten minutes, or else your code is highly inefficent... :)
    5. By this point, you should have rs=250 different output files:
      
        $ ls output/ -alh
        -rw-rw---- 1 pdbaines pdbaines  19K Sep 12 22:18 coef_01_01.txt
        ...
        -rw-rw---- 1 pdbaines pdbaines  19K Sep 12 22:18 coef_05_49.txt
        -rw-rw---- 1 pdbaines pdbaines  19K Sep 12 22:18 coef_05_50.txt
          
      Next, edit and run the supplied script BLB_lin_reg_process.R. This can be run via:
      
          sbatch BLB_lin_reg_process.sh
          
      Depending on how you stored the output files, you may need to edit the script very slightly, but in most cases no modification will be needed. This will read each of the files and produce BLB SE estimates. These will be placed in the final directory.
    6. Produce an index plot of the obtained SE's i.e., a scatterplot with the index on the x-axis (1,2,...,d) and the value of the SE's on the y-axis.
    7. What to turn in:
      • Your code file BLB_lin_reg_job.[R,py,*]
      • Your index plot of the SE values you obtained in part (f)
      • Brief discussion of the algorithm and how well it worked for this example (the SE's should be \(\approx{}0.01\) for this example).


  3. In this question you will implement a MapReduce algorithm using the Hadoop Streaming API on Amazon Elastic MapReduce (EMR). First, follow the instructions to get started using EMR.
    • The goal of your MapReduce algorithm is to produce a bivariate histogram of a large sample of data. Your input is a large set of files containing tab-separated pairs of values, corresponding to the (x,y) samples. There are a total of 180,000,000 observations across the 100 input files (each approximately 60MB). Your MapReduce output must be a set of frequencies for bivariate bins of width 0.1 by 0.1. For example, for bin \((5.5 < x \leq 5.6, 10.4 < y \leq 10.5)\), your algorithm must return the number of data points in the bin. It is recommended that the output from your reducer be in csv format as follows:
      
            x_lo,x_hi,y_lo,y_hi,count
            
      where \(x_{lo},x_{hi},y_{lo},y_{hi}\) are the bin boundaries, and count is the number of observations falling within the bin.
    • For testing purposes you are encouraged to play with the mini version of the data contained in the GitHub repo (mini_out.txt). The file mini_out.txt can be tested with your mapper and reducer in a non-Hadoop environment by running:
      
            cat out_mini.txt | ./mapper.py | sort -k1,1 | ./reducer.py
            
      Do not run your Hadoop job until this works correctly! Note that you may need to change the permissions on your mapper and reducer scripts using:
      
            chmod u+x *.py
            
    • When launching your Hadoop cluster, follow the instructions in class (i.e., an Interactive Hive configuration), and select "Large" for the master, and 5 "Medium" compute nodes.
    • Once logged in, to see what files you have, type ls. To access the HDFS, use hadoop fs as done in class (lecture 09). First off, you need to create a directory on the HDFS:
      
            hadoop fs -mkdir data
            
    • Next, it is time to get the data onto the HDFS. This can be done by typing:
      
            hadoop distcp s3://sta250bucket/bsamples data/
            
      Remember to check this was successful by running hadoop fs -ls data/.
    • To launch your MapReduce job, use the script hadoop_hw.sh.
    • Job status can be checked by opening another ssh session and using lynx.
    • Once successful, copy the results directory to the local filesystem on your AWS machine, and then scp to your laptop with:
      
            scp -i yourkeypair.pem -r hadoop@(amazon_dns):~/binned-output ./
            
    • To concatenate the results files back together, use cat (Linux or Mac; For Windows users, do this on the Amazon cluster before scp-ing back to your laptop/desktop).
    • Produce a 2D-histogram of the results. If your output is in the correct format then you may use the (ugly) code in the GitHub repo to so this. See: PlotRes.R.
    • Do not forget to terminate the session when you are done!!!
    • Do not forget to terminate the session when you are done!!!
    • Do not forget to terminate the session when you are done!!!
    • Do not forget to terminate the session when you are done!!!

    What to turn in:

    1. The 2D histogram or density estimate produced from your MapReduce results.
    2. A description of your MapReduce algorithm. This should address the following questions:

      • What was your Map function?
      • What was your Reduce function?
      • What were the important implementation choices?
      • How well/quickly did your program run?
      • Any comments/improvements you could make?
    • Do not forget to terminate the session when you are done!!!
    • Do not forget to terminate the session when you are done!!!
    • Do not forget to terminate the session when you are done!!!
    • Do not forget to terminate the session when you are done!!!


  4. Next up, time to use try out some basic data manipulation with Hive via Amazon EMR. Your goal in this question is to load in the groups.txt data stored in the course S3 bucket at s3://sta250bucket/groups.txt. For testing puroposes it is recommended that you use s3://sta250bucket/mini_groups.txt.

    1. Fire up a Hadoop+Hive cluster as in Q2, and figure out how to get the data loaded into Hive.
    2. Compute the within-group means for the 1000 groups.
    3. Compute the within-group variances for the 1000 groups.
    4. Make a scatterplot of the within-group means (x-axis) vs. the within-group variances (y-axis).
    • Do not forget to terminate the session when you are done!!!
    • Do not forget to terminate the session when you are done!!!
    • Do not forget to terminate the session when you are done!!!
    • Do not forget to terminate the session when you are done!!!

(: Happy Coding! :)