STA 250 -- Lecture 02

Advanced Statistical Computation

Paul D. Baines

Welcome to STA 250!

On the menu for today...

  1. Coding warmups

  2. Working remotely: ssh, sftp, scp etc.

  3. Filesystem basics

  4. Git + GitHub

  5. Gauss: The Stat Cluster

  6. Python basics

  7. R basics

Notetakers for today: Christopher Aden, Xiongtao Dai, Shan-Yu Liu

Coding Warmup

Working Remotely: Basics

Remote Logins

SSH stands for "Secure Shell", and is a protocol that allows users to login to remote machines. For example, you can login to the Stat Dept cluster ("Gauss") while sitting having a cup of coffee in Australia.

We will use ssh a lot in this course.

Poll: Who has used ssh before?

Working Remotely: Basics II

Copying Files

SSH is great for logging into a remote machine, but you will need a mechanism to transfer files between your laptop/desktop and, say, Gauss.

Enter scp and sftp. This is where using Windows becomes painful.

ssh

To ssh into a remote machine is nice and easy:

ssh pdbaines@gauss.ucdavis.edu

For more X11 forwarding:

ssh -X pdbaines@gauss.ucdavis.edu

For more debugging:

ssh -vX pdbaines@gauss.ucdavis.edu

Obviously login to your account on Gauss. Not mine.

ssh for Windows users:

scp

scp stands for secure copy, and allows you to copy files to/from your local machine to a remote machine (e.g., Gauss). For example, to copy foo.txt from my Desktop to my "Research" folder on Gauss

scp ~/Desktop/foo.txt pdbaines@gauss.cse.ucdavis.edu:~/Research/

To copy directories you need to request -r for recursive copying. For example, to copy you Desktop directory to Gauss:

scp -r ~/Desktop pdbaines@gauss.cse.ucdavis.edu:~/

For more, try man scp or http://linux.die.net/man/1/scp.

rsync

rsync is a useful tool for more complicated file transfers. For example, suppose you are copying 100,000 files from your laptop to Gauss and the file transfer fails midway (after 2 hours). With rsync it is trivial to resume the transfer and avoid recopying already copied files.

For example, to copy my "libraries" folder on my laptop to Gauss, ignoring any files that had been uploaded first time around:

rsync --ignore-existing --recursive -av libraries/ pdbaines@gauss.cse.ucdavis.edu:~/

You can also add --dry-run if you want to see what files will be copied but not actually do the copy.

For more, try man rsync or http://linux.die.net/man/1/rsync.

Gauss: The Stat Dept Cluster

Aside: Authentication Keys

To login into Gauss you need to create public/private keypair. On Mac/Linux this is trivial:

# Make the key:
ssh-keygen -t rsa

This creates a public key (id_rsa.pub) and a private key (id_rsa) that reside in the ~/.ssh directory. You will need to email the key to help@cse.ucdavis.edu to get access to Gauss. Full instructions:
http://wiki.cse.ucdavis.edu/support:general:security:ssh#setup.

Once your account is ready, copy the public key to Gauss e.g.,

scp ~/.ssh/id_rsa.pub yourusername@gauss.cse.ucdavis.edu:~/

Now you should be able to login:

ssh yourusername@gauss.cse.ucdavis.edu

Aside: Scripts

Since you will be logging in to Gauss many times, it is advisable to create a script to save you some typing. Create a file, say, gauss_ssh and type:

#!/bin/bash
ssh -vX myusername@gauss.cse.ucdavis.edu

You may also need to change the permissions to make the script executable:

chmod u+x gauss_ssh

To run the script (and login to Gauss), just type:

./gauss_ssh

Nice. That saved a few keystrokes. Now you are a pro. If you are not already a proficient script writer, then I highly recommend becoming one!

Archiving/Compressing/Uncompressing

It is frequently necessary to compress large files into smaller ones using standard tools. While the .zip archives are familiar for Windows users, the recommend [un]compression tool for this class is tar. Other standards such as .bz2 are also fine, and you may use your preferred approach. Again, on Windows it is usually easier to handle archives via a GUI such as WinZIP.

To create a .tar.gz archive from a single file on Mac/Linux:

tar -cvzf myarchive.tar.gz big_file.txt

To create an archive containing a whole directory:

tar -cvzf myarchive.tar.gz dir_to_compress/

To uncompress a .tar.gz archive into the current directory:

tar -xvzf myarchive.tar.gz ./

Linux File Editing

When logged into Gauss you will frequently need to edit files.

Since Gauss provides now GUI, this can be tricky for first-timers.

You have two main options:

  1. Nano
  2. Vi(m)

Basic Editing with nano

To open a file (e.g., foo.txt) with nano:

nano foo.txt

Note: If foo.txt does not exist, it will be created (but not saved).

Just type/edit the file as you see it.

  • To quit: CTRL+X (it will prompt for save)
  • All commands are listed on the bottom of the screen

Nano is nice and easy to use (but not very powerful).

Basic Editing with Vi(m)

Vi is an old but popular text editor originally written for Unix in the 1970's.

It is fundamentally different from most text editors in that it has two distinct modes:

  • Normal mode: Commands typed are designed to control the session, not appear in the document.
  • Insert mode: All typing appears in the document itself.

To enter insert mode press i, to exit press Esc.

When in "Insert mode", the bottom of the screen should read:

-- INSERT --

Vi(m) Basics

All of the following commands must be typed from normal mode, not insert mode:

  • :w Save (i.e., write) to file
  • :w foobar.txt Save to file with filename foobar.txt
  • :q Quit the program
  • :q! Quit without saving
  • /foo Search for the text 'foo' within the document
  • n Repeat the search (jump to next occurence)
  • G Jump to last line of file
  • :10 Jump to line ten of the file
  • yy Copy the current line to the clipboard
  • 10 yy Copy the next ten lines to the clipboard
  • y$ Copy from the cursor to the end of the line

Vi(m) Basics cont...

  • p Paste the contents of the clipboard
  • dd Delete the current line
  • 10 dd Delete the next ten lines
  • u Undo the last command
  • :%s/foo/bar/cg Search and replace the text "foo" with "bar" globally, confirming first (hence the cg)

See: Vi Cheat Sheet for more.

Git Basics

Why use GitHub?

  • What is GitHub?

    • Rich web-based hosting of git
  • Why use GitHub?

    • Lots of additional features: issues, social components, search, documentation
    • Unlimited free public repositories
    • Private repositories require subscription

Register for free use of 5 private repositories via an educational account at http://github.com/edu.

GitHub Basics

Gauss: Batch Jobs

When you login to Gauss, you are logging in to the "Head node". The head node is designed to manage the system, not do major calculations: that is what the compute nodes are for.

If you just type R (or python) at the command line, you will be running R/python on the head node. Generally speaking, never do this.

To use the compute nodes on Gauss you need to submit a batch, or array, job.

The compute nodes on Gauss cannot be used interactively: you need to provide a script that will run, and make sure your script saves the output/writes to file to store the results.

In light of this non-interactive nature, it is helpful to develop code locally and make sure it is running properly, before running batch jobs on Gauss.

Using Gauss:

Python Basics

R Basics

To develop on your laptop in R I recommend using RStudio (http://rstudio.org). It provides a comprehensive IDE, and makes it easy to debug.

Who is new to R? Who is planning to use R?

Some resources for learning R: http://www.ats.ucla.edu/stat/r/

You could try: http://tryr.codeschool.com/

Or Google "R tutorial" or "R introduction" and pick your favourite guide.

We'll spend more time on R in future weeks.

That is enough for today... :)