Code + goodies used in Prof. Baines' STA 250 Course (UC Davis, Fall 2013)
The course is organized around the following key topics:
To cover these topics, the course will be broken into four modules: (i) Bayesian Inference and Computation, (ii) Statistics with "Big Data", (iii) Optimization and the EM Algorithm, and, (iv) Efficient Computing: Parallelization and GPUs.
The course is designed to equip students with the basic skills required to tackle challenging problems at the forefront of modern statistical applications. For statistics PhD students, there are many rich research topics in these areas. For masters students, and PhD students from other fields, the course is intended to cultivate practical skills that are required of the modern statistician/data scientst, and can be used in your own field of research or future career.
Before we get into the fun stuff, the first few classes will serve as a "boot camp" to make sure everyone has the mathematical and programming background to tackle the challenges later in the course. We will also use the first few weeks to become familiar with some of the key datasets that we will use throughout the course.
Please complete the pre-course survey!
Grading for the course will be broken down with the following weighting:
There is no final exam for the course.
Please note that since the topics for the course are taken from a variety of different areas, there will
be no single textbook for the class. Below are a list of useful references. It is not required that
you purchase any of these books, and they will primarily serve as additional references
for material presented in class.
Bayesian Data Analysis (2nd Edition), Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B.,
Chapman \& Hall/CRC Texts in Statistical Science (2003) \
Available at http://www.amazon.com/Bayesian-Analysis-Chapman-Statistical-Science/dp/158488388X.
and other online retailers. We will be focusing on material in Chapters 2, 3, 6 and 11.
Monte Carlo Statistical Methods (2nd Edition), Robert, C.P. and Casella, G., Springer (2004).
Available at http://www.amazon.com/Monte-Statistical-Methods-Springer-Statistics/dp/0387212396.
We wil be focusing on material in Chapters 3, 5, 6 and 12.
A First Course in Bayesian Statistical Methods, Hoff, P. D., Springer (2009)
The full text of this book is available in electronic form via the UC Davis library
for free. To access the full text as a pdf:
https://vpn.lib.ucdavis.edu/content/m28q35/,DanaInfo=www.springerlink.com+#section=59276&page=11&locus=51
and login with your UC Davis username and Kerberos password.
Hard copies are available from:
http://www.amazon.com/Bayesian-Statistical-Methods-Springer-Statistics/dp/0387922997
and other online retailers.
The Python
tutorial at: http://docs.python.org/2/tutorial/
For some R
references see: http://heather.cs.ucdavis.edu/r.html. For beginners interested
in learning R
you may want to start with http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf
Lecture | Date | Topic | Notes |
---|---|---|---|
01 | Mon 30th Sep: | Course Overview, Demos | |
02 | Wed 2nd Oct: | Boot Camp -- Basics, R, Python | |
03 | Mon 7th Oct: | Boot Camp -- Gauss, Linux, Stats | |
04 | Wed 9th Oct: | Bayes I -- Introduction to Bayes | Homework 0 Due |
05 | Mon 14th Oct: | Bayes II -- MCMC/Bayesian Computing | |
06 | Wed 16th Oct: | Bayes III -- Inference/Model Checking | |
07 | Mon 21st Oct: | Bayes IV -- Applications/Extras | |
08 | Wed 23rd Oct: | Big Data I -- Types of "Big" Data | |
09 | Mon 28th Oct: | Big Data II -- "Big" data strategies | Homework 1 Due |
10 | Wed 30th Oct: | Big Data III -- "Big" data computation | |
11 | Mon 4th Nov: | Big Data IV -- Applications/Extras | |
12 | Wed 6th Nov: | EM I -- Introduction to EM | |
-- | Mon 11th Nov: | NO CLASS -- VETERANS DAY | |
13 | Wed 13th Nov: | EM II -- Variations on EM | Homework 2 Due |
14 | Mon 18th Nov: | EM III -- Parametrization, Convergence | |
15 | Wed 20th Nov: | EM IV -- Efficient algorithms | |
16 | Mon 25th Nov: | GPUs I -- Overview of GPUs | Homework 3 Due |
17 | Wed 27th Nov: | GPUs II -- Programming GPUs | |
18 | Mon 2nd Dec: | GPUs III -- High-level GPU interfaces | |
19 | Wed 4th Dec: | GPUs IV -- Applications/extras | |
-- | Fri 6th Dec: | Homework 4 Due | |
-- | Mon 9th Dec: | Final Project Due |
The basic outline for homeworks is below. All due dates are subject to change.
Each homework will be followed by a "code-swapping" assignment, whereby each student will be
assigned to write a short critique of the homework code submitted by another student in the
course. The plan is for R
users to critique Python
code, and Python
users to critiqueR
so that students are exposed to different programming models.
and, last, but by no means least:
LaTeX
and Markdown
. Other formats that can be converted to .pdf
such as .docx
, .txt.
etc., can be used when taking notes in class, but these must be converted into either LaTeX or Markdown prior to submission. Scanned versions of handwritten notes are not permitted as these cannot be edited by othertex
/md
them later). With the anticipated course enrollment it is expected that each student will only need to take notes once during the quarter. Lecture notes will be available to all students.Below are very basic project descriptions to provide you a flavor of what to expect. Full final project descriptions and datasets will be provided later in the course. All final projects require a written report, and possibly an oral presentation.
Students are also welcome to submit their own proposals for final projects. This can be
especially useful for PhD students with specific problems that they would like to address using
the tools learned in class.
Bayesian Project This project will allow you the opportunity to apply your newly acquired knowledge of Bayesian statistics and computational strategies to a complex model for real-world data (provided by the instructor). You will be required to demonstrate that you are able to effectively solve the problem using simulation results, and also to draw conclusions based on real data.
Big Data Project This project will extend some of the skills developed in the "Big" data module. It will involve a computationally challenging analysis of a "big" dataset: including model development, refinement, verification and application.
EM Project Throughout the course you will be introduced to the Expectation-Maximization (EM) algorithm, and many more sophisticated extensions of it. This final project will require you to derive several of these algorithms for a specific statistical model. Once derived, your job will be to implement the algorithms and run simulations to compare competing performance. The final report will detail your algorithms, explain any implementation decisions made, and summarize your findings.
GPU Project In the fourth and final module of the course you will be introduced to Graphics Processing Units (GPUs) and how they can be used for statistical computation. Using the tools you have learned in class, this final project will require you to implement a statistical analysis that makes use of the power of the GPU. You will be required to implement, debug, test, optimize and evaluate your code.