STA 250 Lecture 11

Advanced Statistical Computation

Paul D. Baines

Welcome to STA 250!

Today is Big Data lecture 4. First code-swaps due today.

On the menu for today...

Intro to Python
Hive Wrap-up
Bayes + Big Data
Big Data Wrap-up

Python

Since Homework 2 recommends the use of Python for your MapReduce scripts (although as discussed in last lecture, other languages can be used too), lets talk about the basics of Python.

Python Basics

There are two major versions of Python available: Python 2 and Python 3.

Unlike with R releases, there are actually substantial differences between versions, and it is common to run into code that works on one version but not another.

Python 2 versions most used at the time of writing are 2.6 and 2.7.
Python 3 versions most used at the time of writing are 3.2 and 3.3.
It doesn't particularly matter which version you use, just try to use the same version across different machines (e.g., laptop, Gauss, Amazon).
Obviously, writing Python 3 compatbile code will give your work a longer shelf-life.

Python Overview

From http://www.python.org/about/

very clear, readable syntax
strong introspection capabilities
intuitive object orientation
natural expression of procedural code
full modularity, supporting hierarchical packages
exception-based error handling
very high level dynamic data types
extensive standard libraries and third party modules for virtually every task
extensions and modules easily written in C, C++ (or Java for Jython, or .NET languages for IronPython)
embeddable within applications as a scripting interface

Python for Data Analysis

Python is maturing as a feasible language for data analysis. It is not currently as poweful as R, given R's huge collection of packages implementing essentially every statistical procedure ever invented.

Nevertheless, Python plays well with Big Data (better than R, although R can still be used) and its beauty, elegance and power for file handling and text processing far outstrips R. For other data types of data analysis, I'd typically stick to R (ignoring other considerations such as collaborators, community usage etc).

Running Python

Python can be run in an interactive manner (as with R), or to execute scripts.

e.g.,

pdbaines@gauss:~$ python
Python 2.7.3 (default, Sep 26 2013, 20:03:06) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

or:

python myscript.py           # run script
python myscript.py arg1 arg2 # with command line arguments

Python basics

>>> foo = 6
>>> bar = 8.5
>>> type(foo)
<type 'int'>
>>> type(bar)
<type 'float'>
>>> foo + bar
14.5
>>> baz + " class"
'hello class'

Python basics

if statements are delimited by if (condition): and then indented blocks of code. Once the indentation block terminates, the if statement is considered closed.

>>> if (i>0):
...     print("i is bigger than zero")
... else:
...     print("i is not bigger than zero")
... 
i is not bigger than zero

>>> i = 1
>>> if i == True:
...     print("Yes")
...     print("i is indeed True")
...
Yes
i is indeed True

Python basics

for loops iterate over elements of a sequence. e.g.,

>>> stuff = ['hello', 789, True, False]
>>> for i in stuff:
...     print("i is: " + str(i))
... 
i is: hello
i is: 789
i is: True
i is: False

Python basics

To iterate over numbers (e.g., for (i in 1:n){...}) use range(0,n).

>>> for i in range(0,5):
...     print(i)
... 
0
1
2
3
4

Note: no 5. range(0,n) is not inclusive of the last value!

Python basics

Additional functionality is provided by Python modules (key ones: numpy, scipy, matplotlib).

import numpy as np
# access functions within numpy using np.*() 
>>> foo = np.array([1,2,3,4,5])
>>> foo
array([1, 2, 3, 4, 5])
>>> foo[0] # zero indexing
1
>>> foo[0:3] # slicing: to index is not inclusive!
array([1, 2, 3])
>>> foo[0:1]
array([1])
>>> foo[1:]
array([2, 3, 4, 5])
>>> foo[:1]
array([1])

Python basics

Be careful with slicing!

>>> foo = np.array([1,2,3,4,5])
>>> bar = foo[2:]
>>> bar
array([3, 4, 5])
>>> foo[3] = 100
>>> foo
array([  1,   2,   3, 100,   5])
>>> bar
array([  3, 100,   5])

Variables such as foo and bar can be thought of as tags that point to memory locations. Actual copies can be made if needed.

See: http://www.precheur.org/python/copy_list

Python basics

To copy a slice (more than one way to do this):

>>> foo = np.array([1,2,3,4,5])
>>> bar
array([ 3, 10,  5])
>>> bar = foo[2:].copy()
>>> foo[3] = 10
>>> foo
array([ 1,  2,  3, 10,  5])
>>> bar
array([3, 4, 5])

Python basics

To convert between strings, ints and floats:

>>> foo = "5.6"
>>> float(foo)
5.6
>>> str(6.5)
'6.5'

Rounding, flooring etc:

>>> import math
>>> math.floor(5.6)
5.0
>>> round(10.566,2)
10.57

Hive

For the homework you won't need to do too much with Hive. Just two things:

Load in the data
Aggregate values across groups (specifically, compute within-group means and variances)

Since this is fairly limited, I will not give you any pointers as to how to do it (unlike most other HW questions where you get more of a template). Googling and digging around to find out how to do things is an important skill. :)

Hint:

GROUP BY

Bayes + Big Data

We have done two modules so far: Bayes and Big Data. Lets put them together and talk about Bayesian Approaches to Big Data.

Specifically, lets discuss the Consensus Monte Carlo approach presented by Steve Scott (Google) at the Stat Dept seminar last month.

To the board!

Big Data Wrap-up

Take-home message from Big Data Module:

Big Data requires computational skills and methodological wizardry
We have seen two major methodological approaches to Big Data: the Bag of Little Bootstraps and Consensus Monte Carlo
Key computational skills for Big Data are the ability to work with data either (i) not storable in memory, and/or, (ii) stored across multiple machines.
The MapReduce paradigm provides a powerful scalable framework for computing with Big Data
Hadoop provides a platform for distributed computing with MapReduce
Higher-level interfaces to Hadoop such as Hive can aid the time-to-productivity (albeit with a loss of computational efficiency)
Cloud computing provides easy access to on-demand scalable computing infrastructure
Hopefully you now have the skills to actually do something with Big Data! :)

That is enough for today... :)

Keep Off

Source. Wed: Optimization + EM.*