STA 250 Lecture 11

Advanced Statistical Computation

Paul D. Baines

Welcome to STA 250!

Today is Big Data lecture 4. First code-swaps due today.

On the menu for today...

  1. Intro to Python

  2. Hive Wrap-up

  3. Bayes + Big Data

  4. Big Data Wrap-up

Python

Since Homework 2 recommends the use of Python for your MapReduce scripts (although as discussed in last lecture, other languages can be used too), lets talk about the basics of Python.

Python Basics

There are two major versions of Python available: Python 2 and Python 3.

Unlike with R releases, there are actually substantial differences between versions, and it is common to run into code that works on one version but not another.

  • Python 2 versions most used at the time of writing are 2.6 and 2.7.

  • Python 3 versions most used at the time of writing are 3.2 and 3.3.

  • It doesn't particularly matter which version you use, just try to use the same version across different machines (e.g., laptop, Gauss, Amazon).

  • Obviously, writing Python 3 compatbile code will give your work a longer shelf-life.

Python Overview

From http://www.python.org/about/

  • very clear, readable syntax
  • strong introspection capabilities
  • intuitive object orientation
  • natural expression of procedural code
  • full modularity, supporting hierarchical packages
  • exception-based error handling
  • very high level dynamic data types
  • extensive standard libraries and third party modules for virtually every task
  • extensions and modules easily written in C, C++ (or Java for Jython, or .NET languages for IronPython)
  • embeddable within applications as a scripting interface

Python for Data Analysis

Python is maturing as a feasible language for data analysis. It is not currently as poweful as R, given R's huge collection of packages implementing essentially every statistical procedure ever invented.

Nevertheless, Python plays well with Big Data (better than R, although R can still be used) and its beauty, elegance and power for file handling and text processing far outstrips R. For other data types of data analysis, I'd typically stick to R (ignoring other considerations such as collaborators, community usage etc).

Running Python

Python can be run in an interactive manner (as with R), or to execute scripts.

e.g.,

pdbaines@gauss:~$ python
Python 2.7.3 (default, Sep 26 2013, 20:03:06) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 

or:

python myscript.py           # run script
python myscript.py arg1 arg2 # with command line arguments

Python basics

>>> foo = 6
>>> bar = 8.5
>>> type(foo)
<type 'int'>
>>> type(bar)
<type 'float'>
>>> foo + bar
14.5
>>> baz + " class"
'hello class'

Python basics

if statements are delimited by if (condition): and then indented blocks of code. Once the indentation block terminates, the if statement is considered closed.

>>> if (i>0):
...     print("i is bigger than zero")
... else:
...     print("i is not bigger than zero")
... 
i is not bigger than zero

>>> i = 1
>>> if i == True:
...     print("Yes")
...     print("i is indeed True")
...
Yes
i is indeed True

Python basics

for loops iterate over elements of a sequence. e.g.,

>>> stuff = ['hello', 789, True, False]
>>> for i in stuff:
...     print("i is: " + str(i))
... 
i is: hello
i is: 789
i is: True
i is: False

Python basics

To iterate over numbers (e.g., for (i in 1:n){...}) use range(0,n).

>>> for i in range(0,5):
...     print(i)
... 
0
1
2
3
4

Note: no 5. range(0,n) is not inclusive of the last value!

Python basics

Additional functionality is provided by Python modules (key ones: numpy, scipy, matplotlib).

import numpy as np
# access functions within numpy using np.*() 
>>> foo = np.array([1,2,3,4,5])
>>> foo
array([1, 2, 3, 4, 5])
>>> foo[0] # zero indexing
1
>>> foo[0:3] # slicing: to index is not inclusive!
array([1, 2, 3])
>>> foo[0:1]
array([1])
>>> foo[1:]
array([2, 3, 4, 5])
>>> foo[:1]
array([1])

Python basics

Be careful with slicing!

>>> foo = np.array([1,2,3,4,5])
>>> bar = foo[2:]
>>> bar
array([3, 4, 5])
>>> foo[3] = 100
>>> foo
array([  1,   2,   3, 100,   5])
>>> bar
array([  3, 100,   5])

Variables such as foo and bar can be thought of as tags that point to memory locations. Actual copies can be made if needed.

See: http://www.precheur.org/python/copy_list

Python basics

To copy a slice (more than one way to do this):

>>> foo = np.array([1,2,3,4,5])
>>> bar
array([ 3, 10,  5])
>>> bar = foo[2:].copy()
>>> foo[3] = 10
>>> foo
array([ 1,  2,  3, 10,  5])
>>> bar
array([3, 4, 5])

Python basics

To convert between strings, ints and floats:

>>> foo = "5.6"
>>> float(foo)
5.6
>>> str(6.5)
'6.5'

Rounding, flooring etc:

>>> import math
>>> math.floor(5.6)
5.0
>>> round(10.566,2)
10.57

Hive

For the homework you won't need to do too much with Hive. Just two things:

  • Load in the data
  • Aggregate values across groups (specifically, compute within-group means and variances)

Since this is fairly limited, I will not give you any pointers as to how to do it (unlike most other HW questions where you get more of a template). Googling and digging around to find out how to do things is an important skill. :)

Hint:

GROUP BY

Bayes + Big Data

We have done two modules so far: Bayes and Big Data. Lets put them together and talk about Bayesian Approaches to Big Data.

Specifically, lets discuss the Consensus Monte Carlo approach presented by Steve Scott (Google) at the Stat Dept seminar last month.

To the board!

Big Data Wrap-up

Take-home message from Big Data Module:

  • Big Data requires computational skills and methodological wizardry
  • We have seen two major methodological approaches to Big Data: the Bag of Little Bootstraps and Consensus Monte Carlo
  • Key computational skills for Big Data are the ability to work with data either (i) not storable in memory, and/or, (ii) stored across multiple machines.
  • The MapReduce paradigm provides a powerful scalable framework for computing with Big Data
  • Hadoop provides a platform for distributed computing with MapReduce
  • Higher-level interfaces to Hadoop such as Hive can aid the time-to-productivity (albeit with a loss of computational efficiency)
  • Cloud computing provides easy access to on-demand scalable computing infrastructure
  • Hopefully you now have the skills to actually do something with Big Data! :)

That is enough for today... :)

Keep Off


Source. Wed: Optimization + EM.*