Welcome to STA 250!
Today is Big Data
lecture 4. First code-swaps due today.
On the menu for today...
Intro to Python
Hive Wrap-up
Bayes + Big Data
Big Data Wrap-up
Paul D. Baines
Today is Big Data
lecture 4. First code-swaps due today.
On the menu for today...
Intro to Python
Hive Wrap-up
Bayes + Big Data
Big Data Wrap-up
Since Homework 2 recommends the use of Python for your MapReduce scripts (although as discussed in last lecture, other languages can be used too), lets talk about the basics of Python.
There are two major versions of Python available: Python 2 and Python 3.
Unlike with R releases, there are actually substantial differences between versions, and it is common to run into code that works on one version but not another.
Python 2 versions most used at the time of writing are 2.6 and 2.7.
Python 3 versions most used at the time of writing are 3.2 and 3.3.
It doesn't particularly matter which version you use, just try to use the same version across different machines (e.g., laptop, Gauss, Amazon).
Obviously, writing Python 3 compatbile code will give your work a longer shelf-life.
From http://www.python.org/about/
Python is maturing as a feasible language for data analysis. It is not currently as poweful as R, given R's huge collection of packages implementing essentially every statistical procedure ever invented.
Nevertheless, Python plays well with Big Data (better than R, although R can still be used) and its beauty, elegance and power for file handling and text processing far outstrips R. For other data types of data analysis, I'd typically stick to R (ignoring other considerations such as collaborators, community usage etc).
Python can be run in an interactive manner (as with R), or to execute scripts.
e.g.,
pdbaines@gauss:~$ python
Python 2.7.3 (default, Sep 26 2013, 20:03:06)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
or:
python myscript.py # run script
python myscript.py arg1 arg2 # with command line arguments
>>> foo = 6
>>> bar = 8.5
>>> type(foo)
<type 'int'>
>>> type(bar)
<type 'float'>
>>> foo + bar
14.5
>>> baz + " class"
'hello class'
if
statements are delimited by if (condition):
and then indented blocks of code.
Once the indentation block terminates, the if
statement is considered closed.
>>> if (i>0):
... print("i is bigger than zero")
... else:
... print("i is not bigger than zero")
...
i is not bigger than zero
>>> i = 1
>>> if i == True:
... print("Yes")
... print("i is indeed True")
...
Yes
i is indeed True
for
loops iterate over elements of a sequence. e.g.,
>>> stuff = ['hello', 789, True, False]
>>> for i in stuff:
... print("i is: " + str(i))
...
i is: hello
i is: 789
i is: True
i is: False
To iterate over numbers (e.g., for (i in 1:n){...}
) use range(0,n)
.
>>> for i in range(0,5):
... print(i)
...
0
1
2
3
4
Note: no 5. range(0,n)
is not inclusive of the last value!
Additional functionality is provided by Python modules (key ones: numpy
, scipy
, matplotlib
).
import numpy as np
# access functions within numpy using np.*()
>>> foo = np.array([1,2,3,4,5])
>>> foo
array([1, 2, 3, 4, 5])
>>> foo[0] # zero indexing
1
>>> foo[0:3] # slicing: to index is not inclusive!
array([1, 2, 3])
>>> foo[0:1]
array([1])
>>> foo[1:]
array([2, 3, 4, 5])
>>> foo[:1]
array([1])
Be careful with slicing!
>>> foo = np.array([1,2,3,4,5])
>>> bar = foo[2:]
>>> bar
array([3, 4, 5])
>>> foo[3] = 100
>>> foo
array([ 1, 2, 3, 100, 5])
>>> bar
array([ 3, 100, 5])
Variables such as foo
and bar
can be thought of as tags that point
to memory locations. Actual copies can be made if needed.
To copy a slice (more than one way to do this):
>>> foo = np.array([1,2,3,4,5])
>>> bar
array([ 3, 10, 5])
>>> bar = foo[2:].copy()
>>> foo[3] = 10
>>> foo
array([ 1, 2, 3, 10, 5])
>>> bar
array([3, 4, 5])
To convert between strings
, ints
and floats
:
>>> foo = "5.6"
>>> float(foo)
5.6
>>> str(6.5)
'6.5'
Rounding, flooring etc:
>>> import math
>>> math.floor(5.6)
5.0
>>> round(10.566,2)
10.57
For the homework you won't need to do too much with Hive. Just two things:
Since this is fairly limited, I will not give you any pointers as to how to do it (unlike most other HW questions where you get more of a template). Googling and digging around to find out how to do things is an important skill. :)
Hint:
GROUP BY
We have done two modules so far: Bayes and Big Data. Lets put them together and talk about Bayesian Approaches to Big Data.
Specifically, lets discuss the Consensus Monte Carlo approach presented by Steve Scott (Google) at the Stat Dept seminar last month.
To the board!
Take-home message from Big Data Module:
Source. Wed: Optimization + EM.*