STA 250 - Lecture 9

Advanced Statistical Computation

Paul D. Baines

Welcome to STA 250!

On the menu for today...

  1. Distributed File Systems

  2. MapReduce

  3. Hadoop

  4. Example: Word Count

Recap: Distributed File Systems

Last lecture we briefly discussed distributed file systems. These file systems managed data "distributed" across many machines.

To program with data stored in this way we need a programming framework/paradigm (we don't want to manually have to keep track of where the data is located, passing it back/forth etc., we want the programming model to take care of that for us.

Enter MapReduce.

MapReduce

MapReduce is the name of the both the programming model and the implementation.

The programming model is actually quite simple...

MapReduce

There are two-steps to programming a MapReduce program:

  1. Map: For every data element, a function is applied to that element, and it returns a (key,value) pair.

  2. Reduce: For every element with the same key, a function is applied to combine the values.

Classic Example 1

Suppose we want to count the number of times each word in a document appears. Each word in the document is a data element i.e.,

Angry Bob was angry that little Bob was angry at big Bob.

Classic Example 1 cont...

For the map step we need to decide on what (key,value) pairs to emit.

Map: For each word, emit: (word, 1). Result:

(Angry, 1) 
(Bob, 1) 
(was, 1) 
(angry, 1) 
(that, 1)
(little, 1)
(Bob, 1)
(was, 1) 
(angry, 1) 
(at, 1) 
(big, 1) 
(Bob, 1)

Classic Example 1 cont...

Next, for the reduce step, (conceptually) all key-value pairs with the same key (i.e., the same words) are combined.

(Angry, 1) 
----------
(Bob, 1)
(Bob, 1)
(Bob, 1)
----------
(was, 1)
(was, 1)
----------
(angry, 1)
(angry, 2)
----------
...

Classic Example 1 cont...

We lastly need a reduce function to apply to the set of values within each unique key. Here, that is just a sum.

Result:

(Angry, 1)
(Bob, 3)
(was, 2)
(angry, 2)
(that, 1)
(little, 1)
(at, 1)
(big, 1)

Voila!

Classic Example 1: Pseudocode Implementation

map(String key, String value): 
    // key: document name, 
    // value: document contents 
    for each word w in value: 
        EmitIntermediate(w, "1"); 

reduce(String key, Iterator values):
    // key: a word 
    // values: a list of counts
    int word_count = 0;
    for each v in values:
        word_count += ParseInt(v); 
    Emit(key, AsString(word_count));

Input to map is a bunch of strings with filenames (key), whose contents are stored in value.

Working with Hadoop

Hadoop is written in Java, and to directly write MapReduce programs it was traditionally done using Java (and often still is for speed).

Fortunately, the Hadoop Streaming API allows for MapReduce programs to run using any executable or script.

Therefore, for non-Java users, it is often convenient to write MapReduce programs in Python using the Hadoop Streaming interface.

So lets write the word count example in Python...

Classic Ex 1: Python Implementation (Map)

Key note: output of each mapper must be <string><tab><string> (newline per data element).

#!/usr/bin/env python
# From: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write results to STDOUT (standard output); what we output here will
        # be the input for the Reduce step, i.e. the input for reducer.py
        print '%s\t%s' % (word, 1)
current_word = None
current_count = 0
word = None
for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    try:
        count = int(count)
    except ValueError:
        continue
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word

if current_word == word:
    print '%s\t%s' % (current_word, current_count)

Hadoop Overview

  1. Namenode: Master server that manages the filesystem namespace and regulates access to files by clients
  2. Datanode: One per node in the cluster, which manage storage attached to the nodes that they run on
  3. Jobtracker: Manages the assignment of map and reduce tasks to the tasktrackers
  4. Tasktrackers: Execute tasks upon instruction from the jobtracker and also handle data motion between the map and reduce phases

Reference: http://wiki.apache.org/hadoop/ProjectDescription

About the HDFS

Since Hadoop is designed to scale to many machines, one key feature is fault tolerance i.e., if one machine breaks down, the whole system doesn't break.

Therefore, redundancy is built into the system. By default, each file is replicated three times (although this can be configured). Files are replicated across different datanodes.

Large files are broken into "chunks", with a default chunk size of 64MB. This aids both the data transfer and data replication rather than working with large files.

Getting files to the HDFS

$ hadoop fs -mkdir test
$ hadoop fs -copyFromLocal pg* test/
$ hadoop fs -ls
  Found 1 items
  drwxr-xr-x   - pdbaines supergroup          0 2013-10-24 17:27 /user/pdbaines/test

$ hadoop fs -ls test/
  Found 3 items
  -rw-r--r--   1 pdbaines supergroup     674570 2013-10-24 17:27 /user/pdbaines/test/pg20417.txt
  -rw-r--r--   1 pdbaines supergroup    1573150 2013-10-24 17:27 /user/pdbaines/test/pg4300.txt
  -rw-r--r--   1 pdbaines supergroup    1423803 2013-10-24 17:27 /user/pdbaines/test/pg5000.txt

Running the MapReduce Example

#!/bin/bash

HADOOP_HOME=/usr/local/Cellar/hadoop/1.2.1
JAR=libexec/contrib/streaming/hadoop-streaming-1.2.1.jar
HSTREAMING="$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/$JAR"

$HSTREAMING \
    -file mapper.py    -mapper mapper.py \
    -file reducer.py   -reducer reducer.py \
    -input test/pg* -output test-output

Cleanup

$ hadoop fs -copyToLocal test-output ./
$ ls -alh test-output
$ hadoop fs -rmr test-output

Checking Hadoop Job Progress/Status

Example 2

Lets do another example, this time involving numbers.

Data:

100 -0.92681706290969
23  2.14354753714106
77  -0.347487277923425
85  0.180645560906827
44  -4.25740698971681
13  0.449197843116687
...

Task: Compute within-group means.

Example 3

Same as before. Lets take it one step further...

Data:

100 -0.92681706290969
23  2.14354753714106
77  -0.347487277923425
85  0.180645560906827
44  -4.25740698971681
13  0.449197843116687
...

Task: Compute within-group variances.

Fancier MapReduce

The basic single MapReduce model can do much, but many algorithms require iteration. Fortunately Hadoop supports recursion (i.e., run map and reduce, check for convergence, if convergence has not been reached, run another map and reduce, check for convergence etc).

We will not directly cover this given time constraints, but this functionality allows for more statistical (and complicated) MapReduce applications.

Note: Direct "chaining" of MapReduce jobs is hard if using the streaming API (easy if MR jobs are written in Java). Alternatives include Cascading and Yelp's mrjob.

I recommend mrjob: it interfaces nicely with EMR (more on this shortly).

What next?

The MapReduce paradigm is very powerful, but coding, debugging MR jobs can be difficult and time-consuming. Simple tasks that could be coded in a few lines using higher-level programming languages can either be difficult or inefficient in the MR paradigm.

Just as with parallel programming in general, the goal is to abstract things to a higher level and avoid the details where possible.

Higher-Level Interfaces to Hadoop/MapReduce

There are many projects/extensions of Hadoop. Most notably for us:

  1. Hive: Allows users to project structure onto data, and utilize basic SQL-type queries. Can still use custom map/reduce programs where desired. We will use Hive in the next homework (more on Wed). See: http://hive.apache.org/

  2. Pig: Framework for easier data analysis using Hadoop. Consists of a higher-level (SQL-style) language called "Pig Latin". Essentially acts as a compiler to translate Pig Latin programs into MapReduce tasks. See: http://pig.apache.org/

  3. Mahout: Scalable library for Machine Learning using Hadoop. Includes: Collaborative Filtering, K-means, Random Forests etc. See: http://mahout.apache.org/

Where to use Hadoop

Currently Hadoop is not installed on Gauss (for various reasons, it isn't ideal to setup MapReduce-style programs on Gauss as it conflicts with the SLURM-style job management currently in place). It will likely appear in the near future, but won't be suitable for large-scale processing.

As far as I am aware, none of the other Stat servers currently have a working Hadoop install either (again, they likely will soon...).

Having a local install on your machine can be helpful for debugging, but where to actually run stuff...?

Hadoop + MapReduce on Amazon

On Wednesday we will take a look at how to use Hadoop via Amazon Web Services (AWS) using ElasticMapReduce (EMR).

That is enough for today... :)