Welcome to STA 250!
On the menu for today...
Distributed File Systems
MapReduce
Hadoop
Example: Word Count
Paul D. Baines
On the menu for today...
Distributed File Systems
MapReduce
Hadoop
Example: Word Count
Last lecture we briefly discussed distributed file systems. These file systems managed data "distributed" across many machines.
To program with data stored in this way we need a programming framework/paradigm (we don't want to manually have to keep track of where the data is located, passing it back/forth etc., we want the programming model to take care of that for us.
Enter MapReduce.
MapReduce is the name of the both the programming model and the implementation.
The programming model is actually quite simple...
There are two-steps to programming a MapReduce program:
Map: For every data element, a function is applied to that element, and it returns a (key,value) pair.
Reduce: For every element with the same key, a function is applied to combine the values.
Suppose we want to count the number of times each word in a document appears. Each word in the document is a data element i.e.,
Angry Bob was angry that little Bob was angry at big Bob.
For the map step we need to decide on what (key,value) pairs to emit.
Map: For each word, emit: (word, 1). Result:
(Angry, 1)
(Bob, 1)
(was, 1)
(angry, 1)
(that, 1)
(little, 1)
(Bob, 1)
(was, 1)
(angry, 1)
(at, 1)
(big, 1)
(Bob, 1)
Next, for the reduce step, (conceptually) all key-value pairs with the same key (i.e., the same words) are combined.
(Angry, 1)
----------
(Bob, 1)
(Bob, 1)
(Bob, 1)
----------
(was, 1)
(was, 1)
----------
(angry, 1)
(angry, 2)
----------
...
We lastly need a reduce function to apply to the set of values within each unique key. Here, that is just a sum.
Result:
(Angry, 1)
(Bob, 3)
(was, 2)
(angry, 2)
(that, 1)
(little, 1)
(at, 1)
(big, 1)
Voila!
map(String key, String value):
// key: document name,
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int word_count = 0;
for each v in values:
word_count += ParseInt(v);
Emit(key, AsString(word_count));
Input to map is a bunch of strings with filenames (key), whose contents are stored in value.
Hadoop is written in Java, and to directly write MapReduce programs it was traditionally done using Java (and often still is for speed).
Fortunately, the Hadoop Streaming API allows for MapReduce programs to run using any executable or script.
Therefore, for non-Java users, it is often convenient to write MapReduce programs in Python using the Hadoop Streaming interface.
So lets write the word count example in Python...
Key note: output of each mapper must be <string><tab><string>
(newline per data element).
#!/usr/bin/env python
# From: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write results to STDOUT (standard output); what we output here will
# be the input for the Reduce step, i.e. the input for reducer.py
print '%s\t%s' % (word, 1)
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%s\t%s' % (current_word, current_count)
Since Hadoop is designed to scale to many machines, one key feature is fault tolerance i.e., if one machine breaks down, the whole system doesn't break.
Therefore, redundancy is built into the system. By default, each file is replicated three times (although this can be configured). Files are replicated across different datanodes.
Large files are broken into "chunks", with a default chunk size of 64MB. This aids both the data transfer and data replication rather than working with large files.
$ hadoop fs -mkdir test
$ hadoop fs -copyFromLocal pg* test/
$ hadoop fs -ls
Found 1 items
drwxr-xr-x - pdbaines supergroup 0 2013-10-24 17:27 /user/pdbaines/test
$ hadoop fs -ls test/
Found 3 items
-rw-r--r-- 1 pdbaines supergroup 674570 2013-10-24 17:27 /user/pdbaines/test/pg20417.txt
-rw-r--r-- 1 pdbaines supergroup 1573150 2013-10-24 17:27 /user/pdbaines/test/pg4300.txt
-rw-r--r-- 1 pdbaines supergroup 1423803 2013-10-24 17:27 /user/pdbaines/test/pg5000.txt
#!/bin/bash
HADOOP_HOME=/usr/local/Cellar/hadoop/1.2.1
JAR=libexec/contrib/streaming/hadoop-streaming-1.2.1.jar
HSTREAMING="$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/$JAR"
$HSTREAMING \
-file mapper.py -mapper mapper.py \
-file reducer.py -reducer reducer.py \
-input test/pg* -output test-output
$ hadoop fs -copyToLocal test-output ./
$ ls -alh test-output
$ hadoop fs -rmr test-output
HDFS Administrator: http://localhost:50070
MapReduce Administrator: http://localhost:50030
Task Tracker: http://localhost:50060
Lets do another example, this time involving numbers.
Data:
100 -0.92681706290969
23 2.14354753714106
77 -0.347487277923425
85 0.180645560906827
44 -4.25740698971681
13 0.449197843116687
...
Task: Compute within-group means.
Same as before. Lets take it one step further...
Data:
100 -0.92681706290969
23 2.14354753714106
77 -0.347487277923425
85 0.180645560906827
44 -4.25740698971681
13 0.449197843116687
...
Task: Compute within-group variances.
The basic single MapReduce model can do much, but many algorithms require iteration. Fortunately Hadoop supports recursion (i.e., run map and reduce, check for convergence, if convergence has not been reached, run another map and reduce, check for convergence etc).
We will not directly cover this given time constraints, but this functionality allows for more statistical (and complicated) MapReduce applications.
Note: Direct "chaining" of MapReduce jobs is hard if using the streaming API (easy if MR jobs are written in Java). Alternatives include Cascading and Yelp's mrjob.
I recommend mrjob
: it interfaces nicely with EMR (more on this shortly).
The MapReduce paradigm is very powerful, but coding, debugging MR jobs can be difficult and time-consuming. Simple tasks that could be coded in a few lines using higher-level programming languages can either be difficult or inefficient in the MR paradigm.
Just as with parallel programming in general, the goal is to abstract things to a higher level and avoid the details where possible.
There are many projects/extensions of Hadoop. Most notably for us:
Hive: Allows users to project structure onto data, and utilize basic SQL-type queries. Can still use custom map/reduce programs where desired. We will use Hive in the next homework (more on Wed). See: http://hive.apache.org/
Pig: Framework for easier data analysis using Hadoop. Consists of a higher-level (SQL-style) language called "Pig Latin". Essentially acts as a compiler to translate Pig Latin programs into MapReduce tasks. See: http://pig.apache.org/
Mahout: Scalable library for Machine Learning using Hadoop. Includes: Collaborative Filtering, K-means, Random Forests etc. See: http://mahout.apache.org/
Currently Hadoop is not installed on Gauss (for various reasons, it isn't ideal to setup MapReduce-style programs on Gauss as it conflicts with the SLURM-style job management currently in place). It will likely appear in the near future, but won't be suitable for large-scale processing.
As far as I am aware, none of the other Stat servers currently have a working Hadoop install either (again, they likely will soon...).
Having a local install on your machine can be helpful for debugging, but where to actually run stuff...?
On Wednesday we will take a look at how to use Hadoop via Amazon Web Services (AWS) using ElasticMapReduce (EMR).
http://icanhas.cheezburger.com/
Wed: Hadoop + MapReduce + Amazon = EMR!