STA 250 Lecture 10

Advanced Statistical Computation

Paul D. Baines

Welcome to STA 250!

Today is Big Data lecture 3. On the menu for today...

Cloud Computing on Amazon
ElasticMapReduce
Working with AWS
Introduction to Hive
Hive Examples

Hadoop + MapReduce on Amazon

Today we will take a look at how to use Hadoop via Amazon Web Services (AWS) using ElasticMapReduce (EMR).

As mentioned in lecture 01, Amazon have kindly given us an educational grant to cover the computing costs for the class.

Launching a Hadoop Cluster on Amazon

MapReduce and Hadoop functionality on Amazon is provided through the Elastic MapReduce service.

This allows you to launch configured Hadoop clusters with varying numbers of nodes, compute power, software etc.

These clusters can also be launched with Hive and Pig pre-configured and ready to use.

Lets see how to launch an interactive session and then login.

Amazon Sessions

EMR jobs come in two types:

Interactive: The machine is created and launched (and any customized configurations or additional software is setup). The user must then login and run his/her scripts.

Crucial note: THE JOB COSTS MONEY AS SOON AS IT STARTS RUNNING. WHEN YOU HAVE FINISHED RUNNING YOUR SCRIPTS, YOU MUST TERMINATE THE JOB via:
https://console.aws.amazon.com/elasticmapreduce/

Non-Interactive: Similar to Gauss batch jobs, all scripts are supplied to Amazon and the machine is setup, scripts setup, output written to the desired destination, and then the job is automatically terminated upon completion.

Storing Data on AWS

Data on AWS is designed to be stored in Amazon's S3 system (http://aws.amazon.com/s3).

S3 stores data inside what are known as buckets.

For homework 2 you will need to retrieve data from the course bucket, and bring that data into Hadoop. For example, to bring in the mini_groups.txt data used for example 2 last lecture into the data directory on your HDFS:

hadoop distcp s3://sta250bucket/mini_groups.txt data/

Note: You must already have created the data directory for this command to work.

Uploading Data to S3

To upload data to S3, do so via the S3 console:

https://console.aws.amazon.com/s3/

The console is fairly self-explanatory.

NOTE: Storing data on AWS costs money, so once you are done. Delete the data!

For homework 1 you will just be using Prof. Baines' data, so no need to upload any data.

Moving Files to EMR (Interactive Jobs)

For non-iteractive EMR jobs, you will need to get your scripts to the machine. Fortunately, this is (almost) the same procedure you already have been using for Gauss: scp.

Only one extra wrinkle: you will need to provide your private keyfile when issuing the scp command.

For example:

scp -i mykey.pem *.py hadoop@ec2-54-202-210-38.us-west-2.compute.amazonaws.com:~/

Q: What does this command do?

Hadoop Streaming on AWS

To run a Hadoop Streaming job on AWS you need to launch your cluster, get the data into your HDFS, get your scripts to HDFS and then launch the MapReduce job in the same manner as described in Lecture 09.

To check the status of your job you can either:

Open another ssh session and run lynx http://localhost:9100/
Setup FoxyProxy to view the status from Firefox. See:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-foxy-proxy.html

Hadoop Streaming on EMR Overview

The basic steps (Note: other workflows are possible, this isn't necessarily the best way to do things!):

Launch the EMR cluster from the AWS website
Set up your data directory in the HDFS
Transfer your data into the HDFS
Copy your scripts to the cluster using scp
Run the MapReduce job
Debug, check status using lynx (and cross your fingers...)
Transfer the results out of the HDFS
Copy the results to your local machine via scp
Terminate your job via the AWS console!!!
Celebrate.

Hive

Hive provides higher-level functionality for working with Hadoop.

Good places to start:

From the wiki:

Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. QL can also be extended with custom scalar functions (UDF's), aggregations (UDAF's), and table functions (UDTF's).

Understanding Hive

Since Hive imposes structure on the data to be analyzed (unlike vanilla Hadoop), there are two main things to grasp:

The Data Definition Language (DDL): For creating databases, tables from the data
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
The Data Manipulation Language (DML): For modifying tables (either via LOAD-ing in data, or INSERT-ing it via queries)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
The hive Language: For querying tables to find what you are interested in.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select

Data Structure in Hive

From the Wiki:

Databases: Namespaces that separate tables and other data units from naming confliction.
Tables: Homogeneous units of data which have the same schema.
Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria. For example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. For example all US data from 2009-12-23 is a partition of the page_views table. Therefore, if you run analysis on only the US data for 2009-12-23, you can run that query only on the relevant partition of the table thereby speeding up the analysis significantly.
Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table.

Primitive Data Types in Hive

Integers
- TINYINT - 1 byte integer
- SMALLINT - 2 byte integer
- INT - 4 byte integer
- BIGINT - 8 byte integer
Boolean type
- BOOLEAN - TRUE/FALSE
Floating point numbers
- FLOAT - single precision
- DOUBLE - Double precision
String type
- STRING - sequence of characters in a specified character set

More Data Types in Hive

In addition to the primitive data types we can also construct complex ones via:

Structs: the elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT} the a field is accessed by the expression c.a
Maps (key-value tuples): The elements are accessed using ['element name'] notation. For example in a map M comprising of a mapping from 'group' -> gid the gid value can be accessed using M['group']
Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example for an array A having the elements ['a', 'b', 'c'], A[1] retruns b.

Creating (Empty) Tables

To create an (empty) table with a known schema is fairly easy. For example:

CREATE TABLE mytable (
 firstname STRING,
 lastname STRING,
 statphd BOOLEAN,
 hwscore INT
 );

Then:

SHOW TABLES;

Putting Data in to Tables

To fill a table with data, there are two options:

INSERT data from another table
LOAD data from a file

Public Datasets on Amazon

Google n-grams: Details at http://aws.amazon.com/datasets/8172056142375670
Amazon S3: DIR=s3://datasets.elasticmapreduce/ngrams/books/
American-English 1-grams (3.0Gb): $DIR/20090715/eng-us-all/1gram/data
American-English 4-grams (135.0Gb): $DIR/20090715/eng-us-all/4gram/data
English 4-grams (293.5Gb): $DIR/20090715/eng/4gram/data
IMPORTANT NOTE: These datasets are hosted in the us-east-1 region. If you process these from other regions you will be charged data transfer fees.
Sloan Digital Sky Survey: Details at: http://aws.amazon.com/datasets/Astronomy/2797
Daily Global Weather Measurements: Details at: http://aws.amazon.com/datasets/Climate/2759
1000 Genomes Project: Details at: http://aws.amazon.com/datasets/Biology/4383
Amazon S3: http://s3.amazonaws.com/1000genomes

Hive on EMR

Lets play around with the Google ngrams data on EMR.

Quick note: The Google n-grams datasets are stored in the Sequence File Format (compressed). This is a commonly used file format for Hadoop. See: http://wiki.apache.org/hadoop/SequenceFile.

That is enough for today... :)

Keep Off

Source. Mon: Big Data Wrap-up.*