Welcome to STA 250!
Today is Big Data
lecture 3. On the menu for today...
Cloud Computing on Amazon
ElasticMapReduce
Working with AWS
Introduction to Hive
Hive Examples
Paul D. Baines
Today is Big Data
lecture 3. On the menu for today...
Cloud Computing on Amazon
ElasticMapReduce
Working with AWS
Introduction to Hive
Hive Examples
Today we will take a look at how to use Hadoop via Amazon Web Services (AWS) using ElasticMapReduce (EMR).
As mentioned in lecture 01, Amazon have kindly given us an educational grant to cover the computing costs for the class.
MapReduce and Hadoop functionality on Amazon is provided through the Elastic MapReduce service.
This allows you to launch configured Hadoop clusters with varying numbers of nodes, compute power, software etc.
These clusters can also be launched with Hive and Pig pre-configured and ready to use.
Lets see how to launch an interactive session and then login.
EMR jobs come in two types:
Crucial note: THE JOB COSTS MONEY AS SOON AS IT STARTS RUNNING. WHEN YOU HAVE FINISHED RUNNING YOUR SCRIPTS,
YOU MUST TERMINATE THE JOB via:
https://console.aws.amazon.com/elasticmapreduce/
Data on AWS is designed to be stored in Amazon's S3 system (http://aws.amazon.com/s3).
S3 stores data inside what are known as buckets.
For homework 2 you will need to retrieve data from the course bucket, and bring that
data into Hadoop. For example, to bring in the mini_groups.txt
data used for
example 2 last lecture into the data
directory on your HDFS:
hadoop distcp s3://sta250bucket/mini_groups.txt data/
Note: You must already have created the data
directory for this command to work.
To upload data to S3, do so via the S3 console:
https://console.aws.amazon.com/s3/
The console is fairly self-explanatory.
NOTE: Storing data on AWS costs money, so once you are done. Delete the data!
For homework 1 you will just be using Prof. Baines' data, so no need to upload any data.
For non-iteractive EMR jobs, you will need to get your scripts
to the machine. Fortunately, this is (almost) the same procedure you
already have been using for Gauss: scp
.
Only one extra wrinkle: you will need to provide your private keyfile
when issuing the scp
command.
For example:
scp -i mykey.pem *.py hadoop@ec2-54-202-210-38.us-west-2.compute.amazonaws.com:~/
Q: What does this command do?
To run a Hadoop Streaming job on AWS you need to launch your cluster, get the data into your HDFS, get your scripts to HDFS and then launch the MapReduce job in the same manner as described in Lecture 09.
To check the status of your job you can either:
Open another ssh session and run lynx http://localhost:9100/
Setup FoxyProxy to view the status from Firefox. See:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-foxy-proxy.html
The basic steps (Note: other workflows are possible, this isn't necessarily the best way to do things!):
scp
lynx
(and cross your fingers...)scp
Hive provides higher-level functionality for working with Hadoop.
Good places to start:
From the wiki:
Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. QL can also be extended with custom scalar functions (UDF's), aggregations (UDAF's), and table functions (UDTF's).
Since Hive imposes structure on the data to be analyzed (unlike vanilla Hadoop), there are two main things to grasp:
The Data Definition Language (DDL): For creating databases, tables from the data
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
The Data Manipulation Language (DML): For modifying tables (either via LOAD
-ing in data, or INSERT
-ing it via queries)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
The hive Language: For querying tables to find what you are interested in.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select
From the Wiki:
Databases: Namespaces that separate tables and other data units from naming confliction.
Tables: Homogeneous units of data which have the same schema.
Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria. For example, a date_partition
of type STRING
and country_partition
of type STRING
. Each unique value of the partition keys defines a partition of the Table. For example all US
data from 2009-12-23
is a partition of the page_views table. Therefore, if you run analysis on only the US
data for 2009-12-23
, you can run that query only on the relevant partition of the table thereby speeding up the analysis significantly.
Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table.
TINYINT
- 1 byte integerSMALLINT
- 2 byte integerINT
- 4 byte integerBIGINT
- 8 byte integerBOOLEAN
- TRUE/FALSEFLOAT
- single precisionDOUBLE
- Double precisionSTRING
- sequence of characters in a specified character setIn addition to the primitive data types we can also construct complex ones via:
Structs: the elements within the type can be accessed using the DOT (.)
notation. For example, for a column c
of type STRUCT {a INT; b INT}
the a
field is accessed by the expression c.a
Maps (key-value tuples): The elements are accessed using ['element name']
notation. For example in a map M
comprising of a mapping from 'group' -> gid
the gid
value can be accessed using M['group']
Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be accessed using the [n]
notation where n
is an index (zero-based) into the array. For example for an array A
having the elements ['a', 'b', 'c']
, A[1]
retruns b
.
To create an (empty) table with a known schema is fairly easy. For example:
CREATE TABLE mytable (
firstname STRING,
lastname STRING,
statphd BOOLEAN,
hwscore INT
);
Then:
SHOW TABLES;
To fill a table with data, there are two options:
INSERT
data from another table
LOAD
data from a file
DIR=s3://datasets.elasticmapreduce/ngrams/books/
$DIR/20090715/eng-us-all/1gram/data
$DIR/20090715/eng-us-all/4gram/data
English 4-grams (293.5Gb): $DIR/20090715/eng/4gram/data
IMPORTANT NOTE: These datasets are hosted in the us-east-1 region. If you process these from other regions you will be charged data transfer fees.
Sloan Digital Sky Survey: Details at: http://aws.amazon.com/datasets/Astronomy/2797
Daily Global Weather Measurements: Details at: http://aws.amazon.com/datasets/Climate/2759
1000 Genomes Project: Details at: http://aws.amazon.com/datasets/Biology/4383
Amazon S3: http://s3.amazonaws.com/1000genomes
Lets play around with the Google ngrams data on EMR.
Quick note: The Google n-grams datasets are stored in the Sequence File Format (compressed). This is a commonly used file format for Hadoop. See: http://wiki.apache.org/hadoop/SequenceFile.
Source. Mon: Big Data Wrap-up.*