Automating AWS EMR Life-cycle

Automate Whole AWS EMR Process By Using Bootstrap Action and EMR Step

Azam
6 min readJan 3, 2021
Photo by Christopher Gower on Unsplash

In this tutorial, we will spin up EMR install all dependencies using bootstrap action then emr step will download images from Kaggle and process images using python and store processed images in s3 and terminate cluster.
This tutorial will highlight how to use Bootstrap Action and EMR Step to automate the configuration of EMR and process some data after processing terminate EMR.

  1. Create a Bootstrap Action shell Script
  2. Create a python script for converting an image from png to jpg
  3. Create an EMR Step shell script
  4. Launch EMR from AWS Console including Bootstrap Action and EMR step
  5. Launch EMR using CFT including Bootstrap Action and EMR step
  6. Problems faced

Getting Started with Bootstrap Action

Basically, Bootstrap action is used to install required packages before the cluster is created, it has one special advantage that it installs packages in each node in the cluster as specified in the script.

Let's install some of our favorite python packages and git for cloning repository.

#!/bin/bash
pip3 install --user pandas
pip3 install --user kaggle
pip3 install --user opencv-python
sudo yum install git -y

Above pip3 install will work only in emr version greater than 5.20 because older emr version have default python 2.7 , if you want to install python packages in older emr version then use below snippet

#!/bin/bash# 1st step is to make python3 as default python
sudo ln -fs /usr/bin/python3.6 /etc/alternatives/python
sudo ln -fs /usr/bin/pip-3.6 /etc/alternatives/pip
sudo pip install --upgrade pip
# now you can install packages
/usr/local/bin/pip3.6 install --user pandas
# rest everything will be same
sudo yum install git -y

Let's convert the image from png to jpg using python

  1. Import required libraries
from PIL import Image
import pandas as pd

2. Open Image as png and save it as jpg

im = Image.open('some_path/image_1.png')
im.save('some_path/image_1.jpg')

3. Final Snippet

from PIL import Image
import pandas as pd
# convert training dataset
df = pd.read_csv(r"/home/hadoop/raw_data/boneage-training-dataset.csv")
for i in df.id:
s_path="//home//hadoop//raw_data//boneage-training-dataset//boneage-training-dataset//"+str(i)+'.png'
d_path="//home//hadoop//clean_data//boneage-training-dataset//"+str(i)+'.jpg'
im = Image.open(s_path)
im.save(d_path)
# convert testing dataset
df = pd.read_csv(r"/home/hadoop/raw_data/boneage-test-dataset.csv")
for i in df['Case ID']:
s_path="//home//hadoop//raw_data//boneage-test-dataset//boneage-test-dataset//"+str(i)+'.png'
d_path="//home//hadoop//clean_data//boneage-test-dataset//"+str(i)+'.jpg'
im = Image.open(s_path)
im.save(d_path)

Dataset used you can find it on kaggle , it has csv file which mainatins mapping of each image with its id , in above code i have used that csv to iterate through each image

Create an EMR Step

1. Configure kaggle

#!/bin/bash
cd /home/hadoop/
aws s3 cp s3://bootstrap-and-emr-step-demo/kaggle.json /home/hadoop/kaggle.json#configuring kaggle for downloading dataset
mkdir /home/hadoop/.kaggle
sudo cp /home/hadoop/kaggle.json /home/hadoop/.kaggle/

In order to download dataset from kaggle we need to authenticate using an API token , for detailed explanation

2. Clone Git Repo

# cloning repo 
git clone -b master https://github.com/kazam1920/png_to_jpg.git /home/hadoop/png_to_jpg

3. Download dataset

# downloading dataset from kaggle
~/.local/bin/kaggle datasets download kmader/rsna-bone-age --path /home/hadoop/raw_data --unzip

4. Creating directories for processing images

mkdir /home/hadoop/clean_data
mkdir /home/hadoop/clean_data/boneage-training-dataset
mkdir /home/hadoop/clean_data/boneage-test-dataset
mkdir /home/hadoop/final
sudo cp raw_data/boneage-training-dataset.csv /home/hadoop/clean_data/
sudo cp raw_data/boneage-test-dataset.csv /home/hadoop/clean_data/

5. Executing python script

python3 /home/hadoop/png_to_jpg/png_to_jpg.py

Above command will work only in emr version greater than 5.20 , for executing python script in older emr version use below snippet

usr/bin/python3.6 /home/hadoop/png_to_jpg/png_to_jpg.py

6. Moving processed data to s3

hdfs dfs -put /home/hadoop/clean_data
s3-dist-cp --src /user/hadoop/clean_data --dest s3://bootstrap-and-emr-step-demo/clean_data

s3-dist-cp is used to copy data in parallel manner and we can use it only if our data is in hdfs ,We can directly use aws s3 cp /user/hadoop/clean_data s3://bootstrap-and-emr-step-demo/clean_data , but if we have huge data lets say in TB’s then aws s3 cp will take too much time.

7. Terminate Cluster

# Extract cluster-id from job-flow.json
cluster_id=$(cat /mnt/var/lib/info/job-flow.json | jq -r ".jobFlowId")
# Disable termination-protection
aws emr modify-cluster-attributes --cluster-id ${cluster_id} --no-termination-protected
# terminate cluster
aws emr terminate-clusters --cluster-ids ${cluster_id}

Now we have all of our script’s ready

Launch EMR From AWS Console

Before getting started there are few things we need to do
1. upload bootstrap. sh and step.sh file to s3
2. make sure .sh files you have stored in the s3 bucket and EMR both should be in the same region

1. From AWS console → Services → EMR → Create Cluster → Advanced options

In software configuration select EMR version and required applications
In software, configuration select EMR version and required applications

2. In Steps section step type → Custom Jar

JAR Location → s3://ap-south-1.elasticmapreduce/libs/script-runner/script-runner.jar
Arguments → <s3 path of your step.sh file>

After adding Step

3. Select the required nodes, instance type, Root device EBS volume size

4. General cluster settings → Additional Options → Bootstrap Actions

Select logging path for storing logs

Script location → <s3 path of your Bootstrap.sh file>

After adding bootstrap action

5. Security Section

In this section select EC2 key pair, EC2 security group for master and core node → create cluster

6. outcomes

1. Starting stage
2. Bootstrapping stage
3. Running step stage
4. Cluster terminated

Once step is executed successfully executed it will terminate cluster

Launch EMR Using CFT

EMR creation is mostly done by using CFT or Terraform, below I have mentioned a CFT snippet to be added to your existing EMR CFT

"Resources": {
"cluster": {
"Type": "AWS::EMR::Cluster",
"Properties": {
"Applications": [
{
"Name": "Hadoop"
},
{
"Name": "JupyterHub"
},
{
"Name": "Hive"
},
],
"BootstrapActions": [
{
"Name": "Custom action",
"ScriptBootstrapAction": {
"Path": "<s3 path of bootstrap.sh>"
}
}
],
...................................................
"Teststeps": {
"Properties": {
"ActionOnFailure": "CANCEL_AND_WAIT",
"HadoopJarStep": {
"Args": [
""<s3 path of step.sh>"
],
"Jar": "s3://ap-south-1.elasticmapreduce/libs/script-runner/script-runner.jar"
},
"JobFlowId": {
"Ref": "cluster"
},
"Name": "Teststeps"
},
"Type": "AWS::EMR::Step"
}

Below shown CFT parsed json for reference

Problems faced while working with bootstrap action

Terminated with errors Master instance (instance-id) failed attempting to download bootstrap action 1 file from S3

This problem occurred when I was creating EMR in a different region and s3 bucket where I have stored bootstrap.sh is in a different region

Terminated with errors On the master instance (instance-id), bootstrap action 1 returned a non-zero return code

This problem occurred due to errors in bootstrap.sh commands, which can be easily debugged if you have enabled logging

Bootstrap action, python, emr step code can be found on GitHub

--

--