Automating AWS EMR Life-cycle

Automate Whole AWS EMR Process By Using Bootstrap Action and EMR Step

6 min readJan 3, 2021

In this tutorial, we will spin up EMR install all dependencies using bootstrap action then emr step will download images from Kaggle and process images using python and store processed images in s3 and terminate cluster.
This tutorial will highlight how to use Bootstrap Action and EMR Step to automate the configuration of EMR and process some data after processing terminate EMR.

Create a Bootstrap Action shell Script
Create a python script for converting an image from png to jpg
Create an EMR Step shell script
Launch EMR from AWS Console including Bootstrap Action and EMR step
Launch EMR using CFT including Bootstrap Action and EMR step
Problems faced

Getting Started with Bootstrap Action

Basically, Bootstrap action is used to install required packages before the cluster is created, it has one special advantage that it installs packages in each node in the cluster as specified in the script.

Let's install some of our favorite python packages and git for cloning repository.

#!/bin/bash
pip3 install --user pandas
pip3 install --user kaggle
pip3 install --user opencv-python
sudo yum install git -y

Above pip3 install will work only in emr version greater than 5.20 because older emr version have default python 2.7 , if you want to install python packages in older emr version then use below snippet

#!/bin/bash# 1st step is to make python3 as default python
sudo ln -fs /usr/bin/python3.6 /etc/alternatives/python
sudo ln -fs /usr/bin/pip-3.6 /etc/alternatives/pip
sudo pip install --upgrade pip# now you can install packages
/usr/local/bin/pip3.6 install --user pandas# rest everything will be same 
sudo yum install git -y

Let's convert the image from png to jpg using python

Import required libraries

from PIL import Image
import pandas as pd

2. Open Image as png and save it as jpg

im = Image.open('some_path/image_1.png')
im.save('some_path/image_1.jpg')

3. Final Snippet

from PIL import Image
import pandas as pd# convert training dataset 
df = pd.read_csv(r"/home/hadoop/raw_data/boneage-training-dataset.csv")
for i in df.id:
    s_path="//home//hadoop//raw_data//boneage-training-dataset//boneage-training-dataset//"+str(i)+'.png'
    d_path="//home//hadoop//clean_data//boneage-training-dataset//"+str(i)+'.jpg'
    im = Image.open(s_path)
    im.save(d_path)# convert testing dataset
df = pd.read_csv(r"/home/hadoop/raw_data/boneage-test-dataset.csv")
for i in df['Case ID']:
    s_path="//home//hadoop//raw_data//boneage-test-dataset//boneage-test-dataset//"+str(i)+'.png'
    d_path="//home//hadoop//clean_data//boneage-test-dataset//"+str(i)+'.jpg'
    im = Image.open(s_path)
    im.save(d_path)

Dataset used you can find it on kaggle , it has csv file which mainatins mapping of each image with its id , in above code i have used that csv to iterate through each image

Create an EMR Step

1. Configure kaggle

#!/bin/bash
cd /home/hadoop/aws s3 cp s3://bootstrap-and-emr-step-demo/kaggle.json /home/hadoop/kaggle.json#configuring kaggle for downloading dataset 
mkdir /home/hadoop/.kaggle
sudo cp /home/hadoop/kaggle.json /home/hadoop/.kaggle/

In order to download dataset from kaggle we need to authenticate using an API token , for detailed explanation

2. Clone Git Repo

# cloning repo 
git clone -b master https://github.com/kazam1920/png_to_jpg.git /home/hadoop/png_to_jpg

3. Download dataset

# downloading dataset from kaggle
~/.local/bin/kaggle datasets download kmader/rsna-bone-age --path /home/hadoop/raw_data --unzip

4. Creating directories for processing images

mkdir /home/hadoop/clean_data
mkdir /home/hadoop/clean_data/boneage-training-dataset
mkdir /home/hadoop/clean_data/boneage-test-dataset
mkdir /home/hadoop/final
sudo cp raw_data/boneage-training-dataset.csv /home/hadoop/clean_data/
sudo cp raw_data/boneage-test-dataset.csv /home/hadoop/clean_data/

5. Executing python script

python3 /home/hadoop/png_to_jpg/png_to_jpg.py

Above command will work only in emr version greater than 5.20 , for executing python script in older emr version use below snippet

usr/bin/python3.6 /home/hadoop/png_to_jpg/png_to_jpg.py

6. Moving processed data to s3

hdfs dfs -put /home/hadoop/clean_data
s3-dist-cp --src /user/hadoop/clean_data --dest s3://bootstrap-and-emr-step-demo/clean_data

s3-dist-cp is used to copy data in parallel manner and we can use it only if our data is in hdfs ,We can directly use aws s3 cp /user/hadoop/clean_data s3://bootstrap-and-emr-step-demo/clean_data , but if we have huge data lets say in TB’s then aws s3 cp will take too much time.

7. Terminate Cluster

# Extract cluster-id from job-flow.json
cluster_id=$(cat /mnt/var/lib/info/job-flow.json | jq -r ".jobFlowId")# Disable termination-protection 
aws emr modify-cluster-attributes --cluster-id ${cluster_id} --no-termination-protected# terminate cluster 
aws emr terminate-clusters --cluster-ids ${cluster_id}

Now we have all of our script’s ready

Launch EMR From AWS Console

Before getting started there are few things we need to do
1. upload bootstrap. sh and step.sh file to s3
2. make sure .sh files you have stored in the s3 bucket and EMR both should be in the same region

1. From AWS console → Services → EMR → Create Cluster → Advanced options

In software configuration select EMR version and required applications — In software, configuration select EMR version and required applications

2. In Steps section step type → Custom Jar

JAR Location → s3://ap-south-1.elasticmapreduce/libs/script-runner/script-runner.jar
Arguments → <s3 path of your step.sh file>

3. Select the required nodes, instance type, Root device EBS volume size

4. General cluster settings → Additional Options → Bootstrap Actions

Script location → <s3 path of your Bootstrap.sh file>

5. Security Section

In this section select EC2 key pair, EC2 security group for master and core node → create cluster

6. outcomes

Once step is executed successfully executed it will terminate cluster

Launch EMR Using CFT

EMR creation is mostly done by using CFT or Terraform, below I have mentioned a CFT snippet to be added to your existing EMR CFT

"Resources": {
    "cluster": {
      "Type": "AWS::EMR::Cluster",
      "Properties": {
   "Applications": [
                    {
                        "Name": "Hadoop"
                    },
                    {
                        "Name": "JupyterHub"
                    },
                    {
                        "Name": "Hive"
                    },
                ],
  "BootstrapActions": [
                    {
                        "Name": "Custom action",
                        "ScriptBootstrapAction": {
                            "Path": "<s3 path of bootstrap.sh>"
                        }
                    }
                ],
...................................................
  "Teststeps": {
    "Properties": {
      "ActionOnFailure": "CANCEL_AND_WAIT",
      "HadoopJarStep": {
        "Args": [
          ""<s3 path of step.sh>"
        ],
        "Jar": "s3://ap-south-1.elasticmapreduce/libs/script-runner/script-runner.jar"
      },
      "JobFlowId": {
        "Ref": "cluster"
      },
      "Name": "Teststeps"
    },
    "Type": "AWS::EMR::Step"
  }