Automating AWS EMR Life-cycle
Automate Whole AWS EMR Process By Using Bootstrap Action and EMR Step
In this tutorial, we will spin up EMR install all dependencies using bootstrap action then emr step will download images from Kaggle and process images using python and store processed images in s3 and terminate cluster.
This tutorial will highlight how to use Bootstrap Action and EMR Step to automate the configuration of EMR and process some data after processing terminate EMR.
- Create a Bootstrap Action shell Script
- Create a python script for converting an image from png to jpg
- Create an EMR Step shell script
- Launch EMR from AWS Console including Bootstrap Action and EMR step
- Launch EMR using CFT including Bootstrap Action and EMR step
- Problems faced
Getting Started with Bootstrap Action
Basically, Bootstrap action is used to install required packages before the cluster is created, it has one special advantage that it installs packages in each node in the cluster as specified in the script.
Let's install some of our favorite python packages and git for cloning repository.
#!/bin/bash
pip3 install --user pandas
pip3 install --user kaggle
pip3 install --user opencv-python
sudo yum install git -y
Above pip3 install will work only in emr version greater than 5.20 because older emr version have default python 2.7 , if you want to install python packages in older emr version then use below snippet
#!/bin/bash# 1st step is to make python3 as default python
sudo ln -fs /usr/bin/python3.6 /etc/alternatives/python
sudo ln -fs /usr/bin/pip-3.6 /etc/alternatives/pip
sudo pip install --upgrade pip# now you can install packages
/usr/local/bin/pip3.6 install --user pandas# rest everything will be same
sudo yum install git -y
Let's convert the image from png to jpg using python
- Import required libraries
from PIL import Image
import pandas as pd
2. Open Image as png and save it as jpg
im = Image.open('some_path/image_1.png')
im.save('some_path/image_1.jpg')
3. Final Snippet
from PIL import Image
import pandas as pd# convert training dataset
df = pd.read_csv(r"/home/hadoop/raw_data/boneage-training-dataset.csv")
for i in df.id:
s_path="//home//hadoop//raw_data//boneage-training-dataset//boneage-training-dataset//"+str(i)+'.png'
d_path="//home//hadoop//clean_data//boneage-training-dataset//"+str(i)+'.jpg'
im = Image.open(s_path)
im.save(d_path)# convert testing dataset
df = pd.read_csv(r"/home/hadoop/raw_data/boneage-test-dataset.csv")
for i in df['Case ID']:
s_path="//home//hadoop//raw_data//boneage-test-dataset//boneage-test-dataset//"+str(i)+'.png'
d_path="//home//hadoop//clean_data//boneage-test-dataset//"+str(i)+'.jpg'
im = Image.open(s_path)
im.save(d_path)
Dataset used you can find it on kaggle , it has csv file which mainatins mapping of each image with its id , in above code i have used that csv to iterate through each image
Create an EMR Step
1. Configure kaggle
#!/bin/bash
cd /home/hadoop/aws s3 cp s3://bootstrap-and-emr-step-demo/kaggle.json /home/hadoop/kaggle.json#configuring kaggle for downloading dataset
mkdir /home/hadoop/.kaggle
sudo cp /home/hadoop/kaggle.json /home/hadoop/.kaggle/
In order to download dataset from kaggle we need to authenticate using an API token , for detailed explanation
2. Clone Git Repo
# cloning repo
git clone -b master https://github.com/kazam1920/png_to_jpg.git /home/hadoop/png_to_jpg
3. Download dataset
# downloading dataset from kaggle
~/.local/bin/kaggle datasets download kmader/rsna-bone-age --path /home/hadoop/raw_data --unzip
4. Creating directories for processing images
mkdir /home/hadoop/clean_data
mkdir /home/hadoop/clean_data/boneage-training-dataset
mkdir /home/hadoop/clean_data/boneage-test-dataset
mkdir /home/hadoop/final
sudo cp raw_data/boneage-training-dataset.csv /home/hadoop/clean_data/
sudo cp raw_data/boneage-test-dataset.csv /home/hadoop/clean_data/
5. Executing python script
python3 /home/hadoop/png_to_jpg/png_to_jpg.py
Above command will work only in emr version greater than 5.20 , for executing python script in older emr version use below snippet
usr/bin/python3.6 /home/hadoop/png_to_jpg/png_to_jpg.py
6. Moving processed data to s3
hdfs dfs -put /home/hadoop/clean_data
s3-dist-cp --src /user/hadoop/clean_data --dest s3://bootstrap-and-emr-step-demo/clean_data
s3-dist-cp
is used to copy data in parallel manner and we can use it only if our data is in hdfs ,We can directly useaws s3 cp /user/hadoop/clean_data s3://bootstrap-and-emr-step-demo/clean_data
, but if we have huge data lets say in TB’s thenaws s3 cp
will take too much time.
7. Terminate Cluster
# Extract cluster-id from job-flow.json
cluster_id=$(cat /mnt/var/lib/info/job-flow.json | jq -r ".jobFlowId")# Disable termination-protection
aws emr modify-cluster-attributes --cluster-id ${cluster_id} --no-termination-protected# terminate cluster
aws emr terminate-clusters --cluster-ids ${cluster_id}
Now we have all of our script’s ready
Launch EMR From AWS Console
Before getting started there are few things we need to do
1. upload bootstrap. sh and step.sh file to s3
2. make sure .sh files you have stored in the s3 bucket and EMR both should be in the same region
1. From AWS console → Services → EMR → Create Cluster → Advanced options
2. In Steps section step type → Custom Jar
JAR Location → s3://ap-south-1.elasticmapreduce/libs/script-runner/script-runner.jar
Arguments → <s3 path of your step.sh file>
3. Select the required nodes, instance type, Root device EBS volume size
4. General cluster settings → Additional Options → Bootstrap Actions
Script location → <s3 path of your Bootstrap.sh file>
5. Security Section
In this section select EC2 key pair, EC2 security group for master and core node → create cluster
6. outcomes
Once step is executed successfully executed it will terminate cluster
Launch EMR Using CFT
EMR creation is mostly done by using CFT or Terraform, below I have mentioned a CFT snippet to be added to your existing EMR CFT
"Resources": {
"cluster": {
"Type": "AWS::EMR::Cluster",
"Properties": {
"Applications": [
{
"Name": "Hadoop"
},
{
"Name": "JupyterHub"
},
{
"Name": "Hive"
},
],
"BootstrapActions": [
{
"Name": "Custom action",
"ScriptBootstrapAction": {
"Path": "<s3 path of bootstrap.sh>"
}
}
],
...................................................
"Teststeps": {
"Properties": {
"ActionOnFailure": "CANCEL_AND_WAIT",
"HadoopJarStep": {
"Args": [
""<s3 path of step.sh>"
],
"Jar": "s3://ap-south-1.elasticmapreduce/libs/script-runner/script-runner.jar"
},
"JobFlowId": {
"Ref": "cluster"
},
"Name": "Teststeps"
},
"Type": "AWS::EMR::Step"
}
Below shown CFT parsed json for reference
Problems faced while working with bootstrap action
Terminated with errors Master instance (instance-id) failed attempting to download bootstrap action 1 file from S3
This problem occurred when I was creating EMR in a different region and s3 bucket where I have stored bootstrap.sh is in a different region
Terminated with errors On the master instance (instance-id), bootstrap action 1 returned a non-zero return code
This problem occurred due to errors in bootstrap.sh commands, which can be easily debugged if you have enabled logging