Deep Learning on a spot instance - save money and run only when needed on AWS!

Disclaimer:

At some point (perhaps mid-2021), AWS seems to have disabled Spot as an option for P-class instances. It may be possible to request a limit increase to run a P spot instance. See notes on instance types at the end.

Background

I had a web scraping job that collected a lot of news articles I wanted to summarize using machine learning (PEGASUS, to be specific). Summarization required a large EC2 instance that I didn't want to leave running - so I decided to "bid" on a spot instance every day when I needed it. Spot prices were often 25 cents on the dollar and only run when needed so overall my deep learning predictions were 98% cheaper than if on an on demand instance running all the time.

Below is a breakdown of how to request a spot instance and run a task on it for deep learning or whatever your use case may be! I used python/boto3 for this.

Python Initiated Spot Request

Import

Import packages and create client

import boto3
import base64
import datetime

client = boto3.client("ec2", "us-east-2")
import block

User Data

Define the user data - this is what's executed once the instance boots

userData = '''
Content-Type: multipart/mixed; boundary="//"
MIME-Version: 1.0

--//
Content-Type: text/cloud-config; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="cloud-config.txt"

#cloud-config
cloud_final_modules:
- [scripts-user, always]

--//
Content-Type: text/x-shellscript; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="userdata.txt"

#!/bin/bash
sudo shutdown -P +120; aws s3 cp  s3://my-bucket-name/run_full_prediction_flow_on_ec2.sh ./ && sudo runuser -l ubuntu  -c '/bin/bash /tmp/run_full_prediction_flow_on_ec2.sh > run_pred.log 2>&1'
--//
'''
userdata we'll be embedding to run on launch of spot instance

cloud-config controls when userdata runs (in this case userdata runs when your instance launches).

You'll see that userdata here is a bash script that:

  • sets the shutdown time for 2 hours from launch (as a fail-safe to prevent cost overruns) sudo shutdown -P +120
  • fetches a bash script aws s3 cp  s3://my-bucket-name/run_full_prediction_flow_on_ec2.sh ./
  • runs that retrieved bash script sudo runuser -l ubuntu  -c '/bin/bash /tmp/run_full_prediction_flow_on_ec2.sh > run_pred.log 2>&1

Here, s3://my-bucket-name/run_full_prediction_flow_on_ec2.sh is some shell script I'm fetching and executing to instantiate anything the instance needs (package installations/configurations) and then running the prediction. An example of what this could look like at the end of this post for reference.

Instance Specification

Then I define the instance specification:

launchSpecification = {
        "IamInstanceProfile": {
            "Arn": "arn:aws:iam::MY-AWS-ACCOUNT-ID:instance-profile/MY-INSTANCE-IAM",
        },
        "BlockDeviceMappings": [
            {
                "DeviceName": "/dev/sda1", # note: '/dev/xvda' on some instance types
                "Ebs": {
                    "DeleteOnTermination": True,
                    "VolumeSize": 80,
                    "VolumeType": "standard",
                },
            },
        ],
        "ImageId": "ami-MY-AMI-ID",
        "KeyName": "MY-SSH-KEY",
        "SecurityGroups": ["MY-SG-1-NAME", "MY-SG-2-NAME"],
        "InstanceType": "m4.xlarge",
        "Placement": {
            "AvailabilityZone": "us-east-2a",
        },
        "SecurityGroupIds": ["sg-MY-SG-1", "sg-MY-SG-2"],
        "UserData": base64.b64encode(
            userData.encode(
                "utf-8"
            )
        ).decode("ascii"),
    }
launch specification

Make request

And finally I make the request with the price I'm willing to pay

client.request_spot_instances(
    DryRun=False,
    SpotPrice="0.30",
    ClientToken=datetime.datetime.now().isoformat()[:10],
    InstanceCount=1,
    Type="one-time",
    LaunchSpecification=launchSpecification
    )

ClientToken is a unique string that AWS uses to ensure your request isn't submitted multiple times. datetime.datetime.now().isoformat()[:10], is the date which means that I can't accidentally spin up multiple instances in the same day.

Final script

import boto3
import base64
import datetime

client = boto3.client("ec2", "us-east-2")

userData = '''
Content-Type: multipart/mixed; boundary="//"
MIME-Version: 1.0

--//
Content-Type: text/cloud-config; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="cloud-config.txt"

#cloud-config
cloud_final_modules:
- [scripts-user, always]

--//
Content-Type: text/x-shellscript; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="userdata.txt"

#!/bin/bash
sudo shutdown -P +120; aws s3 cp  s3://my-bucket-name/run_full_prediction_flow_on_ec2.sh ./ && sudo runuser -l ubuntu  -c '/bin/bash /tmp/run_full_prediction_flow_on_ec2.sh > run_pred.log 2>&1'
--//
'''

launchSpecification = {
        "IamInstanceProfile": {
            "Arn": "arn:aws:iam::MY-AWS-ACCOUNT-ID:instance-profile/MY-INSTANCE-IAM",
        },
        "BlockDeviceMappings": [
            {
                "DeviceName": "/dev/sda1", # note: '/dev/xvda' on some instance types
                "Ebs": {
                    "DeleteOnTermination": True,
                    "VolumeSize": 80,
                    "VolumeType": "standard",
                },
            },
        ],
        "ImageId": "ami-MY-AMI-ID",
        "KeyName": "MY-SSH-KEY",
        "SecurityGroups": ["MY-SG-1-NAME", "MY-SG-2-NAME"],
        "InstanceType": "m4.xlarge",
        "Placement": {
            "AvailabilityZone": "us-east-2a",
        },
        "SecurityGroupIds": ["sg-MY-SG-1", "sg-MY-SG-2"],
        "UserData": base64.b64encode(
            userData.encode(
                "utf-8"
            )
        ).decode("ascii"),
    }


client.request_spot_instances(
    DryRun=False,
    SpotPrice="0.30",
    ClientToken=datetime.datetime.now().isoformat()[:10],
    InstanceCount=1,
    Type="one-time",
    LaunchSpecification=launchSpecification
    )
final python script

For Reference:

Instance type notes:

If you hit an error like botocore.exceptions.ClientError: An error occurred (MaxSpotInstanceCountExceeded) when calling the RequestSpotInstances operation: Max spot instance count exceeded check https://us-east-2.console.aws.amazon.com/ec2/v2/home?region=us-east-2#Limits: to see what instance types you're able to use.  When I searched for spot in June 2022 I saw that "All P Spot Instance Requests" only allowed "0 vCPUs" explaining why my old script with p2.xlarge was failing!

For deep learning you'll want a GPU instance.  AWS seems to have created a new class that has GPU's and can be run as spot - the DL instance type - but that's like $15+ per hour! You may be able to request a limit increase to run P instances if you need a GPU.

run_full_prediction_flow_on_ec2.sh

#cloud-boothook
#!/bin/bash
set -e
export LC_ALL=C.UTF-8
export LANG=C.UTF-8
PATH=/home/ubuntu/.local/bin:$PATH

# Set up directories
mkdir /home/ubuntu/analysis
mkdir /home/ubuntu/analysis/pred

# fetch the data from S3
aws s3 sync s3://MY_BUCKET/data  /home/ubuntu/analysis/data
aws s3 sync s3://MY_BUCKET/model  /home/ubuntu/analysis/model

cd /home/ubuntu/analysis/model

# Install packages
pip3 install --user -r requirements.txt
pip3 install --user pipenv ipython
pipenv --site-packages
pipenv install

# Predict
pipenv run python run_predictions.py

# Send predictions to S3
aws s3 sync /home/ubuntu/analysis/pred s3://MY_BUCKET/pred/`date +"%Y%m%d"` 

# Shutdown after making predictions to limit costs. This was commented out in case someone happened to copy and paste this accidentally...
# sudo shutdown
s3://my-bucket-name/run_full_prediction_flow_on_ec2.sh