Deep Learning on a spot instance - save money and run only when needed on AWS!
Disclaimer:
At some point (perhaps mid-2021), AWS seems to have disabled Spot as an option for P-class instances. It may be possible to request a limit increase to run a P spot instance. See notes on instance types at the end.
Background
I had a web scraping job that collected a lot of news articles I wanted to summarize using machine learning (PEGASUS, to be specific). Summarization required a large EC2 instance that I didn't want to leave running - so I decided to "bid" on a spot instance every day when I needed it. Spot prices were often 25 cents on the dollar and only run when needed so overall my deep learning predictions were 98% cheaper than if on an on demand instance running all the time.
Below is a breakdown of how to request a spot instance and run a task on it for deep learning or whatever your use case may be! I used python/boto3 for this.
Python Initiated Spot Request
Import
Import packages and create client
import boto3
import base64
import datetime
client = boto3.client("ec2", "us-east-2")
User Data
Define the user data - this is what's executed once the instance boots
userData = '''
Content-Type: multipart/mixed; boundary="//"
MIME-Version: 1.0
--//
Content-Type: text/cloud-config; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="cloud-config.txt"
#cloud-config
cloud_final_modules:
- [scripts-user, always]
--//
Content-Type: text/x-shellscript; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="userdata.txt"
#!/bin/bash
sudo shutdown -P +120; aws s3 cp s3://my-bucket-name/run_full_prediction_flow_on_ec2.sh ./ && sudo runuser -l ubuntu -c '/bin/bash /tmp/run_full_prediction_flow_on_ec2.sh > run_pred.log 2>&1'
--//
'''
cloud-config controls when userdata runs (in this case userdata runs when your instance launches).
You'll see that userdata here is a bash script that:
- sets the shutdown time for 2 hours from launch (as a fail-safe to prevent cost overruns)
sudo shutdown -P +120
- fetches a bash script
aws s3 cp s3://my-bucket-name/run_full_prediction_flow_on_ec2.sh ./
- runs that retrieved bash script
sudo runuser -l ubuntu -c '/bin/bash /tmp/run_full_prediction_flow_on_ec2.sh > run_pred.log 2>&1
Here, s3://my-bucket-name/run_full_prediction_flow_on_ec2.sh
is some shell script I'm fetching and executing to instantiate anything the instance needs (package installations/configurations) and then running the prediction. An example of what this could look like at the end of this post for reference.
Instance Specification
Then I define the instance specification:
launchSpecification = {
"IamInstanceProfile": {
"Arn": "arn:aws:iam::MY-AWS-ACCOUNT-ID:instance-profile/MY-INSTANCE-IAM",
},
"BlockDeviceMappings": [
{
"DeviceName": "/dev/sda1", # note: '/dev/xvda' on some instance types
"Ebs": {
"DeleteOnTermination": True,
"VolumeSize": 80,
"VolumeType": "standard",
},
},
],
"ImageId": "ami-MY-AMI-ID",
"KeyName": "MY-SSH-KEY",
"SecurityGroups": ["MY-SG-1-NAME", "MY-SG-2-NAME"],
"InstanceType": "m4.xlarge",
"Placement": {
"AvailabilityZone": "us-east-2a",
},
"SecurityGroupIds": ["sg-MY-SG-1", "sg-MY-SG-2"],
"UserData": base64.b64encode(
userData.encode(
"utf-8"
)
).decode("ascii"),
}
Make request
And finally I make the request with the price I'm willing to pay
client.request_spot_instances(
DryRun=False,
SpotPrice="0.30",
ClientToken=datetime.datetime.now().isoformat()[:10],
InstanceCount=1,
Type="one-time",
LaunchSpecification=launchSpecification
)
ClientToken is a unique string that AWS uses to ensure your request isn't submitted multiple times. datetime.datetime.now().isoformat()[:10],
is the date which means that I can't accidentally spin up multiple instances in the same day.
Final script
import boto3
import base64
import datetime
client = boto3.client("ec2", "us-east-2")
userData = '''
Content-Type: multipart/mixed; boundary="//"
MIME-Version: 1.0
--//
Content-Type: text/cloud-config; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="cloud-config.txt"
#cloud-config
cloud_final_modules:
- [scripts-user, always]
--//
Content-Type: text/x-shellscript; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="userdata.txt"
#!/bin/bash
sudo shutdown -P +120; aws s3 cp s3://my-bucket-name/run_full_prediction_flow_on_ec2.sh ./ && sudo runuser -l ubuntu -c '/bin/bash /tmp/run_full_prediction_flow_on_ec2.sh > run_pred.log 2>&1'
--//
'''
launchSpecification = {
"IamInstanceProfile": {
"Arn": "arn:aws:iam::MY-AWS-ACCOUNT-ID:instance-profile/MY-INSTANCE-IAM",
},
"BlockDeviceMappings": [
{
"DeviceName": "/dev/sda1", # note: '/dev/xvda' on some instance types
"Ebs": {
"DeleteOnTermination": True,
"VolumeSize": 80,
"VolumeType": "standard",
},
},
],
"ImageId": "ami-MY-AMI-ID",
"KeyName": "MY-SSH-KEY",
"SecurityGroups": ["MY-SG-1-NAME", "MY-SG-2-NAME"],
"InstanceType": "m4.xlarge",
"Placement": {
"AvailabilityZone": "us-east-2a",
},
"SecurityGroupIds": ["sg-MY-SG-1", "sg-MY-SG-2"],
"UserData": base64.b64encode(
userData.encode(
"utf-8"
)
).decode("ascii"),
}
client.request_spot_instances(
DryRun=False,
SpotPrice="0.30",
ClientToken=datetime.datetime.now().isoformat()[:10],
InstanceCount=1,
Type="one-time",
LaunchSpecification=launchSpecification
)
For Reference:
Instance type notes:
If you hit an error like botocore.exceptions.ClientError: An error occurred (MaxSpotInstanceCountExceeded) when calling the RequestSpotInstances operation: Max spot instance count exceeded
check https://us-east-2.console.aws.amazon.com/ec2/v2/home?region=us-east-2#Limits: to see what instance types you're able to use. When I searched for spot in June 2022 I saw that "All P Spot Instance Requests" only allowed "0 vCPUs" explaining why my old script with p2.xlarge was failing!
For deep learning you'll want a GPU instance. AWS seems to have created a new class that has GPU's and can be run as spot - the DL instance type - but that's like $15+ per hour! You may be able to request a limit increase to run P instances if you need a GPU.
run_full_prediction_flow_on_ec2.sh
#cloud-boothook
#!/bin/bash
set -e
export LC_ALL=C.UTF-8
export LANG=C.UTF-8
PATH=/home/ubuntu/.local/bin:$PATH
# Set up directories
mkdir /home/ubuntu/analysis
mkdir /home/ubuntu/analysis/pred
# fetch the data from S3
aws s3 sync s3://MY_BUCKET/data /home/ubuntu/analysis/data
aws s3 sync s3://MY_BUCKET/model /home/ubuntu/analysis/model
cd /home/ubuntu/analysis/model
# Install packages
pip3 install --user -r requirements.txt
pip3 install --user pipenv ipython
pipenv --site-packages
pipenv install
# Predict
pipenv run python run_predictions.py
# Send predictions to S3
aws s3 sync /home/ubuntu/analysis/pred s3://MY_BUCKET/pred/`date +"%Y%m%d"`
# Shutdown after making predictions to limit costs. This was commented out in case someone happened to copy and paste this accidentally...
# sudo shutdown