AWS GPU Instance for Batch Process - Save on Deep learning

While EC2 Instances are great for running services, the price can add up - especially if you need access to a GPU (e.g., for deep learning).  

I have a batch process I run daily that takes over an hour on a P2 instance (updating https://8020news.com/today). This would really add up at the on demand price of $0.90 an hour (especially if I didn't have a way to start and stop it on a schedule!)

So I configured a python script (called by a cron job) that requests a spot instance for a GPU instance that shuts down once complete (or after 60 minutes if it does not complete, as a cost saving measure).

The trickiest part was figuring out the user UserData, but AWS has an example on using a mime-multi part file here https://aws.amazon.com/premiumsupport/knowledge-center/execute-user-data-ec2/ and there's others around on Stack Overflow, etc.

Why a Spot Instance?

A P2 instance (with an NVIDIA K80) is the cheapest EC2 instance and it still costs $0.90 an hour. A P3 has a much more powerful GPU (an NVIDIA V100) but that starts at $3.06 an hour. You can usually get a spot instance for about 1/4 the cost - so if you have a batch process that isn't required to run at an exact time (and you don't mind the small chance it will be halted while running) spot instances are perfect! In fact you end up saving 98% relative to an on demand instance running 24/7!

Python Script

Here's the script I'm using, obviosuly you'd have to fill in

ACCOUNT_NUMBER, SECURITY_GROUP, AMI_NAME, SSH_KEYNAME, S3_BUCKET_NAME, and your shell script name!

import boto3
import datetime
import base64
import datetime

client = boto3.client("ec2", "us-east-2")
response = client.request_spot_instances(
    DryRun=False,
    SpotPrice="0.30",
    ClientToken=datetime.datetime.now().isoformat()[:10],
    InstanceCount=1,
    Type="one-time",
    LaunchSpecification={
        "IamInstanceProfile": {
            "Arn": "arn:aws:iam::ACCOUNT_NUMBER:instance-profile/p2-pred",
        },
        "BlockDeviceMappings": [
            {
                "DeviceName": "/dev/sda1",  
                "Ebs": {
                    "DeleteOnTermination": True,
                    "VolumeSize": 80,
                    "VolumeType": "standard",
                },
            },
        ],
        "ImageId": "AMI_NAME",
        "KeyName": "SSH_KEYNAME",
        "SecurityGroups": ["SECURITY_GROUP"],
        "InstanceType": "p2.xlarge",
        "Placement": {
            "AvailabilityZone": "us-east-2b",  
        },
        "SecurityGroupIds": ["sg-ABCD"],
        "UserData": base64.encodestring(
            """
Content-Type: multipart/mixed; boundary="//"
MIME-Version: 1.0

--//
Content-Type: text/cloud-config; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="cloud-config.txt"

#cloud-config
cloud_final_modules:
- [scripts-user, always]

--//
Content-Type: text/x-shellscript; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="userdata.txt"

#!/bin/bash
cd /tmp; sudo shutdown -P +60; aws s3 cp  s3://S3_BUCKET_NAME/run_prediction_on_ec2.sh ./ && sudo runuser -l ubuntu  -c '/bin/bash /tmp/run_prediction_on_ec2.sh >run_prediction_on_ec2.log 2>&1' 
--//
""".encode(
                "utf-8"
            )
        ).decode("ascii"),
    },
)