Deep Learning on a spot instance - save money and run only when needed on AWS!
Patrick Russo -
Disclaimer:
At some point (perhaps mid-2021), AWS seems to have disabled Spot as an option for P-class instances. It may be possible to request a limit increase to run a P spot instance. See notes on instance types at the end.
Background
I had a web scraping job that collected a lot of news articles I wanted to summarize using machine learning (PEGASUS, to be specific). Summarization required a large EC2 instance that I didn't want to leave running - so I decided to "bid" on a spot instance every day when I needed it. Spot prices were often 25 cents on the dollar and only run when needed so overall my deep learning predictions were 98% cheaper than if on an on demand instance running all the time.
Below is a breakdown of how to request a spot instance and run a task on it for deep learning or whatever your use case may be! I used python/boto3 for this.
Python Initiated Spot Request
Import
Import packages and create client
User Data
Define the user data - this is what's executed once the instance boots
cloud-config controls when userdata runs (in this case userdata runs when your instance launches).
You'll see that userdata here is a bash script that:
sets the shutdown time for 2 hours from launch (as a fail-safe to prevent cost overruns) sudo shutdown -P +120
fetches a bash script aws s3 cp s3://my-bucket-name/run_full_prediction_flow_on_ec2.sh ./
Here, s3://my-bucket-name/run_full_prediction_flow_on_ec2.sh is some shell script I'm fetching and executing to instantiate anything the instance needs (package installations/configurations) and then running the prediction. An example of what this could look like at the end of this post for reference.
Instance Specification
Then I define the instance specification:
Make request
And finally I make the request with the price I'm willing to pay
ClientToken is a unique string that AWS uses to ensure your request isn't submitted multiple times. datetime.datetime.now().isoformat()[:10], is the date which means that I can't accidentally spin up multiple instances in the same day.
Final script
For Reference:
Instance type notes:
If you hit an error like botocore.exceptions.ClientError: An error occurred (MaxSpotInstanceCountExceeded) when calling the RequestSpotInstances operation: Max spot instance count exceeded check https://us-east-2.console.aws.amazon.com/ec2/v2/home?region=us-east-2#Limits: to see what instance types you're able to use. When I searched for spot in June 2022 I saw that "All P Spot Instance Requests" only allowed "0 vCPUs" explaining why my old script with p2.xlarge was failing!
For deep learning you'll want a GPU instance. AWS seems to have created a new class that has GPU's and can be run as spot - the DL instance type - but that's like $15+ per hour! You may be able to request a limit increase to run P instances if you need a GPU.