Model training in the cloud
Maximizing your GPU dollars — Training a model in Azure Machine Learning Services with 3 lines of code
Learn how to simply train models in Azure Cloud with no-effort, but a couple of dollars
Happy new year fellows!
However, I was surprised by reading that, even after being the second biggest cloud provider in terms of adoption, Azure didn’t make to the competition because the service was hard to use it. We are talking here about Azure Machine Learning Services, Microsoft’s solution for training custom Machine Learning models in the cloud. Although I think the service is relatively simple, I will agree and acknowledge that today’s documentation about how the service works is not the best it could be (probably due to the recent (big) changes introduced in September 2018). Such changes left a mixture of old and new documentation which makes difficult to know which one is the right one. Hope we fix this soon.
In the meantime, I am here to show how easy it would be to implement the very same problem in the original post but using Azure Machine Learning Services (from now AML, since I’m a lazy person). To prove to you that it is easy, I promise to use just 3 lines of code (plus some Python imports, don’t be so harsh).
We are trying to find out which cloud provider offers the best deal for training a model using a GPU-based compute hardware. I will follow the rules as stated in the original post, however, please read the disclosures at the end for my opinion around the method.
The author proposes to solve the popular Kaggle’s competition Cats vs Dogs classification problem by training a Convolutional Neural Network combined with Transfer Learning. The chosen framework was PyTorch. I will use most of the code of the author to keep the comparison as fair as possible but I had to change some lines since the API he used (fast.ai) changed.
Problem type: Classification
Number of classes: 2 (cats, dogs)
Input: Images (25.000 — 50% cats, 50% dogs)
Proposed model: Convolutional Neural Network
Base model: ResNet50
Framework: PyTorch + fast.ai
Microsoft has a variety of services tailored for Machine Learning and AI, however, the most suitable for this talk is by far AML. It provides a cloud-based environment you can use to develop, train, test, deploy, manage, and track machine learning models. The current version of AML uses a code-first approach with Python, which means that the whole process is managed using this language. It can be executed from a notebook or from the IDE of your choice.
To make this post more catchy, I will first show how to solve the problem, but I explained in details each of the pieces in this post. We can achieve this in less than 20 minutes by creating two scripts (actually 1, cause we already have one of them):
- A Python script called train.py, containing all the logic to train the model. This script is exactly the same one you would use in any other cloud provider or even your own machine to train the model, so we don’t have to create this one. The one I’m using is here (PyTorch + fast.ai) which is an updated version of the one used by the original post.
- A Python script called run.py, containing all the controller logic to create and submit a Machine Learning experiment in AML. This script is specific to AML.
Prepare your environment
As I stated before, AML uses a code-first approach to create, manage and publish machine learning experiments. So you will need to install some libraries in your environment. Your environment could be your local computer using PyCharm, Spider, VS Code or any other IDE of your choice, or it can be a notebook running in the cloud or locally. You need to install two libraries: Azure and Azure ML SDK.
!pip install azure
!pip install -upgrade azureml-sdk[notebooks,automl]
The run.py script
First, we are going to create a workspace in AML to work with. The workspace is the cloud resource you will use to create, manage, and publish machine learning experiments. To create a workspace you need the subscription ID of the subscription you are going to use, a name for the workspace and a location to deploy the resource. The location parameter is important since it will define which compute hardware will be available for your training job. I’m using East of US.
from azureml.core import Workspacews = Workspace.create(
name = "aa-ml-aml-workspace",
subscription_id = "1a2b3c4d-5a7b-5a7b-9a0b1c2d3e5f6g6>"
resource_group = "AdvanceAnalytics.ML",
location = 'eastus',
exist_ok = True)
Then, we are going to create an execution environment for PyTorch. The execution environment is the hardware and software configuration you are gonna use to train the model. AML has some preconfigured execution environments for TensorFlow and PyTorch (they are called estimators in AML, but I don’t like the name since in Data Science an estimator has a different meaning. Hope we change it soon.):
from azureml.train.dnn import PyTorchsrc = PyTorch(
source_directory = r'.\fastai',
entry_script = 'train.py',
use_gpu = True,
pip_packages = ['fastai'])
This method will create a PyTorch execution environment. Parameters are:
- souce_directory: All the files in souce_directory will be copies to the execution target (this is usually your project root directory).
- compute_target specified where are you going to execute this job. The value ‘amlcompute’ signals we want Azure to provision a VM for this specific job. The machine will be created and once the job is done it will be destroyed. Pretty cool feature. Other types are available including (Databricks, HDInsight (Spark), custom VMs, local computer)
- vm_size specified the type of hardware to use. In this case, Standard_NC6 are powered by NVIDIA Tesla K80 with 8 GiB, 6 vCPU, and 56 GiB of RAM.
- entry_script specified which is the training script you want to execute. This file should be inside of source_directory.
- use_gpu specifies that we want GPU-enabled libraries.
- pip_packages allows you to specify which additional packages you need in the execution environment. In this case, since PyTorch execution environment has everything that is needed for PyTorch, the only package that is missing is fast.ai.
Finally, we create an experiment, indicating the workspace we used and a name and we run it:
from azureml.core.run import Run
from azureml.core.experiment import Experimentexperiment = Experiment(workspace=ws, name="azureml-benchmark")
run = experiment.submit(src)
What is happening under the hood is that Azure is preparing a new docker image for executing PyTorch code with GPU support, copying all the assets we need, installing all the packages we specified, creating a VM and deploying the image in the VM. Finally, the script is executed and once done the VM destroyed.
You can check progress in the Azure Portal or in Python by:
run.wait_for_completion(show_output = True)
With show_output = true, you will connect to the output stream, so it will be like having a console connected to the VM:
What else? Nothing!
That’s all. Isn’t that awesome? We provisioned the complete hardware + software stack with 3 lines of code! The entire script is available on GitHub
The training process took 7.21 minutes. The cost per hour of a Standard_NC6 VM is of $1.56/hour as January 2nd, 2019. So it costs 0.19 USD to train the model (*)
I am not posting here the other cloud providers numbers since I couldn’t verify the prices published in the original post. Probably because they are from October 2018 or I’m reading them wrong. You are more than welcome to check them.
However, I would cast doubt if this is the correct approach to measure the cost of the training process since the time for the provisioning of the VM will also incur in some cost. For instance, in my case, it took around 7 minutes to spin up everything, install dependencies, prepares the environment, etc. (same amount of time that training). How quickly you can plug your code into the provisioned hardware will also impact the cost. Storage, networking, and other resources consumed by the training VM need also to be taken into account for correct cost estimation. In addition to that, choosing the framework is no longer a trivial task. If you use specific services in some cloud providers, including AWS, Azure, and GCP, they have optimized hardware accelerations in place when the code is for instance written in TensorFlow. You have to be careful about this.
Anyway, the main idea of this post was about showing how to use AML to solve the problem. Coding is always fun! And cheaper than shopping!
If you are interesting in learn how Azure Machine Learning works, check