Model training in the cloud

Maximizing your GPU dollars — Training a model in Azure Machine Learning Services with 3 lines of code

Learn how to simply train models in Azure Cloud with no-effort, but a couple of dollars

Happy new year fellows!

The contest

We are trying to find out which cloud provider offers the best deal for training a model using a GPU-based compute hardware. I will follow the rules as stated in the original post, however, please read the disclosures at the end for my opinion around the method.

The objective

About AML

Microsoft has a variety of services tailored for Machine Learning and AI, however, the most suitable for this talk is by far AML. It provides a cloud-based environment you can use to develop, train, test, deploy, manage, and track machine learning models. The current version of AML uses a code-first approach with Python, which means that the whole process is managed using this language. It can be executed from a notebook or from the IDE of your choice.

  1. A Python script called run.py, containing all the controller logic to create and submit a Machine Learning experiment in AML. This script is specific to AML.

Prepare your environment

As I stated before, AML uses a code-first approach to create, manage and publish machine learning experiments. So you will need to install some libraries in your environment. Your environment could be your local computer using PyCharm, Spider, VS Code or any other IDE of your choice, or it can be a notebook running in the cloud or locally. You need to install two libraries: Azure and Azure ML SDK.

!pip install azure
!pip install -upgrade azureml-sdk[notebooks,automl]

The run.py script

First, we are going to create a workspace in AML to work with. The workspace is the cloud resource you will use to create, manage, and publish machine learning experiments. To create a workspace you need the subscription ID of the subscription you are going to use, a name for the workspace and a location to deploy the resource. The location parameter is important since it will define which compute hardware will be available for your training job. I’m using East of US.

import azureml.core
from azureml.core import Workspace
ws = Workspace.create(
name = "aa-ml-aml-workspace",
subscription_id = "1a2b3c4d-5a7b-5a7b-9a0b1c2d3e5f6g6>"
resource_group = "AdvanceAnalytics.ML",
location = 'eastus',
exist_ok = True)
from azureml.train.dnn import PyTorchsrc = PyTorch(
source_directory = r'.\fastai',
compute_target='amlcompute',
vm_size='Standard_NC6',
entry_script = 'train.py',
use_gpu = True,
pip_packages = ['fastai'])
  • compute_target specified where are you going to execute this job. The value ‘amlcompute’ signals we want Azure to provision a VM for this specific job. The machine will be created and once the job is done it will be destroyed. Pretty cool feature. Other types are available including (Databricks, HDInsight (Spark), custom VMs, local computer)
  • vm_size specified the type of hardware to use. In this case, Standard_NC6 are powered by NVIDIA Tesla K80 with 8 GiB, 6 vCPU, and 56 GiB of RAM.
  • entry_script specified which is the training script you want to execute. This file should be inside of source_directory.
  • use_gpu specifies that we want GPU-enabled libraries.
  • pip_packages allows you to specify which additional packages you need in the execution environment. In this case, since PyTorch execution environment has everything that is needed for PyTorch, the only package that is missing is fast.ai.
from azureml.core.run import Run
from azureml.core.experiment import Experiment
experiment = Experiment(workspace=ws, name="azureml-benchmark")
run = experiment.submit(src)
run.wait_for_completion(show_output = True)

What else? Nothing!

That’s all. Isn’t that awesome? We provisioned the complete hardware + software stack with 3 lines of code! The entire script is available on GitHub

The results

The training process took 7.21 minutes. The cost per hour of a Standard_NC6 VM is of $1.56/hour as January 2nd, 2019. So it costs 0.19 USD to train the model (*)

(*) Disclosure

I am not posting here the other cloud providers numbers since I couldn’t verify the prices published in the original post. Probably because they are from October 2018 or I’m reading them wrong. You are more than welcome to check them.

Solution Architect at the Office of CTO @ Microsoft. Machine Learning and Advanced Analytics. Sensemaking by engaging first hand. Frustrated sociologist.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store