The Holy Bible of Azure Machine Learning Service. A walk-through for the believer (Part 1)

A week ago I wrote a blog post taking a challenge about training a deep learning model in the cloud using a GPU. I showed how to achieve that with Azure Machine Learning Services (AML from now on) using just 3 lines of code + some imports. However, knowing which are those 3 lines of code is not an easy task.

If you are an AML believer, get in line for your miracle cause you are about to discover how this piece of technology works and write your own 3 lines of code. The post is a bit long, but I promise to talk about every single detail.

Note: Image has been altered. The original quote was “BITCON BELIEVERS TALING ON THE TECH WORLD”.

Azure Machine Learning Services (AML) provides a cloud-based environment you can use to develop, train, test, deploy, manage, and track machine learning models. I’m gonna cover three main topics in a sequence of posts:

  • Part 1: Train the trainer in the cloud using AML. I will show you here how to create a run.py script to start training your models in the cloud with a couple of lines of code and without changing a single line of your training routine. ← this post!
  • Part 2: Making your trainer smarter for AML. I will show you here how you can modify your training routine script with some AML specifics lines of code to allow debugging, tracking, monitoring and manage outputs. If you are familiar with MLFlow, then you will find these capabilities a bit similar. [Link]
  • Part 3: Promoting you model to production. I will show you how you can productionize your model using AML make it ready to be consumed by end users using web services. [Link]

Before you start: The current version of AML uses a code-first approach with Python, which means that the whole process is managed using this language. It can be executed from a notebook or from the IDE of your choice. Grab the Python IDE of your preference and your Azure subscription ID and keep reading. If you don’t have an Azure Subscription yet, you can grab a free one here with 200 USD of credits.

To install AML libraries you have to:

pip install azureml-sdk[notebooks]

[January 2019] If you face an error about Cannot uninstall ‘PyYAML’, then there is a workaround for this issue as:

pip install azure-cli-core==2.0.54
pip install azureml-sdk[notebooks] --ignore-installed

To warm-up, let’s build some vocabulary about all the pieces inside AML:

A workspace is the cloud resource you will use to create, manage, and publish machine learning experiments when dealing with AML. It also provides a way to share resources and collaborate within teams in your organization. For instance, each Data Science team can use specific workspaces to share all the compute resources allocated to it with the projects they work for. A workspace can be created from the Portal (pretty straight forward), from the CLI or even from Python:

import azureml.core
from azureml.core.workspace import Workspace
ws = Workspace.create(
name = workspace_name,
subscription_id = subscription_id,
resource_group = resource_group,
location = workspace_region,
exist_ok = True)

Probably the most important parameter you have to specify here is location since it will define which compute hardware will be available for your training jobs. Check the link to see which VMs are available on which zone.

About workspace name: There seems to be a restriction not well documented in the length of the workspace name. It can’t be any longer than 26 characters. If you try that, your workspace will be created correctly, but you can face some weird errors when submitting training jobs. I’m tracking this down with the product group at Microsoft to nail it down.

An experiment is a logical container for your proposed solution to the problem you want to model. A workspace can have multiple experiments running at the same time. They don’t just work as a container for your solution, but they also allow you to track down your progress around how good your solution is doing. Such progress is tracked using metrics you can define. If your problem is a classification problem, probably you will want to track the Accuracy or the MAP your model is getting. Each experiment can have multiple metrics being tracked.

Your experiment will be associated with a folder on your local computer. Such a folder contains all the resources (code files, assets, data, etc) you need to solve the problem. The folder will typically be associated with a code repository. This is not required, but it will allow you to collaborate among different Data Scientists in the same experiment. The repository can be hosted in any service, from GitHub to Azure DevOps.

You create an experiment using Python by simply indicating the name of the experiment and the Workspace associated with it.

from azureml.core import Experiment
experiment = Experiment(workspace=ws, name=experiment_name)

You can check all the experiments in the workspace in the Azure Portal too, but you can’t create a new one there.

Inside an experiment, you have Runs. A Run is a particular instance of the experiment. Each time you submit your experiment to Azure and execute it, it will create a Run. You will typically collect metrics across different runs, for instance, the accuracy the model is getting, in order to compare. This is how you can track progress in your model manage how it evolves. The run can also generate outputs. Typically, one of the outputs will be the model itself (a file).

You create a run for your experiment by executing the submit method of the experiment object.

run = experiment.submit(script_run_configuration)
run.wait_for_completion(show_output = True)

As you can see you have to specify a configuration. We are going to see it in a minute. Once a run is submitted, the training process for your experiment will start. The method is asynchronous, that means that it will not wait until it is done. You will typically want to wait for it. wait_for_completion does that for you. The show_output = true indicated that you want to see the output of the process in your console. The output will be a live stream so you can see exactly what’s going on. Kind of cool!

You can also check Runs’ progress in the Azure Portal by opening the experiment you are using:

In turn, each Run might have multiple iterations (aka child runs). This is particular for Azure Automated ML, a technology for automating the hyper-parameter tunning and model selection (if you are curious about this, I have a post on this topic comparing Google AutoML, AutoKeras, Azure Automated ML and auto-sklean). Each iteration will be a “try” in the tuning of the model. Let’s not focus on this now, I promise to cover it in more details in another post.

What, where and how to execute the training

Do you remember that for submitting a Run you have to specify a configuration parameter? Ok, say hello to the Script Run Configuration. This object will instruct the experiment what to execute, where and how.

However, dealing with Script Run Configurations is a bit difficult. Lucky for me and you, there is an easier and flexible yet powerful abstraction for them. So say goodbye to the Script Run Configuration and hello to the Estimator. The Estimator is an abstraction that allows you to build a Script Run Configuration based on high-level specifications.

You create an Estimator as follows:

from azureml.train.dnn import Estimatorexec_environment = Estimator(source_directory=working_dir,
entry_script='train.py',
script_params=script_params,
compute_target=compute_target,
pip_packages=['scikit-learn']
conda_packages=['scikit-learn'])

Let’s see its parameters:

The Estimator has to tell AML which work it has to do. You will instruct that by specifying two parameters (or three): source_directory, which indicated the folder where all your assets are located locally in your environment. You have to make sure that everything you need can be got from this directory, every single script, asset or whatever should be here. Why? Because this folder will be copied to the compute target you will use for training. To picture it, let’s say that you want to train your model in a Docker container in a VM in the cloud. This folder will be copied inside the docker container.

Of course, inside this directory, you may have a lot of Python files. That’s why you also have to specify the entry_script parameter. This is the source_directory-relative path for the script that is gonna be called when the training job starts. If your script needs some parameters, just use the script_params argument which is just a simple dictionary.

script_params = {
'--regularization': 0.8
}

Important: The size of the directory. This folder can be no bigger than 300 MB as January 2019. The reason for that is that bigger files will take time to copy and delay the execution. You will get an error if you try something bigger. You may wonder… what about my data set! Easy, you have a couple of options: One is to download your data from the internet as part of your training script. Another option is use the attached blob storage that cames with your workspace. Each AML workspace has one. You can upload the data you need to this storage and then access it from your training script. I’ll show you how Part 2.

The compute hardware where you run your job is called a Compute Target. You control where to run the estimator using the compute_target parameter. There are 3 categories where you can run your training job:

  • Local: Your local machine, or a cloud-based virtual machine (VM) that you use as a development and experimentation environment. Why you would want to use AML with your local environment? Does it have any sense? Yes, actually two: First, you might want to try your code first in your local machine to see if it works. Second, you might not be interested in training in the cloud, but you want to use the cloud to track your training process, collect metrics, logs, etc. If you have heard about MlFlow, you can think of AML as the same.
  • Managed compute: This is a compute resource that is managed by AML. You have two types of Managed Compute: Run-based or Persisted. Run-based. Run-based is hardware that will last just for the time the Run is running. You submit a job, the hardware is provisioned and configured, the job is run, and finally, the hardware is deallocated. End of story.You only get charged for the time the hardware was up. Persistent, on the other hand, means that you want to have compute hardware dedicated to your workspace for all the runs. This allows you to control cost more effectively because you can set how much of this hardware can be used and how much it has to scale. The Persistent is then typically a cluster of machines and will allow you to specify the min and max amount of nodes you want to provision, how they should scale up, when they have to scale down and so on. Scaling happens automatically for you.
  • Attached compute: You can also bring your own Azure cloud compute and attach it to Azure Machine Learning. This can be a DataBricks cluster, an HDInsight cluster or a Docker VM.

Possible values for this are:

  • local: The default value. It indicates that it will run in your own machine.
  • amlcompute: It indicates that you want to use AML managed compute.
  • remote: It indicates you want to use a remote compute resource managed outside of AML.

Example: Run-based amlcompute

exec_environment = Estimator(source_directory=working_dir,
entry_script='train.py',
script_params=script_params,
compute_target='amlcompute',
vm_size = 'Standard_NC6'

pip_packages=['scikit-learn']
conda_packages=['scikit-learn'])

Example: Persistent amlcompute

# Check if the compute already exists. If not, it creates it
if
'amlcompute_gpu' in ws.compute_targets:
my_compute_target = ws.compute_targets[compute_name]
else:
provisioning_config = AmlCompute.provisioning_configuration(
vm_size = 'Standard_NC6',
min_nodes = 1,
max_nodes = 3)

my_compute_target = ComputeTarget.create(ws,
"amlcompute_gpu",
provisioning_config)
exec_environment = Estimator(source_directory=working_dir,
entry_script='train.py',
script_params=script_params,
compute_target=my_compute_target,
pip_packages=['scikit-learn']
conda_packages=['scikit-learn'])

Tip: Even though Persistent Compute Targets can be created from Python, you will typically provision this from Azure Portal. The administrator of the workspace will have the permissions to configure the compute hardware for all the users in the workspace and then each user will just run their experiments in the provisioned compute target. This allows you to control cost more effectively.

Example: Adding a Managed Compute Target in Azure Machine Learning using the East US data center, with a Standard NC6 machine (GPU), low priority (cheaper machines but availability is not guaranteed) and scaling up to 3 nodes. Scales down happens automatically after 120 minutes of no use.

Example: Remote VM

attach_config = RemoteCompute.attach_configuration(
address = "0.0.0.0",
ssh_port=22,
username='<username>',
password="<password>")
attach_compute = ComputeTarget.attach(ws, "attachvm", attach_config)
exec_environment = Estimator(source_directory=working_dir,
entry_script='train.py',
script_params=script_params,
compute_target=attach_compute,
pip_packages=['scikit-learn']
conda_packages=['scikit-learn'])

Limitations: AML only supports virtual machines that run Ubuntu. You can use a system-built conda environment, an existing Python environment, or a Docker container. When you execute by using a Docker container, you need to have Docker Engine running on the VM. Your VM or docker image should have conda installed. Keep reading to find out why.

Tip: In the same way than Persistent Compute Targets, Remote Targets will typically be registered from the Azure Portal.

Adding a Remote compute target from the Azure Portal.

We support other compute targets:

Finally, the Estimator allows us to specify how our code is going to be executed. A couple of parameters will allow us to do that:

  • use_gpu: Default False. Indicates if the hardware supports GPU. If true, the image deployed in the virtual machine will have all the drivers and distributions to support GPUs.
  • use_docker: Default True. Indicates if the job will be submitted as a Docker image into the compute target.
  • custom_docker_base_image: Allows to specify a custom docker image to deploy in the container. For instance: “microsoft/cntk”. All images must have conda installed, or the job will fail.
  • pip_packages: Allows you to specify which extra Python packages you want AML to install using pip in the target compute. If you use a custom docker image, then you will probably have everything you need there, but if you are using the default docker image, some packages may be missing.
  • conda_packages: The same than pip_packages, but for conda environment. This is the reason why you have to have conda installed in your image (even if you don’t use it).

If you are working with PyTorch or Tensorflow, I have good news for you. We have preconfigured Estimators for you ready to be used. They work in the same way that the Estimator class, but all the libraries required for TensorFlow or PyTorch are already there.

from azureml.train.dnn import TensorFlow
exec_env = TensorFlow(source_directory=project_folder,
compute_target=compute_target,
script_params=script_params,
entry_script='train.py',
use_gpu=True)

If you are dealing with really big deep learning models, you may have heard about Distributed Training. For instance, TensorFlow with Horovod allows you to achieve that. We support such scenarios in AML. The execution environments have a parameter called distributed_backend where you can specify how each node of the training cluster will communicate to achieve distributed training. Of course, your compute target should be a cluster type. When node_count is specified, AML will create clusters instead of single VMs.

estimator= TensorFlow(source_directory=project_folder,
compute_target=compute_target,
script_params=script_params,
entry_script='tf_horovod_word2vec.py',
node_count=2,
process_count_per_node=1,
distributed_backend='mpi'
,
use_gpu=True)

For a complete example of this, see the following Jupyter notebook from the Azure Machine Learning team where they implement distributed training with Horovod and TensorFlow.

Putting all the pieces together

Ok, you are now familiar with all the pieces you need. Let’s see a complete example. This is the scenario: You have a train.py script that uses PyTorch to create a deep learning model for image classification using Transfer Learning with ResNet. The model solves the Cat vs Dogs Kaggle problem. Your model uses the API from fast.ai since it provides a high-level abstraction. You want to use a GPU to speed up the training. You create a run.py script that will look like this:

from azureml.core import Workspace
from azureml.core.run import Run
from azureml.core.experiment import Experiment
from azureml.train.dnn import PyTorch
ws = Workspace.create(name = 'aa-ml-aml-workspace',
subscription_id = '1234-12345678-12345678-12',
resource_group = 'AA.Aml.Experiments',
location = 'eastus',
exist_ok = True)
src = PyTorch(source_directory = r'.\classification-with-resnet',
compute_target='amlcompute',
vm_size='Standard_NC6',
entry_script = 'train.py',
use_gpu = True,
pip_packages = ['fastai'])
experiment = Experiment(workspace=ws, name='classify-with-resnet')
run = experiment.submit(src)
run.wait_for_completion(show_output = True)

That’s all! You can track progress in the console as we said before:

You can find the repository where this is implemented in my GitHub repository.

Conclusions

As you can see AML is a really powerful service to manage all the life cycle of the training process. The next post, I will show you how you can modify your training script to allow AML track the performance, model output and debugging capabilities. Stay tuned!

Solution Architect at the Office of CTO @ Microsoft. Machine Learning and Advanced Analytics. Sensemaking by engaging first hand. Frustrated sociologist.