Train a TensorFlow model

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

2020-09-16

This tutorial demonstrates how run a TensorFlow job at scale using Azure ML. You will train a TensorFlow model to classify handwritten digits (MNIST) using a deep neural network (DNN) and log your results to the Azure ML service.

Prerequisites

If you don’t have access to an Azure ML workspace, follow the setup tutorial to configure and create a workspace.

Set up development environment

The setup for your development work in this tutorial includes the following actions:

Import required packages
Connect to a workspace
Create an experiment to track your runs
Create a remote compute target to use for training

Import azuremlsdk package

library(azuremlsdk)

Load your workspace

Instantiate a workspace object from your existing workspace. The following code will load the workspace details from a config.json file if you previously wrote one out with write_workspace_config().

ws <- load_workspace_from_config()

Or, you can retrieve a workspace by directly specifying your workspace details:

ws <- get_workspace("<your workspace name>", "<your subscription ID>", "<your resource group>")

Create an experiment

An Azure ML experiment tracks a grouping of runs, typically from the same training script. Create an experiment to track the runs for training the TensorFlow model on the MNIST data.

exp <- experiment(workspace = ws, name = "tf-mnist")

If you would like to track your runs in an existing experiment, simply specify that experiment’s name to the name parameter of experiment().

Create a compute target

By using Azure Machine Learning Compute (AmlCompute), a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. In this tutorial, you create a GPU-enabled cluster as your training environment. The code below creates the compute cluster for you if it doesn’t already exist in your workspace.

You may need to wait a few minutes for your compute cluster to be provisioned if it doesn’t already exist.

cluster_name <- "gpucluster"

compute_target <- get_compute(ws, cluster_name = cluster_name)
if (is.null(compute_target))
{
  vm_size <- "STANDARD_NC6"
  compute_target <- create_aml_compute(workspace = ws, 
                                       cluster_name = cluster_name,
                                       vm_size = vm_size, 
                                       max_nodes = 4)
  
  wait_for_provisioning_completion(compute_target, show_output = TRUE)
}

Prepare the training script

A training script called tf_mnist.R has been provided for you in the train-with-tensorflow/ subfolder of this vignette. The Azure ML SDK provides a set of logging APIs for logging various metrics during training runs. These metrics are recorded and persisted in the experiment run record, and can be be accessed at any time or viewed in the run details page in Azure Machine Learning studio.

In order to collect and upload run metrics, you need to do the following inside the training script:

Import the azuremlsdk package

library(azuremlsdk)

Add the log_metric_to_run() function to track our primary metric, “accuracy”, for this experiment. If you have your own training script with several important metrics, simply create a logging call for each one within the script.

log_metric_to_run("accuracy",
                  sess$run(accuracy,
                  feed_dict = dict(x = mnist$test$images, y_ = mnist$test$labels)))

See the reference for the full set of logging methods log_*() available from the R SDK.

Create an estimator

An Azure ML estimator encapsulates the run configuration information needed for executing a training script on the compute target. Azure ML runs are run as containerized jobs on the specified compute target.

To create the estimator, define the following:

The directory that contains your scripts needed for training (source_directory). All the files in this directory are uploaded to the cluster node(s) for execution. The directory must contain your training script and any additional scripts required.
The training script that will be executed (entry_script).
The compute target (compute_target), in this case the AmlCompute cluster you created earlier.
Any environment dependencies required for training. For full control over your training environment (instead of using the defaults), you can create a custom Docker image to use for your remote run, which is what we’ve done in this example. The Docker image includes the necessary packages for TensorFlow GPU training. The Dockerfile used to build the image is included in the train-with-tensorflow/ folder for reference. See the r_environment() reference for the full set of configurable options. Pass the environment object to the environment parameter in estimator.

env <- r_environment("tensorflow-env", custom_docker_image = "amlsamples/r-tensorflow:latest")

est <- estimator(source_directory = "train-with-tensorflow",
                 entry_script = "tf_mnist.R",
                 compute_target = compute_target,
                 environment = env)

Submit the job

Finally submit the job to run on your cluster. submit_experiment() returns a Run object that you can then use to interface with the run.

run <- submit_experiment(exp, est)

You can view the run’s details as a table. Clicking the “Web View” link provided will bring you to Azure Machine Learning studio, where you can monitor the run in the UI.

plot_run_details(run)

Model training happens in the background. Wait until the model has finished training before you run more code.

wait_for_run_completion(run, show_output = TRUE)

View run metrics

Once your job has finished, you can view the metrics collected during your TensorFlow run.

metrics <- get_run_metrics(run)
metrics

Clean up resources

Delete the resources once you no longer need them. Don’t delete any resource you plan to still use.

Delete the compute cluster:

delete_compute(compute_target)

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.