Thursday, September 4, 2025
HomeData Modelling & AIDistributed training with PyTorch and Azure ML

Distributed training with PyTorch and Azure ML

By Beatriz Stollnitz, Principal Cloud Advocate at Microsoft

Overview of distributed training

Adding distributed training to Azure ML code

    cluster = AmlCompute(
        ...
        type="amlcompute",
        ...
    )
    environment = Environment(image="mcr.microsoft.com/azureml/" +
                              "openmpi4.1.0-cuda11.1-cudnn8-ubuntu20.04:latest",
                              conda_file=CONDA_PATH)
    job = command(
        ...
        resources=dict(instance_count=2),
        distribution=dict(type="PyTorch", process_count_per_instance=4),
        ...
    )
  • WORLD_SIZE — The number of processes in the current instance.
  • NODE_RANK — The index of the current instance. The first instance has NODE_RANK zero.
  • MASTER_ADDR — The IP address of the first instance.
  • MASTER_PORT — An available port on the first instance.
  • LOCAL_RANK — The index of the current process within its instance.
  • RANK — The global index of the current process (among all processes on all instances).

Adding distributed training to PyTorch code

  • The backend, which determines how the processes communicate with each other. The methods available to us are “gloo,” “mpi,” and “nccl.” We choose “nccl” because we want distributed GPU training.
  • The initialization method, which determines how we want to initialize information needed during training. This information can be initialized using TCP, a shared file system, or environment variables. We’ll choose environment variable initialization, so that PyTorch will look for the environment variables that Azure ML sets automatically.
    torch.distributed.init_process_group(backend="nccl", init_method="env://")
    rank = int(os.environ["RANK"])
    local_rank = int(os.environ["LOCAL_RANK"])
    device = torch.device("cuda", local_rank)
from torch import nn
    ...
    model = nn.parallel.DistributedDataParallel(
        module=NeuralNetwork().to(device), device_ids=[local_rank])
    if rank == 0:
        save_model(model_dir, model)
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_data)
    train_loader = DataLoader(train_data,
                              batch_size=batch_size,
                              sampler=train_sampler)
    for epoch in range(epochs):
        ...
        train_sampler.set_epoch(epoch)

Additional Resources:

Train compute-intensive models with Azure Machine Learning – Training | Microsoft Learning Path

Train compute-intensive models with Azure Machine Learning – Training | Microsoft Learn

Part 1: Training and Deploying Your PyTorch Model in the Cloud with Azure ML

Part 2: Training Your PyTorch Model Using Components and Pipelines in Azure ML

Part 3: Faster Training and Inference Using the Azure Container for PyTorch in Azure ML

Article originally posted here. Reposted with permission.

RELATED ARTICLES

Most Popular

Dominic
32261 POSTS0 COMMENTS
Milvus
81 POSTS0 COMMENTS
Nango Kala
6626 POSTS0 COMMENTS
Nicole Veronica
11795 POSTS0 COMMENTS
Nokonwaba Nkukhwana
11855 POSTS0 COMMENTS
Shaida Kate Naidoo
6747 POSTS0 COMMENTS
Ted Musemwa
7023 POSTS0 COMMENTS
Thapelo Manthata
6695 POSTS0 COMMENTS
Umr Jansen
6714 POSTS0 COMMENTS