Wednesday, January 29, 2025
Google search engine
HomeData Modelling & AIDistributed training with PyTorch and Azure ML

Distributed training with PyTorch and Azure ML

By Beatriz Stollnitz, Principal Cloud Advocate at Microsoft

Overview of distributed training

Adding distributed training to Azure ML code

    cluster = AmlCompute(
        ...
        type="amlcompute",
        ...
    )
    environment = Environment(image="mcr.microsoft.com/azureml/" +
                              "openmpi4.1.0-cuda11.1-cudnn8-ubuntu20.04:latest",
                              conda_file=CONDA_PATH)
    job = command(
        ...
        resources=dict(instance_count=2),
        distribution=dict(type="PyTorch", process_count_per_instance=4),
        ...
    )
  • WORLD_SIZE — The number of processes in the current instance.
  • NODE_RANK — The index of the current instance. The first instance has NODE_RANK zero.
  • MASTER_ADDR — The IP address of the first instance.
  • MASTER_PORT — An available port on the first instance.
  • LOCAL_RANK — The index of the current process within its instance.
  • RANK — The global index of the current process (among all processes on all instances).

Adding distributed training to PyTorch code

  • The backend, which determines how the processes communicate with each other. The methods available to us are “gloo,” “mpi,” and “nccl.” We choose “nccl” because we want distributed GPU training.
  • The initialization method, which determines how we want to initialize information needed during training. This information can be initialized using TCP, a shared file system, or environment variables. We’ll choose environment variable initialization, so that PyTorch will look for the environment variables that Azure ML sets automatically.
    torch.distributed.init_process_group(backend="nccl", init_method="env://")
    rank = int(os.environ["RANK"])
    local_rank = int(os.environ["LOCAL_RANK"])
    device = torch.device("cuda", local_rank)
from torch import nn
    ...
    model = nn.parallel.DistributedDataParallel(
        module=NeuralNetwork().to(device), device_ids=[local_rank])
    if rank == 0:
        save_model(model_dir, model)
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_data)
    train_loader = DataLoader(train_data,
                              batch_size=batch_size,
                              sampler=train_sampler)
    for epoch in range(epochs):
        ...
        train_sampler.set_epoch(epoch)

Additional Resources:

Train compute-intensive models with Azure Machine Learning – Training | Microsoft Learning Path

Train compute-intensive models with Azure Machine Learning – Training | Microsoft Learn

Part 1: Training and Deploying Your PyTorch Model in the Cloud with Azure ML

Part 2: Training Your PyTorch Model Using Components and Pipelines in Azure ML

Part 3: Faster Training and Inference Using the Azure Container for PyTorch in Azure ML

Article originally posted here. Reposted with permission.

RELATED ARTICLES

Most Popular

Recent Comments