Documentation for Pytorch DDP module configuration on multi nodes for time-out issue.

In the description below is from Aleksei Rutkovskii, Aleksei demonstrates the way he resolves the issue for multi-node training with Pytorch DDP module specifically on superpod.
The problem was resolved by configuring the network interface for the pytorch backend. For gloo backend,
export GLOO_SOCKET_IFNAME=ip -o -4 route show to default | awk '{print $5}'
For nvcc backend,
export NCCL_SOCKET_IFNAME=ip -o -4 route show to default | awk '{print $5}'

It could be useful to have this information under pytorch training doc page.

--------------------------------------------[Description]---------------------------------------------

Initial Problem: I observed timeout issues during multi-node DDP training on the gpu_preempt partition. It appeared to be using the en0 Ethernet interface, which led to the problem.
Interface Check: I ran ip addr show to list all available network interfaces. It showed that:
- en0 (Ethernet) and ibs2 (Infiniband) were UP.
- All other Ethernet (eno*) and Infiniband (ib0 to ib8) interfaces were DOWN.
Decision: Since ibs2 is a high-speed Infiniband interface, I switched NCCL communication to use ibs2 instead of en0 to reduce latency and avoid timeouts.
Solution: I applied the following environment variables to configure NCCL for Infiniband communication: export NCCL_SOCKET_IFNAME=ibs2 # Use ibs2 for NCCL communication export NCCL_IB_DISABLE=0 # Enable Infiniband support in NCCL export NCCL_NET_GDR_LEVEL=0 # Adjust for compatibility export NCCL_P2P_LEVEL=SYS # Use system-level P2P communication
Master Address Configuration: I set the MASTER_ADDR to the IP address of ibs2 on the master node: MASTER_HOST=$(scontrol show hostnames SLURM_NODELIST | head -n 1) MASTER_IP=(srun --nodes=1 --ntasks=1 --exclusive -w $MASTER_HOST ip -o -4 addr show ibs2 | awk '{print 4}' | cut -d/ -f1) export MASTER_ADDR=MASTER_IP export MASTER_PORT=12345
Result: After applying these settings, DDP started working correctly on two gpu_preempt nodes using the ibs2 Infiniband interface.
Final sbatch script: #!/bin/bash -l

SLURM SUBMIT SCRIPT

Set the partition and constraints

#SBATCH -p gpu-preempt #SBATCH --constraint=a100-80g

Set the number of nodes and tasks per node

#SBATCH --nodes=2 #SBATCH --gres=gpu:1 # Number of GPUs per node #SBATCH --ntasks-per-node=1 # 1 task per node #SBATCH --cpus-per-task=8 # Number of CPUs per task #SBATCH --time=0-02:00:00 #SBATCH -o /work/pi_mzink_umass_edu/SPRITE/multi-node-examples/%j.txt # %j will be replaced with the job ID

ip -o link show | awk -F': ' '{print $2}'

ip addr show ib0

ip link show ib*

ip addr show

Debugging flags (optional)

export NCCL_DEBUG=INFO export PYTHONFAULTHANDLER=1

Set the network interface name to Infiniband (ibs2)

export NCCL_SOCKET_IFNAME=ibs2

(Optional) Adjust NCCL settings for Infiniband communication

export NCCL_IB_DISABLE=0 # Enable Infiniband export NCCL_NET_GDR_LEVEL=0 # Adjust based on your setup export NCCL_P2P_LEVEL=SYS # Use system-level P2P communication

Set the network interface name manually (recommended for multi-node)

export NCCL_SOCKET_IFNAME=`ip -o -4 route show to default | awk '{print $5}'`

(Optional) If you encounter issues with NCCL, switch to GLOO

export GLOO_SOCKET_IFNAME=`ip -o -4 route show to default | awk '{print $5}'`

Load the necessary module

module load miniconda/22.11.1-1

Load your Python environment, for example with conda

conda activate ddp-env-test

Specify the master address and port for DDP using the Infiniband interface

MASTER_HOST=$(scontrol show hostnames SLURM_NODELIST | head -n 1) MASTER_IP=(srun --nodes=1 --ntasks=1 --exclusive -w $MASTER_HOST ip -o -4 addr show ibs2 | awk '{print 4}' | cut -d/ -f1) export MASTER_ADDR=MASTER_IP export MASTER_PORT=12345

Display MASTER_ADDR for debugging

echo "MASTER_HOST: $MASTER_HOST" echo "MASTER_IP (ibs2): $MASTER_IP" echo "MASTER_PORT: $MASTER_PORT"

Run the training script with DDP configuration

srun python ./imagenet/lightning/train.py fit
--trainer.num_nodes=2
--trainer.devices=1
--trainer.accelerator=gpu #
# --trainer.sync_batchnorm=true
# --trainer.num_sanity_val_steps=0
# --trainer.logger=true
# --trainer.replace_sampler_ddp=false
# --trainer.reload_dataloaders_every_n_epochs=1

Edited Oct 04, 2024 by Hojae Son