Documentation for Pytorch DDP module configuration on multi nodes for time-out issue.
In the description below is from Aleksei Rutkovskii, Aleksei demonstrates the way he resolves the issue for multi-node training with Pytorch DDP module specifically on superpod.
The problem was resolved by configuring the network interface for the pytorch backend.
For gloo backend,
export GLOO_SOCKET_IFNAME=ip -o -4 route show to default | awk '{print $5}'
For nvcc backend,
export NCCL_SOCKET_IFNAME=ip -o -4 route show to default | awk '{print $5}'
It could be useful to have this information under pytorch training doc page.
--------------------------------------------[Description]---------------------------------------------
- Initial Problem: I observed timeout issues during multi-node DDP training on the
gpu_preempt
partition. It appeared to be using theen0
Ethernet interface, which led to the problem. - Interface Check: I ran ip addr show to list all available network interfaces. It showed that:
- en0 (Ethernet) and ibs2 (Infiniband) were UP.
- All other Ethernet (eno*) and Infiniband (ib0 to ib8) interfaces were DOWN.
- Decision: Since ibs2 is a high-speed Infiniband interface, I switched NCCL communication to use ibs2 instead of en0 to reduce latency and avoid timeouts.
- Solution: I applied the following environment variables to configure NCCL for Infiniband communication:
export NCCL_SOCKET_IFNAME=ibs2 # Use
ibs2
for NCCL communication export NCCL_IB_DISABLE=0 # Enable Infiniband support in NCCL export NCCL_NET_GDR_LEVEL=0 # Adjust for compatibility export NCCL_P2P_LEVEL=SYS # Use system-level P2P communication - Master Address Configuration: I set the MASTER_ADDR to the IP address of ibs2 on the master node: MASTER_HOST=$(scontrol show hostnames SLURM_NODELIST | head -n 1) MASTER_IP=(srun --nodes=1 --ntasks=1 --exclusive -w $MASTER_HOST ip -o -4 addr show ibs2 | awk '{print 4}' | cut -d/ -f1) export MASTER_ADDR=MASTER_IP export MASTER_PORT=12345
- Result: After applying these settings, DDP started working correctly on two gpu_preempt nodes using the ibs2 Infiniband interface.
- Final sbatch script: #!/bin/bash -l
SLURM SUBMIT SCRIPT
Set the partition and constraints
#SBATCH -p gpu-preempt #SBATCH --constraint=a100-80g
Set the number of nodes and tasks per node
#SBATCH --nodes=2 #SBATCH --gres=gpu:1 # Number of GPUs per node #SBATCH --ntasks-per-node=1 # 1 task per node #SBATCH --cpus-per-task=8 # Number of CPUs per task #SBATCH --time=0-02:00:00 #SBATCH -o /work/pi_mzink_umass_edu/SPRITE/multi-node-examples/%j.txt # %j will be replaced with the job ID
ip -o link show | awk -F': ' '{print $2}'
ip addr show ib0
ip link show ib*
ip addr show
Debugging flags (optional)
export NCCL_DEBUG=INFO export PYTHONFAULTHANDLER=1
Set the network interface name to Infiniband (ibs2)
export NCCL_SOCKET_IFNAME=ibs2
(Optional) Adjust NCCL settings for Infiniband communication
export NCCL_IB_DISABLE=0 # Enable Infiniband export NCCL_NET_GDR_LEVEL=0 # Adjust based on your setup export NCCL_P2P_LEVEL=SYS # Use system-level P2P communication
Set the network interface name manually (recommended for multi-node)
ip -o -4 route show to default | awk '{print $5}'
export NCCL_SOCKET_IFNAME=(Optional) If you encounter issues with NCCL, switch to GLOO
ip -o -4 route show to default | awk '{print $5}'
export GLOO_SOCKET_IFNAME=Load the necessary module
module load miniconda/22.11.1-1
Load your Python environment, for example with conda
conda activate ddp-env-test
Specify the master address and port for DDP using the Infiniband interface
MASTER_HOST=$(scontrol show hostnames SLURM_NODELIST | head -n 1) MASTER_IP=(srun --nodes=1 --ntasks=1 --exclusive -w $MASTER_HOST ip -o -4 addr show ibs2 | awk '{print 4}' | cut -d/ -f1) export MASTER_ADDR=MASTER_IP export MASTER_PORT=12345
Display MASTER_ADDR for debugging
echo "MASTER_HOST: $MASTER_HOST" echo "MASTER_IP (ibs2): $MASTER_IP" echo "MASTER_PORT: $MASTER_PORT"
Run the training script with DDP configuration
srun python ./imagenet/lightning/train.py fit
--trainer.num_nodes=2
--trainer.devices=1
--trainer.accelerator=gpu #
# --trainer.sync_batchnorm=true
# --trainer.num_sanity_val_steps=0
# --trainer.logger=true
# --trainer.replace_sampler_ddp=false
# --trainer.reload_dataloaders_every_n_epochs=1