gigl.common.utils.torch_training#

Functions

get_distributed_backend

Returns the distributed backend based on whether distributed training is enabled and whether CUDA is used. Args: use_cuda (bool): Whether CUDA is used for training Returns: Optional[str]: The distributed backend (NCCL or GLOO) if distributed training is enabled, None otherwise.

get_rank

This is automatically set by Kubeflow PyTorchJob launcher Returns: int: The index of the process involved in distributed training

get_world_size

This is automatically set by Kubeflow PyTorchJob launcher Returns: int: Total number of processes involved in distributed training

is_distributed_available_and_initialized

Returns:

is_distributed_local_debug

For local debugging purpose only This sets necessary environment variables for distributed training at local machine Returns: bool: If True, then should_distribute early exit and enables distributed training

should_distribute

Determines whether the process should be configured for distributed training. Returns: bool: True if the process is configured for distributed training.