gigl.common.utils.torch_training#

Functions

`get_distributed_backend`	Returns the distributed backend based on whether distributed training is enabled and whether CUDA is used. Args: use_cuda (bool): Whether CUDA is used for training Returns: Optional[str]: The distributed backend (NCCL or GLOO) if distributed training is enabled, None otherwise.
`get_rank`	This is automatically set by Kubeflow PyTorchJob launcher Returns: int: The index of the process involved in distributed training
`get_world_size`	This is automatically set by Kubeflow PyTorchJob launcher Returns: int: Total number of processes involved in distributed training
`is_distributed_available_and_initialized`	Returns:
`is_distributed_local_debug`	For local debugging purpose only This sets necessary environment variables for distributed training at local machine Returns: bool: If True, then should_distribute early exit and enables distributed training
`should_distribute`	Determines whether the process should be configured for distributed training. Returns: bool: True if the process is configured for distributed training.