gigl.common.utils.torch_training#
Functions
Returns the distributed backend based on whether distributed training is enabled and whether CUDA is used. Args: use_cuda (bool): Whether CUDA is used for training Returns: Optional[str]: The distributed backend (NCCL or GLOO) if distributed training is enabled, None otherwise. |
|
This is automatically set by Kubeflow PyTorchJob launcher Returns: int: The index of the process involved in distributed training |
|
This is automatically set by Kubeflow PyTorchJob launcher Returns: int: Total number of processes involved in distributed training |
|
Returns: |
|
For local debugging purpose only This sets necessary environment variables for distributed training at local machine Returns: bool: If True, then should_distribute early exit and enables distributed training |
|
Determines whether the process should be configured for distributed training. Returns: bool: True if the process is configured for distributed training. |