gigl.distributed.DistLinkPredictionDataset#

Bases: DistDataset

This class is inherited from GraphLearn-for-PyTorch’s DistDataset class. We override the __init__ functionality to support positive and negative edges and labels. We also override the share_ipc function to correctly serialize these new fields. We additionally introduce a build function for storing the partitioned inside of this class. We assume data in this class is only in the CPU RAM, and do not support data on GPU memory, thus simplifying the logic and tooling required compared to the base DistDataset class.

Methods

__init__

Initializes the fields of the DistLinkPredictionDataset class. This function is called upon each serialization of the DistLinkPredictionDataset instance. Args: rank (int): Rank of the current process world_size (int): World size of the current process edge_dir (Literal["in", "out"]): Edge direction of the provied graph The below arguments are only expected to be provided when re-serializing an instance of the DistLinkPredictionDataset class after build() has been called graph_partition (Optional[Union[Graph, Dict[EdgeType, Graph]]]): Partitioned Graph Data node_feature_partition (Optional[Union[Feature, Dict[NodeType, Feature]]]): Partitioned Node Feature Data edge_feature_partition (Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]): Partitioned Edge Feature Data node_partition_book (Optional[Union[PartitionBook, Dict[NodeType, PartitionBook]]]): Node Partition Book edge_partition_book (Optional[Union[PartitionBook, Dict[EdgeType, PartitionBook]]]): Edge Partition Book positive_edge_label (Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]): Positive Edge Label Tensor negative_edge_label (Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]): Negative Edge Label Tensor node_ids (Optional[Union[torch.Tensor, Dict[NodeType, torch.Tensor]]]): Node IDs on the current machine num_train: (Optional[Mapping[NodeType, int]]): Number of training nodes on the current machine. Will be a dict if heterogeneous. num_val: (Optional[Mapping[NodeType, int]]): Number of validation nodes on the current machine. Will be a dict if heterogeneous. num_test: (Optional[Mapping[NodeType, int]]): Number of test nodes on the current machine. Will be a dict if heterogeneous.

build

Provided some partition graph information, this method stores these tensors inside of the class for subsequent live subgraph sampling using a GraphLearn-for-PyTorch NeighborLoader.

from_ipc_handle

get_edge_feature

get_edge_types

get_graph

get_node_feature

get_node_label

get_node_types

init_edge_features

Initialize the edge feature storage.

init_graph

Initialize the graph storage and build the object of Graph.

init_node_features

Initialize the node feature storage.

init_node_labels

Initialize the node label storage.

init_node_split

Initialize the node split.

load

Load a certain dataset partition from partitioned files and create in-memory objects (Graph, Feature or torch.Tensor).

load_vineyard

random_node_split

Performs a node-level random split by adding train_idx, val_idx and test_idx attributes to the DistDataset object.

share_ipc

Serializes the member variables of the DistLinkPredictionDatasetClass Returns: int: Rank on current machine int: World size across all machines Literal["in", "out"]: Graph Edge Direction Optional[Union[Graph, Dict[EdgeType, Graph]]]: Partitioned Graph Data Optional[Union[Feature, Dict[NodeType, Feature]]]: Partitioned Node Feature Data Optional[Union[Feature, Dict[EdgeType, Feature]]]: Partitioned Edge Feature Data Optional[Union[torch.Tensor, Dict[NodeType, torch.Tensor]]]: Node Partition Book Tensor Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]: Edge Partition Book Tensor Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]: Positive Edge Label Tensor Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]: Negative Edge Label Tensor Optional[Union[int, Dict[NodeType, int]]]: Number of training nodes on the current machine. Will be a dict if heterogeneous. Optional[Union[int, Dict[NodeType, int]]]: Number of validation nodes on the current machine. Will be a dict if heterogeneous. Optional[Union[int, Dict[NodeType, int]]]: Number of test nodes on the current machine. Will be a dict if heterogeneous.

Initializes the fields of the DistLinkPredictionDataset class. This function is called upon each serialization of the DistLinkPredictionDataset instance. Args:

rank (int): Rank of the current process world_size (int): World size of the current process edge_dir (Literal[“in”, “out”]): Edge direction of the provied graph

The below arguments are only expected to be provided when re-serializing an instance of the DistLinkPredictionDataset class after build() has been called

graph_partition (Optional[Union[Graph, Dict[EdgeType, Graph]]]): Partitioned Graph Data node_feature_partition (Optional[Union[Feature, Dict[NodeType, Feature]]]): Partitioned Node Feature Data edge_feature_partition (Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]): Partitioned Edge Feature Data node_partition_book (Optional[Union[PartitionBook, Dict[NodeType, PartitionBook]]]): Node Partition Book edge_partition_book (Optional[Union[PartitionBook, Dict[EdgeType, PartitionBook]]]): Edge Partition Book positive_edge_label (Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]): Positive Edge Label Tensor negative_edge_label (Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]): Negative Edge Label Tensor node_ids (Optional[Union[torch.Tensor, Dict[NodeType, torch.Tensor]]]): Node IDs on the current machine num_train: (Optional[Mapping[NodeType, int]]): Number of training nodes on the current machine. Will be a dict if heterogeneous. num_val: (Optional[Mapping[NodeType, int]]): Number of validation nodes on the current machine. Will be a dict if heterogeneous. num_test: (Optional[Mapping[NodeType, int]]): Number of test nodes on the current machine. Will be a dict if heterogeneous.

list of weak references to the object (if defined)

Provided some partition graph information, this method stores these tensors inside of the class for subsequent live subgraph sampling using a GraphLearn-for-PyTorch NeighborLoader.

Note that this method will clear the following fields from the provided partition_output:
  • partitioned_edge_index

  • partitioned_node_features

  • partitioned_edge_features

We do this to decrease the peak memory usage during the build process by removing these intermediate assets.

Args:

partition_output (PartitionOutput): Partitioned Graph to be stored in the DistLinkPredictionDataset class splitter (Optional[NodeAnchorLinkSplitter]): A function that takes in an edge index and returns:

  • a tuple of train, val, and test node ids, if heterogeneous

  • a dict[NodeType, tuple[train, val, test]] of node ids, if homogeneous

Optional as not all datasets need to be split on, e.g. if we’re doing inference.

During serializiation, the initialized Feature type does not immediately contain the feature and id2index tensors. These fields are initially set to None, and are only populated when we retrieve the size, retrieve the shape, or index into one of these tensors. This can also be done manually with the feature.lazy_init_with_ipc_handle() function.

Initialize the edge feature storage.

Args:
edge_feature_data (torch.Tensor or numpy.ndarray): A tensor of the raw

edge feature data, should be a dict for heterogenous graph edges. (default: None)

id2idx (torch.Tensor or numpy.ndarray): A tensor that maps edge id to

index, should be a dict for heterogenous graph edges. (default: None)

split_ratio (float): The proportion (between 0 and 1) of edge feature data

allocated to the GPU, should be a dict for heterogenous graph edges. (default: 0.0)

device_group_list (List[DeviceGroup]): A list of device groups used for

edge feature lookups, the GPU part of feature data will be replicated on each device group in this list during the initialization. GPUs with peer-to-peer access to each other should be set in the same device group properly. (default: None)

device (torch.device): The target cuda device rank used for edge feature

lookups when the GPU part is not None.. (default: None)

with_gpu (bool): A Boolean value indicating whether the Feature uses

UnifiedTensor. If True, it means Feature consists of UnifiedTensor, otherwise Feature is PyTorch CPU Tensor and split_ratio, device_group_list and device will be invliad. (default: True)

dtype (torch.dtype): The data type of edge feature elements, if not

specified, it will be automatically inferred. (Default: None).

Initialize the graph storage and build the object of Graph.

Args:
edge_index (torch.Tensor or numpy.ndarray): Edge index for graph topo,

2D CPU tensor/numpy.ndarray(homo). A dict should be provided for heterogenous graph. (default: None)

edge_ids (torch.Tensor or numpy.ndarray): Edge ids for graph edges, A

CPU tensor (homo) or a Dict[EdgeType, torch.Tensor](hetero). (default: None)

edge_weights (torch.Tensor or numpy.ndarray): Edge weights for graph edges,

A CPU tensor (homo) or a Dict[EdgeType, torch.Tensor](hetero). (default: None)

layout (str): The edge layout representation for the input edge index,

should be ‘COO’, ‘CSR’ or ‘CSC’. (default: ‘COO’)

graph_mode (str): Mode in graphlearn_torch’s Graph, ‘CPU’, ‘ZERO_COPY’

or ‘CUDA’. (default: ‘ZERO_COPY’)

directed (bool): A Boolean value indicating whether the graph topology is

directed. (default: False)

device (torch.device): The target cuda device rank used for graph

operations when graph mode is not “CPU”. (default: None)

Initialize the node feature storage.

Args:
node_feature_data (torch.Tensor or numpy.ndarray): A tensor of the raw

node feature data, should be a dict for heterogenous graph nodes. (default: None)

id2idx (torch.Tensor or numpy.ndarray): A tensor that maps node id to

index, should be a dict for heterogenous graph nodes. (default: None)

sort_func: Function for reordering node features. Currently, only features

of homogeneous nodes are supported to reorder. (default: None)

split_ratio (float): The proportion (between 0 and 1) of node feature data

allocated to the GPU, should be a dict for heterogenous graph nodes. (default: 0.0)

device_group_list (List[DeviceGroup]): A list of device groups used for

node feature lookups, the GPU part of feature data will be replicated on each device group in this list during the initialization. GPUs with peer-to-peer access to each other should be set in the same device group properly. (default: None)

device (torch.device): The target cuda device rank used for node feature

lookups when the GPU part is not None.. (default: None)

with_gpu (bool): A Boolean value indicating whether the Feature uses

UnifiedTensor. If True, it means Feature consists of UnifiedTensor, otherwise Feature is PyTorch CPU Tensor and split_ratio, device_group_list and device will be invliad. (default: True)

dtype (torch.dtype): The data type of node feature elements, if not

specified, it will be automatically inferred. (Default: None).

Initialize the node label storage.

Args:
node_label_data (torch.Tensor or numpy.ndarray): A tensor of the raw

node label data, should be a dict for heterogenous graph nodes. (default: None)

id2idx (torch.Tensor or numpy.ndarray): A tensor that maps global node id

to local index, and should be None for GLT(none-v6d) graph. (default: None)

Initialize the node split.

Args:
node_split (tuple): A tuple containing the train, validation, and test node indices.

(default: None)

Load a certain dataset partition from partitioned files and create in-memory objects (Graph, Feature or torch.Tensor).

Args:
root_dir (str): The directory path to load the graph and feature

partition data.

partition_idx (int): Partition idx to load. graph_mode (str): Mode for creating graphlearn_torch’s Graph, including

CPU, ZERO_COPY or CUDA. (default: ZERO_COPY)

input_layout (str): layout of the input graph, including CSR, CSC

or COO. (default: COO)

feature_with_gpu (bool): A Boolean value indicating whether the created

Feature objects of node/edge features use UnifiedTensor. If True, it means Feature consists of UnifiedTensor, otherwise Feature is a PyTorch CPU Tensor, the device_group_list and device will be invliad. (default: True)

graph_caching (bool): A Boolean value indicating whether to load the full

graph totoploy instead of partitioned one.

device_group_list (List[DeviceGroup], optional): A list of device groups

used for feature lookups, the GPU part of feature data will be replicated on each device group in this list during the initialization. GPUs with peer-to-peer access to each other should be set in the same device group properly. (default: None)

whole_node_label_file (str): The path to the whole node labels which are

not partitioned. (default: None)

device: The target cuda device rank used for graph operations when graph

mode is not “CPU” and feature lookups when the GPU part is not None. (default: None)

During serializiation, the initialized Feature type does not immediately contain the feature and id2index tensors. These fields are initially set to None, and are only populated when we retrieve the size, retrieve the shape, or index into one of these tensors. This can also be done manually with the feature.lazy_init_with_ipc_handle() function.

Performs a node-level random split by adding train_idx, val_idx and test_idx attributes to the DistDataset object. All nodes except those in the validation and test sets will be used for training.

Args:
num_val (int or float): The number of validation samples.

If float, it represents the ratio of samples to include in the validation set.

num_test (int or float): The number of test samples in case

of "train_rest" and "random" split. If float, it represents the ratio of samples to include in the test set.

Serializes the member variables of the DistLinkPredictionDatasetClass Returns:

int: Rank on current machine int: World size across all machines Literal[“in”, “out”]: Graph Edge Direction Optional[Union[Graph, Dict[EdgeType, Graph]]]: Partitioned Graph Data Optional[Union[Feature, Dict[NodeType, Feature]]]: Partitioned Node Feature Data Optional[Union[Feature, Dict[EdgeType, Feature]]]: Partitioned Edge Feature Data Optional[Union[torch.Tensor, Dict[NodeType, torch.Tensor]]]: Node Partition Book Tensor Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]: Edge Partition Book Tensor Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]: Positive Edge Label Tensor Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]: Negative Edge Label Tensor Optional[Union[int, Dict[NodeType, int]]]: Number of training nodes on the current machine. Will be a dict if heterogeneous. Optional[Union[int, Dict[NodeType, int]]]: Number of validation nodes on the current machine. Will be a dict if heterogeneous. Optional[Union[int, Dict[NodeType, int]]]: Number of test nodes on the current machine. Will be a dict if heterogeneous.