gigl.distributed.DistLinkPredictionDataset#
- class gigl.distributed.dist_link_prediction_dataset.DistLinkPredictionDataset(rank: int, world_size: int, edge_dir: Literal['in', 'out'], graph_partition: Graph | Dict[EdgeType, Graph] | None = None, node_feature_partition: Feature | Dict[NodeType, Feature] | None = None, edge_feature_partition: Feature | Dict[EdgeType, Feature] | None = None, node_partition_book: PartitionBook | Dict[NodeType, PartitionBook] | None = None, edge_partition_book: PartitionBook | Dict[EdgeType, PartitionBook] | None = None, positive_edge_label: Tensor | Dict[EdgeType, Tensor] | None = None, negative_edge_label: Tensor | Dict[EdgeType, Tensor] | None = None, node_ids: Tensor | Dict[NodeType, Tensor] | None = None, num_train: int | Dict[NodeType, int] | None = None, num_val: int | Dict[NodeType, int] | None = None, num_test: int | Dict[NodeType, int] | None = None)#
Bases:
DistDataset
This class is inherited from GraphLearn-for-PyTorch’s DistDataset class. We override the __init__ functionality to support positive and negative edges and labels. We also override the share_ipc function to correctly serialize these new fields. We additionally introduce a build function for storing the partitioned inside of this class. We assume data in this class is only in the CPU RAM, and do not support data on GPU memory, thus simplifying the logic and tooling required compared to the base DistDataset class.
Methods
Initializes the fields of the DistLinkPredictionDataset class. This function is called upon each serialization of the DistLinkPredictionDataset instance. Args: rank (int): Rank of the current process world_size (int): World size of the current process edge_dir (Literal["in", "out"]): Edge direction of the provied graph The below arguments are only expected to be provided when re-serializing an instance of the DistLinkPredictionDataset class after build() has been called graph_partition (Optional[Union[Graph, Dict[EdgeType, Graph]]]): Partitioned Graph Data node_feature_partition (Optional[Union[Feature, Dict[NodeType, Feature]]]): Partitioned Node Feature Data edge_feature_partition (Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]): Partitioned Edge Feature Data node_partition_book (Optional[Union[PartitionBook, Dict[NodeType, PartitionBook]]]): Node Partition Book edge_partition_book (Optional[Union[PartitionBook, Dict[EdgeType, PartitionBook]]]): Edge Partition Book positive_edge_label (Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]): Positive Edge Label Tensor negative_edge_label (Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]): Negative Edge Label Tensor node_ids (Optional[Union[torch.Tensor, Dict[NodeType, torch.Tensor]]]): Node IDs on the current machine num_train: (Optional[Mapping[NodeType, int]]): Number of training nodes on the current machine. Will be a dict if heterogeneous. num_val: (Optional[Mapping[NodeType, int]]): Number of validation nodes on the current machine. Will be a dict if heterogeneous. num_test: (Optional[Mapping[NodeType, int]]): Number of test nodes on the current machine. Will be a dict if heterogeneous.
Provided some partition graph information, this method stores these tensors inside of the class for subsequent live subgraph sampling using a GraphLearn-for-PyTorch NeighborLoader.
from_ipc_handle
get_edge_feature
get_edge_types
get_graph
get_node_feature
get_node_label
get_node_types
Initialize the edge feature storage.
Initialize the graph storage and build the object of Graph.
Initialize the node feature storage.
Initialize the node label storage.
Initialize the node split.
Load a certain dataset partition from partitioned files and create in-memory objects (
Graph
,Feature
ortorch.Tensor
).load_vineyard
Performs a node-level random split by adding
train_idx
,val_idx
andtest_idx
attributes to theDistDataset
object.Serializes the member variables of the DistLinkPredictionDatasetClass Returns: int: Rank on current machine int: World size across all machines Literal["in", "out"]: Graph Edge Direction Optional[Union[Graph, Dict[EdgeType, Graph]]]: Partitioned Graph Data Optional[Union[Feature, Dict[NodeType, Feature]]]: Partitioned Node Feature Data Optional[Union[Feature, Dict[EdgeType, Feature]]]: Partitioned Edge Feature Data Optional[Union[torch.Tensor, Dict[NodeType, torch.Tensor]]]: Node Partition Book Tensor Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]: Edge Partition Book Tensor Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]: Positive Edge Label Tensor Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]: Negative Edge Label Tensor Optional[Union[int, Dict[NodeType, int]]]: Number of training nodes on the current machine. Will be a dict if heterogeneous. Optional[Union[int, Dict[NodeType, int]]]: Number of validation nodes on the current machine. Will be a dict if heterogeneous. Optional[Union[int, Dict[NodeType, int]]]: Number of test nodes on the current machine. Will be a dict if heterogeneous.
- __init__(rank: int, world_size: int, edge_dir: Literal['in', 'out'], graph_partition: Graph | Dict[EdgeType, Graph] | None = None, node_feature_partition: Feature | Dict[NodeType, Feature] | None = None, edge_feature_partition: Feature | Dict[EdgeType, Feature] | None = None, node_partition_book: PartitionBook | Dict[NodeType, PartitionBook] | None = None, edge_partition_book: PartitionBook | Dict[EdgeType, PartitionBook] | None = None, positive_edge_label: Tensor | Dict[EdgeType, Tensor] | None = None, negative_edge_label: Tensor | Dict[EdgeType, Tensor] | None = None, node_ids: Tensor | Dict[NodeType, Tensor] | None = None, num_train: int | Dict[NodeType, int] | None = None, num_val: int | Dict[NodeType, int] | None = None, num_test: int | Dict[NodeType, int] | None = None) None #
Initializes the fields of the DistLinkPredictionDataset class. This function is called upon each serialization of the DistLinkPredictionDataset instance. Args:
rank (int): Rank of the current process world_size (int): World size of the current process edge_dir (Literal[“in”, “out”]): Edge direction of the provied graph
- The below arguments are only expected to be provided when re-serializing an instance of the DistLinkPredictionDataset class after build() has been called
graph_partition (Optional[Union[Graph, Dict[EdgeType, Graph]]]): Partitioned Graph Data node_feature_partition (Optional[Union[Feature, Dict[NodeType, Feature]]]): Partitioned Node Feature Data edge_feature_partition (Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]): Partitioned Edge Feature Data node_partition_book (Optional[Union[PartitionBook, Dict[NodeType, PartitionBook]]]): Node Partition Book edge_partition_book (Optional[Union[PartitionBook, Dict[EdgeType, PartitionBook]]]): Edge Partition Book positive_edge_label (Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]): Positive Edge Label Tensor negative_edge_label (Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]): Negative Edge Label Tensor node_ids (Optional[Union[torch.Tensor, Dict[NodeType, torch.Tensor]]]): Node IDs on the current machine num_train: (Optional[Mapping[NodeType, int]]): Number of training nodes on the current machine. Will be a dict if heterogeneous. num_val: (Optional[Mapping[NodeType, int]]): Number of validation nodes on the current machine. Will be a dict if heterogeneous. num_test: (Optional[Mapping[NodeType, int]]): Number of test nodes on the current machine. Will be a dict if heterogeneous.
- __weakref__#
list of weak references to the object (if defined)
- build(partition_output: PartitionOutput, splitter: NodeAnchorLinkSplitter | None = None) None #
Provided some partition graph information, this method stores these tensors inside of the class for subsequent live subgraph sampling using a GraphLearn-for-PyTorch NeighborLoader.
- Note that this method will clear the following fields from the provided partition_output:
partitioned_edge_index
partitioned_node_features
partitioned_edge_features
We do this to decrease the peak memory usage during the build process by removing these intermediate assets.
- Args:
partition_output (PartitionOutput): Partitioned Graph to be stored in the DistLinkPredictionDataset class splitter (Optional[NodeAnchorLinkSplitter]): A function that takes in an edge index and returns:
a tuple of train, val, and test node ids, if heterogeneous
a dict[NodeType, tuple[train, val, test]] of node ids, if homogeneous
Optional as not all datasets need to be split on, e.g. if we’re doing inference.
- property edge_features: Feature | Dict[EdgeType, Feature] | None#
During serializiation, the initialized Feature type does not immediately contain the feature and id2index tensors. These fields are initially set to None, and are only populated when we retrieve the size, retrieve the shape, or index into one of these tensors. This can also be done manually with the feature.lazy_init_with_ipc_handle() function.
- init_edge_features(edge_feature_data: Tensor | ndarray | Dict[Tuple[str, str, str], Tensor | ndarray] | None = None, id2idx: Tensor | ndarray | Dict[Tuple[str, str, str], Tensor | ndarray] | None = None, split_ratio: float | Dict[Tuple[str, str, str], float] = 0.0, device_group_list: List[DeviceGroup] | None = None, device: int | None = None, with_gpu: bool = True, dtype: dtype | None = None)#
Initialize the edge feature storage.
- Args:
- edge_feature_data (torch.Tensor or numpy.ndarray): A tensor of the raw
edge feature data, should be a dict for heterogenous graph edges. (default:
None
)- id2idx (torch.Tensor or numpy.ndarray): A tensor that maps edge id to
index, should be a dict for heterogenous graph edges. (default:
None
)- split_ratio (float): The proportion (between 0 and 1) of edge feature data
allocated to the GPU, should be a dict for heterogenous graph edges. (default:
0.0
)- device_group_list (List[DeviceGroup]): A list of device groups used for
edge feature lookups, the GPU part of feature data will be replicated on each device group in this list during the initialization. GPUs with peer-to-peer access to each other should be set in the same device group properly. (default:
None
)- device (torch.device): The target cuda device rank used for edge feature
lookups when the GPU part is not None.. (default: None)
- with_gpu (bool): A Boolean value indicating whether the
Feature
uses UnifiedTensor
. If True, it meansFeature
consists ofUnifiedTensor
, otherwiseFeature
is PyTorch CPU Tensor andsplit_ratio
,device_group_list
anddevice
will be invliad. (default:True
)- dtype (torch.dtype): The data type of edge feature elements, if not
specified, it will be automatically inferred. (Default:
None
).
- init_graph(edge_index: Tensor | ndarray | Dict[Tuple[str, str, str], Tensor | ndarray] | None = None, edge_ids: Tensor | ndarray | Dict[Tuple[str, str, str], Tensor | ndarray] | None = None, edge_weights: Tensor | ndarray | Dict[Tuple[str, str, str], Tensor | ndarray] | None = None, layout: str | Dict[Tuple[str, str, str], str] = 'COO', graph_mode: str = 'ZERO_COPY', directed: bool = False, device: int | None = None)#
Initialize the graph storage and build the object of Graph.
- Args:
- edge_index (torch.Tensor or numpy.ndarray): Edge index for graph topo,
2D CPU tensor/numpy.ndarray(homo). A dict should be provided for heterogenous graph. (default:
None
)- edge_ids (torch.Tensor or numpy.ndarray): Edge ids for graph edges, A
CPU tensor (homo) or a Dict[EdgeType, torch.Tensor](hetero). (default:
None
)- edge_weights (torch.Tensor or numpy.ndarray): Edge weights for graph edges,
A CPU tensor (homo) or a Dict[EdgeType, torch.Tensor](hetero). (default:
None
)- layout (str): The edge layout representation for the input edge index,
should be ‘COO’, ‘CSR’ or ‘CSC’. (default: ‘COO’)
- graph_mode (str): Mode in graphlearn_torch’s
Graph
, ‘CPU’, ‘ZERO_COPY’ or ‘CUDA’. (default: ‘ZERO_COPY’)
- directed (bool): A Boolean value indicating whether the graph topology is
directed. (default:
False
)- device (torch.device): The target cuda device rank used for graph
operations when graph mode is not “CPU”. (default:
None
)
- init_node_features(node_feature_data: Tensor | ndarray | Dict[str, Tensor | ndarray] | None = None, id2idx: Tensor | ndarray | Dict[str, Tensor | ndarray] | Sequence | Dict[str, Sequence] | None = None, sort_func=None, split_ratio: float | Dict[str, float] = 0.0, device_group_list: List[DeviceGroup] | None = None, device: int | None = None, with_gpu: bool = True, dtype: dtype | None = None)#
Initialize the node feature storage.
- Args:
- node_feature_data (torch.Tensor or numpy.ndarray): A tensor of the raw
node feature data, should be a dict for heterogenous graph nodes. (default:
None
)- id2idx (torch.Tensor or numpy.ndarray): A tensor that maps node id to
index, should be a dict for heterogenous graph nodes. (default:
None
)- sort_func: Function for reordering node features. Currently, only features
of homogeneous nodes are supported to reorder. (default:
None
)- split_ratio (float): The proportion (between 0 and 1) of node feature data
allocated to the GPU, should be a dict for heterogenous graph nodes. (default:
0.0
)- device_group_list (List[DeviceGroup]): A list of device groups used for
node feature lookups, the GPU part of feature data will be replicated on each device group in this list during the initialization. GPUs with peer-to-peer access to each other should be set in the same device group properly. (default:
None
)- device (torch.device): The target cuda device rank used for node feature
lookups when the GPU part is not None.. (default: None)
- with_gpu (bool): A Boolean value indicating whether the
Feature
uses UnifiedTensor
. If True, it meansFeature
consists ofUnifiedTensor
, otherwiseFeature
is PyTorch CPU Tensor andsplit_ratio
,device_group_list
anddevice
will be invliad. (default:True
)- dtype (torch.dtype): The data type of node feature elements, if not
specified, it will be automatically inferred. (Default:
None
).
- init_node_labels(node_label_data: Tensor | ndarray | Dict[str, Tensor | ndarray] | None = None, id2idx: Tensor | ndarray | Dict[str, Tensor | ndarray] | Sequence | Dict[str, Sequence] | None = None)#
Initialize the node label storage.
- Args:
- node_label_data (torch.Tensor or numpy.ndarray): A tensor of the raw
node label data, should be a dict for heterogenous graph nodes. (default:
None
)- id2idx (torch.Tensor or numpy.ndarray): A tensor that maps global node id
to local index, and should be None for GLT(none-v6d) graph. (default:
None
)
- init_node_split(node_split: Tuple[Tensor | ndarray | Dict[str, Tensor | ndarray], Tensor | ndarray | Dict[str, Tensor | ndarray], Tensor | ndarray | Dict[str, Tensor | ndarray]] | None = None)#
Initialize the node split.
- Args:
- node_split (tuple): A tuple containing the train, validation, and test node indices.
(default:
None
)
- load(*args, **kwargs)#
Load a certain dataset partition from partitioned files and create in-memory objects (
Graph
,Feature
ortorch.Tensor
).- Args:
- root_dir (str): The directory path to load the graph and feature
partition data.
partition_idx (int): Partition idx to load. graph_mode (str): Mode for creating graphlearn_torch’s
Graph
, includingCPU
,ZERO_COPY
orCUDA
. (default:ZERO_COPY
)- input_layout (str): layout of the input graph, including
CSR
,CSC
or
COO
. (default:COO
)- feature_with_gpu (bool): A Boolean value indicating whether the created
Feature
objects of node/edge features useUnifiedTensor
. If True, it meansFeature
consists ofUnifiedTensor
, otherwiseFeature
is a PyTorch CPU Tensor, thedevice_group_list
anddevice
will be invliad. (default:True
)- graph_caching (bool): A Boolean value indicating whether to load the full
graph totoploy instead of partitioned one.
- device_group_list (List[DeviceGroup], optional): A list of device groups
used for feature lookups, the GPU part of feature data will be replicated on each device group in this list during the initialization. GPUs with peer-to-peer access to each other should be set in the same device group properly. (default:
None
)- whole_node_label_file (str): The path to the whole node labels which are
not partitioned. (default:
None
)- device: The target cuda device rank used for graph operations when graph
mode is not “CPU” and feature lookups when the GPU part is not None. (default:
None
)
- property node_features: Feature | Dict[NodeType, Feature] | None#
During serializiation, the initialized Feature type does not immediately contain the feature and id2index tensors. These fields are initially set to None, and are only populated when we retrieve the size, retrieve the shape, or index into one of these tensors. This can also be done manually with the feature.lazy_init_with_ipc_handle() function.
- random_node_split(num_val: float | int, num_test: float | int)#
Performs a node-level random split by adding
train_idx
,val_idx
andtest_idx
attributes to theDistDataset
object. All nodes except those in the validation and test sets will be used for training.- Args:
- num_val (int or float): The number of validation samples.
If float, it represents the ratio of samples to include in the validation set.
- num_test (int or float): The number of test samples in case
of
"train_rest"
and"random"
split. If float, it represents the ratio of samples to include in the test set.
Serializes the member variables of the DistLinkPredictionDatasetClass Returns:
int: Rank on current machine int: World size across all machines Literal[“in”, “out”]: Graph Edge Direction Optional[Union[Graph, Dict[EdgeType, Graph]]]: Partitioned Graph Data Optional[Union[Feature, Dict[NodeType, Feature]]]: Partitioned Node Feature Data Optional[Union[Feature, Dict[EdgeType, Feature]]]: Partitioned Edge Feature Data Optional[Union[torch.Tensor, Dict[NodeType, torch.Tensor]]]: Node Partition Book Tensor Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]: Edge Partition Book Tensor Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]: Positive Edge Label Tensor Optional[Union[torch.Tensor, Dict[EdgeType, torch.Tensor]]]: Negative Edge Label Tensor Optional[Union[int, Dict[NodeType, int]]]: Number of training nodes on the current machine. Will be a dict if heterogeneous. Optional[Union[int, Dict[NodeType, int]]]: Number of validation nodes on the current machine. Will be a dict if heterogeneous. Optional[Union[int, Dict[NodeType, int]]]: Number of test nodes on the current machine. Will be a dict if heterogeneous.