gigl.distributed.build_dataset#

gigl.distributed.dataset_factory.build_dataset(serialized_graph_metadata: SerializedGraphMetadata, distributed_context: DistributedContext, sample_edge_direction: Literal['in', 'out'] | str, should_load_tensors_in_parallel: bool = True, partitioner: DistLinkPredictionDataPartitioner | None = None, dataset: DistLinkPredictionDataset | None = None, tf_dataset_options: TFDatasetOptions = TFDatasetOptions(batch_size=10000, file_buffer_size=104857600, deterministic=False, use_interleave=True, num_parallel_file_reads=64, ram_budget_multiplier=0.5), splitter: NodeAnchorLinkSplitter | None = None, _ssl_positive_label_percentage: float | None = None, _dataset_building_port: int = 10000) → DistLinkPredictionDataset#

Launches a spawned process for building and returning a DistLinkPredictionDataset instance provided some SerializedGraphMetadata Args:

serialized_graph_metadata (SerializedGraphMetadata): Metadata about TFRecords that are serialized to disk distributed_context (DistributedContext): Distributed context containing information for master_ip_address, rank, and world size sample_edge_direction (Union[Literal[“in”, “out”], str]): Whether edges in the graph are directed inward or outward. Note that this is

listed as a possible string to satisfy type check, but in practice must be a Literal[“in”, “out”].

should_load_tensors_in_parallel (bool): Whether tensors should be loaded from serialized information in parallel or in sequence across the [node, edge, pos_label, neg_label] entity types. partitioner (Optional[DistLinkPredictionDataPartitioner]): Initialized partitioner to partition the graph inputs. If provided, this must be a

DistLinkPredictionDataPartitioner or subclass of it. If not provided, will initialize a DistLinkPredictionDataPartitioner instance using provided edge assign strategy.

dataset (Optional[DistLinkPredictionDataset]): Initialized dataset class to store the graph inputs. If provided, this must be a
DistLinkPredictionDataset or subclass of it. If not provided, will initialize a DistLinkPredictionDataset instance using provided edge_dir.

tf_dataset_options (TFDatasetOptions): Options provided to a tf.data.Dataset to tune how serialized data is read. splitter (Optional[NodeAnchorLinkSplitter]): Optional splitter to use for splitting the graph data into train, val, and test sets. If not provided (None), no splitting will be performed. _ssl_positive_label_percentage (Optional[float]): Percentage of edges to select as self-supervised labels. Must be None if supervised edge labels are provided in advance.

Slotted for refactor once this functionality is available in the transductive splitter directly

_dataset_building_port (int): WARNING: You don’t need to configure this unless port conflict issues. Slotted for refactor.
The RPC port to use to build the dataset. In future, the port will be automatically assigned based on availability. Currently defaults to: gigl.distributed.constants.DEFAULT_MASTER_DATA_BUILDING_PORT

Returns:: DistLinkPredictionDataset: Built GraphLearn-for-PyTorch Dataset class