gigl.common.data.TFDatasetOptions#

class gigl.common.data.dataloaders.TFDatasetOptions(batch_size: int = 10000, file_buffer_size: int = 104857600, deterministic: bool = False, use_interleave: bool = True, num_parallel_file_reads: int = 64, ram_budget_multiplier: float = 0.5)#

Bases: object

Options for tuning a tf.data.Dataset.

Choosing between interleave or not is not straightforward. We’ve found that interleave is faster for large numbers (>100) of small (<20M) files. Though this is highly variable, you should do your own benchmarks to find the best settings for your use case.

Deterministic processing is much (100%!) slower for larger (>10M entities) datasets, but has very little impact on smaller datasets.

Args:

batch_size (int): How large each batch should be while processing the data. file_buffer_size (int): The size of the buffer to use when reading files. deterministic (bool): Whether to use deterministic processing, if False then the order of elements can be non-deterministic. use_interleave (bool): Whether to use tf.data.Dataset.interleave to read files in parallel, if not set then num_parallel_file_reads will be used. num_parallel_file_reads (int): The number of files to read in parallel if use_interleave is False. ram_budget_multiplier (float): The multiplier of the total system memory to set as the tf.data RAM budget..

Methods

__init__

__delattr__(name)#: Implement delattr(self, name).

__eq__(other)#: Return self==value.

__hash__()#: Return hash(self).

__init__(batch_size: int = 10000, file_buffer_size: int = 104857600, deterministic: bool = False, use_interleave: bool = True, num_parallel_file_reads: int = 64, ram_budget_multiplier: float = 0.5) → None#

__repr__()#: Return repr(self).

__setattr__(name, value)#: Implement setattr(self, name, value).

__weakref__#: list of weak references to the object (if defined)