gigl.common.data.TFDatasetOptions#
- class gigl.common.data.dataloaders.TFDatasetOptions(batch_size: int = 10000, file_buffer_size: int = 104857600, deterministic: bool = False, use_interleave: bool = True, num_parallel_file_reads: int = 64, ram_budget_multiplier: float = 0.5)#
Bases:
object
Options for tuning a tf.data.Dataset.
Choosing between interleave or not is not straightforward. We’ve found that interleave is faster for large numbers (>100) of small (<20M) files. Though this is highly variable, you should do your own benchmarks to find the best settings for your use case.
Deterministic processing is much (100%!) slower for larger (>10M entities) datasets, but has very little impact on smaller datasets.
- Args:
batch_size (int): How large each batch should be while processing the data. file_buffer_size (int): The size of the buffer to use when reading files. deterministic (bool): Whether to use deterministic processing, if False then the order of elements can be non-deterministic. use_interleave (bool): Whether to use tf.data.Dataset.interleave to read files in parallel, if not set then num_parallel_file_reads will be used. num_parallel_file_reads (int): The number of files to read in parallel if use_interleave is False. ram_budget_multiplier (float): The multiplier of the total system memory to set as the tf.data RAM budget..
Methods
- __delattr__(name)#
Implement delattr(self, name).
- __eq__(other)#
Return self==value.
- __hash__()#
Return hash(self).
- __init__(batch_size: int = 10000, file_buffer_size: int = 104857600, deterministic: bool = False, use_interleave: bool = True, num_parallel_file_reads: int = 64, ram_budget_multiplier: float = 0.5) None #
- __repr__()#
Return repr(self).
- __setattr__(name, value)#
Implement setattr(self, name, value).
- __weakref__#
list of weak references to the object (if defined)