gigl.src.common.models.layers.CountMinSketch#

class gigl.src.common.models.layers.count_min_sketch.CountMinSketch(width: int = 2000, depth: int = 10)#

Bases: object

A probability data structure that can be used to estimate the frequency of an item in a stream. For full details please refer to the paper https://dsf.berkeley.edu/cs286/papers/countmin-latin2004.pdf

There is also a good blog from Redis to talk about its application https://redis.io/blog/count-min-sketch-the-art-and-science-of-estimating-stuff/

How accurate is the estimation?

Denote the total count is N, and the width of the table is w, the depth of the table is d. For each row (in d), we have atleast 1/2 of the probability that the hashed count is less than 2N/w Then, with d rows, the final results is more than 2N/w larger than the actual count with a probability less than 1/2^d

Currently, this implementation only uses single thread, and we might want to optimize it if we see performance issue

Methods

__init__

add

Add an item to the sketch

add_torch_long_tensor

Add all items in a torch long tensor to the sketch

estimate

Return the estimated count of the item

estimate_torch_long_tensor

Return the estimated count of all items in a torch long tensor

get_table

Return the internal state of the table, for testing purpose

total

Return the total number of items seen so far

__init__(width: int = 2000, depth: int = 10)#
__weakref__#

list of weak references to the object (if defined)

add(item: Any, delta: int = 1) None#

Add an item to the sketch

add_torch_long_tensor(tensor: LongTensor) None#

Add all items in a torch long tensor to the sketch

estimate(item: Any) int#

Return the estimated count of the item

estimate_torch_long_tensor(tensor: LongTensor) LongTensor#

Return the estimated count of all items in a torch long tensor

get_table() ndarray#

Return the internal state of the table, for testing purpose

total() int#

Return the total number of items seen so far