gigl.src.common.models.layers.CountMinSketch#
- class gigl.src.common.models.layers.count_min_sketch.CountMinSketch(width: int = 2000, depth: int = 10)#
Bases:
object
A probability data structure that can be used to estimate the frequency of an item in a stream. For full details please refer to the paper https://dsf.berkeley.edu/cs286/papers/countmin-latin2004.pdf
There is also a good blog from Redis to talk about its application https://redis.io/blog/count-min-sketch-the-art-and-science-of-estimating-stuff/
- How accurate is the estimation?
Denote the total count is N, and the width of the table is w, the depth of the table is d. For each row (in d), we have atleast 1/2 of the probability that the hashed count is less than 2N/w Then, with d rows, the final results is more than 2N/w larger than the actual count with a probability less than 1/2^d
Currently, this implementation only uses single thread, and we might want to optimize it if we see performance issue
Methods
Add an item to the sketch
Add all items in a torch long tensor to the sketch
Return the estimated count of the item
Return the estimated count of all items in a torch long tensor
Return the internal state of the table, for testing purpose
Return the total number of items seen so far
- __init__(width: int = 2000, depth: int = 10)#
- __weakref__#
list of weak references to the object (if defined)
- add(item: Any, delta: int = 1) None #
Add an item to the sketch
- add_torch_long_tensor(tensor: LongTensor) None #
Add all items in a torch long tensor to the sketch
- estimate(item: Any) int #
Return the estimated count of the item
- estimate_torch_long_tensor(tensor: LongTensor) LongTensor #
Return the estimated count of all items in a torch long tensor
- get_table() ndarray #
Return the internal state of the table, for testing purpose
- total() int #
Return the total number of items seen so far