In this article implementation of the Counting Bloom Filter is going to be discussed. Following are the topics that are going to be covered.
- Why do we need Probabilistic Data Structures?
- What are some areas where it can be applied?
- What is the membership type of problem?
- How is Counting Bloom Filter different from traditional Bloom Filter?
- Introduction to Counting Bloom Filter and Implementation.
Why do we need Probabilistic Data Structures?
In this Big Data Age, every company was processing a large amount of data each day. e.g., IBM has stated that people create 2.5 quintillion bytes of data every day, this number is growing constantly, so companies need some technique to quickly understand the data to find value and act on it. However, the traditional technologies which include data structures and algorithms, become ineffective when dealing with Big Data. The best solution for this type of problem is using Probabilistic Data Structures.
A deterministic data structure can also perform all the operations that a probabilistic data structure does but only with low data sets. If the data set is too big and couldn’t fit into the memory, then the deterministic data structure fails and is simply not feasible.
PDS(Probabilistic Data Structures) always provide approximated answers but with reliable ways to estimate possible errors. Fortunately, the potential losses and errors are fully compensated for by extremely low memory requirements, constant query time, and scaling, the factors that become essential in Big Data applications.
What are some areas we apply to?
As mentioned previously the advantage of PDS appears in Big Data Problems like
- Membership Problem where we need to find whether the particular element is present in the data or not?
- In problems like finding the number of unique elements present in the data.
- Counting the frequency of a particular element in the whole data.
What is the membership type of problem?
A membership problem for a dataset is a task to decide whether some element belongs to the dataset or not. Unlike deterministic data structures which find the place where the exact match occurred. In many of the scenarios, we don’t need to know which element from the set has been matched, only that a match has been made and, therefore it is possible to store only signatures of the elements rather than the whole value.
How is counting Bloom Filter different from traditional Bloom Filter?
The main disadvantage of classical Bloom Filter is deletion operation is not possible in that i.e., only the element can be inserted and whether the element is present in the data or not can be checked. Even if anyone try to change the bits in the bit array corresponding to the k positions it may lead to false negatives. Fortunately, missing deletion support is not a problem for many real-world applications,
Counting Bloom Filter and its Implementation
The most popular extension of the classical Bloom filter that supports deletion is the Counting Bloom filter, proposed by Li Fan, Pei Cao, Jussara Almeida, and Andrei Z. Broder in 2000. Counting Bloom Filter introduces an array of m counters {Cj}mj=1 corresponding to each bit in the filter’s array.
The Counting Bloom filter allows approximating the number of times each element has been seen in the filter by incrementing the corresponding counter every time the element is added. The associated CountingBloomFilter data structure contains a bit array and the array of counters of length m, all initialized to zeros. When a new element is inserted into CountingBloomFilter, first compute its corresponding bit-positions, then for each position, we increment the associated counter, These are some basic ideas to code in any preferred language.
Pseudo Code:
Input: Element x belongs to D
Input: Counting Bloom Filter with m counters {Cj}j=1m and k hash functions {hi}i=1k
Algo:for i=1 to k do
j=hi(x)
Cj=Cj+1
Next to test whether an element is present or not we check the counters corresponding to k positions if the value of all the counters is greater than 0 implies the element is probably present.
Input: Element x belongs to D
Input: Counting Bloom Filter with m counters and k hash functions.
Algo:for i=1 to k do
j=hi(x)
if CountingBloomFilter[j]<1 then
return False
return True
Now, finally the important deletion part. The deletion is quite similar to the insertion but in reverse. To delete an element x , we compute all k hash values hi = {hi(x )}ki=1 and decrease the corresponding counters.
Below is the implementation of the above Algorithm
Python3
import math from fnvhash import fnv1a_32 from bitarray import bitarray from bitarray.util import ba2int,int2ba class CBloomFilter(): def __init__( self , n,Counter_size,bucket_size,no_hashfn): self .n = n self .N = Counter_size self .m = bucket_size self .k = no_hashfn self .bit_array = [] for i in range ( self .m): count = bitarray( self .N) count.setall( 0 ) self .bit_array.append(count) def hash ( self ,item,seed): return fnv1a_32(item.encode(),seed) % self .m def add( self , item): for i in range ( self .k): index = self . hash (item,i) cur_val = ba2int( self .bit_array[index]) new_array = int2ba(cur_val + 1 ,length = self .N) self .bit_array[index] = new_array def check( self , item): for i in range ( self .k): index = self . hash (item,i) cur_val = ba2int( self .bit_array[index]) if ( not cur_val> 0 ): return False return True def remove( self ,item): if ( self .check(item)): for i in range ( self .k): index = self . hash (item,i) cur_val = ba2int( self .bit_array[index]) new_array = int2ba(cur_val - 1 ,length = self .N) self .bit_array[index] = new_array print ( 'Element Removed' ) else : print ( 'Element is probably not exist' ) |
In the above code for hashing the elements FNV hash function(you may also use the murmur hash function) is used which is widely used in probabilistic data structures. n is the expected no of elements that you are going to enter into the filter. Counter_size is the number of bits for each counter in the bucket. bucket_size is the m number of buckets we are going to use in the filter and finally, the no_hashfn is k the number of hash functions that we are going to use.
Now test the class with some examples save the file as counting_bloom_filter.py now create another file for testing.
Python3
from counting_bloom_filter import CBloomFilter from random import shuffle n = 20 # no of items to add N = 4 # size of the each counter m = 150 # total number of the buckets k = 5 # number of hash functions cbloomf = CBloomFilter(n, N, m, k) print ( "Size of bit array:{}" . format (cbloomf.m)) print ( "Size of each bucket:{}" . format (cbloomf.N)) print ( "Number of hash functions:{}" . format (cbloomf.k)) # words to be added word_present = [ 'abound' , 'abounds' , 'abundance' , 'abundant' , 'accessible' , 'bloom' , 'blossom' , 'bolster' , 'bonny' , 'bonus' , 'bonuses' , 'coherent' , 'cohesive' , 'colorful' , 'comely' , 'comfort' , 'gems' , 'generosity' , 'generous' , 'generously' , 'genial' ] # word not added word_absent = [ 'bluff' , 'cheater' , 'hate' , 'war' , 'humanity' , 'racism' , 'hurt' , 'nuke' , 'gloomy' , 'facebook' , 'neveropen' , 'twitter' ] for item in word_present: cbloomf.add(item) shuffle(word_present) shuffle(word_absent) test_words = word_present[: 10 ] + word_absent shuffle(test_words) for word in test_words: if cbloomf.check(word): if word in word_absent: print ( "'{}' is a false positive!" . format (word)) else : print ( "'{}' is probably present!" . format (word)) else : print ( "'{}' is definitely not present!" . format (word)) |
Output:
Properties:
The Counting Bloom filter inherits all the properties of the classical Bloom filter, including false-positive error estimation and recommendations for the optimal choice of m and k refer.