Introduction
Every day on the internet, more than 2.5 quintillion bytes of data are created. This data is increasing in terms of variety, velocity and volume, hence called big data. To analyze this data, one has to collect this data, store it in a safe place, clean it and then perform analysis. One of the major problems faced by big data engineers is dealing with unuseful or redundant data. A lot of time and memory is used to store and analyze this extra data which turns out to be fruitless in the end. Thus, the removal of duplicate data becomes extremely essential to cut the analysis cost and reduce redundancy.
Data cleaning can be done using various techniques but before cleaning the data, it is necessary to know the amount of useful data present in the dataset. Therefore, before the removal of duplicate data from a data stream or database, it is necessary to have knowledge of distinct or unique data present. A way to do so is by hashing the elements of the universal set using the Flajolet Martin Algorithm. The FM algorithm is used in a database query, big data analytics, spectrum sensing in cognitive radio sensor networks, and many more areas. It shows superior performance as compared with many other methods to find distinct elements in a stream of data.
This article was published as a part of the Data Science Blogathon
Table of contents
Flajolet Martin Algorithm
Flajolet Martin Algorithm, also known as FM algorithm, is used to approximate the number of unique elements in a data stream or database in one pass. The highlight of this algorithm is that it uses less memory space while executing.
Pseudo Code-Stepwise Solution:
- Selecting a hash function h so each element in the set is mapped to a string to at least log2n bits.
- For each element x, r(x)= length of trailing zeroes in h(x)
- R= max(r(x))
=> Distinct elements= 2R
Need of Flajolet Marting Algorithm
The Flajolet-Martin algorithm is a potent tool for ascertaining the count of distinct elements within a database. Its utility shines in scenarios where memory size is substantial and processing the complete dataset poses challenges. Key uses and advantages include:
- Scalability: This algorithm is inherently scalable and capable of tackling large datasets. It efficiently estimates unique element counts without necessitating the storage of the entire dataset in memory.
- Memory Efficiency: The Flajolet-Martin algorithm demands minimal memory. It constructs a compact data representation by utilizing hash functions and bit manipulations, effectively counting unique elements.
- Speed: Suited for real-time applications, the algorithm is computationally efficient. It swiftly generates counts of distinct items, requiring relatively modest computing resources. This agility is ideal for swift, on-the-fly calculations.
Working with Flajolet Martin Algorithm
Let us compare this algorithm with our conventional algorithm using python code. Assume we have an array(stream in code) of data of length 20 with 8 unique elements. Using the brute force approach to find the number of unique elements in the array, each element is taken into consideration. Another array(st_unique in code) is formed for unique elements. Initially, the new array is empty(st_unique length equals zero), so naturally, the first element is not present in it. The first element is considered to be unique as it does not exist in the new array and thus a copy of the first element is inserted into the new array(1 is appended to st_unique).
Similarly, all the elements are checked, if they are already present in the new array, they are not considered to be unique, else a copy of the element is inserted into the new array. Running the brute force algorithm for our array, we will get 8 elements in the new array. If each element takes 20 bytes of data, the new array will take 8*20= 160 bytes memory to run the algorithm.
Python Code
For the same array, if use the FM algorithm for the same array, we define a variable(maxnum in code) that stores the maximum number of zeroes at the end. For each value in the array(stream in code), we run a loop to covert its hash function in the form ax+b mod c,(a=1, b=6 and c=32 in this case) into binary(we place [2:] at the end because in python when converted to binary, the number starts with ‘0b’). We run another loop to find if the number of zeroes at the end exceeds the maximum number of zeroes.
In this case, if each variable occupies 2 bytes of data, the whole program takes 4*20= 80 bytes of data, i.e half of the memory used in the above case. Here, we have considered the variables maxnum, sum, val, and j. We have not considered i, time, and stream as they are common in both of the codes.
stream=[1,2,3,4,5,6,4,2,5,9,1,6,3,7,1,2,2,4,2,1]
print('Using Flajolet Martin Algorithm:')
import time
start_time = time.time()
maxnum=0
for i in range(0,len(stream)):
val= bin((1*stream[i] + 6) % 32)[2:]
sum=0
for j in range(len(val)-1,0,-1):
if val[j]=='0':
sum+=1
else:
break
if sum>maxnum:
maxnum=sum
print('distict elements', 2**maxnum)
print("--- %s seconds ---" % (time.time() - start_time))
For small values of m(where m is the number of unique elements), the brute force approach can work, but for large data sets or data streams, where m is very large, a lot of space is required. The compiler may not let us run the algorithm in some cases.
This is where the Flajolet Martin Algorithm can be used. Not only does it occupy less memory, but it also shows better results in terms of time in seconds when the python code is run which can be shown in our output as we calculated seconds taken by both algorithms by using time.time() in python. As shown in the output, it can clearly be said that the FM algorithm takes very little time as compared to the conventional algorithm.
Drawbacks of the Algorithm
It is important to choose the hash parameters wisely while implementing this algorithm as it has been proven practically that the FM Algorithm is very sensitive to the hash function parameters. The hash function used in the above code is of the form ax+b mod c where x is an element in the stream and a,b,c are 1,6,32 respectively where a,b,c are the hash function parameters. For any other values of a,b,c the algorithm may not give the same result. Thus, it is important to observe the stream and then assign proper values to a,b and c.
Conclusion
This approach is used to maintain a count of distinct values seen so far, given a large number of values. For example, getting an approximation of the number of distinct URLs surfed by a person on the web. Many companies want to check how many unique users logged in to their website the previous day to check if their advertisement was successful or not. Here, the FM algorithm is an excellent solution for these companies.
If you want to learn more of these data science concepts, enroll in our Blackbelt program that offers exclusive modules for data science enthusiasts. Explore the program now.
Frequently Asked Questions
A. The Flajolet-Martin algorithm is a probabilistic method to estimate the cardinality or number of distinct elements in a dataset or data stream. It employs hash functions and bit manipulation for efficient approximations.
A. The Flajolet-Martin algorithm uses hash-based techniques to estimate unique elements in a data stream. It leverages bit patterns to approximate cardinality, making it efficient for large-scale datasets with minimal memory usage.
A. In data analytics, the Flajolet-Martin algorithm aids in estimating distinct elements within large datasets. It offers a memory-efficient way to understand dataset diversity, which is crucial for insights and decision-making.
A. In machine learning, the Flajolet-Martin algorithm isn’t a central component. However, its probabilistic approach might find applications in tasks where estimating the cardinality of distinct features is relevant, such as feature engineering or data preprocessing.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.