Wednesday, January 8, 2025
Google search engine
HomeData Modelling & AIIntroduction to Rolling Hash – Data Structures and Algorithms

Introduction to Rolling Hash – Data Structures and Algorithms

A rolling hash is a hash function that is used to efficiently compute a hash value for a sliding window of data. It is commonly used in computer science and computational biology, where it can be used to detect approximate string matches, find repeated substrings, and perform other operations on sequences of data.

The idea behind a rolling hash is to compute the hash value for a fixed-size window of data, and then “roll” the window one position at a time and re-compute the hash value. By using a sliding window, the rolling hash can avoid re-computing the hash value from scratch for each position, which can be time-consuming for large data sets. Instead, the rolling hash only needs to update the hash value for the new data that is added or removed from the window.

Examples:

Example 1

  • let’s say we have a string “hello world” and we want to search for the substring “world” within it. We can use the Rabin-Karp hash function to compute the hash value of each possible substring of length 5 (the length of the substring “world”) and compare it to the hash value of the target substring “world”.
  • To compute the hash value of a substring, the Rabin-Karp algorithm uses a rolling hash function that maintains a running hash value of the current substring and updates the hash value as the sliding window moves across the input string.
  • For instance, the Rabin-Karp rolling hash function might use the following formula to compute the hash value of a substring: hash(substring) = (c1 * a^(k-1) + c2 * a^(k-2) + … + ck * a^0) mod m where c1, c2, …, ck are the ASCII values of the characters in the substring, a is a prime number (e.g., 31), k is the length of the substring, and m is a large prime number (e.g., 10^9 + 7).
  • Using this rolling hash function, we can compute the hash value of each possible substring of length 5 in the string “hello world”, and compare it to the hash value of the target substring “world” to search for a match. If a match is found, we can return the index of the matching substring in the input string.
  • Rolling hashes are useful for many applications where data needs to be efficiently compressed, searched, or compared in a continuous and incremental manner.

Example 2

  • Suppose we have a large file (e.g., several gigabytes in size) that we want to deduplicate by identifying and removing duplicate blocks of data. To do this efficiently, we can use a rolling hash to compute the hash value of each block of data and compare it to the hash values of previously seen blocks.
  • One common rolling hash used for this purpose is the Adler-32 hash function, which is a checksum algorithm that is fast and simple to compute. The Adler-32 hash function works by combining two 16-bit values (s1 and s2) that are updated for each byte in the input data.
  • To compute the Adler-32 hash of a block of data, we can use the following formula: hash(data) = s2 * 65536 + s1 where s1 is the sum of all bytes in the data modulo 65521, and s2 is the sum of all s1 values modulo 65521.
  • Using this rolling hash function, we can compute the hash value of each block of data in the file as we read it, and compare it to the hash values of previously seen blocks stored in a hash table or a bloom filter. If a match is found, we can skip the duplicate block and save space by only storing a reference to the previously seen block.

The most commonly used algorithm for rolling hashes is the Rabin-Karp algorithm, which uses a polynomial hash function to compute the hash value for the window of data. The polynomial hash function uses the coefficients of a polynomial to represent the window of data, and the hash value is computed by evaluating the polynomial at a fixed point in a finite field. The hash value can be efficiently updated by removing the first coefficient from the polynomial and adding the next coefficient from the sliding window, which can be done in constant time.

Here’s an implementation of the rolling hash algorithm in Python:

  • This function takes in a string s, a window size window_size, and two optional arguments base and mod. The base argument is used as the base for the polynomial hash function, and the mod is the modulus used to reduce the hash values.
  • The function first initializes some variables, including a list power that stores the powers of the base modulo the mod, a list hash_values that stores the hash values of all substrings of length window_size in s, and a variable current_hash that stores the hash value of the current window.
  • The function then uses a for loop to precompute the powers of the base modulo the mod. It then computes the hash value of the first window using another for loop that iterates over the characters in the first window.
  • Finally, the function uses another for loop to compute the hash values of the rest of the substrings. It first removes the contribution of the first character in the current window, shifts the window by one character, and adds the new character to the hash. It then stores the hash value of the current window in the hash_values list.
  • The function returns the hash_values list, which contains the hash values of all substrings of length window_size in s.

Below is the implementation for the above approach:

C++




#include <bits/stdc++.h>
using namespace std;
 
vector<int> rolling_hash(string s, int window_size,
                         int base = 26,
                         int mod = 1000000007)
{
    int n = s.length();
    vector<int> power(n + 1, 1);
    vector<int> hash_values(n - window_size + 1);
 
    // Precompute the powers of the base modulo the mod
    for (int i = 1; i <= n; i++) {
        power[i] = (power[i - 1] * base) % mod;
    }
 
    // Compute the hash value of the first window
    int current_hash = 0;
    for (int i = 0; i < window_size; i++) {
        current_hash = (current_hash * base + s[i]) % mod;
    }
    hash_values[0] = current_hash;
 
    // Compute the hash values of the rest of the substrings
    for (int i = 1; i <= n - window_size; i++) {
        // Remove the contribution of the first character in
        // the window
        current_hash = (current_hash
                        - power[window_size - 1] * s[i - 1])
                       % mod;
 
        // Shift the window by one character and add the new
        // character to the hash
        current_hash
            = (current_hash * base + s[i + window_size - 1])
              % mod;
 
        hash_values[i] = current_hash;
    }
    return hash_values;
}
 
// Driver code
int main()
{
 
    // Input string
    string s = "abcabcabc";
 
    // Window size
    int window_size = 3;
 
    // Calculate rolling hash values
    vector<int> hash_values = rolling_hash(s, window_size);
 
    // Print the hash values
    for (auto& val : hash_values) {
        cout << val << " ";
    }
    cout << endl;
 
    return 0;
}
 
// this code is contributed by Prajwal kandekar


Java




import java.util.*;
 
public class Main {
    public static ArrayList<Integer>
    rolling_hash(String s, int window_size, int base,
                 int mod)
    {
        int n = s.length();
        ArrayList<Integer> power
            = new ArrayList<Integer>(n + 1);
        ArrayList<Integer> hash_values
            = new ArrayList<Integer>(n - window_size + 1);
        // Precompute the powers of the base modulo the mod
        for (int i = 0; i <= n; i++) {
            power.add(1);
        }
        for (int i = 1; i <= n; i++) {
            power.set(i, (power.get(i - 1) * base) % mod);
        }
 
        // Compute the hash value of the first window
        int current_hash = 0;
        for (int i = 0; i < window_size; i++) {
            current_hash
                = (current_hash * base + s.charAt(i)) % mod;
        }
        hash_values.add(current_hash);
 
        // Compute the hash values of the rest of the
        // substrings
        for (int i = 1; i <= n - window_size; i++) {
            // Remove the contribution of the first
            // character in the window
            current_hash = (current_hash
                            - power.get(window_size - 1)
                                  * s.charAt(i - 1))
                           % mod;
 
            // Shift the window by one character and add the
            // new character to the hash
            current_hash = (current_hash * base
                            + s.charAt(i + window_size - 1))
                           % mod;
 
            hash_values.add(current_hash);
        }
        return hash_values;
    }
 
    // Driver code
    public static void main(String[] args)
    {
        // Input string
        String s = "abcabcabc";
 
        // Window size
        int window_size = 3;
 
        // Calculate rolling hash values
        ArrayList<Integer> hash_values
            = rolling_hash(s, window_size, 26, 1000000007);
 
        // Print the hash values
        for (int val : hash_values) {
            System.out.print(val + " ");
        }
        System.out.println();
    }
}


Python3




# Python Implementation
from typing import List
 
 
def rolling_hash(s: str, window_size: int, base: int = 26, mod: int = 10**9 + 7) -> List[int]:
    """
    Calculates the rolling hash values of all substrings of length window_size in string s.
    Uses the polynomial rolling hash algorithm with base and mod as constants.
 
    :param s: the input string
    :param window_size: the size of the rolling window
    :param base: the base for the polynomial hash function
    :param mod: the modulus for the polynomial hash function
    :return: a list of hash values of all substrings of length window_size in s
    """
    n = len(s)
    power = [1] * (n + 1)
    hash_values = [0] * (n - window_size + 1)
 
    # Precompute the powers of the
    # base modulo the mod
    for i in range(1, n + 1):
        power[i] = (power[i - 1] * base) % mod
 
    # Compute the hash value of
    # the first window
    current_hash = 0
    for i in range(window_size):
        current_hash = (current_hash * base + ord(s[i])) % mod
 
    hash_values[0] = current_hash
 
    # Compute the hash values of the
    # rest of the substrings
    for i in range(1, n - window_size + 1):
 
        # Remove the contribution of the
        # first character in the window
        current_hash = (
            current_hash - power[window_size - 1] * ord(s[i - 1])) % mod
 
        # Shift the window by one character
        # and add the new character
        # to the hash
        current_hash = (current_hash * base + ord(s[i + window_size - 1])) % mod
 
        hash_values[i] = current_hash
 
    return hash_values
 
 
# Driver code
def main():
 
    # Input string
    s = "abcabcabc"
 
    # Window size
    window_size = 3
 
    # Calculate rolling hash values
    hash_values = rolling_hash(s, window_size)
 
    # Print the hash values
    print(hash_values)
 
 
if __name__ == "__main__":
    main()


C#




// C# Implementation
using System;
using System.Collections.Generic;
 
class Program {
    static List<int> rolling_hash(string s, int window_size,
                                  int base_, int mod)
    {
        int n = s.Length;
        List<int> power = new List<int>(n + 1);
        List<int> hash_values
            = new List<int>(n - window_size + 1);
        // Precompute the powers of the base modulo the mod
        for (int i = 0; i <= n; i++) {
            power.Add(1);
        }
        for (int i = 1; i <= n; i++) {
            power[i] = (power[i - 1] * base_) % mod;
        }
 
        // Compute the hash value of the first window
        int current_hash = 0;
        for (int i = 0; i < window_size; i++) {
            current_hash
                = (current_hash * base_ + s[i]) % mod;
        }
        hash_values.Add(current_hash);
 
        // Compute the hash values of the rest of the
        // substrings
        for (int i = 1; i <= n - window_size; i++) {
            // Remove the contribution of the first
            // character in the window
            current_hash
                = (current_hash
                   - power[window_size - 1] * s[i - 1])
                  % mod;
 
            // Shift the window by one character and add the
            // new character to the hash
            current_hash = (current_hash * base_
                            + s[i + window_size - 1])
                           % mod;
 
            hash_values.Add(current_hash);
        }
        return hash_values;
    }
 
    // Driver code
    static void Main(string[] args)
    {
        // Input string
        string s = "abcabcabc";
 
        // Window size
        int window_size = 3;
 
        // Calculate rolling hash values
        List<int> hash_values
            = rolling_hash(s, window_size, 26, 1000000007);
 
        // Print the hash values
        foreach(int val in hash_values)
        {
            Console.Write(val + " ");
        }
        Console.WriteLine();
    }
}
 
// This code is contributed by Tapesh(tapeshdua420)


Javascript




function rollingHash(s, windowSize, base = 26, mod = 1000000007) {
  const n = s.length;
  const power = new Array(n + 1).fill(1);
  const hashValues = new Array(n - windowSize + 1);
 
  // Precompute the powers of the base modulo the mod
  for (let i = 1; i <= n; i++) {
    power[i] = (power[i - 1] * base) % mod;
  }
 
  // Compute the hash value of the first window
  let currentHash = 0;
  for (let i = 0; i < windowSize; i++) {
    currentHash = (currentHash * base + s.charCodeAt(i)) % mod;
  }
  hashValues[0] = currentHash;
 
  // Compute the hash values of the rest of the substrings
  for (let i = 1; i <= n - windowSize; i++) {
    // Remove the contribution of the first character in the window
    currentHash =
      ((currentHash - power[windowSize - 1] * s.charCodeAt(i - 1)) % mod + mod) %
      mod;
 
    // Shift the window by one character and add the new character to the hash
    currentHash =
      (currentHash * base + s.charCodeAt(i + windowSize - 1)) % mod;
 
    hashValues[i] = currentHash;
  }
  return hashValues;
}
 
// Driver code
const s = "abcabcabc";
 
// Window size
const windowSize = 3;
 
// Calculate rolling hash values
const hashValues = rollingHash(s, windowSize);
 
// Print the hash values
console.log(hashValues.join(" "));
 
//This code is contributed by Tushar Rokade


Output

[68219, 68919, 69544, 68219, 68919, 69544, 68219]

Time Complexity: O(n)
Auxiliary Space: O(n)

Applications of the rolling hash algorithm:

  • Approximate string matching: The rolling hash can be used to quickly identify regions of two strings that are likely to be similar, which can be useful for detecting plagiarism or identifying similar documents.
  • Repeated substring detection: The rolling hash can be used to identify repeated substrings in a long sequence of data, which can be useful for compressing the data or finding patterns in the data.
  • Data stream processing: The rolling hash can be used to efficiently process large streams of data, such as network packets or sensor data, by computing a hash value for each window of data and using the hash value to identify patterns or anomalies in the data.
  • File synchronization: Rolling hashes can be used to efficiently detect changes in large files, such as multimedia files or database backups. By computing rolling hashes for fixed-size chunks of the file, it is possible to quickly identify regions of the file that have been modified or deleted, which can be used to synchronize the file with a remote backup or a different version of the file.
  • Data deduplication: Rolling hashes can be used to detect duplicate blocks of data in a large data set, such as a file system or a backup archive. By computing rolling hashes for fixed-size chunks of the data, it is possible to quickly identify blocks of data that have already been stored, which can be used to save storage space and reduce backup times.
  • Network intrusion detection: Rolling hashes can be used to identify network traffic that matches known patterns of attack or malicious behavior, such as denial-of-service attacks or malware infections. By computing rolling hashes for packets of network traffic, it is possible to quickly identify patterns of behavior that are indicative of an attack, which can be used to trigger an alarm or block the traffic.
  • Genome analysis: Rolling hashes are widely used in computational biology to analyze large sequences of genetic data, such as DNA or RNA sequences. By computing rolling hashes for fixed-size segments of the sequence, it is possible to quickly identify regions of the sequence that are likely to be functionally important or evolutionarily conserved, which can be used to guide experimental design or annotate the sequence.
  • Video processing: Rolling hashes can be used to efficiently detect duplicate or near-duplicate frames in a video stream, which can be used for video compression or to identify video content that has been illegally copied or distributed. By computing rolling hashes for fixed-size frames of the video, it is possible to quickly identify regions of the video that are likely to be similar, which can be used to group the frames into clusters and identify the most representative frames.
  • Cloud storage: Rolling hashes can be used to efficiently detect changes in large data sets that are stored in the cloud, such as cloud backups or cloud-based file systems. By computing rolling hashes for fixed-size chunks of the data, it is possible to quickly identify regions of the data that have been modified, which can be used to synchronize the data with a remote backup or a different version of the data.
  • Machine learning: Rolling hashes can be used to preprocess large data sets for use in machine learning models, such as image recognition or speech recognition. By computing rolling hashes for fixed-size chunks of the data, it is possible to reduce the dimensionality of the data while preserving the most salient features, which can be used to speed up training times and improve model accuracy.

Advantages of the rolling hash algorithm:

  • Speed: Rolling hash is very fast and efficient in calculating the hash value of a substring. It is especially useful when dealing with large strings, where calculating the hash value of the entire string would be too slow.
  • Memory usage: Rolling hash uses very little memory, as it only needs to store the hash value of the current window and the hash value of the previous window.
  • Partial matching: Rolling hash allows for partial matching, which means that you can compare two substrings and quickly determine whether they are equal or not. This is useful for string searching and pattern-matching algorithms.

Disadvantages of the rolling hash algorithm:

  • Hash collisions: Rolling hash can produce hash collisions, which means that two different substrings can have the same hash value. This can lead to false matches and incorrect results.
  • Window size: Rolling hash requires a fixed window size, which means that it may not be suitable for all types of strings. If the window size is too small, it may miss important features of the string. If the window size is too large, it may be too slow and memory-intensive.
  • Limited hash space: Rolling hash has a limited hash space, which means that there is a finite number of hash values that can be generated. This can limit the accuracy and reliability of the hash function.

Feeling lost in the world of random DSA topics, wasting time without progress? It’s time for a change! Join our DSA course, where we’ll guide you on an exciting journey to master DSA efficiently and on schedule.
Ready to dive in? Explore our Free Demo Content and join our DSA course, trusted by over 100,000 neveropen!

RELATED ARTICLES

Most Popular

Recent Comments