Data Modelling & AI Data Structure & Algorithm

Histogram Reconstruction

29 July 2024

0

In this article, we will discuss “Diagram to Histogram” also known as “Interval Finding”. While dealing with statistical data, diagrams are represented with single points(or the corresponding numbers) like the stars, which is supposed to be a histogram with a certain width as shown in the below figure.

Analyzing the Problem:

In this problem statement, the assumption is that all intervals stick seamlessly together i.e., there are no gaps and no overlaps. The right edge of a bin is identical to the left edge of the following bin. Given N points, thus N bins, the task is to find (N + 1) bin edges. Every given point is located in the exact center of its X interval. This gives N equations for (N + 1) unknown quantities, so the system is underdetermined. There are two suggestions:

The bins shall be as uniform. Mathematically, the variance of their bin widths shall be minimized.
One supplies directly one fixed value for a bin edge, only the others are derived from that.

Calculations:

Here are few assumptions:

Let x_i be the coordinates of the point (their y values are completely irrelevant for us).
w_i the respective bin widths.
e_i the coordinates of the edges of the bin.
Assume that the x_i is sorted in ascending order.

Below is the image to illustrate the above concepts:

From the above representation, two simple relations that can be easily verified are:

$x_{i + 1} - x_i = \frac12(w_{i + 1} + w_i)$

$x_i = \frac12(e_i + e_{i + 1})$

The above formula simplifies the calculations especially for even N and is the reason why different results are obtained for odd and even values of N. The latter is directly the recursion formula that needs to fill in the array e[](the points on the X-axis of the histogram).

How to minimize the Variance?

The idea is similar to all the minimization problems i.e., check for the derivative 0. The problem is to reduce its formula to only one unknown variable that we can then work with it to find the minimum value.

The variance is given by:

$V=\frac1n\sum_{i=0}^{n-1}(w_i-\bar{w})^2=\frac1n\sum_{i=0}^{n-1}w_i^2-\bar{w}^2$

The value of mean value in the above formula is given by:

$\bar{w}=\frac1n\sum_{i=0}^{n-1}w_i=\frac1n(e_n-e_0)$

Derive the above equation with respect to an arbitrary quantity z as:

$\frac{dV}{dz}=\frac2n\sum_{i=0}^{n-1}w_i\frac{dw_i}{dz}-\frac2n\bar{w}\sum_{i=0}^{n-1}\frac{dw_i}{dz}$

Applying iteratively the first equation derived above by replacing w_i with (z = e₀). For Example:

$w_i=2x_i-2x_{i-1}-w_{i-1}=2x_i-4x_{i-1}+2x_{i-2}+w_{i-2}=\dots=(-1)^{i+1}\cdot2e_0+4\sum_{j=0}^{i-1}(-1)^{i-j}x_j+2x_i$

Putting all the above values obtained to find the value of e₀:

When N is odd, then

$e_0=\frac1{n^2-1}\sum_{i=0}^{n-1}(-1)^ix_i(2n^2-2in-n-1)$

When N is even, then

$e_0=\frac1n\sum_{i=0}^{n-1}(-1)^ix_i(2n-2i-1)$

Alternatively, using the value of z = e_N, gives a more simple formula as:

When N is odd, then

$e_N = \frac1{N^2-1}\sum_{i=0}^{n-1}(-1)^ix_i(2iN + N - 1)$

When N is even, then

$e_N = \frac1n\sum_{i = 0}^{N - 1}(-1)^ix_i(2i + 1)$

Approach: The idea to solve the given problem is to iterate two nested loops, one for find the value of e₀ or e_N according to the derived formula and another loop for finding the elements of the array e[]. Therefore, the time complexity will be O(N) for all variants. Both loops work recursively and the second loop find the elements of the array e[] from the previous according to the second equation given above.

Note:

The rounding effect of integer divisions is used to handle both cases of odd and even values of N.
The C functions will work in C++ as well if the keyword register is omitted in the below implementation.
The program will require a C compiler of the standard C99 and a C++ compiler of C++14 respectively.

Below is the implementation of the above approach:

C

// C program for the above approach 
#include <stdio.h> 
#include <stdlib.h> 
#define N 6 
  
// Function to fill the array elements 
// e[] from the end 
double* pointsToIntervalsN( 
    int n, const double* x, 
    double* e) 
{ 
    // Check for array overlap 
    if (n < 2 || !x || e < x && e + n >= x) 
        return NULL; 
  
    // If e is a NULL pointer, then 
    // allocate the array 
    if (e 
        || (e 
            = (double*)malloc( 
                (n + 1) * sizeof(double)))) { 
  
        // Find the value of m on the 
        // basis of odd or even value of N 
        const int m = n & 1 ? n : 2; 
        const int j = m * n; 
        register double sum = 0.; 
  
        // Count i and x downwards 
        for (int i = m / 2; i < j; i += m) { 
            sum = i * *x++ - sum; 
        } 
        sum /= j / 2; 
  
        // Note: m/2 and j/2 above are 
        // integer divisions! 
        for (e[n] = sum; n--; e[n] = sum) 
            sum = 2 * *--x - sum; 
    } 
  
    // Including e==NULL for the case 
    // of malloc error 
    return e; 
} 
  
// Function to fill the output array 
// from the front 
double* pointsToIntervals0(const int n, 
                           const double* x, 
                           double* e) 
{ 
    // Check for overlaps 
    if (n < 2 || !x || e >= x && e < x + n) 
        return NULL; 
  
    if (e 
        || (e 
            = (double*)malloc( 
                (n + 1) * sizeof(double)))) { 
  
        const int m = n & 1 ? n : 2; 
        const int j = m * n; 
        register double sum = 0.; 
  
        // Count i down and x 
        // from the front 
        x += n; 
  
        for (int i = m / 2; i < j; 
             i += m) { 
            sum = i * *--x - sum; 
        } 
  
        // Update the value of sum 
        sum /= j / 2; 
  
        *e = sum; 
        for (int i = 0; i < n; 
             e[++i] = sum) 
            sum = 2 * x[i] - sum; 
    } 
  
    // Return the updated e 
    return e; 
} 
  
// Function to find thefixed single 
// e value from which all other e's 
// are derived 
double* pointsToIntervalsFix(const int n, 
                             const double* x, 
                             double e_base, 
                             double* e) 
{ 
    // Base Case 
    if (n < 1 || !x) 
        return NULL; 
  
    int k = 0; 
  
    // Perform Binary Search for e_base 
    for (int l = n; l > 1; l /= 2) 
        if (e_base > x[k + l / 2]) 
            k += (l + 1) / 2; 
  
    // The e_base is either the left 
    // or the right edge of the bin 
    // around x[k] 
    if (e_base > x[k]) 
        ++k; 
  
    // Now it's the left. 
  
    // Assume e is filled the left side 
    // first, the right side of e can 
    // overlap with x 
    if (e + k >= x && e < x + n) 
        return NULL; 
  
    // If the right side is filled 
    // first, so that the left side 
    // of e can overlap with x 
    if (e || (e = (double*)malloc( 
                  (n + 1) * sizeof(double)))) { 
        e[k] = e_base; 
  
        // Fill in both sides of array 
        // e[] starting from k 
        for (int i = k; i--; e[i] = e_base) 
            e_base = 2 * x[i] - e_base; 
  
        for (e_base = e[k]; k < n; 
             e[++k] = e_base) 
            e_base = 2 * x[k] - e_base; 
    } 
  
    return e; 
} 
  
// Driver Code 
int main() 
{ 
    double e_orig[N + 1] 
        = { 4, 37, 121, 200, 234, 300, 365 }; 
    double x[N], e_recN[N + 1], e_rec0[N + 1]; 
    double e_base = 235.4, e_fix[N + 1]; 
  
    // Make x the mean values of the 
    // neighbouring e_orig values: 
    for (int i = N; i--; 
         x[i] = (e_orig[i + 1] + e_orig[i]) / 2) 
        ; 
  
    // Function Call 
    pointsToIntervalsN(N, x, e_recN); 
    pointsToIntervals0(N, x, e_rec0); 
    pointsToIntervalsFix(N, x, e_base, e_fix); 
  
    printf("Example for n = %d:", N); 
    printf("\nx     "); 
    for (int i = 0; i < N; ++i) 
        printf("% .3f", x[i]); 
  
    printf("\ne_orig "); 
  
    for (int i = 0; i <= N; ++i) 
        printf("% .3f", e_orig[i]); 
  
    printf("\ne_recN "); 
  
    for (int i = 0; i <= N; ++i) 
        printf("% .3f", e_recN[i]); 
  
    printf("\ne_rec0 "); 
  
    for (int i = 0; i <= N; ++i) 
        printf("% .3f", e_rec0[i]); 
  
    printf("\ne_fix  "); 
  
    for (int i = 0; i <= N; ++i) 
        printf("% .3f", e_fix[i]); 
  
    return 0; 
} 

Output:

Example for n = 6:
x      20.500 79.000 160.500 217.000 267.000 332.500
e_orig  4.000 37.000 121.000 200.000 234.000 300.000 365.000
e_recN  3.583 37.417 120.583 200.417 233.583 300.417 364.583
e_rec0  3.583 37.417 120.583 200.417 233.583 300.417 364.583
e_fix   5.400 35.600 122.400 198.600 235.400 298.600 366.400

Caveats and Prospects:

Both the methods i.e., the minimum variance and fixed edge occasionally fail in such a way that one or more histogram bins with negative width is obtained, i.e., there are some e_i > e_{(i + 1}₎ seemingly not properly sorted. This always happens when any random X values is taken as input.
Try fixing the bin with the “most negative” width of all, hoping that all others will then look reasonable too.
This brings us to the prospects because the two methods illustrated above are by far not the only possibilities to formulate an extra condition. Instead of “possibly equal” bin widths, assume a tendency like linear increase of the w_i or an absolute minimum of all w_i.

Feeling lost in the world of random DSA topics, wasting time without progress? It’s time for a change! Join our DSA course, where we’ll guide you on an exciting journey to master DSA efficiently and on schedule.
Ready to dive in? Explore our Free Demo Content and join our DSA course, trusted by over 100,000 neveropen!

Histogram Reconstruction

C

Run Local AWS Cloud Stack using LocalStack on Linux

Learn Terraform Automation in 3 days using Video Courses

How To Expose Ansible AWX Service using Nginx Ingress

LEAVE A REPLY Cancel reply

Most Popular

5 Best Antiviruses With Keylogger Protection in 2025 by Tyler Cross

Best VPNs for School in 2025 That Work With Firewalls by Toma Novakovic

How to Watch the Super Bowl From Anywhere in 2025 by Raven Wu

Best Malware Removal + Protection Software in 2025 by Raven Wu

Recent Comments

EDITOR PICKS

5 Best Antiviruses With Keylogger Protection in 2025 by Tyler Cross

Best VPNs for School in 2025 That Work With Firewalls by Toma Novakovic

How to Watch the Super Bowl From Anywhere in 2025 by Raven Wu

POPULAR POSTS

5 Best Antiviruses With Keylogger Protection in 2025 by Tyler Cross

Best VPNs for School in 2025 That Work With Firewalls by Toma Novakovic

How to Watch the Super Bowl From Anywhere in 2025 by Raven Wu

POPULAR CATEGORY

ABOUT US

FOLLOW US