Thursday, January 9, 2025
Google search engine
HomeData Modelling & AIHistogram Reconstruction

Histogram Reconstruction

In this article, we will discuss “Diagram to Histogram” also known as “Interval Finding”. While dealing with statistical data, diagrams are represented with single points(or the corresponding numbers) like the stars, which is supposed to be a histogram with a certain width as shown in the below figure.

Analyzing the Problem:

In this problem statement, the assumption is that all intervals stick seamlessly together i.e., there are no gaps and no overlaps. The right edge of a bin is identical to the left edge of the following bin. Given N points, thus N bins, the task is to find (N + 1) bin edges. Every given point is located in the exact center of its X interval. This gives N equations for (N + 1) unknown quantities, so the system is underdetermined. There are two suggestions:

  • The bins shall be as uniform. Mathematically, the variance of their bin widths shall be minimized.
  • One supplies directly one fixed value for a bin edge, only the others are derived from that.

Calculations:

Here are few assumptions:

  • Let xi be the coordinates of the point (their y values are completely irrelevant for us).
  • wi the respective bin widths.
  • ei the coordinates of the edges of the bin.
  • Assume that the xi is sorted in ascending order.

Below is the image to illustrate the above concepts:

Relation between considered quantities

From the above representation, two simple relations that can be easily verified are:

x_{i + 1} - x_i = \frac12(w_{i + 1} + w_i)

x_i = \frac12(e_i + e_{i + 1})

The above formula simplifies the calculations especially for even N and is the reason why different results are obtained for odd and even values of N. The latter is directly the recursion formula that needs to fill in the array e[](the points on the X-axis of the histogram).

How to minimize the Variance?

The idea is similar to all the minimization problems i.e., check for the derivative 0. The problem is to reduce its formula to only one unknown variable that we can then work with it to find the minimum value.

  • The variance is given by:

V=\frac1n\sum_{i=0}^{n-1}(w_i-\bar{w})^2=\frac1n\sum_{i=0}^{n-1}w_i^2-\bar{w}^2

  • The value of mean value in the above formula is given by:

\bar{w}=\frac1n\sum_{i=0}^{n-1}w_i=\frac1n(e_n-e_0)

  • Derive the above equation with respect to an arbitrary quantity z as:

 \frac{dV}{dz}=\frac2n\sum_{i=0}^{n-1}w_i\frac{dw_i}{dz}-\frac2n\bar{w}\sum_{i=0}^{n-1}\frac{dw_i}{dz}

  • Applying iteratively the first equation derived above by replacing wi with (z = e0). For Example:

w_i=2x_i-2x_{i-1}-w_{i-1}=2x_i-4x_{i-1}+2x_{i-2}+w_{i-2}=\dots=(-1)^{i+1}\cdot2e_0+4\sum_{j=0}^{i-1}(-1)^{i-j}x_j+2x_i

  • Putting all the above values obtained to find the value of e0:

When N is odd, then

e_0=\frac1{n^2-1}\sum_{i=0}^{n-1}(-1)^ix_i(2n^2-2in-n-1)

When N is even, then

e_0=\frac1n\sum_{i=0}^{n-1}(-1)^ix_i(2n-2i-1)

  • Alternatively, using the value of z = eN, gives a more simple formula as:

When N is odd, then

e_N = \frac1{N^2-1}\sum_{i=0}^{n-1}(-1)^ix_i(2iN + N - 1)

When N is even, then

e_N = \frac1n\sum_{i = 0}^{N - 1}(-1)^ix_i(2i + 1)

Approach: The idea to solve the given problem is to iterate two nested loops, one for find the value of e0 or eN according to the derived formula and another loop for finding the elements of the array e[]. Therefore, the time complexity will be O(N) for all variants. Both loops work recursively and the second loop find the elements of the array e[] from the previous according to the second equation given above.

Note:

Below is the implementation of the above approach:

C




// C program for the above approach
#include <stdio.h>
#include <stdlib.h>
#define N 6
  
// Function to fill the array elements
// e[] from the end
double* pointsToIntervalsN(
    int n, const double* x,
    double* e)
{
    // Check for array overlap
    if (n < 2 || !x || e < x && e + n >= x)
        return NULL;
  
    // If e is a NULL pointer, then
    // allocate the array
    if (e
        || (e
            = (double*)malloc(
                (n + 1) * sizeof(double)))) {
  
        // Find the value of m on the
        // basis of odd or even value of N
        const int m = n & 1 ? n : 2;
        const int j = m * n;
        register double sum = 0.;
  
        // Count i and x downwards
        for (int i = m / 2; i < j; i += m) {
            sum = i * *x++ - sum;
        }
        sum /= j / 2;
  
        // Note: m/2 and j/2 above are
        // integer divisions!
        for (e[n] = sum; n--; e[n] = sum)
            sum = 2 * *--x - sum;
    }
  
    // Including e==NULL for the case
    // of malloc error
    return e;
}
  
// Function to fill the output array
// from the front
double* pointsToIntervals0(const int n,
                           const double* x,
                           double* e)
{
    // Check for overlaps
    if (n < 2 || !x || e >= x && e < x + n)
        return NULL;
  
    if (e
        || (e
            = (double*)malloc(
                (n + 1) * sizeof(double)))) {
  
        const int m = n & 1 ? n : 2;
        const int j = m * n;
        register double sum = 0.;
  
        // Count i down and x
        // from the front
        x += n;
  
        for (int i = m / 2; i < j;
             i += m) {
            sum = i * *--x - sum;
        }
  
        // Update the value of sum
        sum /= j / 2;
  
        *e = sum;
        for (int i = 0; i < n;
             e[++i] = sum)
            sum = 2 * x[i] - sum;
    }
  
    // Return the updated e
    return e;
}
  
// Function to find thefixed single
// e value from which all other e's
// are derived
double* pointsToIntervalsFix(const int n,
                             const double* x,
                             double e_base,
                             double* e)
{
    // Base Case
    if (n < 1 || !x)
        return NULL;
  
    int k = 0;
  
    // Perform Binary Search for e_base
    for (int l = n; l > 1; l /= 2)
        if (e_base > x[k + l / 2])
            k += (l + 1) / 2;
  
    // The e_base is either the left
    // or the right edge of the bin
    // around x[k]
    if (e_base > x[k])
        ++k;
  
    // Now it's the left.
  
    // Assume e is filled the left side
    // first, the right side of e can
    // overlap with x
    if (e + k >= x && e < x + n)
        return NULL;
  
    // If the right side is filled
    // first, so that the left side
    // of e can overlap with x
    if (e || (e = (double*)malloc(
                  (n + 1) * sizeof(double)))) {
        e[k] = e_base;
  
        // Fill in both sides of array
        // e[] starting from k
        for (int i = k; i--; e[i] = e_base)
            e_base = 2 * x[i] - e_base;
  
        for (e_base = e[k]; k < n;
             e[++k] = e_base)
            e_base = 2 * x[k] - e_base;
    }
  
    return e;
}
  
// Driver Code
int main()
{
    double e_orig[N + 1]
        = { 4, 37, 121, 200, 234, 300, 365 };
    double x[N], e_recN[N + 1], e_rec0[N + 1];
    double e_base = 235.4, e_fix[N + 1];
  
    // Make x the mean values of the
    // neighbouring e_orig values:
    for (int i = N; i--;
         x[i] = (e_orig[i + 1] + e_orig[i]) / 2)
        ;
  
    // Function Call
    pointsToIntervalsN(N, x, e_recN);
    pointsToIntervals0(N, x, e_rec0);
    pointsToIntervalsFix(N, x, e_base, e_fix);
  
    printf("Example for n = %d:", N);
    printf("\nx     ");
    for (int i = 0; i < N; ++i)
        printf("% .3f", x[i]);
  
    printf("\ne_orig ");
  
    for (int i = 0; i <= N; ++i)
        printf("% .3f", e_orig[i]);
  
    printf("\ne_recN ");
  
    for (int i = 0; i <= N; ++i)
        printf("% .3f", e_recN[i]);
  
    printf("\ne_rec0 ");
  
    for (int i = 0; i <= N; ++i)
        printf("% .3f", e_rec0[i]);
  
    printf("\ne_fix  ");
  
    for (int i = 0; i <= N; ++i)
        printf("% .3f", e_fix[i]);
  
    return 0;
}


Output:

Example for n = 6:
x      20.500 79.000 160.500 217.000 267.000 332.500
e_orig  4.000 37.000 121.000 200.000 234.000 300.000 365.000
e_recN  3.583 37.417 120.583 200.417 233.583 300.417 364.583
e_rec0  3.583 37.417 120.583 200.417 233.583 300.417 364.583
e_fix   5.400 35.600 122.400 198.600 235.400 298.600 366.400

Caveats and Prospects:

  • Both the methods i.e., the minimum variance and fixed edge occasionally fail in such a way that one or more histogram bins with negative width is obtained, i.e., there are some ei > e(i + 1) seemingly not properly sorted. This always happens when any random X values is taken as input.
  • Try fixing the bin with the “most negative” width of all, hoping that all others will then look reasonable too.
  • This brings us to the prospects because the two methods illustrated above are by far not the only possibilities to formulate an extra condition. Instead of “possibly equal” bin widths, assume a tendency like linear increase of the wi or an absolute minimum of all wi.
Feeling lost in the world of random DSA topics, wasting time without progress? It’s time for a change! Join our DSA course, where we’ll guide you on an exciting journey to master DSA efficiently and on schedule.
Ready to dive in? Explore our Free Demo Content and join our DSA course, trusted by over 100,000 neveropen!

RELATED ARTICLES

Most Popular

Recent Comments