Data binning, bucketing is a data pre-processing method used to minimize the effects of small observation errors. The original data values are divided into small intervals known as bins and then they are replaced by a general value calculated for that bin. This has a smoothing effect on the input data and may also reduce the chances of overfitting in the case of small datasets
There are 2 methods of dividing data into bins:
- Equal Frequency Binning: bins have an equal frequency.
- Equal Width Binning : bins have equal width with a range of each bin are defined as [min + w], [min + 2w] …. [min + nw] where w = (max – min) / (no of bins).
Equal frequency:
Input:[5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215] Output: [5, 10, 11, 13] [15, 35, 50, 55] [72, 92, 204, 215]
Equal Width:
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215] Output: [5, 10, 11, 13, 15, 35, 50, 55, 72] [92] [204, 215]
Code : Implementation of Binning Technique:
Python
# equal frequency def equifreq(arr1, m): a = len (arr1) n = int (a / m) for i in range ( 0 , m): arr = [] for j in range (i * n, (i + 1 ) * n): if j > = a: break arr = arr + [arr1[j]] print (arr) # equal width def equiwidth(arr1, m): a = len (arr1) w = int (( max (arr1) - min (arr1)) / m) min1 = min (arr1) arr = [] for i in range ( 0 , m + 1 ): arr = arr + [min1 + w * i] arri = [] for i in range ( 0 , m): temp = [] for j in arr1: if j > = arr[i] and j < = arr[i + 1 ]: temp + = [j] arri + = [temp] print (arri) # data to be binned data = [ 5 , 10 , 11 , 13 , 15 , 35 , 50 , 55 , 72 , 92 , 204 , 215 ] # no of bins m = 3 print ( "equal frequency binning" ) equifreq(data, m) print ( "\n\nequal width binning" ) equiwidth(data, 3 ) |
Output :
equal frequency binning [5, 10, 11, 13] [15, 35, 50, 55] [72, 92, 204, 215] equal width binning [[5, 10, 11, 13, 15, 35, 50, 55, 72], [92], [204, 215]]