True Error vs Sample Error

26 July 2024

1

True Error

The true error can be said as the probability that the hypothesis will misclassify a single randomly drawn sample from the population. Here the population represents all the data in the world.

Let’s consider a hypothesis h(x) and the true/target function is f(x) of population P. The probability that h will misclassify an instance drawn at random i.e. true error is:

$T.E. = Prob[f(x) \neq h(x)]$

Sample Error

The sample error of S with respect to target function f and data sample S is the proportion of examples S misclassifies.

$S.E. =\frac{1}{n} \sum_{x \epsilon S}\delta(f(x) \neq h(x))$

$Sample \, Error = \frac{Number\, of\, missclassified \, instances}{Total \, Number \, of \, Instance}$

or, the following formula represents also represents sample error:

$S.E. = \frac{FP + FN}{TP + FP + FN + TN}$
$S.E. = 1 - \frac{TP + TN}{TP + FP + FN + TN}$
S.E. = 1- Accuracy

Suppose Hypothesis h misclassifies the 7 out of the 33 examples in total populations. Then the sampling error should be:

$SE = \frac{7}{33} = .21$

Bias & Variance

Bias: Bias is the difference between the average prediction of the hypothesis and the correct value of prediction. The hypothesis with high bias tries to oversimplify the training (not working on a complex model). It tends to have high training errors and high test errors.

$Bias = E[h(x)]- f(x)$

Variance: High variance hypotheses have high variability between their predictions. They try to over-complex the model and do not generalize the data very well.

$Var(X) = E[(X - E[X])^2]$

Confidence Interval

Generally, the true error is complex and difficult to calculate. It can be estimated with the help of a confidence interval. The confidence interval can be estimated as the function of the sampling error.

Below are the steps for the confidence interval:

Randomly drawn n samples S (independently of each other), where n should be >30 from the population P.
Calculate the Sample Error of sample S.

Here we assume that the sampling error is the unbiased estimator of True Error. Following is the formula for calculating true error:

$T.E. = S.E. \pm z_{s} \sqrt{\frac{S.E. (1- S.E.)}{n}}$

where z_s is the value of the z-score of the s percentage of the confidence interval:

% Confidence Interval	50	80	90	95	99	99.5
Z-score	0.67	1.28	1.64	1.96	2.58	2.80