Friday, December 27, 2024
Google search engine
HomeLanguagesTrue Error vs Sample Error

True Error vs Sample Error

True Error

The true error can be said as the probability that the hypothesis will misclassify a single randomly drawn sample from the population. Here the population represents all the data in the world.

Let’s consider a hypothesis h(x) and the true/target function is f(x) of population P. The probability that h will misclassify an instance drawn at random i.e. true error is:

T.E. = Prob[f(x) \neq h(x)]

Sample Error

The sample error of S with respect to target function f and data sample S is the proportion of examples S misclassifies.

S.E. =\frac{1}{n} \sum_{x \epsilon S}\delta(f(x) \neq h(x))

Sample \, Error = \frac{Number\, of\, missclassified \, instances}{Total \, Number \, of \, Instance}

or, the following formula represents also represents sample error:

  • S.E.  = \frac{FP + FN}{TP + FP + FN + TN}
  • S.E.  = 1 - \frac{TP + TN}{TP + FP + FN + TN}
  • S.E. = 1- Accuracy

Suppose Hypothesis h misclassifies the 7 out of the 33 examples in total populations. Then the sampling error should be:

SE = \frac{7}{33} = .21

Bias & Variance

Bias: Bias is the difference between the average prediction of the hypothesis and the correct value of prediction. The hypothesis with high bias tries to oversimplify the training (not working on a complex model). It tends to have high training errors and high test errors.

Bias = E[h(x)]- f(x)

Variance: High variance hypotheses have high variability between their predictions. They try to over-complex the model and do not generalize the data very well.

Var(X)  = E[(X - E[X])^2]

Confidence Interval

  Generally, the true error is complex and difficult to calculate. It can be estimated with the help of a confidence interval. The confidence interval can be estimated as the function of the sampling error.

Below are the steps for the confidence interval:

  • Randomly drawn n samples S (independently of each other), where n should be >30 from the population P.
  • Calculate the Sample Error of sample S.

Here we assume that the sampling error is the unbiased estimator of True Error. Following is the formula for calculating true error:

T.E. = S.E. \pm  z_{s} \sqrt{\frac{S.E. (1- S.E.)}{n}}

 where zs is the value of the z-score of the s percentage of the confidence interval:

% Confidence Interval 50 80 90 95 99 99.5
 Z-score 0.67 1.28 1.64 1.96 2.58 2.80

True Error vs Sample Error

True Error Sample Error
The true error represents the probability that a random sample from the population is misclassified. Sample Error represents the fraction of the sample which is misclassified.
True error is used to estimate the error of the population. Sample Error is used to estimate the errors of the sample.
True error is difficult to calculate. It is estimated by the confidence interval range on the basis of Sample error. Sample Error is easy to calculate. You just have to calculate the fraction of the sample that is misclassified.
The true error can be caused by poor data collection methods, selection bias, or non-response bias. Sampling error can be of type population-specific error (wrong people to survey), selection error, sample-frame error (wrong frame window selected for sample), and non-response error (when respondent failed to respond).

Implementation:

In this implementation, we will be implementing the estimation of true error using a confidence interval. 

Python3




# imports
import numpy as np
import scipy.stats as st
  
#define sample data
np.random.seed(0)
data = np.random.randint(10, 30, 10000)
  
alphas = [0.90, 0.95, 0.99, 0.995]
for alpha in alphas:
  print(st.norm.interval(alpha=alpha, loc=np.mean(data), scale=st.sem(data)))


# confidence Interval
90%: (17.868667310403545, 19.891332689596453)
95%: (17.67492277275104, 20.08507722724896)
99%: (17.29626006422982, 20.463739935770178)
99.5%: (17.154104780989755, 20.60589521901025)

References:

RELATED ARTICLES

Most Popular

Recent Comments