Statistical Distribution of Error
Prev Next

Statistical Distribution of Error

Business leaders are used to make difficult strategic decisions under some uncertainty. Data Analytic systems may help them by limiting the uncertainty which sometimes creates a false sense of security. This human behavior combined with the raw data errors and algorithm errors affect the corporate performance. To explain the combined economic effect of these errors we use the statistical distribution of errors.

Error Distribution

Statistical distribution of errors for the baseline vs analytic model of a temperature predicting algorithm used in price hedging. The empirical distributions were derived from historic data and simulations over a ten year period with one hour sampling rate.

On this graph we can distinguish three areas of interest:

The example shown here is almost ideal. The No Cost Band ranging from 0 to 2.5°C contains about half of the entire area under curve excluding these errors from the economically penalizing effect. For errors so low there is no need to compensate the temperature so the cost is zero.

The Economic Loss Area is characterized usually by a simple (if not linear) relation between prediction error and economic loss. From that relation and the error's statistical distribution we can derive the implied economic loss. The true loss is only for the eventual portions of this graph over the baseline graph.

Events in the Risk Band have highly damaging effects, much more than the error level may imply. In general we describe the risk level using the extreme events statistic (Gumbell distribution) derived from the empirical distribution of errors. In most cases the threshold at which losses became catastrophic is external. In our example a temperature error over 21°C cannot be compensated due to some technical limitations leading to the catastrophic destruction of the equipment and warehouse content. Our algorithm was designed to eliminate all errors over 14.75°C leaving this band empty.