Statistical Distribution of Error | ||

Prev | Next |

Business leaders are used to make difficult strategic decisions under some uncertainty. Data Analytic systems may help them by limiting the uncertainty which sometimes creates a false sense of security. This human behavior combined with the raw data errors and algorithm errors affect the corporate performance. To explain the combined economic effect of these errors we use the statistical distribution of errors.

Statistical distribution of errors for the baseline vs analytic model of a temperature predicting algorithm used in price hedging. The empirical distributions were derived from historic data and simulations over a ten year period with one hour sampling rate. |

On this graph we can distinguish three areas of interest:

**No Cost Band**- the narrow range next to the left end where the error is irrelevant since most systems have a certain resilience. Ideally the entire distribution should be found here.**Economic Loss Area**- this area represents most of the graph and it describes algorithm's contribution to loses compared to an ideal solution.**Risk Band**- this area is found at the right end of the graph and it contains very low probability but highly damaging events.

The example shown here is almost ideal. The **No Cost Band** ranging from 0 to 2.5°C
contains about half of the entire area under curve excluding these errors from the economically
penalizing effect. For errors so low there is no need to compensate the temperature so the cost is zero.

The **Economic Loss Area** is characterized usually by a simple (if not linear) relation
between prediction error and economic loss. From that relation and the error's statistical
distribution we can derive the implied economic loss. The true loss is only for the eventual
portions of this graph over the baseline graph.

Events in the **Risk Band** have highly damaging effects, much more than the error level
may imply. In general we describe the risk level using the extreme events statistic (Gumbell
distribution) derived from the empirical distribution of errors. In most cases the threshold
at which losses became catastrophic is external. In our example a temperature error over 21°C
cannot be compensated due to some technical limitations leading to the catastrophic destruction
of the equipment and warehouse content. Our algorithm was designed to eliminate all errors over
14.75°C leaving this band empty.