On the origins of regression dilution

Figure 1 showing OLS in the presence of error in both axes. The red line ignores x errors, the blue line ignores y errors. ( Sample is real meteorological data for Ballypatrick: Tmax / deg.C and sun-hours per month. )


Consider two data samples with independent normally ( gaussian ) distributed errors representing two quantities having a linear relationship.

It is well established that ordinary least squares analysis will produce a best unbiased linear estimation of the true relationship if the magnitude of the x error is negligible in relation to the y error.

OLS is based upon minimisation of mean square error. In the case of the symmetric gaussian distribution this minimisation condition corresponds the the peak of the distribution which is coincident with the mean. With negligible x errors this provides a reliable mean value of y/x and hence the best estimation of the true relationship available from the sample.

Consider the effect of non-negligible normally distributed errors in x.

The distribution of errors in 1/x will not be gaussian and will be skewed towards zero. The degree of skew will be determined by the std. deviation in relation to the mean: how much of the tail of the distribution approaches zero where the errors in 1/x become disproportionately larger.

If OLS is applied in this situation, effectively ignoring the x errors this will insert an additional error into the y sample. The distribution of the total y error can be found by convolution of the probability distribution functions of y and 1/x.

Convolution of the gaussian pdf of y with the skewed pdf of 1/x will skew y. Now the minimisation condition will converge on the skewed peak which is no longer coincident with the mean, it is closer to zero. Erroneously taking this to be representative of y with give an artificially low estimation of y/x and thus of the linear relationship being sought.

This problem is referred to as regression dilution or attenuation bias.

Convolution is a commutative operation so convolution PDFy with PDF1/x is identical to convolution of PDF1/x with PDFy.

To visualise the effect, without doing the detailed maths, convolution with a gaussian is the same process and applying a low-pass gaussian filter. The skewed pdf of 1/x will be broadened by that of y but there will be no phase change, ie the peak will still be in the same place.

In the other sense the asymmetric pdf of 1/x will induce a phase change in the pdf of y and will shift the peak of y towards zero, leading to the regression dilution issue, as already discussed.

If the pdf of y is very broad in relation to the skew in 1/x, the shift in y will be negligible. This corresponds to case of errx << erry where the regression dilution is negligible and the OLS result is a valid best estimator of the true linear relationship.

Contrariwise, if this technique is applied to a scatter plot of two variables with significant experimental error, the fitted OLS slope will generally be significantly in error.

The two slopes derived in the example plot vary by a factor of about 2.5. The true slope probably lies between the two.