Linear regression with noise in both variables

Two sets of synthetic data having a linear relationship, but with significant random noise, are regressed.

Neither variable conforms to the requirement of having minimal error that leads to the least squares regression being the ‘best estimator’ of the underlying linear relationship, so the regression is performed as both x-against-y and y-against-x and the results are compared.

This example graph shows a typical result where the true linear relationship lies somewhere between the two least-squares results.

It is clear that not being aware of this effect and taking a simple regression fit to be the ‘best estimator’ can lead to grave errors in assessing the underlying linear relationship.

This graph is part of an article on the subject.

https://climategrog.wordpress.com/2014/03/08/on-inappropriate-use-of-ols/

The following code will create similar random data series and reproduce the graph. Select text with mouse and use copy / paste to take a copy.

The script is for gnuplot a cross-platform plotting utility and also calls awk utility, which is also available on most platforms.

Setting xnz=0 corresponds to negligible x-errors and will reproduce valid ‘best estimation’ results. An example graph can be seen here:
https://climategrog.wordpress.com/?attachment_id=854

### assessing regression dilution in the presence of x-variable noise
### script suitable for gnuplot v 4.6 , can be supplied to gnuplot "load" command. Also requires awk (or gawk, mawk, etc) installed.
### awk and gnuplot are available on lin/win/max platforms.


# synthetic data with radiative f/b from "temp" term in col 2, randomised temp in col 1;  as for dRad vs dTemp plots

  f0(x)=m0*x+c0;  inv_f0(x)=(x-c0)/m0
  f1(x)=m1*x+c1;  inv_f1(x)=(x-c1)/m1
  gf(x)=gm*x+c

# increase 'runs' to find average bisector slope, graph is simply that of last run

runs=1; ols_sum=0; do for [rep = 1:runs] {
# call awk to create synthetic data:
 datafile= "rand.dat"
 fb=3.0; ynz=1.; xnz=2.;   vars=sprintf("fb=%.1f; ynz=%.1f; xnz=%.1f;",fb,ynz,xnz);
 awkcmd='BEGIN{srand(); max=1024;'.vars.\
        ' for (i=-max/2.;i<max/2.;i++) { {printf( "%.3f %.3f\n",i/max+(rand()-0.5)*xnz,fb*(i/max+(rand()-0.5)*ynz))} } }'
 print "error=", res=system (cmd="awk '".awkcmd."' > ".datafile."; echo $?")

 m0=c0=m1=c1=1.;
 fit f0(x) datafile u 1:2 via m0,c0
 fit f1(x) datafile u 2:1 via m1,c1

 gm=sqrt(m0/m1);
 x_sect=(c1+m1*c0)/(1-m0*m1)
 c=y_sect=f0(x_sect); print y_sect=inv_f1(x_sect)

unset label;  set key b r
set title "linear regression in the presence of error in the abscissa "
set label 1 sprintf ("conventional regression slope = %.3f",m0) at graph 0.05,0.95
set label 2 sprintf ("reciprocal of inv. reg. slope = %.3f",1/m1) at graph  0.05,0.85
set label 3 sprintf ("inverse regression slope = %.3f",m1) at graph  0.05,0.9
#set label 4 sprintf ("geometric .mean of slopes = %.3f",gm) at graph 0.05,0.8
set label 4 sprintf ("x variable noise = %3d \%",xnz*100) at graph 0.05,0.8
set label 5 sprintf ("y variable noise = %3d \%",ynz*100) at graph 0.05,0.75

if (rep==runs){
plot datafile t sprintf("synthetic data ( linear +  random noise )",ynz*100) \
, f0(x) tit sprintf("regression slope = %.2f ",m0) \
, inv_f1(x) tit"inverted regression fit " \
, x*fb tit  sprintf("true slope = %.1f", fb)
}

ols_sum=ols_sum+m0; rep
} # done

print sprintf ("mean slope over %d runs = %.3f",runs, ols_sum/(runs+0.))


Advertisements