On inappropriate use of OLS

http://climategrog.files.wordpress.com/2013/11/ols_scatterplot_regression2.png?w=670

Figure 1 showing conventional and inverse OLS linear regression fits on some real, observed climate variables.

Least squares regression is a very useful technique, widely used in almost all branches of science. It is usually one of the first techniques that is taught in schools for analysing experimental data.

It is also a technique that is misapplied almost as often as it is used correctly.

It can be shown that under certain conditions, the least square linear regression is the best estimation of the true relationship that can be derived from the available data. In statistics this is often called the ‘best, unbiased linear estimator’ of the slope. (Some, who enjoy contrived acronyms, abbreviate this to “BLUE”.)

However, it has been known since at least 1878 that this will under-estimate the slope if there is measurement error in the x-variable. (R. J. Adcock) [0]

There are two main conditions for this result to be an accurate estimation of the slope. One is that the deviations of the data from the true relationship are ‘normally’ or gaussian distributed. That is to say that they are of a random nature. This condition can be violated by significant periodic components in the data or excessive number of out-lying data points. The latter may often occur when only a small number of data points is available and the noise, even if random in nature, is not sufficiently sampled to average out.

The other main condition is that there be negligible error ( or non-linear varibility ) in the x variable. If this condition is not met, the OLS result derived from the data will almost always under-estimate the slope of the true relationship. This effect is sometimes referred to as regression dilution. The degree by which the slope is under-estimated is determined by the nature of the x and y errors but most strongly by those in x since they are required to be negligible for OLS to give the best estimation.

In this discussion “errors” can be understood to be both observational inaccuracies and any variability due to some factor other than the supposed linear relationship that it is sought to determine by linear regression of the two variables.

In certain circumstances regression dilution can be corrected for, but in order to do so, some knowledge of the nature and size of the both x and y errors has to be known. Typically this is not the case beyond knowing whether the x variable is a ‘controlled variable’ with negligible error, although several techniques have been developed to estimate the error in the estimation of the slope [7].

A controlled variable can usually be attained in a controlled experiment, or when studying a time series, provided that the date and time of observations have been recorded and documented in a precise and consistent manner. It is typically not the case when both sets of data are observations of different variables, as is the case when comparing two quantities in climatology.

One way to demonstrate the problem is to invert the x and y axes and repeat the OLS fit. If the result were valid, irrespective of orientation, the first slope would be the reciprocal of second one. However, this is only the case when there is very small errors in both variables, ie. the data is highly correlated ( grouped closely around a straight line ). In the case of one controlled variable and one error prone variable, the inverted result will be incorrect. In the case of two datasets containing observational error, both results will be wrong and the correct result will generally lie somewhere in between.

Another way to check the result is by examining the cross-correlation between the residual and the independent variable ( model – y vs x ), then repeat for incrementally larger values of the fitted ratio. Depending on the nature of the data, it will often be obvious that the OLS result does not represent the minimum correlation between the ordinate and the regressor, ie. it does not optimally account for co-variability of the two quantities.

In the latter situation, the two regression fits can be taken as bounding the likely true value but some knowledge of the relative errors is needed to decide where in that range the best estimation lies. There are a number of techniques such as bisecting the angle, taking the geometric mean (square root of the product), or some other average, but ultimately, they are no more objective unless driven by some knowledge of the relative errors. Clearly bisection would not be correct if one variable had low error, since the true slope would then be close to the OLS fit done with that quantity on the x axis.



Figure 2. A typical example of linear regression of two noisy variables produced from synthetic randomised data. ( Click to enlarge graph and access code to reproduce data and graph. )



Figure 2b. A typical example of correct application of linear regression to data with negligible x-errors. The regressed slope is very close to the true value, so close as to be indistinguishable visually. ( Click to enlarge )

The larger the x-errors, the greater the skew in the distribution and the greater the dilution effect.

An Illustration: the Spencer simple model.

The following case is used to illustrate the issue with ‘climate-like’ data. However, it should be emphasised that the problem is an objective mathematical one, the principal of which is independent of any particular test data used. Whether the following model is an accurate representation of climate ( it is not claimed to be ) has no bearing on the regression problem.

In a short article on his site Dr. Roy Spencer provided a simple, single-slab ocean, climate model with a predetermined feedback variable built into it. He observed that attempting to derive the climate sensitivity in the usual way consistently under-estimated the know feedback used to generate the data.

By specifying that sensitivity (with a total feedback parameter) in the model, one can see how an analysis of simulated satellite data will yield observations that routinely suggest a more sensitive climate system (lower feedback parameter) than was actually specified in the model run.

And if our climate system generates the illusion that it is sensitive, climate modelers will develop models that are also sensitive, and the more sensitive the climate model, the more global warming it will predict from adding greenhouse gasses to the atmosphere.

This is a very important observation. Regressing noisy radiative flux change against noisey temperature anomalies does consistently produce incorrectly high sensitivity. However, it is not an illusion created by the climate system, it is an illusion created by the incorrect application of OLS regression. Where there is error on both variables, the OLS slope is no longer an accurate estimation of the underlying linear relationship being sought.

Dr Spencer was kind enough to provide an implementation of the simple model in the form of a spread sheet download so that others may experiment and verify the effect.

To demonstrate this problem, the spreadsheet provided was modified to duplicate the dRad vs dTemp graph but with the axes inverted, ie. using exactly the same data for each run but additionally displaying it the other way around. Thus the ‘trend line’ provided by the spreadsheet is calculated with the variables inverted. No changes were made to the model.

Three values were used for the feedback variable were used in turn, two values: 0.9 and 1.9 that Roy Spencer suggests represent the range of IPCC values and 5.0 which he proposes as a value closer to that which he has derived from satellite observational data.

Here is a snap-shot of the spreadsheet showing a table of results from nine runs for each feedback parameter value. Both the conventional and the inverted regression slopes and their geometric mean have been tabulated.

Figure 3. Snap-shot of spreadsheet, click to enlarge.

Firstly this confirms Roy Spencer’s observation that the regression of dRad against dTemp consistently and significantly under-estimates the the feedback parameter used to create the data (and hence over-estimates climate sensitivity of the model). In this limited test, error is between a third and a half of the correct value. There is only one value of the conventional least squares slope that is greater than the respective feedback parameter value.

Secondly, it is noted that the geometric mean of the two OLS regressions does provide a reasonably close to the true feedback parameter, for the value derived from satellite observations. Variations are fairly evenly spread either side: the mean is only slightly higher than the true value and the standard deviation is about 9% of the mean.

However, for the two lower feedback values, representing the IPCC range of climate sensitivities, while the usual OLS regression is substantially less than the true value, the geometric mean over-estimates and does not provide a reliable correction over the range of feedbacks.

All the feedbacks represent a net negative feedback ( otherwise the climate system would be fundamentally unstable ). However, the IPCC range of values represents less negative feedbacks, thus a less stable climate. This can be seen reflected in the degree of variability in data plotted in the spreadsheet. The standard deviations of the slopes are also somewhat higher. This is not unexpected with less feedback controlling variations.

It can be concluded that the ratio of the proportional variability in the two quantities changes as a function of the degree of feedback in the system. The geometric mean of the two slopes does not provide a good estimation of the true feedback for the less stable configurations which have greater variability. This is in agreement with Isobe et al 1990 [7] which considers the merits of several regressions methods.

The simple model helps to see how this relates to Rad / Temp plots and climate sensitivity. However, the problem of regression dilution is a totally general mathematical result and can be reproduced from two series having a linear relationship with added random changes, as shown above.

What the papers say

A quick review of several recent papers on the problems of determining climate sensitivity shows a general lack of appreciation of the regression dilution problem.

Dessler 2010 b [1] :

Estimates of Earth’s climate sensitivity are uncertain, largely because of uncertainty in the long-term cloud feedback.

Spencer & Braswell 2011 [2] :

Abstract: The sensitivity of the climate system to an imposed radiative imbalance remains the largest source of uncertainty in projections of future anthropogenic climate change.

There seems to be agreement that this is the key problem in assessing future climate trends. However, many authors seem unaware of the regression problem and much published work on this issue seems to rely heavily on the false assumption that OLS regression of dRad against dTemp can be used to correctly determine this ratio, and hence various sensitivities and feedbacks.

Trenberth 2010 [3] :

To assess climate sensitivity from Earth radiation observations of limited duration and observed sea surface temperatures (SSTs) requires a closed and therefore global domain, equilibrium between the fields, and robust methods of dealing with noise. Noise arises from natural variability in the atmosphere and observational noise in precessing satellite observations.

Whether or not the results provide meaningful insight depends critically on assumptions, methods and the time scales ….

Indeed so. Unfortunately they then go on to contradict earlier work by Lindzen and Choi that did not rely on OLS regression by relying on inappropriate use of regression.

Spencer and Braswell 2011 [2]

As shown by SB10, the presence of any time-varying radiative forcing decorrelates the co-variations between radiative flux and temperature. Low correlations lead to regression-diagnosed feedback parameters biased toward zero, which corresponds to a borderline unstable climate system.

This is an important paper highlighting the need to use lagged regression to avoid the decorrelating effect of delays in the response. However, does not deal with the further attenuation due to regression dilution. It is ultimately still based on regression of two error laden-variables and thus does not recognise regression dilution that is also present in this situation. Thus it is likely that this paper is still over-estimating sensitivity.

Dessler 2011 [4] :

Using a more realistic value of σ(dF_ocean)/σ(dR_cloud) = 20, regression of TOA flux vs. dTs yields a slope that is within 0.4% of lamba.

Then in the conclusion of the paper, emphasis added:

warming). Rather, the evolution of the surface and atmosphere during ENSO variations are dominated by oceanic heat transport. This means in turn that regressions of TOA fluxes vs. δTs can be used to accurately estimate climate sensitivity or the magnitude of climate feedbacks.

Also from a previous paper:

Dessler 2010 b [1]

The impact of a spurious long-term trend in either dRall-sky or dRclear-sky is estimated by adding in a trend of T0.5 W/m 2/ decade into the CERES data. This changes the calculated feedback by T0.18 W/m2/K. Adding these errors in quadrature yields a total uncertainty of 0.74 and 0.77 W/m2/K in the calculations, using the ECMWF and MERRA reanalyses, respectively. Other sources of uncertainty are negligible.

The author is apparently unaware of the inaccuracy of regressing two uncontrolled variables is a major source of uncertainty and error.

Lindzen & Choi 2011 [5]

[Our] new method does moderately well in distinguishing positive from negative feedbacks and in quantifying negative feedbacks. In contrast, we show that simple regression methods used by several existing papers generally exaggerate positive feedbacks and even show positive feedbacks when actual feedbacks are negative.

… but we see clearly that the simple regression always under-estimates negative feedbacks and exaggerates positive feedbacks.

Here the authors have clearly noted that there is a problem with the regression based techniques and go into quite some detail in quantifying the problem, though they do not explicitly identify it as being due to the presence of uncertainty in the x-variable.

The L&C papers, to their credit, recognise that regression seriously under-estimates the slope and utilise other techniques to more correctly determine the ratio.

It seems the latter authors are exceptional in looking at the sensitivity question without relying on inappropriate use of linear regression. It is certainly part of the reason that their results are considerably lower than almost all other authors on this subject.

Forster & Gregory 2006 [8]

For less than perfectly correlated data, OLS regression of Q-N against δTs will tend to underestimate Y values and therefore overestimate the equilibrium climate sensitivity (see Isobe et al. 1990).

Another important reason for adopting our regression model was to reinforce the main conclusion of the paper: the suggestion of a relatively small equilibrium climate sensitivity. To show the robustness of this conclusion, we deliberately adopted the regression model that gave the highest climate sensitivity (smallest Y value). It has been suggested that a technique based on total least squares regression or bisector least squares regression gives a better fit, when errors in the data are uncharacterized (Isobe et al. 1990). For example, for 1985–96 both of these methods suggest YNET of around 3.5 +/- 2.0 W m2 K-1 ( a 0.7–2.4-K equilibrium surface temperature increase for 2 ϫ CO2 ), and this should be compared to our 1.0–3.6-K range quoted in the conclusions of the paper.

Here, the authors explicitly state the regression problem and its effect on the results of their study on sensitivity. However, when writing in 2005, they apparently feared that it would impede the acceptance of what was already a low value of climate sensitivity if they presented the mathematically more accurate, but lower figures.

It is interesting to note that Roy Spencer, in a non peer reviewed article, found an very similar figure of 3.66 W/m2/K by comparing ERBE data to MSU derived temperatures following Mt Pinatubo. [10]

So Forster and Gregory felt constrained to bury their best estimation of climate sensitivity, and the discussion of the regression problem in an appendix. In view of the ‘gatekeeper’ activities revealed in the Climategate emails, this may have been a wise judgement in 2005.

Now, eight years after the publication of F&G 2006, proper application of the best mathematical techniques available to correct this systematic over-estimation of climate sensitivity is long overdue.

 

Conclusion

Inappropriate use of linear regression can produce spurious and significantly low estimations of the true slope of a linear relationship if both variables have significant measurement error or other perturbing factors.

This is precisely the case when attempting to regress modelled or observed radiative flux against surface temperatures in order to estimate sensitivity of the climate system.

In the sense that this regression is conventionally done in climatology, it will under-estimate the net feedback factor (often denoted as ‘lambda’). Since climate sensitivity is defined as the reciprocal of this term, this results in an over-estimation of climate sensitivity.

If an incorrect evaluation of climate sensitivity from observations is used as a basis for the choice of parametrised inputs to climate models, the resulting model will be over sensitive and produce exaggerated warming. Similarly faulty analyses of their output will further inflate the apparent model sensitivity.

This situation may account for the difference between regression-based estimations of climate sensitivity and those produced by other methods. Many techniques to reduce this effect are available in the broader scientific literature, thought there is no single, generally applicable solution to the problem.

Those using linear regression to assess climate sensitivity need to account for this significant source of error when supplying uncertainly values in published estimations of climate sensitivity or take steps to address this issue.

 
 

References

 [1] Dessler 2010 b “A Determination of the Cloud Feedback from Climate Variations over the Past Decade”

http://geotest.tamu.edu/userfiles/216/dessler10b.pdf

 [2] Spencer and Braswell 2011: “On the Misdiagnosis of Surface Temperature Feedbacks from Variations in Earth’s Radiant Energy Balance”

http://www.mdpi.com/2072-4292/3/8/1603/pdf

 [3] Trenberth et al 2010 “Relationships between tropical sea surface temperature and top‐of‐atmosphere radiation”

http://www.mdpi.com/2072-4292/3/9/2051/pdf

 [4] Dessler 2011
“Cloud variations and the Earth’s energy budget”

http://geotest.tamu.edu/userfiles/216/Dessler2011.pdf

 [5] Lindzen & Choi 2001 “On the Observational Determination of Climate Sensitivity and Its Implications”

http://www-eaps.mit.edu/faculty/lindzen/236-Lindzen-Choi-2011.pdf

 [6] Nic Lewis : “A Sensitive Matter: How The IPCC Buried Evidence Showing Good News About Global Warming “
http://www.thegwpf.org/content/uploads/2014/02/A-Sensitive-Matter-Foreword-inc.pdf

 [7] Isobe et al 1990 “Linear Regression in Astronomy I”

http://articles.adsabs.harvard.edu/cgi-bin/nph-iarticle_query?journal=ApJ..&year=1990&volume=.364&letter=.&db_key=AST&page_ind=116&data_type=GIF&type=SCREEN_GIF&classic=YES

 [8]
Forster & Gregory 2006
“The Climate Sensitivity and Its Components Diagnosed from Earth Radiation Budget Data”

http://www.image.ucar.edu/idag/Papers/Forster_sensitivity.pdf‎

Adcock, R.J., 1878. A Problem in Least Squares. The Analyst, 5, 53-54.

 [0]
Quirino Paris 2004 : “Robust Estimators of Errors-In-Variables Models Part I”

http://arelibrary.ucdavis.edu/working_papers/files/04-007.pdf‎

R Spencer “Revisiting the Pinatubo Eruption as a Test of Climate Sensitivity”

http://www.drroyspencer.com/2010/06/revisiting-the-pinatubo-eruption-as-a-test-of-climate-sensitivity/

Gaussian low-pass filter



Click to enlarge graph.

The graph is scaled to unit frequency (eg 1 Hz or one cycle per year …).

It illustrates how a 159 day gaussian will filter daily data. For monthly data 12*159/365=5.23 , so the nearest would be sigma=5mo . However since that is slightly short of the intended value and gaussian leaks a little, since it is never fully zero, six months would be better to suppress a 12mo cycle.

Code ( use copy / paste within code block ).


#!/bin/awk -f

# pass input through 3 sigma gaussian filter where sigma, if not given, is 2 data points wide
# usage : ./gauss.awk filename <sigma=2> <scale_factor=1>
# optional scale_factor simply scales the output
# use  OFMT="%6.4f"
# sigma can be compared to the period of the -3dB point of the filter
# result is centred, ie not shift. dataset shortened by half window each end

# operates on files with two columns of numeric data
#  eg date as integer year or floating point decimal
# check whether data is continuous !!
# nov 2012  OFMT="%10.8f"


# FWHM (-3dB) = 2*sigma*sqrt(2*ln2)
# half height freq = sigma * 1.1774
# 3 sigma gives 99.7% accuracy
# eg. sigma=6 months can be used to remove most of an annual signal
#  though other filters with a zero pass may be a better choice


BEGIN { OFMT="%10.8f"
# ARGV[1]=filename; argv[0] is script name, hence ARGC>=1

  if (ARGV[1]=="") { print " # usage : ./gauss.awk filename <sigma=2> <scale_factor=1>" ; exit 1}

  pi= 3.14159265359811668006

  if  ( ARGC >3 ) {scaleby=ARGV[3];ARGV[3]=""} else {scaleby=1};
  if  ( ARGC >2 ) {sigma=ARGV[2];ARGV[2]=""} else {sigma=2};

  print "# filtering "ARGV[1]" with gaussian of sigma= ",sigma
  root2pi_sigma=sqrt(2*pi)*sigma;
  two_sig_sqr=2.0*sigma*sigma;

  gw=3*sigma-1;  # gauss is approx zero at 3 sigma, use 3 sig window


# calculate normalised gaussian coeffs
  for (tot_wt=j=0;j<=gw;j++) {tot_wt+=gwt[-j]=gwt[j]=exp(-j*j/two_sig_sqr)/ root2pi_sigma};
  tot_wt=2*tot_wt-gwt[0];
  tot_wt/=scaleby;
  for (j=-gw;j<=gw;j++) {gwt[j]/=tot_wt};

# strip off last .xxx part of file name
# improve this  (doesn't work with paths like ../filename)

  if (ARGV[1]=="-") {
    out_file="-";
  }
   else {
    split(ARGV[1],fn,".");
    basename=fn[1]
    out_file=basename"-gauss"sigma".dat";
    print "# ",out_file >out_file;
  }

  ln=-1;
}

( ($0 !~ /^#/) && ($0 !~ /^[" ",\t]*$/)  ) {
  xdata[++ln]=$1;
  ydata[ln]=$2;

  if (ln>2*gw)
  {
    gauss=0
    for (j=-2*gw;j<=0;j++) {gauss+=ydata[ln+j]*gwt[j+gw]}
    if (out_file=="-")
      print xdata[ln-gw],gauss
    else {
      print NR,xdata[ln-gw],gauss
      print xdata[ln-gw],gauss >> out_file;
    }
  }
else
  {
#    print  $1,$2;

  }
}


END {
  if (ARGV[1]=="") { exit 1}
  print "# gausssian window width = "gw+gw+1",done"
  print "# output file = "out_file

# for (i=0;i<=gw;i++){print gwt[i]}
}


Lanczos high-pass filter


The graph shows the frequency response of a 3-lobe Lanczos high-pass filter with central frequencies of 461 and 269 days, respectively.

These are chosen such that a periodicity of 1 year is either fully included or fully excluded in the filter’s pass-band. ie the 269 day filter will pass only frequencies shorter than 365 days and fully block annual cycles. 461 day filter will ensure that everything including the annual signal is unattenuated and excludes anything longer than about 1.6 years.

The equivalent low-pass filter is detailed here:

http://climategrog.wordpress.com/2013/11/28/lanczos-filter-script/

frequency axis is in units of “per year” not “year” as labelled.

Code: ( select text in code block and use copy/paste to get the code )


#!/bin/awk -f

#  usage: ./lanczos_HP.awk filename [-3dB_period=15] [lobes=5]"

# purpose: convolve 2nd column of file with lanczos windowed sinc function
# column 1 data should be evenly spaced without gaps.
# optional second parameter is period of half-power point (-3bD) attenuation, as a period expressed  in data intervals
#    default = 15 ; eg. 15 monthly periods : half power at 15mo, close to 100% pass at 12 mo.
#                   to exclude 12mo, use 9 . This will pass 6mo and higher freq unattenuated.
#                   for daily data use 461 and 269 days , respectively.
# optional third parameter is the number of lobes in impulse response. More gives sharper transition but more frequency distortion.
#    default=5; 3 and 4 are good. >5 may produce significant ringing artefacts


function sinc(x) {
  if (x==0) return 1; else return (sin(x)/x);
}

BEGIN{  OFMT = "%8.6f"

  twopi=6.28318530717959

  if ( ARGC >3 ) {lobes=ARGV[3];ARGV[3]=""} else {lobes=5};
  if ( ARGC >2 ) {period=ARGV[2];ARGV[2]=""} else {period=15};
  if ( ARGC <2) { print "### usage: ./lanczos_HP.awk filename [-3dB_period=15] [lobes=5]"; err=1; exit 1; }
  else {len=ARGV[2]; ARGC=2;}

  pi= 3.14159265359811668006 ;
  halfk=int(period/2*lobes);
  kwid= 2*halfk;         # kernel = -halfk....+halfk = kwid+1 pts ie odd number
  w_taper=twopi/kwid;    # first lanczos lobe -pi to +pi to fill window
  w=twopi/period;        # sinc passes zero pi, 2*pi, 3*pi ....

# calculate normalised kernel coeffs
  for (tot_wt=j=0;j<=halfk;j++) {tot_wt+=gwt[-j]=gwt[j]=sinc(j*w)*sinc(j*w_taper)};
  tot_wt=2*tot_wt-gwt[0];
  for (j=-halfk;j<=halfk;j++) {gwt[j]/=tot_wt};

# for HP filter, invert LP and add unity impulse:
  for (j=-gw;j<=gw;j++) {gwt[j]/=-tot_wt}
  gwt[0]+=1;

# strip off last .xxx part of file name
# improve this  (doesn't work with paths like ../filename)

  if (ARGV[1]=="-") {
    out_file="-";
  }
  else {
    split(ARGV[1],fn,".");
    basename=fn[1]
    out_file=basename"-lanc"lobes"LP_"period".dat";
    print "# ",out_file >out_file;
    print "# ",out_file
  }

  ln=-1;  # init linenumber counter
}

($0 !~ /^#/)&&($0 != "") {    # strip out comments, headers and blank lines
  xdata[++ln]=$1;
  ydata[ln]=$2;

  if (ln>2*halfk)
  {
    sum=0
    for (j=-2*halfk;j<=0;j++) {sum+=ydata[ln+j]*gwt[j+halfk]}
    if (out_file=="-")
      print xdata[ln-halfk],sum
    else {
      print NR,xdata[ln-halfk],sum
      print xdata[ln-halfk],sum >> out_file;
    }
  } #endif ln

}


END {
    if (err) exit err;
    if (FNR<=kwid) print " ### insufficient data to fill "kwid+1" point filter buffer"; exit 1;

    print "# "lobes"-lobe lanczos kernel width = "2*halfk+1",done"

    print "# output file = "out_file
# uncomment following to dump kernel (impluse response):
# sum=0; for (j=-halfk;j<=halfk;j++) {sum+= gwt[j]; printf ("%.4f\n", gwt[j])}
# print "# abs sum of coeffs = ",sum;

}


Lanczos low-pass filter



Click to enlarge graph.

frequency axis is in units of “per year” not “year” as labelled.

Equivalent high-pass filter:

http://climategrog.wordpress.com/2013/11/28/lanczos-high-pass-filter/

Discussion: http://climategrog.wordpress.com/?attachment_id=659

Source Code: ( to get code, use browser copy / paste function within code block below).


#!/bin/awk -f

#  usage: ./lanczos_LP.awk filename [-3dB_period=15] [lobes=5]"

# purpose: convolve 2nd column of file with lanczos windowed sinc function
# column 1 data should be evenly spaced without gaps.
# optional second parameter is period of half-power point (-3bD) attenuation, as a period expressed  in data intervals
#    default = 15 ; eg. 15 monthly periods : half power at 15mo, close to zero pass at 12 mo.
# optional third parameter is the number of lobes in impulse response. More gives sharper transition but more overshoot distortion of frequency.
# output comments give pass-band and stop band info. this is not sent to output file.
#    default=5; 3 and 4 are good. >5 may produce significant ringing artefacts.


function sinc(x) {
  if (x==0) return 1; else return (sin(x)/x);
}

BEGIN{  OFMT = "%8.6f"

  twopi=6.28318530717959

  if ( ARGC >3 ) {lobes=ARGV[3];ARGV[3]=""} else {lobes=5};
  if ( ARGC >2 ) {period=ARGV[2];ARGV[2]=""} else {period=15};
  if ( ARGC <2) { print "### usage: ./lanczos_LP.awk filename [-3dB_period=15] [lobes=5]"; err=1; exit 1; }
  else {len=ARGV[2]; ARGC=2;}

  pi= 3.14159265359811668006 ;
  halfk=int(period/2*lobes);
  kwid= 2*halfk;         # kernel = -halfk....+halfk = kwid+1 pts ie odd number
  w_taper=twopi/kwid;    # first lanczos lobe -pi to +pi to fill window
  w=twopi/period;        # sinc passes zero pi, 2*pi, 3*pi ....

# calculate normalised kernel coeffs
  for (tot_wt=j=0;j<=halfk;j++) {tot_wt+=gwt[-j]=gwt[j]=sinc(j*w)*sinc(j*w_taper)};
  tot_wt=2*tot_wt-gwt[0];
  for (j=-halfk;j<=halfk;j++) {gwt[j]/=tot_wt};

# strip off last .xxx part of file name
# improve this  (doesn't work with paths like ../filename)

  if (ARGV[1]=="-") {
    out_file="-";
  }
  else {
    split(ARGV[1],fn,".");
    basename=fn[1]
    out_file=basename"-lanc"lobes"LP_"period".dat";
    print "# ",out_file >out_file;
    print "# ",out_file
  }

  ln=-1;  # init linenumber counter
}

($0 !~ /^#/)&&($0 != "") {    # strip out comments, headers and blank lines
  xdata[++ln]=$1;
  ydata[ln]=$2;

  if (ln>2*halfk)
  {
    sum=0
    for (j=-2*halfk;j<=0;j++) {sum+=ydata[ln+j]*gwt[j+halfk]}
    if (out_file=="-")
      print xdata[ln-halfk],sum
    else {
      print NR,xdata[ln-halfk],sum
      print xdata[ln-halfk],sum >> out_file;
    }
  } #endif ln

}
}


END {
    if (err) exit err;
    if (FNR<=kwid) {print " ### insufficient data to fill "kwid+1" point filter buffer"; exit 1}

    print "# "lobes"-lobe lanczos kernel width = "2*halfk+1",done"
    if (lobes==5) printf ("# "lobes"-lobe lanczos LP on %s: zero-pass at %.3f; 99% pass > %.3f \n",ARGV[1],period/1.258,period/.752)
    if (lobes==4) printf ("# "lobes"-lobe lanczos LP on %s: zero-pass at %.3f; 99% pass > %.3f \n",ARGV[1],period/1.325,period/.698)
    if (lobes==3) printf ("# "lobes"-lobe lanczos LP on %s: zero-pass at %.3f; 99% pass > %.3f \n",ARGV[1],period/1.436,period/.588)

    print "# output file = "out_file
# uncomment following to dump kernel (impluse response):
# sum=0; for (j=-halfk;j<=halfk;j++) {sum+= gwt[j]; printf ("%.4f\n", gwt[j])}
# print "# abs sum of coeffs = ",sum;

}

simple running-mean script ( as base for triple RM )

The following script will calculate a simple running mean. On its own this is a very poor filter, see accompanying article :

http://climategrog.wordpress.com/2013/05/19/triple-running-mean-filters/

This script is intended for use by the triple-running-mean script available here:

http://climategrog.wordpress.com/2013/11/02/triple-running-mean-script/

Select code with mouse to copy elsewhere.

#!/bin/awk -f

# calculate running mean of width 'window'
# line number taken as x if file if file is a single column of data
# [usage] ./runmean.awk filename <window>
# default window=12
# strips empty and comment lines

# operates on files with two columns of numeric data
#  eg date as integer year or floating point decimal
# check whether data is continuous and evenly spaced !!


BEGIN{  OFMT = "%10.8f"
# ARGV[1]=fn; argv[0] is script name, hence ARGC>=1

  if (ARGV[1]=="") { print " ### error, no input file given." ; err=1; exit 1}

  ln=0;
  if  ( ARGC >2 ) {window=ARGV[2];ARGV[2]=""} else {window=12};
  if (int(window)!=window) {window=int(window+0.5); window_adj=1; }

  print "# filtering "ARGV[1]" with running mean  window = ",window

  if (ARGV[1]=="-") {
    out_file="-";
  }
   else {
    split(ARGV[1],fn,".");
    basename=fn[1]
    out_file=basename"-rmean"window".dat";

    print "# ",out_file >out_file;
  }
}  # end BEGIN


{ # body
if ( ($0 !~ /^#/) && ($0 !~ /^[" ",\t]*$/)  )  {  # strip empty and comment lines

  if ($2==""){  # for single column data files
    xdata[++ln]=NR;
    ydata[ln]=$1;
  }else{
    xdata[++ln]=$1;
    ydata[ln]=$2;
  }

  sum+=ydata[ln];

  if (ln>=window)
  {
    mid_x=(xdata[ln]+xdata[ln-(window-1)])/2;  # -1 to centre correctly
    ymean=(sum)/window;
    sum-= ydata[ln-window+1];  # remove oldest
    if (out_file=="-")
      print mid_x,ymean;
    else {
      print NR,mid_x,ymean
      print mid_x,ymean >> out_file;
    }
  } # endif
 } # non empty
} # end body

END {
  if (err==1) { exit 1}
  if (window_adj==1) print "# warning : window rounded to nearest integer, window  = ",window
  print "# running mean window = "window",done"
  print "# output file = "out_file
}

triple running-mean script

The following script will call a simple running mean three times with appropriate window size to do effect triple running mean, as described in the article (as amended for the asymmetric kernel to minimise negative leakage):

http://climategrog.wordpress.com/2013/05/19/triple-running-mean-filters/

It requires the runmean.awk script found here:

http://climategrog.wordpress.com/2013/11/02/574/

Select code with mouse to copy elsewhere.

#!/bin/bash

# call runmean.awk three times to compose triple running mean
# usage: ./r3m.sh file window_len   ; default window is 12 data point

if [ "x$1" == "x"  ]; then echo "$0 : err no file name,   usage: $0 filename "
 exit 1;
else fn=$1
fi

if [ "x$2" == "x"  ]; then win=12; else win=$2; fi

#win2=`awk "BEGIN{print $win/ 1.3371}"`
#win3=`awk "BEGIN{print $win/ 1.3371/ 1.3371}"`


# asymmetric stages with following window ratios:
k=1.15; k2=1.58

win2=`awk "BEGIN{print $win/ "$k"}"`
win3=`awk "BEGIN{print $win/ "$k2" }"`

outfile=`echo $fn | awk '{ print substr($1,1,length($1)-4) }'`
outfile+="-3rm"$win".dat"

echo  "# triple running mean :  $win  $win2  $win3 " > $outfile

# echo $fn; echo $win; echo $win2; echo $win3

cat $fn | ./runmean.awk - $win |  ./runmean.awk - $win2 |  ./runmean.awk - $win3 >> $outfile

#echo "# cat $fn | ./runmean.awk - $win |  ./runmean.awk - $win2 |  ./runmean.awk - $win3 >> $outfile"

echo "# triple running mean :  $win  $win2  $win3 "
echo "# outfile = "$outfile;

On identifying inter-decadal variation in NH sea ice

Introduction

The variation in the magnitude of the annual cycle in arctic sea ice area has increased notably since the minimum of 2007. This means that using a unique annual cycle fitted to all the data leaves a strong residual (or “anomaly”) in post-2007 years. This makes it difficult or impossible to visualise how the data has evolved during that period. This leads to the need to develop an adaptive method to evaluate the typical annual cycle, in order to render the inter-annual variations more intelligible.

Before attempting to asses any longer term averages or fit linear “trends” to the data it is also necessary to identify any repetitive variations of a similar time-scale to that of the fitting period, in order to avoid spurious results.

See also: an examination of sub-annual variation

http://climategrog.wordpress.com/?attachment_id=460

Method

Short-term anomalies: adapting calculations to decadal variability.

Adaptation can be achieved by identifying different periods in the data that have different degrees of variability and calculating the average annual variation for each period.

Details of how the average seasonal profile for each segment was calculated are shown in the appendix.


Figure 1. The post-2007 interval. (click to enlarge)

The data was split into separate periods that were identified by their different patterns of variation. For example, the period from 1997 to 2007 has a notably more regular annual cycle (leading to smaller variation in the anomalies). The earlier periods have larger inter-annual variations, similar to those of the most recent segment.

An approximately periodic variation was noted in the data, in the rough form of a rectified sine wave. In order to avoid corrupting the average slope calculations by spanning from a peak to a trough in this pattern, a mathematical sine function was adjusted manually to approximate the period and magnitude shown in the data. It was noted that, despite the break of this pattern in the early 2000′s, the phase of this pattern did not alter in the post-2007 section. However, there was a notable drop in level ( about 0.5×10^6 km^2 ). Labels indicate the timing of several notable climate events which may account for some of the deviations from the observed cyclic pattern. These are included for ease of reference without necessarily implying a particular interpretation or causation.



Figure 2. Identifying periods for analysis

The early period (pre-1988) was also separated out since it was derived from a notably different instrument on a satellite with a much longer global coverage. It is mostly from Nimbus 7 mission which was in an orbit with a 6 day global coverage flight path. Somehow this data was processed to produce into a 2 day interval time-series, although documentation of the processing method seems light.

Later data, initially from the US defence meteorology platforms starts in 1988 and had total polar coverage twice per day,producing daily time-series.

In order to maintain the correct relationship between the different segments, the mean value for each segment from the constant term of the harmonic model used to derive the seasonal variations, was retained. The anomaly time-series was reconstructed by adding the deviations from the harmonic model to the mean, for each segment in turn. The calculated anomalies were extended beyond the base period at each end, so as to provide an overlap to determine suitable points for splicing the separate records together. A light low-pass filter was applied to the residual ‘anomalies’ to remove high frequency detail and improve visibility.

The result of this processing can be compared to the anomaly graph provided at the source of this data. The extent of the lines showing mean rates of change indicate the periods of the data used. These were chosen to be an integral number of cycles of the repetitive, circa five year pattern.



Figure 3. Showing composite adaptive anomaly



Figure 4. Showing Cryosphere Today anomaly derived with single seasonal cycle

Discussion

The average slope for each segment is shown and clearly indicate the decrease in ice area was accelerating from the beginning of the era of the satellite observations until 2007. The derived values suggest this was a roughly parabolic acceleration. This was a cause for legitimate concern around 2007 and there was much speculation as to what this implied for the future development of arctic ice coverage. Many scientists have suggested “ice free” summers by 2013 or shortly thereafter.

The rate of ice loss since 2007 is very close to that of the 1990s but is clearly less pronounced than it was from 1997 to 2007, a segment of the data which in itself shows a clear downward curvature, indicating accelerating ice loss.

Since some studies estimate that much of the ice is now thinner, younger ice of only 1 or 2 years of age, recent conditions should be a more sensitive indicator of change. Clearly the predicted positive feedbacks, run-away melting and catastrophic collapse of Arctic sheet are not in evidence. The marked deceleration since 2007 indicates that either the principal driver of the melting has abated or there is a strongly negative regional feedback operating to counteract it.

The 2013 summer minimum is around 3.58 million km^2. Recent average rate of change is -0.043 million km^2 per year.

While it is pointless to suggest any one set of conditions will be maintained in an ever varying climate system, with the current magnitude of the annual variation and the average rate of change shown in the most recent period, it would take 83 years to reach ice free conditions at the summer minimum.

Some have preferred to redefine “ice free” to mean less than 1 millions km^2 sea ice remaining at the summer minimum. On that basis the “ice free” summer figure reduces to 61 years.

Conclusion

In order to extract and interpret inter-decadal changes in NH sea ice coverage, it is essential to characterise and remove short-term variation. The described adaptive method allows a clearer picture of the post 2007 period to be gained. It shows that what could be interpreted as a parabolic acceleration in the preceding part of the satellite record is no longer continuing and is replaced by a decelerating change in ice area. At the time of this analysis, the current decadal rate of change is about -430,000 km^2 per decade, against an annual average of around 8.7 million and a summer minimum of around 3.6 million km^2.

Whether the recent deceleration will continue or revert to earlier acceleration is beyond current understanding and prediction capabilities. However, it is clear that it would require a radical change for either an ice free ( or an less than one million km^2 ) arctic summer minimum to occur in the coming years.

Unfortunately data of this quality covers a relatively short period of climate history. This means it is difficult to conclude much except that the earlier areal acceleration has changed to a deceleration, with the current rate of decline being about half that of the 1997-2007 period. This is clearly at odds with the suggestion that the region is dominated by a positive feedback.

It is also informative to put this in the context of the lesser, but opposite, tendency of increasing sea ice coverage around Antarctica over the same period.


Figure 5. Showing changes in Arctic and Antarctic sea ice coverage



Figure 3. Showing composite adaptive anomaly

 
 


Appendix: approximating the typical seasonal cycle

Spectral analysis of the daily ice area data shows many harmonics of the annual variation, the amplitude generally diminishing with frequency. Such harmonics can be used to reconstruct a good approximation of the asymmetric annual cycle.

Here the typical cycle of seasonal variation for the period was estimated by fitting a constant plus 7 harmonic model to the data ( the 7th being about 52 days ). The residual difference between the data and this model was then taken as the anomaly for the segment and low-pass filtered to remove 45 day (8th harmonic) and shorter variations.

This reveals an initial recovery from the 2007 minimum and a later drift downwards to a new minimum in 2012. By the time of the annual minimum for 2013, another strong recovery seems to have started.

This repetitive pattern, which was identified in the full record, is used to define trough-to-trough or peak-to-peak periods over which to calculate the average rate of change for each segment of the data. This was done by fitting a linear model to the unfiltered anomaly data ( using a non-linear least squares technique).

Supplementary Information

The fitted average rate of change for the respective periods, in chronological order ( million km^2 per year )

-0.020
-0.047
-0.088
-0.043

The cosine amplitude, half the total annual variation, of the fundamental of the harmonic model for successive data segments ( million km^2 )

pre 1988 -4.43
1988-1997 -4.47
1997-2007 -4.62
post 2007 -5.14

Mathematical approximation of repetitive pattern used to determine sampling intervals, shown in figure 2.

cosine period 11.16 years ( 5.58 year repetition )
cosine amplitude 0.75 x 10^6 km^2

Resources

Data source: http://arctic.atmos.uiuc.edu/cryosphere/timeseries.anom.1979-2008
(data downloaded 16th Sept 2013)

A extensive list of official sources both arctic and antarctic sea ice data can be found here:

http://wattsupwiththat.com/reference-pages/sea-ice-page/