Data corruption by running mean “smoothers”

[See update at end of article]

Running means are often used as a simple low pass filter (usually without understanding its defects). Often it is referred to as a “smoother”. In fact it does not even “smooth” too well either since it lets through enough high frequencies to give a spiky result.

Running means are fast and easy to implement. Since most people have some understanding of what an average does, the idea of a running average seems easily understood. Sadly it’s not that simple and running averages often cause serious corruption of the data.

So it smooths the data to an extent, but what else does it do?

The problem with an evenly weighted average is that the data is effectively masked by a rectangular window. The frequency response of such a rectangular window is the sinc function [1] and thus the effect on the frequency content of the data is to apply the sinc function as a frequency filter. The sinc function oscillates and has negative lobes that actually invert part of the signal it was intended to remove. This can introduce all sorts of undesirable artefacts into the data.

An example of one of the problems can be seen here:
running_mean_WTF

http://www.woodfortrees.org/plot/rss/from:1980/plot/rss/from:1980/mean:60/plot/rss/from:1980/mean:30/mean:22/mean:17

Figure 1 Comparing effects of different filters on a climate data time series ( 60 month running mean vs 30m triple running mean [blue] ).

It can be noted that the peaks and troughs in the running mean are absolutely wrong. When the raw data has a peak the running mean produces a trough. This is clearly undesirable.

The data is “smoother” than it was but its sense is perverted. This highlights the difference between simply “smoothing” data and applying appropriately chosen low-pass filter. The two are not the same but the terms are often thought to be synonymous.

Some other filters, such as the gaussian, are much more well behaved, however a gaussian response is never zero, so there is always some leakage of what we would like to remove. That is often acceptable but sometimes not ideal.

Comparing frequency of gaussian and running mean
figure 2 showing the magnitude of the frequency response. However, it should be noted that the sign of every other lobe of running mean is negative in sign, actually inverting the data.

Below is a comparison of two filters ( running mean and gaussian ) applied to some synthetic climate-like data generated from random numbers. Click to see the full extent of the graph.

rm_gauss_AR1
Figure 3. Showing artefacts introduced by simple running mean filter.

As well as the inversion defect, which is again found here around 1970, some of the peaks get bent sideways into an asymmetric form. In particular, this aberration can be noted around 1958 and 1981. In comparing two datasets in order to attribute causation or measure response times of events, this could be very disruptive and lead to totally false conclusions.

 

Triple running mean filters

Another solution is to improve the running mean’s frequency response.

The sinc function has the maximum of the troublesome negative lobe at πx=tan(πx). Solving this gives πx=1.4303 πx=1.3371…..
[Thanks to Peter Mott for pointing out the error.]
However, simply targeting the peak in the lobe does not produce optimal results. Reduced values leave less residual.

Now if a second running mean is passed after the first one with a period shorter by this ratio, it will filter out the the inverted data…. and produce another, smaller, positive lobe.

A similar operation will kill the new lobe and by this stage any residual problems are getting small enough that they are probably no longer a problem.

The triple running mean has the advantage that it has a zero in the frequency response that will totally remove a precise frequency as well letting very little of higher frequencies through. If there is a fixed, known frequency to be eliminated, this can be a better choice than a gaussian filter of similar period.

The two are shown in the plot above and it can be seen that a triple running mean does not invert the peaks as was the case for the simple running mean that is commonly used.

Example.

With monthly data it is often desirable to remove an annual variation. This can be approximated by the 12,8,6 triple RM shown:

12 / 1.3371 = 8.8975
12 / 1.3371 / 1.3371 = 6.712

It can be seen the second stage is pretty accurate but the final one is rather approximate. However, the error is not large in the third stage.



Figure 4. Comparing frequency response of gaussian and triple running mean filters.

A similar operation on daily data would use: 365, 273, 204

365.242 / 1.3371 = 273.29
365.242 / 1.3371 / 1.3371 = 204,39

Another advantage is that the data from r3m filter really is “smooth” since it does not let past some high frequencies that a simple running mean does. If the aim is simply to “smooth” the data, rather than target a specific frequency, a r3m filter with half the nominal width often gives a smoother result without losing as much information, as was shown in figure 1.

This defect in the smoothing can be seen in the example plot. For example, there is a spike near 1986 in the simple running mean. Worst of all this is not even a true spike in the data that is getting through the filter, it is an artefact.

Another example is the official NOAA [2] presentation of sun spot number (SSN) taken from SIDC [3], examined here:

In 2004, Svalgaard et al published a prediction of the cycle 24 peak [4]. That prediction has proved to be remarkably accurate. It would be even more remarkable if SIDC were to apply a “smoothing” filter that did not invert and displace the peak and reduce its value.

Using direct polar field measurements, now available
for four solar cycles, we predict that the approaching solar
cycle 24 (~2011 maximum) will have a peak smoothed
monthly sunspot number of 75 ± 8, making it potentially the
smallest cycle in the last 100 years.

SIDC processing converts a later trough into the peak value of cycle 24. The supposed peak aligns with the lowest monthly value in the last 2.5 years of data. Clearly the processing is doing more than the intended “smoothing”.

The filter used in this case is a running mean with the first and last points having reduced weighting. It is essentially the same and shares the same defects. Apparently the filter applied to SIDC data was introduced by the Zürich observatory at the end of the 19th century when all these calculations had to be done by hand ( and perhaps the defects were less well understood ). The method has been retained to provide consistency with the historical record. This practise is currently under review.

While it may have been a reasonable compromise in 19th century, there seems little reason other than ignorance of problems for using simple running mean “smoothers” in the 21st century.

Conclusion

Referring to a filter as a “smoother” is often a sign that the user is seeking a visual effect and may be unaware that this can fundamentally change the data in unexpected ways.


Wider appreciation of the corruption introduced by using running mean filters would be beneficial in many fields of study.

 

Refs.

  [1] Plot of sinc function http://mathworld.wolfram.com/SincFunction.html

  [2] NOAA/Space Weather Prediction Center http://www.swpc.noaa.gov/SolarCycle/index.html

  [3] SIDC sunspot data: http://sidc.oma.be/sunspot-data/
SIDC readme: http://sidc.oma.be/html/readme.txt
SIDC applies a 13 point running mean with first and last points weighted 50%. This is a slight improvement on a flat running mean but shares the same tendancy to invert certain features in the data.

  [4] Svalgaard, L.,E. W. Cliver, and Y. Kamide (2005), Sunspot cycle 24: Smallest
cycle in 100 years?, Geophys. Res. Lett., 32, L01104, doi:10.1029/
2004GL021664. http://www.leif.org/research/Cycle%2024%20Smallest%20100%20years.pdf

Appendix

Scripts to automatically effect a triple-running-mean are provided here:

http://climategrog.wordpress.com/2013/11/02/triple-running-mean-script/

Example of how to effect a triple running mean on Woodfortrees.org :

http://www.woodfortrees.org/plot/rss/from:1980/plot/rss/from:1980/mean:60/plot/rss/from:1980/mean:30/mean:22/mean:17

Example of triple running mean in spread sheet:

https://www.dropbox.com/s/gp34rlw06mcvf6z/R3M.xls

[Update]

The main object of this article was to raise awareness of the strong, unintentional distortions introduced by the ubiquitous running mean “smoother”.

Filter design is a whole field of study in itself, of which even an introduction would be beyond the scope of this short article. However, it was also an aim to suggest some useful replacements for the simple running-average and to provide implementations that can easily be adopted. To that end, a small adjustment has been made to the r3m.sh script provided and another higher quality filter is introduced:

http://climategrog.wordpress.com/?attachment_id=659

A script to implement a low-pass Lanczos filter is provided here: http://climategrog.wordpress.com/2013/11/28/lanczos-filter-script/

An equivalent high-pass filter is provided here: http://climategrog.wordpress.com/2013/11/28/lanczos-high-pass-filter/

High-pass filters may be used, for example, to isolate sub-annual variability in order to investigate the presence or absense of a lunar infulence in daily data.

An example is the 66 day filter used in this analysis:

http://climategrog.wordpress.com/?attachment_id=460

The following points arose in discussion of the article.

Vaughan Pratt points out that shortening the window by a factor of 1.2067 (rather than 1.3371 originally suggested in this article) reduces the stop-band leakage. This provides a useful improvement.

Further optimisation can be attained by reducing negative leakage peaks at the cost of accepting slightly more positive leakage. Since the residual negative peaks are still inverting and hence corrupting the data, this will generally be preferable to simply reducing net residuals irrespective of sign.

The asymmetric triple running-mean is shown in the comparison of the frequency responses, along with a Lanczos filter, here:

http://climategrog.wordpress.com/?attachment_id=660

The Pratt configuration and the asymmetric 3RM result in identical averaging intervals when set to remove the annual cycle from monthly data. Both result in a choice of 8,10 and 12 month windows.

The difference will have an effect when filtering longer periods or higher resolutions, such as daily data.

If this is implemented in a spreadsheet, it should be noted that each average over an even interval will result in a 0.5 month shift in the data since it is not correctly centred. In a triple running-mean this results in 1.5 months shift with respect to the original data.

In this case the 1.3371 formula originally suggested in the article, giving 12,9,7 month averages and producing just one 0.5 month lag, may be preferable.

None of these issues apply if the scripts provided accompanying the article are used, since they all correctly centre the data.

A more technical discussion of cascading running-mean filters to achieve other profiles can be found in this 1992 paper, suggested by Pekka Pirilä and should serve as a starting point for further study of the subject.

http://www.cwu.edu/~andonie/MyPapers/Gaussian%20Smoothing_96.pdf

About these ads