Guest Essay by Kip Hansen
This essay is the third and last in a series of essays about Averages — their use and misuse. My interest is in the logical and scientific errors, the informational errors, that can result from what I have playfully coined “The Laws of Averages”.
As both the word and the concept “average” are subject to a great deal of confusion and misunderstanding in the general public and both word and concept have seen an overwhelming amount of “loose usage” even in scientific circles, not excluding peer-reviewed journal articles and scientific press releases, I gave a refresher on Averages in Part 1 of this series. If your maths or science background is near the great American average, I suggest you take a quick look at the primer in Part 1 then read Part 2 before proceeding.
Why is it a mathematical sin to average a series of averages?
“Dealing with data can sometimes cause confusion. One common data mistake is averaging averages. This can often be seen when trying to create a regional number from county data.” — Data Don’ts: When You Shouldn’t Average Averages
“Today a client asked me to add an “average of averages” figure to some of his performance reports. I freely admit that a nervous and audible groan escaped my lips as I felt myself at risk of tumbling helplessly into the fifth dimension of “Simpson’s Paradox”– that is, the somewhat confusing statement that averaging the averages of different populations produces the average of the combined population.” — Is an Average of Averages Accurate? (Hint: NO!)
“Simpson’s paradox… is a phenomenon in probability and statistics, in which a trend appears in different groups of data but disappears or reverses when these groups are combined. It is sometimes given the descriptive title reversal paradox or amalgamation paradox.” — the Wiki “Simpson’s Paradox”
Averaging averages is only valid when the sets of data — groups, cohorts, number of measurements — are all exactly equal in size (or very nearly so), contain the same number of elements, represent that same area, same volume, same number of patients, same number of opinions and, as with all averages, the data itself is physically and logically homogenous (not heterogeneous) and physically and logically commensurable (not incommensurable). [if this is unclear, please see Part 1 of this series.]
For example, if one has four 6th Grade classes, each containing exactly 30 pupils, and wished to find the average height of the 6th Grade students, one could go about it two ways: 1) Average each class by summing the heights of the students then finding the average by dividing by 30, then summing the averages and dividing by four to get the overall average – an average of the averages or 2) combine all four classes together in one set of 120 students, sum the heights, and divide by 120. The results will be the same.
The contrary example is four classes of 6th Grade students, each of differing sizes — 30, 40, 20, and 60. Finding four class averages and then averaging the averages gives one answer — quite different from the answer if one summed the height of all 150 students and divided by 150. Why? It is because the individual students in the class with only 20 students and the individual students in the class of 60 students will have differing, unequal effects on the overall average. For the average to be valid, each student should represent 0.66% of the overall average [one divided by 150]. But when averaged by class, each class then accounts for 25% of the overall average. Thus each student in the class of 20 would count for 25%/20 = 1.25% of the overall average whereas each student in the class of 60 each count for only 25%/60 = 0.416% of the overall average. Similarly, students in the classes of 30 and 40 each count as 0.83 % and 0.625%. Each student in the smallest class would affect the overall average twice as much as each student in the largest class — contrary to the ideal of each student having an equal effect on the average.
For our readers in Indiana (that’s one of the states in the US), we could look at Per Capita Personal Income of the Indianapolis metro area:
This information is provided by the Indiana Business Research Center in an article titled: “Data Don’ts: When You Shouldn’t Average Averages”.
As you can see, if one averages the averages of the counties, one gets a PCPI of $40,027, however, aggregating first and then averaging gives a truer figure of $40,527. This result has a difference — in this case an error — of 1.36%. Of interest to those in Indiana, only the top three earning counties have PCPI higher than the state average, by either system, and eight counties are below the average.
If this seems trivial to you, consider that various claims of “striking new medical discoveries’ and “hottest year ever” are based on just these sorts of differences in effect sizes that are in the range of single digit, or even a fraction of, percentage points or a tenth or one-hundredths of a degree.
To compare with climatology, the published anomalies from the 30-year climate reference period (1981-2011) for the month of June 2017 range from 0.38 °C (ECMWF) to 0.21°C (UAH) with the Tokyo Climate Center weighing in with a middle value of 0.36°C. The range (0.17°C) is nearly 25% of the total temperature increase for the last century. (0.71°C). Even looking at only the two highest figures, 0.38°C and 0.36°C, the difference of 0.02°C is 5% of the total anomaly.
How exactly these averages are produced matters a very great deal in the final result. It matters not at all whether one is averaging absolute values or anomalies — the magnitude of induced error can be huge
Related, but not identical, is Simpson’s Paradox.
Simpson’s Paradox, or more correctly the Simpson-Yule effect, is a phenomenon that occurs in statistics and probabilities (and thus with averages), often seen in medical studies and various branches of social sciences, in which a result (a trend or effect difference, for example) seen when comparing groups of data disappears or reverses itself when the groups (of data) are combined.
Some examples of Simpson’s Paradox are famous. One with implications for today’s hot topics involved claimed bias in admission
rations ratios for men and women at UC Berkeley. Here’s how one author explained it:
“In 1973, UC Berkeley was sued for gender bias, because their graduate school admission figures showed obvious bias against women.
Men were much more successful in admissions than women, leading Berkeley to be “one of the first universities to be sued for sexual discrimination”. The lawsuit failed, however, when statisticians examined each department separately. Graduate departments have independent admissions systems, so it makes sense to check them separately—and when you do, there appears to be a bias in favor of women.”
In this instance, the combined (amalgamated) data across all departments gave the less informative view of the situation.
Of course, like many famous examples, the UC Berkeley story is a Scientific Urban Legend – the numbers and mathematical phenomenon are true, but there never was a gender bias lawsuit. Real story here.
Another famous example of Simpson’s Paradox was featured (more or less correctly) on the long-running TV series Numb3rs. (full disclosure: I have watched all episodes of this series over the years, some multiple times). I have heard that some people like sports statistics, so this one is for you. It “involves the batting averages of players in professional baseball. It is possible for one player to have a higher batting average than another player each year for a number of years, but to have a lower batting average across all of those years.”
This chart makes the paradox clear:
Each individual year, Justice has a slightly better batting average, but when the three years are combined, Jeter has the slightly better stat. This is Simpson’s Paradox, results reversing when multiple groups of data are considered separately or aggregated.
In climatology, the various groups go to great lengths to avoid the downsides of averaging averages. As we will see in comments, various representatives of the various methodologies will
weight weigh in and defend their methods.
One group will claim that they do not average at all — they engage in “spatial prediction” which somehow magically produces a prediction that they then simply label as the Global Average Surface Temperature (all while denying having performed averaging). They do, of course, start with daily, monthly, and annual averages — but not real averages…..more on this later.
Another expert might weigh in and say that they definitely don’t average temperatures….they only average anomalies. That is, they find the anomalies first and then average those. If pressed hard enough, this faction will admit that the averaging has long before been accomplished, the local station data — daily average dry bulb temperature — is averaged repeatedly, to arrive at monthly averages, then annual averages, sometimes multiple stations are averaged to achieve a “cell” average, and then these annual or climatic averages are subtracted from the present absolute temperature average (monthly or annual, depending on the process) to leave a remainder, which is called the “ anomaly” — oh, then the anomalies are averaged. The anomalies may or may not, depending on system, actually represent equal areas of the Earth’s surface. [See the first section for the error involved in averaging averages that do not represent the same fraction of the aggregated whole]. This group, and nearly all others, rely on “not real averages” at the root of their method.
Climatology has an averaging problem but the real one is not so much the one discussed above. In climatology, the daily average temperature used in calculations is not an average of the air temperatures experienced or recorded at the weather station during the last 24 hour period under consideration. It is the arithmetic mean of the lowest and highest recorded temperatures (Lo and Hi, the Min Max) for the 24 hour period. It is not the average of all the hourly temperature records, for instance, even when they are recorded and reported. No matter how many measurements are recorded, the daily average is calculated by summing the Lo and the Hi and dividing by two.
Does this make a difference? That is a tricky question.
Temperatures have been recorded as High and Low (Min-Max) for 150 years or more. That’s just how it was done, and in order to remain consistent, that’s how it is done today.
A data download of temperature records for weather station WBAN:64756, Millbrook, NY, for December 2015 through February 2016 gives temperature readings every five minutes. Data set includes values for “DAILYMaximumDryBulbTemp” and “DAILYMinimumDryBulbTemp” followed by “DAILYAverageDryBulbTemp”, all in degrees F. DAILYAverageDryBulbTemp is the arithmetical mean of the two preceding values (Max and Min). It is this last that is used in climatology as the Daily Average Temperature. A typical December day the recorded values look like this:
Daily Max 43 — Daily Min 34 — Daily Average 38 (the arithmetic mean is really 38.5, however, the algorithm apparently rounds x.5 down to x)
However, the Daily Average of All Recorded Temperatures is: 37.3….
The differences on this one day:
Difference between reported Daily Average of Hi-Lo and actual average of recorded Hi-Lo numbers = 0.5 °F due to rounding algorithm.
Difference between reported Daily Average and the more correct Daily Average Using All Recorded Temps = 0.667 °F
Other days in January and February show a range of difference between the reported Daily Average and the Average of All Recorded Temperatures from 0.1°F through 1.25°F to a high noted at 3.17°F on the January 5, 2016.
This is not a scientific sampling — but it is a quick ground truth case study that shows that the numbers being averaged from the very start — the Daily Average Temperatures officially recorded at surface stations, the unmodified basic data themselves, are not calculated to any degree of accuracy or precision at all — but rather are calculated “the way we always have” — finding the mean between the highest and lowest temperatures in a 24-hour period — that does not even give us what we would normally expect as the “average temperature during that day” — but some other number — a simple Mean between the Daily Lo and the Daily Hi, which the above chart reveals to be quite different. The average distance from zero for the two month sample is 1.3°F. The average of all differences, including the sign, is 0.39°F.
The magnitude of these daily differences? Up to or greater than the commonly reported climatic annual global temperature anomalies. It does not matter one whit whether the differences are up or down — it matters that they imply that the numbers being used to influence policy decisions are not accurate all the way down to basic daily temperature reports from single weather stations. Inaccurate data never ever produces accurate results. Personally, I do not think this problem disappears when using “only anomalies” (which some will claim loudly in comments) — the basic, first-floor data is incorrectly, inaccurately, imprecisely calculated.
But, but, but….I know, I can hear the complaints now. The usual chorus of:
- It all averages out in the end (it does not)
- But what about the Law of Large Numbers? (magical thinking)
- We are not concerned with absolute values, only anomalies.
The first two are specious arguments.
The last I will address. The answer lies in the “why” of the differences described above. The reason for the difference (other than the simple rounding up and down of fractional degrees to whole degrees) is that the air temperature at any given weather station is not distributed normally….that is, graphed minute to minute, or hour to hour, one would not see a “normal distribution”, which would look like this:
If air temperature was normally distributed through the day, then the currently used Daily Average Dry Bulb Temperature — the arithmetic mean between the day’s Hi and Lo — would be correct and would not differ from the Daily Average of All Recorded Temperatures for the Day.
But real air surface temperatures look much more like these three days from January and February 2016 in Millbrook, NY:
Air temperature at a weather station does not start at the Lo climb evenly and steadily to the Hi and then slide back down evenly to the next Lo. That is a myth — any outdoorsman (hunter, sailor, camper, explorer, even jogger) knows this fact. Yet in climatology, Daily Average Temperature — and all subsequent weekly, monthly, yearly averages — are calculated based on this false idea. At first, out of necessity — weather stations used Min-Max recording thermometers and were often checked only once per day, and the recording tabs reset at that time — and now out of respect for convention and consistency. We can’t go back and undo the facts — but need to acknowledge that the Daily Averages from those Min-Max/Hi-Lo readings do not represent the actual Daily Average Temperature — neither in accuracy or precision. This insistence on consistency means that the error ranges represented in the above example affect all Global Average Surface Temperature calculations that use station data as their source.
Note: The example used here is of winter days in a temperate climate. The situation is representative, but not necessarily quantitatively — both the signs and the sizes of the effects will be different for different climates, different stations, different seasons. The effect cannot be obviated through statistical manipulation or reducing the station data to anomalies.
Any anomalies derived by subtracting climatic scale averages from current temperatures will not tell us if the average absolute temperature at any one station is rising or falling (or how much). It will tells us only that the mean between the daily hi-low temperatures is rising or falling — which is an entirely different thing. Days with very low lows for an hour or two in early morning followed by high temps most of the rest of the day have the same hi-low mean as days with very low lows for 12 hours and a short hot spike in the afternoon. These two types of days to not have the same actual average temperature. Anomalies cannot illuminate the difference. A climatic shift from one to the other will not show up in anomalies yet the environment would be greatly affected by such a regime shift.
What can we know from the use of these imprecise “daily averages” (and all the other numbers) derived from them?
There are some who question that there is an actual Global Average Surface Temperature. (see “Does a Global Temperature Exist?”)
On the other hand, Steven Mosher so aptly informed us recently:
“The global temperature exists. It has a precise physical meaning. It’s this meaning that allows us to say…The LIA [Little Ice Age] was cooler than today…it’s the meaning that allows us to say the day side of the planet is warmer than the night side…The same meaning that allows us to say Pluto is cooler than Earth and Mercury is warmer.”
What such global averages based on questionably derived “daily averages” cannot tell us is that this year or that year was warmer or cooler by some fraction of a degree. The calculation error –the measurement error — of commonly used station Daily Average Dry Bulb Temperature is equal in magnitude (or nearly so) to the long-term global temperature change. The historic temperature record cannot be corrected for this fault. And modern digital records would require recalculation of Daily Averages from scratch. Even then, the two data sets would not be comparable quantitatively — possibly not even qualitatively.
So, “Yes, It Matters”
It matters a lot how and what one averages. It matters all the way up and down through the magnificent mathematical wonderland that represents the computer programs that read these basic digital records from thousands of weather stations around the world and transmogrify them into a single number.
It matters especially when that single number is then subsequently used as a club to beat the general public and our political leaders into agreement with certain desired policy solutions that will have major — and many believe negative — repercussions on society.
It is not enough to correctly mathematically calculate the average of a data set.
It is not enough to be able to defend the methods your Team uses to calculate the [more-often-abused-than-not] Global Averages of data sets.
Even if these averages are of homogeneous data and objects, physically and logically correct, averages return a single number which can then incorrectly be assumed to be a summary or fair representation of the whole set.
Averages, in any and all cases, by their very nature, give only a very narrow view of the information in a data set — and if accepted as representational of the whole, the average will act as a Beam of Darkness, hiding and obscuring the bulk of the information; thus, instead of leading us to a better understanding, they can act to reduce our understanding of the subject under study.
Averaging averages is fraught with danger and must be viewed cautiously. Averaged averages should be considered suspect until proven otherwise.
In climatology, Daily Average Temperatures have been, and continue to be, calculated inaccurately and imprecisely from daily minimum and maximum temperatures which fact casts doubts on the whole Global Average Surface Temperature enterprise.
Averages are good tools but, like hammers or saws, must be used correctly to produce beneficial and useful results. The misuse of averages reduces rather than betters understanding, confuses rather than clarifies and muddies scientific and policy decisions.
[July 25, 2016 – 12:15 EDT]
Those wanting more data about the differences between Tmean (the Mean between Daily Min and Daily Max) and Taverage (the arithmetic average of all 24 recorded hourly temps — some use T24 for this) — both quantitatively and in annual trends should refer to Spatiotemporal Divergence of the Warming Hiatus over Land Based on Different Definitions of Mean Temperature by Chunlüe Zhou & Kaicun Wang [Nature Scientific Reports | 6:31789 | DOI: 10.1038/srep31789]. Contrary to assertions in comments that trends of these differently defined “average” temperatures are the same, Zhou and Wang show this figure and cation: (h/t David Fair)
Figure 4. The (a,d) annual, (b,e) cold, and (c,f) warm seasonal temperature trends (unit: °C/decade) from the Global Historical Climatology Network-Daily version 3.2 (GHCN-D, [T2]) and the Integrated Surface Database-Hourly (ISD-H, [T24]) are shown for 1998–2013. The GHCN-D is an integrated database of daily climate summaries from land surface stations across the globe, which provides available Tmax and Tmin at approximately 10,400 stations from 1998 to 2013. The ISD-H consists of global hourly and synoptic observations available at approximately 3400 stations from over 100 original data sources. Regions A1, A2 andA3 (inside the green regions shown in the top left subfigure) are selected in this study.
# # # # #
Author’s Comment Policy:
I am always anxious to read your ideas, opinions, and to answer your questions about the subject of the essay, which in this case is Averages, their uses and misuses.
If you hope that I will respond or reply to your comment, please address your comment explicitly to me — such as “Kip: I wonder if you could explain…..”
As regular visitors know, I do not respond to Climate Warrior comments from either side of the Great Climate Divide — feel free to leave your mandatory talking points but do not expect a response from me.
The ideas presented in this essay, particularly in the Climatology section, are likely to stir controversy and raise objections. For this reason, it is especially important to remain on-point, on-topic in your comments and try to foster civil discussion.
I understand that opinions may vary.
I am interested in examples of the misuse of averages, the proper use of averages, and I expect that many of you will have lots of varying opinions regarding the use of averages in Climate Science.
# # # # #