Guest essay by Bob Koss
Being an old retired guy with time on my hands, this summer I decided to find out just how well GHCN-Monthly follows their own methodology in regard to data collection. What I discovered is, they don’t. My remarks below relate strictly to the GHCN monthly unadjusted dataset on which their final adjusted dataset is based. At the end of this article are links to some verifications of what I discuss.
For those unfamiliar with the organizations involved, a few terms are defined.
The Global Historic Climate Network(GHCN), a part of the National Climatic Data Center(NCDC), is the repository other global temperature data analysts turn to for many of their data sources. Monthly Climatic Data of the World(MCDW) is also a part of NCDC and separately compiles a less extensive set of monthly data than GHCN. US Historic Climate Network(USHCN) is a network of stations completely within the continental US and are also part of NCDC. Met Office is a UK data source of stations, many of which overlap with other NCDC sources.
GHCN created a table of data sources, ranking them in order from low to high priority(quality). The highest priority data is to be used whenever multiple sources are available for the same station. This rule might as well not exist, since they don’t follow it. Evidently it is only a rule for PR purposes and not really necessary to follow.
Here is their description of that rule from the methodology paper linked near the end of this post.
The data integration phase begins by assembling and
merging the various source level data sets. Although a single
datum may be provided by more than one source, only one
value is added to version 3 for any particular month. The
datum is selected based on availability and a hierarchical
process involving priority levels based on the reliability and
quality of the source. Data from sources considered to be of
higher quality and reliability are used preferentially over
other sources. Table 3 lists the sources, and their order of
assemblage (highest priority listed first). For example, if a
non-missing datum is present for the same date/location
from data source M (MCDW) and data source P (CLIMAT
bulletin), the datum from data source M will be placed in the
data set. The source from which each datum originated is
indicated in the version 3 data set by a source flag as shown
in the table. Daily reconstruction of the data set using this
method ensures that any changes made in the source data
sets get incorporated into GHCN-M while also allowing for
the reproduction of the version 3 data set by other institutions
Table 3 mentioned in the above quote.
Table 3. Source Data Sets From Which GHCN-M Version 3 is
Constructed and Maintained
Priority Source Data Set Source Flag
1 Datzilla (Manual/Expert Assessment) Z
2 USHCN-M Version 2 U
3 World Weather Records W
4 KNMI Netherlands (DeBilt only) N
5 Colonial Era Archive J
6 MCDW (DSI 3500) M
7 MCDW quality controlled but not yet published C
8 UK Met Office CLIMAT K
9 CLIMAT bulletin P
10 GHCN-M Version 2 Ga
For any station incorporated from GHCN-M version 2 that had multiple
time series (“duplicates”) for mean temperature, the ‘G’ flag is replaced by
a number from 0 to 9 that corresponds to the particular duplicate in version
2 from which it originated. This number is the 12th digit in the version 2
Around June 6th, 2014 GHCN rolled back a higher quality source to a lower one by changing 2013 data from MCDW to Met Office data.(16000+ months of data) This resulted in numerous value changes and an increase in the amount of missing data. Those changes remained for over a month until I noticed while comparing my June 3rd file with one from early July. I inquired about the changes. Next day, July 10th, the higher quality source was re-inserted. I was told a couple days later, by one of the head GHCN team members, that it was “an unintentional processing problem that occurred with one of our ingest streams”. They did update their status.txt file, unsurprisingly in about as low-key a way as possible.
I find their reason unpersuasive. Why are they even touching 2013 data unless to over-write with a higher quality source? I wouldn’t expect them to still be streaming 2013 data, but have it always at hand and archived on site. They rebuild their dataset daily. What competent organization would not do a sanity check on their new build by running a simple data comparison to the previous dataset?
My latest query of about a week ago has to do with still using lower quality data at least as far back as 2001. For Australia between 2003-2013, 98% of their data is sourced to Met Office, but the higher quality MCDW has much of that data available. I don’t understand why they aren’t using the higher priority MCDW data. There are 2000-3000 pieces annually of Met Office data still being used since 2001, less than 1/3rd of it is related to Australia. Other countries in the database might also still be listed with inferior data simply because their data hasn’t been properly upgraded. A couple emails were exchanged, but no reason given, and no changes made. At this point I think it is questionable if GHCN will thoroughly investigate and upgrade to higher quality sources where appropriate. It will be a pleasant surprise if they do.
Below is a graphic example of how much difference the data source can make in the monthly temperature record. I’m not saying all stations have differences of such a magnitude, or that this shows the largest/smallest difference, or that all stations go in a similar direction. I haven’t checked, but wouldn’t be surprised if the differences tilted quite a bit in one direction.
Some digging in July led to finding the entire continent of Australia is devoid of data for September, October, November in 2011. They did have September, October data in v3.0 when it was superceded by v3.1 in early November 2011. v3.1 discarded October when it launched leaving only September intact. At some point in time since then they also discarded September. Emailed them about this on July 31st and a couple times since then. Latest is they are trying to get Met Office to re-transmit the data. MCDW has much of that data and since GHCN considers them a higher quality source than Met Office, I don’t understand why they aren’t using that instead.
Final example for today. October 2nd this year they deleted all the August data for the rest of the world(ROW) leaving only USHCN data in the database. They even deleted US station data not part of USHCN. Amazingly, they still managed to add ROW data for September during the deletion period. The August ROW data was missing until October 8th when they re-inserted it. Still don’t know why they deleted it. Mentioned it in an email about a week ago. No reason has been provided. The data deletion did increase the mean value of the remaining August data by 0.9C. Was there some announcement concerning global temperatures for summer or August during that period?
With such erratic data handling, the accuracy of their product is questionable.
This post is already long enough, so I’ll end here.
Free paper on GHCN v3 methodology. pg. 11 explains source priority and processing. http://onlinelibrary.wiley.com/doi/10.1029/2011JD016187/pdf
Daily issued data files along with status.txt, a readme, and other stuff.
Published MCDW data by station(ends 2011).
Published MCDW data by month. Current to Aug 2014.
A compilation of annual data concerning the 2013 roll-back, October 2014 deletion,
and the missing Australian data in 2011.