[Go-essp-tech] Handling missingdata in the CMIP5 archive

Fri Jun 24 10:31:01 MDT 2011

Hi all,

In this discussion, I think we should distinguish between two cases:

1) Isolated time-slices are missing from a time-series (say, less than 
0.1% of samples are missing).  This might happen, for example,  if a few 
"history" files get lost and can't be regenerated.  We shouldn't expect 
entire files produced by CMOR to be missing in this case, just some time 
samples.

2) A group decides only to provide model output for a subset of the 
requested period, and so there are whole portions of a time-series 
missing.  For example, suppose a group only chooses to save 3-hourly 
data from its historical run for the years 1960-1969 and years 
1990-2005.  The data requested for years 1970-1989 would be missing.

Keep in mind also that some requested time-series contain gaps by 
design.  For example the 3-d aero data for the RCP runs is to be 
collected only for years 2010, 2020, 2040, 2060, 2080, and 2100.  Note 
that the gaps are not all of equal length.  These data are monthly, so 
most months are "missing" by design.

I think when large portions of a time-series are missing (case 2), the 
user will easily notice this by inspecting the file names, as long as 
the gap is *not* contained within the file itself.  This leads to the 
suggestion that entire files may be omitted, but within a single file 
the data should be complete (although isolated time-slices might be 
entirely filled with "missing" values.  I don't think we can generate 
new types of files for CMIP5 that are "empty"; it's too late for changes 
of this kind.  Also I don't think seeing a file of size near zero is any 
easier than checking the time periods explicitly given in the names of 
the files.

Revisiting what to do about the *isolated* missing time-slices of case 
1, my original suggestion was to omit these (or fill the with missing 
values), but Bryan felt strongly they should always be included and 
filled with missing values.  Others have pointed out that one can fairly 
easily infer from the time-coordinate whether or not a time slice has 
been omitted, whereas if the entire time slice were filled with "missing 
values", one would have to read in the data itself to determine whether 
there was any valid data.   On the other hand if anyone failed to read 
in the time-coordinates, and instead simply read all the time-slices 
that were available, and then *assumed* no time-slices were omitted, 
they would likely perform a flawed analysis and might never notice.  
They would be less likely to do this if all the time-slices were 
actually written, but isolated ones were filled with missing values.

So, I'm inclined to allow some flexibility summarized here since unless 
folks are careful, they'll make mistakes no matter what we decide:

When isolated time-slices in a dataset are lost and it is impossible to 
recover them, it is recommended that those isolated missing time-slices be:
1) filled entirely with the "missing data" value, or
2) be entirely omitted from the file (making sure the time-coordinate 
reflects their absence)

When significant portions of a time-series are omitted (either my design 
or otherwise), one should simply not create files for those portions of 
the time-series.  This might require the user to divide data normally 
found in a single file into two files.  For example, if 100-years of 
monthly mean data are normally packaged into a single file, but a 
decade  (i.e., 120 consecutive samples) is unavailable (say years 
40-49), the user should write instead two files, the first with 40 years 
of day and the second with the last 50 years of data.

Further discussion invited.

Best regards,
Karl

On 6/24/11 7:47 AM, Bentley, Philip wrote:
> Hi George,
>
> Your chosen solution to use a metadata attribute ('nodata' in your case)
> to flag that a given file is an empty/null file is exactly the solution
> that I had in mind for CMIP5 files comprised of all missing data.
>
> Unfortunately - and the reason I didn't pursue it on mailing lists -
> such files would not, I think, be CF-compliant and as such would likely
> trip up current netCDF client software (and certainly the tools likely
> to be used for analysing CMIP5 datasets).
>
> Although it's probably too late to use this device on the CMIP5 project,
> nonetheless I wonder if it isn't worth making a proposal along these
> lines to the CF mailing list?
>
> Regards,
> Phil
>
>> -----Original Message-----
>> From: go-essp-tech-bounces at ucar.edu
>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of George J. Huffman
>> Sent: 24 June 2011 14:23
>> To: Kettleborough, Jamie
>> Cc: go-essp-tech at ucar.edu
>> Subject: Re: [Go-essp-tech] Handling missingdata in the CMIP5 archive
>>
>> Hi all - to quote a different context ... in the
>> Precipitation Processing System, which is the processing
>> center for TRMM and GPM satellite project data, the choice is
>> to provide a file whether or not the data actually exist.  If
>> parts of the file are missing, they are filled with the
>> missing value, as you'd expect.  If the entire contents of
>> the file are unavailable, the metadata in the header includes
>> a "nodata=true" flag and no space is wasted.  To follow up on
>> the "early failure" comments, if you process the header for
>> the nodata flag, you'd immediately hit it, and if you don't,
>> you'd immediately hit read failures.  As a visual check, the
>> all-missing file's size is tiny compared to the usual file
>> that has data.
>>
>> George
>>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110624/9aa3e095/attachment.html