[Go-essp-tech] Handling missing data in the CMIP5 archive
ag.stephens at stfc.ac.uk
ag.stephens at stfc.ac.uk
Fri Jun 24 06:52:23 MDT 2011
Dear all,
Please note my last message (below) was stuck in my mailbox for 6 hours. Hence it does account for debate re-opening.
Please ignore.
Ag
From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of ag.stephens at stfc.ac.uk
Sent: 24 June 2011 13:47
To: taylor13 at llnl.gov; stockhause at dkrz.de
Cc: go-essp-tech at ucar.edu
Subject: Re: [Go-essp-tech] Handling missing data in the CMIP5 archive
Thanks Karl,
I appreciate the clarity of the guidance. This will help us proceed.
Kind regards,
Ag
From: Karl Taylor [mailto:taylor13 at llnl.gov]
Sent: 24 June 2011 01:17
To: Martina Stockhause
Cc: Stephens, Ag (STFC,RAL,RALSP); Lawrence, Bryan (STFC,RAL,RALSP); go-essp-tech at ucar.edu
Subject: Re: [Go-essp-tech] Handling missing data in the CMIP5 archive
Dear all,
O.K. I didn't read Martina and Frank's email until now. Perhaps my last email should have read in summary:
If a time-slice that should have been included as part of the requested model output is entirely missing, that dataset will not be assigned a DOI until that time-slice has been:
1) recovered (ideally)
2) filled entirely with "missing" values (if it is impossible to recover the actual data).
By special exception, DOI's may be assigned to datasets (without requiring infilling missing data) when a modeling group purposely omits some portion of a time-series, as long as the remaining portion is likely to be of interest. In no case should a single file contain non-contiguous portions of the time series.
Best regards,
Karl
On 6/23/11 7:22 AM, Martina Stockhause wrote:
Dear Ag,
we propose to fill up the gaps, see 2.5.1 in CF:
http://cf-pcmdi.llnl.gov/documents/cf-conventions/1.5/cf-conventions.html#missing-data
which refers to the NetCDF Guide:
http://www.unidata.ucar.edu/software/netcdf/docs/netcdf.html#Attribute-Conventions
However, there are examples where this does not make much sense. e.g. if
sb puts 1970-1999 and 2070-2099 intentionally into one dataset.
So it remains up to the QC manager to decide whatever makes sense.
Required is in all cases an appropriate comment.
Regards... Martina & frank
On 06/23/2011 01:51 PM, ag.stephens at stfc.ac.uk<mailto:ag.stephens at stfc.ac.uk> wrote:
Dear Karl and Bryan,
There was discussion on the handling of missing data a while back. Do we have a policy decision on this issue? It would be great to know exactly where we stand in terms of whether a missing time step will fail QC and hence need fixing before replication (and subsequent DOIs) can be considered.
It looks to me like there are valid arguments either way, so I think what we need is an authoritative decision that we can all follow.
Thanks
Ag
________________________________________
From: Karl Taylor [taylor13 at llnl.gov<mailto:taylor13 at llnl.gov>]
Sent: 04 May 2011 20:40
To: Lawrence, Bryan (STFC,RAL,RALSP)
Cc: go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>; Stephens, Ag (STFC,RAL,RALSP)
Subject: Re: [Go-essp-tech] Handling missing data in the CMIP5 archive
Hi Bryan,
Oh, I left something out. Why is it lots of work for the user to notice by looking at the time axis that the spacing between coordinates is greater than normal, and thus some time slices have clearly been skipped? For daily data, for example, if the interval between two successive time-coordinates is 10 days, then 9 samples must be missing.
I will concede that for some software and for some purposes having time-slices included that are completely filled with the missing_value flag could provide some advantages, so I guess I wouldn't object to requiring this, but I think it's a judgment call that's not that clear-cut.
cheers,
Karl
On 5/4/11 11:42 AM, Bryan Lawrence wrote:
Hi Karl
I think we're somewhat at cross purposes.
My view is that if the time-slices have actually been lost, we
shouldn't necessarily reject the data as being useless.
Agreed.
I agree,
however, that we should encourage the modeling groups to try to
recover or reproduce the lost time slices to make their output more
complete.
Agreed.
If that is impossible, I still think in many cases
analysts will want access to the portions of the time-series that
are available.
In which case we should require them to write misssing data fields for
that portion. That should be trivial for them to do, and save the
consumers a vast amount of time. (ie use the CF missing data flag, we're
not suggesintg htey have to re-run anything unless they want to).
This is Ag's option 2c, which you don't seem to mention.
Consider, for example, a 1000 year control run with a decade missing
in the middle (perhaps all contained in a single lost file). Don't
you think many researchers will make use of the two portions of the
time-series that *are* available, and shouldn't the available data
be assigned a DOI?
As I recall, data not passing QC level 2 won't normally be replicated
and wouldn't be assigned a DOI. Is this correct?
Correct.
Cheers
Bryan
best regards,
Karl
On 5/4/11 1:08 AM, Bryan Lawrence wrote:
Hi Karl
There are two issues noted in your email:(1) missing variables, and
(2) missing time slices in a sequence.
I agree that (1) is something to be noted, I think (2) is something
that should cause failure, and require a response as Ag has
suggested. I don't think it's too much to ask a modelling group to
either provide the missing data, or provide missing data flags -
but actual missing files in a sequence should be an error and a
failure!
I think we should be holding a candle for the users here. The
reality is that no code is going to read the metadata to find
missing data, whereas code can read and understand missing data
flags.
Bryan
Dear Ag,
There is another possible way of handling the "missing data"
issue. I'm not sure that a dataset should be be required to be
complete (i.e., required to include all time slices) to be
considered eligible for DOI assignment. That is, we could relax
the criteria. Note that I don't think we require *all* variables
requested within a single dataset to be present, so some datasets
will indeed be incomplete but be eligible for a DOI. I think the
QC procedure should be to check with the modeling group, and if
they can't supply the missing time-slices, then we somehow note
this flaw in the dataset documentation and if other QC checks are
passed, assign it a DOI.
The criteria for getting a DOI should be that there are no known
errors in the data itself, and that there are no major problems
with the metadata. In this case the data will be reliable, and
analysts will be welcome to use it and publish results, so I
think it should be assigned a DOI.
What do others think?
Best regards,
Karl
On 4/28/11 3:12 AM, ag.stephens at stfc.ac.uk<mailto:ag.stephens at stfc.ac.uk><mailto:ag.stephens at stfc.ac.uk><mailto:ag.stephens at stfc.ac.uk> wrote:
Dear all,
At BADC we have come across our first "missing data" issue in the
CMIP5 datasets we are ingesting. We have an example of some
missing months for a particular set of variables that was
revealed when running the QC code from DKRZ.
It would be very useful for the CMIP5 archive managers to make an
authoritative statement about how we should handle missing data
time steps in the archive.
I propose the following response when a Data Node receives a
dataset
in which time steps are missing:
1. QC manager (i.e. whoever runs the QC code) informs Data
Provider that there is missing data in a dataset (specifying
full DRS structure and date range missing).
2a. If Data Provider says "no, cannot provide this data" then
the affected datasets cannot get a DOI and cannot be part of
the "crystallised archive". STOP
2b. Data Provider re-generates files, data is re-ingested, new
version is generated, QC is re-run, all is good. STOP
2c. Data Provider cannot re-generate but wants to pass QC - so
needs to create the required files full of missing data.
3. Data Provider creates missing data files and sends, data
re-ingested, new version is generated, QC re-run, all good.
STOP
In cases 2a and 2c it would also be very useful if the dataset is
annotated to inform the user which dates have been FILLED with
missing data. This would, I believe, be in the QC logs but we
might want a more prominent record of this if possible.
Cheers,
Ag
BADC--
Scanned by iCritical.
--
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848;
Web: home.badc.rl.ac.uk/lawrence
--
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848;
Web: home.badc.rl.ac.uk/lawrence
--
Scanned by iCritical.
--
Scanned by iCritical.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110624/49d2ceaf/attachment-0001.html
More information about the GO-ESSP-TECH
mailing list