[Go-essp-tech] Handling missing data in the CMIP5 archive
Bentley, Philip
philip.bentley at metoffice.gov.uk
Thu May 5 03:52:37 MDT 2011
Hi Karl,
As a modelling centre affected (or should that be afflicted!) by this
particular issue, it's probably time for us to chime in with our own 2
cents worth.
There are a variety of technical and human reasons why there are
occasional small temporal gaps in the model data that we have submitted
to the CMIP5 archive: model crashes/restarts, files not making it into
our archive system, start/end dates not specified exactly in conformance
with the CMIP5 experiment plan, etc, etc. (Given the number of
experiments that MOHC is conducting I don't think it would be humanely
possible for us to get everything right all the time :-).
If it was trivial matter to identify and fix these small bits of missing
data I can assure you that we would have done that. The reality,
however, is that the complexities (and, yes, quirks) of the UM, together
with the software integration aspects of the CMOR library, mean that is
by no means a trivial technical issue. And like the rest of the
CMIP5/ESGF endeavour - that's you guys! - we have very few resources
spread fairly thinly. Hence we have had to make decisions on where to
prioritise our efforts. Do we fix occasional small gaps in data
time-series, or do we focus on CFMIP2, TAMIP, 60-level models, or invest
*significant* effort into understanding and using the CMIP5
questionnaire! (In the latter case, to the not inconsiderable benefit of
other modelling centres.)
So, in the same spirit in which the compliance rules were relaxed with
regard to provision of model metadata via the CMIP5 questionnaire, we
would hope that similar flexibility be extended to the submission of
model data, some of which may contain occasional small portions of
missing data. Not surprisingly perhaps, we believe that it is far
preferable to have 99% of the data for a particular simulation available
in the archive than have it rejected (or non-DOI'd) because of, say, 1
missing month or year.
Also, given that we have been submitting model data to the archive since
last October, it would seem somewhat, er, punitive to introduce a
stricter data compliance rule at this stage in the game!
For our part we will endeavour to minimise the size/number of temporal
gaps in our submitted data. And, as time and reources permit, we will
investigate technical solutions that will enable us to supply files of
missing data where we do have such gaps. In the meantime we will
continue to utilise the appropriate mechanisms (e.g. the CMIP5
questionnaire) to flag up data quality issues such as this.
Regards,
Phil
________________________________
From: go-essp-tech-bounces at ucar.edu
[mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Karl Taylor
Sent: 04 May 2011 20:41
To: Bryan Lawrence
Cc: go-essp-tech at ucar.edu
Subject: Re: [Go-essp-tech] Handling missing data in the CMIP5
archive
Hi Bryan,
Oh, I left something out. Why is it lots of work for the user
to notice by looking at the time axis that the spacing between
coordinates is greater than normal, and thus some time slices have
clearly been skipped? For daily data, for example, if the interval
between two successive time-coordinates is 10 days, then 9 samples must
be missing.
I will concede that for some software and for some purposes
having time-slices included that are completely filled with the
missing_value flag could provide some advantages, so I guess I wouldn't
object to requiring this, but I think it's a judgment call that's not
that clear-cut.
cheers,
Karl
On 5/4/11 11:42 AM, Bryan Lawrence wrote:
Hi Karl
I think we're somewhat at cross purposes.
My view is that if the time-slices have actually
been lost, we
shouldn't necessarily reject the data as being
useless.
Agreed.
I agree,
however, that we should encourage the modeling
groups to try to
recover or reproduce the lost time slices to
make their output more
complete.
Agreed.
If that is impossible, I still think in many
cases
analysts will want access to the portions of the
time-series that
are available.
In which case we should require them to write misssing
data fields for
that portion. That should be trivial for them to do, and
save the
consumers a vast amount of time. (ie use the CF missing
data flag, we're
not suggesintg htey have to re-run anything unless they
want to).
This is Ag's option 2c, which you don't seem to mention.
Consider, for example, a 1000 year control run
with a decade missing
in the middle (perhaps all contained in a single
lost file). Don't
you think many researchers will make use of the
two portions of the
time-series that *are* available, and shouldn't
the available data
be assigned a DOI?
As I recall, data not passing QC level 2 won't
normally be replicated
and wouldn't be assigned a DOI. Is this
correct?
Correct.
Cheers
Bryan
best regards,
Karl
On 5/4/11 1:08 AM, Bryan Lawrence wrote:
Hi Karl
There are two issues noted in your
email:(1) missing variables, and
(2) missing time slices in a sequence.
I agree that (1) is something to be
noted, I think (2) is something
that should cause failure, and require a
response as Ag has
suggested. I don't think it's too much
to ask a modelling group to
either provide the missing data, or
provide missing data flags -
but actual missing files in a sequence
should be an error and a
failure!
I think we should be holding a candle
for the users here. The
reality is that no code is going to read
the metadata to find
missing data, whereas code can read and
understand missing data
flags.
Bryan
Dear Ag,
There is another possible way of
handling the "missing data"
issue. I'm not sure that a dataset
should be be required to be
complete (i.e., required to include all
time slices) to be
considered eligible for DOI assignment.
That is, we could relax
the criteria. Note that I don't think we
require *all* variables
requested within a single dataset to be
present, so some datasets
will indeed be incomplete but be
eligible for a DOI. I think the
QC procedure should be to check with the
modeling group, and if
they can't supply the missing
time-slices, then we somehow note
this flaw in the dataset documentation
and if other QC checks are
passed, assign it a DOI.
The criteria for getting a DOI should be
that there are no known
errors in the data itself, and that
there are no major problems
with the metadata. In this case the
data will be reliable, and
analysts will be welcome to use it and
publish results, so I
think it should be assigned a DOI.
What do others think?
Best regards,
Karl
On 4/28/11 3:12 AM,
ag.stephens at stfc.ac.uk wrote:
Dear all,
At BADC we have come across our first
"missing data" issue in the
CMIP5 datasets we are ingesting. We have
an example of some
missing months for a particular set of
variables that was
revealed when running the QC code from
DKRZ.
It would be very useful for the CMIP5
archive managers to make an
authoritative statement about how we
should handle missing data
time steps in the archive.
I propose the following response when a
Data Node receives a
dataset
in which time steps are missing:
1. QC manager (i.e. whoever runs the
QC code) informs Data
Provider that there is missing data
in a dataset (specifying
full DRS structure and date range
missing).
2a. If Data Provider says "no, cannot
provide this data" then
the affected datasets cannot get a
DOI and cannot be part of
the "crystallised archive". STOP
2b. Data Provider re-generates files,
data is re-ingested, new
version is generated, QC is re-run,
all is good. STOP
2c. Data Provider cannot re-generate
but wants to pass QC - so
needs to create the required files
full of missing data.
3. Data Provider creates missing data
files and sends, data
re-ingested, new version is
generated, QC re-run, all good.
STOP
In cases 2a and 2c it would also be very
useful if the dataset is
annotated to inform the user which dates
have been FILLED with
missing data. This would, I believe, be
in the QC logs but we
might want a more prominent record of
this if possible.
Cheers,
Ag
BADC--
Scanned by iCritical.
--
Bryan Lawrence
Director of Environmental Archival and
Associated Research
(NCAS/British Atmospheric Data Centre
and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848;
Web: home.badc.rl.ac.uk/lawrence
--
Bryan Lawrence
Director of Environmental Archival and Associated
Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC
NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848;
Web: home.badc.rl.ac.uk/lawrence
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110505/bbe145a3/attachment-0001.html
More information about the GO-ESSP-TECH
mailing list