[Go-essp-tech] Handling missing data in the CMIP5 archive

Thu May 5 03:52:37 MDT 2011

Hi Karl,

As a modelling centre affected (or should that be afflicted!) by this
particular issue, it's probably time for us to chime in with our own 2
cents worth.

There are a variety of technical and human reasons why there are
occasional small temporal gaps in the model data that we have submitted
to the CMIP5 archive: model crashes/restarts, files not making it into
our archive system, start/end dates not specified exactly in conformance
with the CMIP5 experiment plan, etc, etc. (Given the number of
experiments that MOHC is conducting I don't think it would be humanely
possible for us to get everything right all the time :-).

If it was trivial matter to identify and fix these small bits of missing
data I can assure you that we would have done that. The reality,
however, is that the complexities (and, yes, quirks) of the UM, together
with the software integration aspects of the CMOR library, mean that is
by no means a trivial technical issue.  And like the rest of the
CMIP5/ESGF endeavour - that's you guys! - we have very few resources
spread fairly thinly.  Hence we have had to make decisions on where to
prioritise our efforts. Do we fix occasional small gaps in data
time-series, or do we focus on CFMIP2, TAMIP, 60-level models, or invest
*significant* effort into understanding and using the CMIP5
questionnaire! (In the latter case, to the not inconsiderable benefit of
other modelling centres.)

So, in the same spirit in which the compliance rules were relaxed with
regard to provision of model metadata via the CMIP5 questionnaire, we
would hope that similar flexibility be extended to the submission of
model data, some of which may contain occasional small portions of
missing data.  Not surprisingly perhaps, we believe that it is far
preferable to have 99% of the data for a particular simulation available
in the archive than have it rejected (or non-DOI'd) because of, say, 1
missing month or year.

Also, given that we have been submitting model data to the archive since
last October, it would seem somewhat, er, punitive to introduce a
stricter data compliance rule at this stage in the game!

For our part we will endeavour to minimise the size/number of temporal
gaps in our submitted data. And, as time and reources permit, we will
investigate technical solutions that will enable us to supply files of
missing data where we do have such gaps. In the meantime we will
continue to utilise the appropriate mechanisms (e.g. the CMIP5
questionnaire) to flag up data quality issues such as this.

Regards,

Phil

________________________________

	From: go-essp-tech-bounces at ucar.edu
[mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Karl Taylor
	Sent: 04 May 2011 20:41
	To: Bryan Lawrence
	Cc: go-essp-tech at ucar.edu
	Subject: Re: [Go-essp-tech] Handling missing data in the CMIP5
archive

	Hi Bryan,

	Oh, I left something out.  Why is it lots of work for the user
to notice by looking at the time axis that the spacing between
coordinates is greater than normal, and thus some time slices have
clearly been skipped?  For daily data,  for example, if the interval
between two successive time-coordinates is 10 days, then 9 samples must
be missing.    

	I will concede that for some software and for some purposes
having time-slices included that are completely filled with the
missing_value flag could provide some advantages, so I guess I wouldn't
object to requiring this, but I think it's a judgment call that's not
that clear-cut.

	cheers,
	Karl

	On 5/4/11 11:42 AM, Bryan Lawrence wrote: 

		Hi Karl

		I think we're somewhat at cross purposes.

			My view is that if the time-slices have actually
been lost, we
			shouldn't necessarily reject the data as being
useless. 

		Agreed.

			I agree,
			however, that we should encourage the modeling
groups to try to
			recover or reproduce the lost time slices to
make their output more
			complete.

		Agreed.

			If that is impossible, I still think in many
cases
			analysts will want access to the portions of the
time-series that
			are available.

		In which case we should require them to write misssing
data fields for 
		that portion. That should be trivial for them to do, and
save the 
		consumers a vast amount of time.  (ie use the CF missing
data flag, we're 
		not suggesintg htey have to re-run anything unless they
want to).

		This is Ag's option 2c, which you don't seem to mention.

			Consider, for example, a 1000 year control run
with a decade missing
			in the middle (perhaps all contained in a single
lost file).  Don't
			you think many researchers will make use of the
two portions of the
			time-series that *are* available, and shouldn't
the available data
			be assigned a DOI?

			As I recall, data not passing QC level 2 won't
normally be replicated
			and wouldn't be assigned a DOI.  Is this
correct?

		Correct.

		Cheers
		Bryan

			best regards,
			Karl

			On 5/4/11 1:08 AM, Bryan Lawrence wrote:

				Hi Karl

				There are two issues noted in your
email:(1) missing variables, and
				(2) missing time slices in a sequence.

				I agree that (1) is something to be
noted, I think (2) is something
				that should cause failure, and require a
response as Ag has
				suggested. I don't think it's too much
to ask a modelling group to
				either provide the missing data, or
provide missing data flags -
				but actual missing files in a sequence
should be an error and a
				failure!

				I think we should be holding a candle
for the users here. The
				reality is that no code is going to read
the metadata to find
				missing data, whereas code can read and
understand missing data
				flags.

				Bryan

				Dear Ag,

				There is another possible way of
handling the "missing data"
				issue. I'm not sure that a dataset
should be be required to be
				complete (i.e., required to include all
time slices) to be
				considered eligible for DOI assignment.
That is, we could relax
				the criteria. Note that I don't think we
require *all* variables
				requested within a single dataset to be
present, so some datasets
				will indeed be incomplete but be
eligible for a DOI.  I think the
				QC procedure should be to check with the
modeling group, and if
				they can't supply the missing
time-slices, then we somehow note
				this flaw in the dataset documentation
and if other QC checks are
				passed, assign it a DOI.

				The criteria for getting a DOI should be
that there are no known
				errors in the data itself, and that
there are no major problems
				with the metadata.  In this case the
data will be reliable, and
				analysts will be welcome to use it and
publish results, so I
				think it should be assigned a DOI.

				What do others think?

				Best regards,
				Karl

				On 4/28/11 3:12 AM,
ag.stephens at stfc.ac.uk wrote:

				Dear all,

				At BADC we have come across our first
"missing data" issue in the
				CMIP5 datasets we are ingesting. We have
an example of some
				missing months for a particular set of
variables that was
				revealed when running the QC code from
DKRZ.

				It would be very useful for the CMIP5
archive managers to make an
				authoritative statement about how we
should handle missing data
				time steps in the archive.

				I propose the following response when a
Data Node receives a
				dataset

				in which time steps are missing:

				   1. QC manager (i.e. whoever runs the
QC code) informs Data
				   Provider that there is missing data
in a dataset (specifying
				   full DRS structure and date range
missing).

				   2a. If Data Provider says "no, cannot
provide this data" then
				   the affected datasets cannot get a
DOI and cannot be part of
				   the "crystallised archive". STOP

				   2b. Data Provider re-generates files,
data is re-ingested, new
				   version is generated, QC is re-run,
all is good. STOP

				   2c. Data Provider cannot re-generate
but wants to pass QC - so
				   needs to create the required files
full of missing data.

				   3. Data Provider creates missing data
files and sends, data
				   re-ingested, new version is
generated, QC re-run, all good.
				   STOP

				In cases 2a and 2c it would also be very
useful if the dataset is
				annotated to inform the user which dates
have been FILLED with
				missing data. This would, I believe, be
in the QC logs but we
				might want a more prominent record of
this if possible.

				Cheers,

				Ag
				BADC--
				Scanned by iCritical.

				--
				Bryan Lawrence
				Director of Environmental Archival and
Associated Research
				(NCAS/British Atmospheric Data Centre
and NCEO/NERC NEODC)
				STFC, Rutherford Appleton Laboratory
				Phone +44 1235 445012; Fax ... 5848;
				Web: home.badc.rl.ac.uk/lawrence

		--
		Bryan Lawrence
		Director of Environmental Archival and Associated
Research
		(NCAS/British Atmospheric Data Centre and NCEO/NERC
NEODC)
		STFC, Rutherford Appleton Laboratory
		Phone +44 1235 445012; Fax ... 5848; 
		Web: home.badc.rl.ac.uk/lawrence

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110505/bbe145a3/attachment-0001.html