[Go-essp-tech] Handling missing data in the CMIP5 archive

Thu May 5 04:07:18 MDT 2011

Hi Phil

You guys have known about this error (and others like it, for weeks). 

Are you really saying it's a vast effort to have a short python code that 
provides missing data sequences when *you(mohc)* 
 a) already know exactly what is missing, and
 b) how long it is missing for.
 c) have written your own internal q.c. report for it (which must take a 
comparable amount of time to running a "provide missing data" code.

We then find it immediately too, because we're running q.c. code, and we 
could fix it too (and would, but then it wouldn't be *your* data any 
more). Every user will do this too (or cock up because they don't spot 
the problem).

I know we're all spread thin, and I can lose this argument in the 
context of CMIP5 q.c up to level 2, but I'm not ready to given for q.c. 
level 3.

In particular, a DOI is not a URL alone. It has to mean something a wee 
bit more than that if you want it to be taken seriously as a 
publication. Speaking personally, I wont go to bat with the journals and 
ISI unless it is!

I know you're spread thin, but we are all spread thin. I'm not 
suggesting you have to rerun the model here!

Cheers
Bryan

> Hi Karl,
> 
> As a modelling centre affected (or should that be afflicted!) by this
> particular issue, it's probably time for us to chime in with our own
> 2 cents worth.
> 
> There are a variety of technical and human reasons why there are
> occasional small temporal gaps in the model data that we have
> submitted to the CMIP5 archive: model crashes/restarts, files not
> making it into our archive system, start/end dates not specified
> exactly in conformance with the CMIP5 experiment plan, etc, etc.
> (Given the number of experiments that MOHC is conducting I don't
> think it would be humanely possible for us to get everything right
> all the time :-).
> 
> If it was trivial matter to identify and fix these small bits of
> missing data I can assure you that we would have done that. The
> reality, however, is that the complexities (and, yes, quirks) of the
> UM, together with the software integration aspects of the CMOR
> library, mean that is by no means a trivial technical issue.  And
> like the rest of the CMIP5/ESGF endeavour - that's you guys! - we
> have very few resources spread fairly thinly.  Hence we have had to
> make decisions on where to prioritise our efforts. Do we fix
> occasional small gaps in data time-series, or do we focus on CFMIP2,
> TAMIP, 60-level models, or invest *significant* effort into
> understanding and using the CMIP5
> questionnaire! (In the latter case, to the not inconsiderable benefit
> of other modelling centres.)
> 
> So, in the same spirit in which the compliance rules were relaxed
> with regard to provision of model metadata via the CMIP5
> questionnaire, we would hope that similar flexibility be extended to
> the submission of model data, some of which may contain occasional
> small portions of missing data.  Not surprisingly perhaps, we
> believe that it is far preferable to have 99% of the data for a
> particular simulation available in the archive than have it rejected
> (or non-DOI'd) because of, say, 1 missing month or year.
> 
> Also, given that we have been submitting model data to the archive
> since last October, it would seem somewhat, er, punitive to
> introduce a stricter data compliance rule at this stage in the game!
> 
> For our part we will endeavour to minimise the size/number of
> temporal gaps in our submitted data. And, as time and reources
> permit, we will investigate technical solutions that will enable us
> to supply files of missing data where we do have such gaps. In the
> meantime we will continue to utilise the appropriate mechanisms
> (e.g. the CMIP5 questionnaire) to flag up data quality issues such
> as this.
> 
> Regards,
> 
> Phil
> 
> 
> 
> ________________________________
> 
> 	From: go-essp-tech-bounces at ucar.edu
> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Karl Taylor
> 	Sent: 04 May 2011 20:41
> 	To: Bryan Lawrence
> 	Cc: go-essp-tech at ucar.edu
> 	Subject: Re: [Go-essp-tech] Handling missing data in the CMIP5
> archive
> 
> 
> 	Hi Bryan,
> 
> 	Oh, I left something out.  Why is it lots of work for the user
> to notice by looking at the time axis that the spacing between
> coordinates is greater than normal, and thus some time slices have
> clearly been skipped?  For daily data,  for example, if the interval
> between two successive time-coordinates is 10 days, then 9 samples
> must be missing.
> 
> 	I will concede that for some software and for some purposes
> having time-slices included that are completely filled with the
> missing_value flag could provide some advantages, so I guess I
> wouldn't object to requiring this, but I think it's a judgment call
> that's not that clear-cut.
> 
> 	cheers,
> 	Karl
> 
> 	On 5/4/11 11:42 AM, Bryan Lawrence wrote:
> 
> 		Hi Karl
> 
> 		I think we're somewhat at cross purposes.
> 
> 
> 			My view is that if the time-slices have actually
> been lost, we
> 			shouldn't necessarily reject the data as being
> useless.
> 
> 		Agreed.
> 
> 
> 			I agree,
> 			however, that we should encourage the modeling
> groups to try to
> 			recover or reproduce the lost time slices to
> make their output more
> 			complete.
> 
> 		Agreed.
> 
> 
> 			If that is impossible, I still think in many
> cases
> 			analysts will want access to the portions of the
> time-series that
> 			are available.
> 
> 		In which case we should require them to write misssing
> data fields for
> 		that portion. That should be trivial for them to do, and
> save the
> 		consumers a vast amount of time.  (ie use the CF missing
> data flag, we're
> 		not suggesintg htey have to re-run anything unless they
> want to).
> 
> 		This is Ag's option 2c, which you don't seem to mention.
> 
> 
> 			Consider, for example, a 1000 year control run
> with a decade missing
> 			in the middle (perhaps all contained in a single
> lost file).  Don't
> 			you think many researchers will make use of the
> two portions of the
> 			time-series that *are* available, and shouldn't
> the available data
> 			be assigned a DOI?
> 
> 
> 
> 			As I recall, data not passing QC level 2 won't
> normally be replicated
> 			and wouldn't be assigned a DOI.  Is this
> correct?
> 
> 		Correct.
> 
> 		Cheers
> 		Bryan
> 
> 
> 
> 			best regards,
> 			Karl
> 
> 			On 5/4/11 1:08 AM, Bryan Lawrence wrote:
> 
> 				Hi Karl
> 
> 				There are two issues noted in your
> email:(1) missing variables, and
> 				(2) missing time slices in a sequence.
> 
> 				I agree that (1) is something to be
> noted, I think (2) is something
> 				that should cause failure, and require a
> response as Ag has
> 				suggested. I don't think it's too much
> to ask a modelling group to
> 				either provide the missing data, or
> provide missing data flags -
> 				but actual missing files in a sequence
> should be an error and a
> 				failure!
> 
> 				I think we should be holding a candle
> for the users here. The
> 				reality is that no code is going to read
> the metadata to find
> 				missing data, whereas code can read and
> understand missing data
> 				flags.
> 
> 				Bryan
> 
> 
> 				Dear Ag,
> 
> 				There is another possible way of
> handling the "missing data"
> 				issue. I'm not sure that a dataset
> should be be required to be
> 				complete (i.e., required to include all
> time slices) to be
> 				considered eligible for DOI assignment.
> That is, we could relax
> 				the criteria. Note that I don't think we
> require *all* variables
> 				requested within a single dataset to be
> present, so some datasets
> 				will indeed be incomplete but be
> eligible for a DOI.  I think the
> 				QC procedure should be to check with the
> modeling group, and if
> 				they can't supply the missing
> time-slices, then we somehow note
> 				this flaw in the dataset documentation
> and if other QC checks are
> 				passed, assign it a DOI.
> 
> 				The criteria for getting a DOI should be
> that there are no known
> 				errors in the data itself, and that
> there are no major problems
> 				with the metadata.  In this case the
> data will be reliable, and
> 				analysts will be welcome to use it and
> publish results, so I
> 				think it should be assigned a DOI.
> 
> 				What do others think?
> 
> 				Best regards,
> 				Karl
> 
> 				On 4/28/11 3:12 AM,
> ag.stephens at stfc.ac.uk wrote:
> 
> 				Dear all,
> 
> 				At BADC we have come across our first
> "missing data" issue in the
> 				CMIP5 datasets we are ingesting. We have
> an example of some
> 				missing months for a particular set of
> variables that was
> 				revealed when running the QC code from
> DKRZ.
> 
> 				It would be very useful for the CMIP5
> archive managers to make an
> 				authoritative statement about how we
> should handle missing data
> 				time steps in the archive.
> 
> 				I propose the following response when a
> Data Node receives a
> 				dataset
> 
> 				in which time steps are missing:
> 
> 				   1. QC manager (i.e. whoever runs the
> QC code) informs Data
> 				   Provider that there is missing data
> in a dataset (specifying
> 				   full DRS structure and date range
> missing).
> 
> 				   2a. If Data Provider says "no, cannot
> provide this data" then
> 				   the affected datasets cannot get a
> DOI and cannot be part of
> 				   the "crystallised archive". STOP
> 
> 				   2b. Data Provider re-generates files,
> data is re-ingested, new
> 				   version is generated, QC is re-run,
> all is good. STOP
> 
> 				   2c. Data Provider cannot re-generate
> but wants to pass QC - so
> 				   needs to create the required files
> full of missing data.
> 
> 				   3. Data Provider creates missing data
> files and sends, data
> 				   re-ingested, new version is
> generated, QC re-run, all good.
> 				   STOP
> 
> 				In cases 2a and 2c it would also be very
> useful if the dataset is
> 				annotated to inform the user which dates
> have been FILLED with
> 				missing data. This would, I believe, be
> in the QC logs but we
> 				might want a more prominent record of
> this if possible.
> 
> 				Cheers,
> 
> 				Ag
> 				BADC--
> 				Scanned by iCritical.
> 
> 				--
> 				Bryan Lawrence
> 				Director of Environmental Archival and
> Associated Research
> 				(NCAS/British Atmospheric Data Centre
> and NCEO/NERC NEODC)
> 				STFC, Rutherford Appleton Laboratory
> 				Phone +44 1235 445012; Fax ... 5848;
> 				Web: home.badc.rl.ac.uk/lawrence
> 
> 		--
> 		Bryan Lawrence
> 		Director of Environmental Archival and Associated
> Research
> 		(NCAS/British Atmospheric Data Centre and NCEO/NERC
> NEODC)
> 		STFC, Rutherford Appleton Laboratory
> 		Phone +44 1235 445012; Fax ... 5848;
> 		Web: home.badc.rl.ac.uk/lawrence

--
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848; 
Web: home.badc.rl.ac.uk/lawrence