[Go-essp-tech] Handling missing data in the CMIP5 archive

Thu May 5 04:20:06 MDT 2011

Hi Phil

Actually having written that, and thought a bit longer, I think the 
situation could be put analagous to atm.chem.phys discussion papers.

I have no problem with passing q.c level 2 in this situation, and 
allowing folks access to the data (analogous to a discussion paper for 
atm.chem.phys).

We now have what is effectively a discusison paper anda referees report: 
"you have unflagged missing data" - both public.

When the final version is received for formal publication, we expect it 
to have been fixed (using flags or a rerun, whatever).  At which point it 
is eligible to get a persistent DOI to a fixed dataset (and move from 
being a discussion paper to being a real paper).

Otherwise the position (wrt a doi publication) is analogous to "we put a 
lot of work into our paper, we don't  want to do any more  because we're 
busy" ... which doesn't and shouldn't fly, for a publication.

Cheers
Bryan

> Hi Phil
> 
> You guys have known about this error (and others like it, for weeks).
> 
> Are you really saying it's a vast effort to have a short python code
> that provides missing data sequences when *you(mohc)*
>  a) already know exactly what is missing, and
>  b) how long it is missing for.
>  c) have written your own internal q.c. report for it (which must
> take a comparable amount of time to running a "provide missing data"
> code.
> 
> We then find it immediately too, because we're running q.c. code, and
> we could fix it too (and would, but then it wouldn't be *your* data
> any more). Every user will do this too (or cock up because they
> don't spot the problem).
> 
> I know we're all spread thin, and I can lose this argument in the
> context of CMIP5 q.c up to level 2, but I'm not ready to given for
> q.c. level 3.
> 
> In particular, a DOI is not a URL alone. It has to mean something a
> wee bit more than that if you want it to be taken seriously as a
> publication. Speaking personally, I wont go to bat with the journals
> and ISI unless it is!
> 
> I know you're spread thin, but we are all spread thin. I'm not
> suggesting you have to rerun the model here!
> 
> Cheers
> Bryan
> 
> > Hi Karl,
> > 
> > As a modelling centre affected (or should that be afflicted!) by
> > this particular issue, it's probably time for us to chime in with
> > our own 2 cents worth.
> > 
> > There are a variety of technical and human reasons why there are
> > occasional small temporal gaps in the model data that we have
> > submitted to the CMIP5 archive: model crashes/restarts, files not
> > making it into our archive system, start/end dates not specified
> > exactly in conformance with the CMIP5 experiment plan, etc, etc.
> > (Given the number of experiments that MOHC is conducting I don't
> > think it would be humanely possible for us to get everything right
> > all the time :-).
> > 
> > If it was trivial matter to identify and fix these small bits of
> > missing data I can assure you that we would have done that. The
> > reality, however, is that the complexities (and, yes, quirks) of
> > the UM, together with the software integration aspects of the CMOR
> > library, mean that is by no means a trivial technical issue.  And
> > like the rest of the CMIP5/ESGF endeavour - that's you guys! - we
> > have very few resources spread fairly thinly.  Hence we have had
> > to make decisions on where to prioritise our efforts. Do we fix
> > occasional small gaps in data time-series, or do we focus on
> > CFMIP2, TAMIP, 60-level models, or invest *significant* effort
> > into understanding and using the CMIP5
> > questionnaire! (In the latter case, to the not inconsiderable
> > benefit of other modelling centres.)
> > 
> > So, in the same spirit in which the compliance rules were relaxed
> > with regard to provision of model metadata via the CMIP5
> > questionnaire, we would hope that similar flexibility be extended
> > to the submission of model data, some of which may contain
> > occasional small portions of missing data.  Not surprisingly
> > perhaps, we believe that it is far preferable to have 99% of the
> > data for a particular simulation available in the archive than
> > have it rejected (or non-DOI'd) because of, say, 1 missing month
> > or year.
> > 
> > Also, given that we have been submitting model data to the archive
> > since last October, it would seem somewhat, er, punitive to
> > introduce a stricter data compliance rule at this stage in the
> > game!
> > 
> > For our part we will endeavour to minimise the size/number of
> > temporal gaps in our submitted data. And, as time and reources
> > permit, we will investigate technical solutions that will enable us
> > to supply files of missing data where we do have such gaps. In the
> > meantime we will continue to utilise the appropriate mechanisms
> > (e.g. the CMIP5 questionnaire) to flag up data quality issues such
> > as this.
> > 
> > Regards,
> > 
> > Phil
> > 
> > 
> > 
> > ________________________________
> > 
> > 	From: go-essp-tech-bounces at ucar.edu
> > 
> > [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Karl Taylor
> > 
> > 	Sent: 04 May 2011 20:41
> > 	To: Bryan Lawrence
> > 	Cc: go-essp-tech at ucar.edu
> > 	Subject: Re: [Go-essp-tech] Handling missing data in the CMIP5
> > 
> > archive
> > 
> > 	Hi Bryan,
> > 	
> > 	Oh, I left something out.  Why is it lots of work for the user
> > 
> > to notice by looking at the time axis that the spacing between
> > coordinates is greater than normal, and thus some time slices have
> > clearly been skipped?  For daily data,  for example, if the
> > interval between two successive time-coordinates is 10 days, then
> > 9 samples must be missing.
> > 
> > 	I will concede that for some software and for some purposes
> > 
> > having time-slices included that are completely filled with the
> > missing_value flag could provide some advantages, so I guess I
> > wouldn't object to requiring this, but I think it's a judgment call
> > that's not that clear-cut.
> > 
> > 	cheers,
> > 	Karl
> > 	
> > 	On 5/4/11 11:42 AM, Bryan Lawrence wrote:
> > 		Hi Karl
> > 		
> > 		I think we're somewhat at cross purposes.
> > 		
> > 			My view is that if the time-slices have actually
> > 
> > been lost, we
> > 
> > 			shouldn't necessarily reject the data as being
> > 
> > useless.
> > 
> > 		Agreed.
> > 		
> > 			I agree,
> > 			however, that we should encourage the modeling
> > 
> > groups to try to
> > 
> > 			recover or reproduce the lost time slices to
> > 
> > make their output more
> > 
> > 			complete.
> > 		
> > 		Agreed.
> > 		
> > 			If that is impossible, I still think in many
> > 
> > cases
> > 
> > 			analysts will want access to the portions of the
> > 
> > time-series that
> > 
> > 			are available.
> > 		
> > 		In which case we should require them to write misssing
> > 
> > data fields for
> > 
> > 		that portion. That should be trivial for them to do, and
> > 
> > save the
> > 
> > 		consumers a vast amount of time.  (ie use the CF missing
> > 
> > data flag, we're
> > 
> > 		not suggesintg htey have to re-run anything unless they
> > 
> > want to).
> > 
> > 		This is Ag's option 2c, which you don't seem to mention.
> > 		
> > 			Consider, for example, a 1000 year control run
> > 
> > with a decade missing
> > 
> > 			in the middle (perhaps all contained in a single
> > 
> > lost file).  Don't
> > 
> > 			you think many researchers will make use of the
> > 
> > two portions of the
> > 
> > 			time-series that *are* available, and shouldn't
> > 
> > the available data
> > 
> > 			be assigned a DOI?
> > 			
> > 			
> > 			
> > 			As I recall, data not passing QC level 2 won't
> > 
> > normally be replicated
> > 
> > 			and wouldn't be assigned a DOI.  Is this
> > 
> > correct?
> > 
> > 		Correct.
> > 		
> > 		Cheers
> > 		Bryan
> > 		
> > 			best regards,
> > 			Karl
> > 			
> > 			On 5/4/11 1:08 AM, Bryan Lawrence wrote:
> > 				Hi Karl
> > 				
> > 				There are two issues noted in your
> > 
> > email:(1) missing variables, and
> > 
> > 				(2) missing time slices in a sequence.
> > 				
> > 				I agree that (1) is something to be
> > 
> > noted, I think (2) is something
> > 
> > 				that should cause failure, and require a
> > 
> > response as Ag has
> > 
> > 				suggested. I don't think it's too much
> > 
> > to ask a modelling group to
> > 
> > 				either provide the missing data, or
> > 
> > provide missing data flags -
> > 
> > 				but actual missing files in a sequence
> > 
> > should be an error and a
> > 
> > 				failure!
> > 				
> > 				I think we should be holding a candle
> > 
> > for the users here. The
> > 
> > 				reality is that no code is going to read
> > 
> > the metadata to find
> > 
> > 				missing data, whereas code can read and
> > 
> > understand missing data
> > 
> > 				flags.
> > 				
> > 				Bryan
> > 				
> > 				
> > 				Dear Ag,
> > 				
> > 				There is another possible way of
> > 
> > handling the "missing data"
> > 
> > 				issue. I'm not sure that a dataset
> > 
> > should be be required to be
> > 
> > 				complete (i.e., required to include all
> > 
> > time slices) to be
> > 
> > 				considered eligible for DOI assignment.
> > 
> > That is, we could relax
> > 
> > 				the criteria. Note that I don't think we
> > 
> > require *all* variables
> > 
> > 				requested within a single dataset to be
> > 
> > present, so some datasets
> > 
> > 				will indeed be incomplete but be
> > 
> > eligible for a DOI.  I think the
> > 
> > 				QC procedure should be to check with the
> > 
> > modeling group, and if
> > 
> > 				they can't supply the missing
> > 
> > time-slices, then we somehow note
> > 
> > 				this flaw in the dataset documentation
> > 
> > and if other QC checks are
> > 
> > 				passed, assign it a DOI.
> > 				
> > 				The criteria for getting a DOI should be
> > 
> > that there are no known
> > 
> > 				errors in the data itself, and that
> > 
> > there are no major problems
> > 
> > 				with the metadata.  In this case the
> > 
> > data will be reliable, and
> > 
> > 				analysts will be welcome to use it and
> > 
> > publish results, so I
> > 
> > 				think it should be assigned a DOI.
> > 				
> > 				What do others think?
> > 				
> > 				Best regards,
> > 				Karl
> > 				
> > 				On 4/28/11 3:12 AM,
> > 
> > ag.stephens at stfc.ac.uk wrote:
> > 				Dear all,
> > 				
> > 				At BADC we have come across our first
> > 
> > "missing data" issue in the
> > 
> > 				CMIP5 datasets we are ingesting. We have
> > 
> > an example of some
> > 
> > 				missing months for a particular set of
> > 
> > variables that was
> > 
> > 				revealed when running the QC code from
> > 
> > DKRZ.
> > 
> > 				It would be very useful for the CMIP5
> > 
> > archive managers to make an
> > 
> > 				authoritative statement about how we
> > 
> > should handle missing data
> > 
> > 				time steps in the archive.
> > 				
> > 				I propose the following response when a
> > 
> > Data Node receives a
> > 
> > 				dataset
> > 				
> > 				in which time steps are missing:
> > 				   1. QC manager (i.e. whoever runs the
> > 
> > QC code) informs Data
> > 
> > 				   Provider that there is missing data
> > 
> > in a dataset (specifying
> > 
> > 				   full DRS structure and date range
> > 
> > missing).
> > 
> > 				   2a. If Data Provider says "no, cannot
> > 
> > provide this data" then
> > 
> > 				   the affected datasets cannot get a
> > 
> > DOI and cannot be part of
> > 
> > 				   the "crystallised archive". STOP
> > 				   
> > 				   2b. Data Provider re-generates files,
> > 
> > data is re-ingested, new
> > 
> > 				   version is generated, QC is re-run,
> > 
> > all is good. STOP
> > 
> > 				   2c. Data Provider cannot re-generate
> > 
> > but wants to pass QC - so
> > 
> > 				   needs to create the required files
> > 
> > full of missing data.
> > 
> > 				   3. Data Provider creates missing data
> > 
> > files and sends, data
> > 
> > 				   re-ingested, new version is
> > 
> > generated, QC re-run, all good.
> > 
> > 				   STOP
> > 				
> > 				In cases 2a and 2c it would also be very
> > 
> > useful if the dataset is
> > 
> > 				annotated to inform the user which dates
> > 
> > have been FILLED with
> > 
> > 				missing data. This would, I believe, be
> > 
> > in the QC logs but we
> > 
> > 				might want a more prominent record of
> > 
> > this if possible.
> > 
> > 				Cheers,
> > 				
> > 				Ag
> > 				BADC--
> > 				Scanned by iCritical.
> > 				
> > 				--
> > 				Bryan Lawrence
> > 				Director of Environmental Archival and
> > 
> > Associated Research
> > 
> > 				(NCAS/British Atmospheric Data Centre
> > 
> > and NCEO/NERC NEODC)
> > 
> > 				STFC, Rutherford Appleton Laboratory
> > 				Phone +44 1235 445012; Fax ... 5848;
> > 				Web: home.badc.rl.ac.uk/lawrence
> > 		
> > 		--
> > 		Bryan Lawrence
> > 		Director of Environmental Archival and Associated
> > 
> > Research
> > 
> > 		(NCAS/British Atmospheric Data Centre and NCEO/NERC
> > 
> > NEODC)
> > 
> > 		STFC, Rutherford Appleton Laboratory
> > 		Phone +44 1235 445012; Fax ... 5848;
> > 		Web: home.badc.rl.ac.uk/lawrence
> 
> --
> Bryan Lawrence
> Director of Environmental Archival and Associated Research
> (NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
> STFC, Rutherford Appleton Laboratory
> Phone +44 1235 445012; Fax ... 5848;
> Web: home.badc.rl.ac.uk/lawrence
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

--
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848; 
Web: home.badc.rl.ac.uk/lawrence