[Go-essp-tech] Handling missing data in the CMIP5 archive

Bryan Lawrence bryan.lawrence at stfc.ac.uk
Wed May 4 02:08:20 MDT 2011


Hi Karl

There are two issues noted in your email:(1) missing variables, and (2) 
missing time slices in a sequence.

I agree that (1) is something to be noted, I think (2) is something that 
should cause failure, and require a response as Ag has suggested. I 
don't think it's too much to ask a modelling group to either provide the 
missing data, or provide missing data flags - but actual missing files in 
a sequence should be an error and a failure!

I think we should be holding a candle for the users here. The reality is 
that no code is going to read the metadata to find missing data, whereas 
code can read and understand missing data flags. 

Bryan

> Dear Ag,
> 
> There is another possible way of handling the "missing data" issue. 
> I'm not sure that a dataset should be be required to be complete
> (i.e., required to include all time slices) to be considered
> eligible for DOI assignment.  That is, we could relax the criteria. 
> Note that I don't think we require *all* variables requested within
> a single dataset to be present, so some datasets will indeed be
> incomplete but be eligible for a DOI.  I think the QC procedure
> should be to check with the modeling group, and if they can't supply
> the missing time-slices, then we somehow note this flaw in the
> dataset documentation and if other QC checks are passed, assign it a
> DOI.
> 
> The criteria for getting a DOI should be that there are no known
> errors in the data itself, and that there are no major problems with
> the metadata.  In this case the data will be reliable, and analysts
> will be welcome to use it and publish results, so I think it should
> be assigned a DOI.
> 
> What do others think?
> 
> Best regards,
> Karl
> 
> On 4/28/11 3:12 AM, ag.stephens at stfc.ac.uk wrote:
> > Dear all,
> > 
> > At BADC we have come across our first "missing data" issue in the
> > CMIP5 datasets we are ingesting. We have an example of some
> > missing months for a particular set of variables that was revealed
> > when running the QC code from DKRZ.
> > 
> > It would be very useful for the CMIP5 archive managers to make an
> > authoritative statement about how we should handle missing data
> > time steps in the archive.
> > 
> > I propose the following response when a Data Node receives a dataset 
in which time steps are missing:
> >   1. QC manager (i.e. whoever runs the QC code) informs Data
> >   Provider that there is missing data in a dataset (specifying
> >   full DRS structure and date range missing).
> >   
> >   2a. If Data Provider says "no, cannot provide this data" then the
> >   affected datasets cannot get a DOI and cannot be part of the
> >   "crystallised archive". STOP
> >   
> >   2b. Data Provider re-generates files, data is re-ingested, new
> >   version is generated, QC is re-run, all is good. STOP
> >   
> >   2c. Data Provider cannot re-generate but wants to pass QC - so
> >   needs to create the required files full of missing data.
> >   
> >   3. Data Provider creates missing data files and sends, data
> >   re-ingested, new version is generated, QC re-run, all good. STOP
> > 
> > In cases 2a and 2c it would also be very useful if the dataset is
> > annotated to inform the user which dates have been FILLED with
> > missing data. This would, I believe, be in the QC logs but we
> > might want a more prominent record of this if possible.
> > 
> > Cheers,
> > 
> > Ag
> > BADC--
> > Scanned by iCritical.

--
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848; 
Web: home.badc.rl.ac.uk/lawrence


More information about the GO-ESSP-TECH mailing list