[Go-essp-tech] Handling missing data in the CMIP5 archive

Wed May 4 12:42:15 MDT 2011

Hi Karl

I think we're somewhat at cross purposes.

> My view is that if the time-slices have actually been lost, we
> shouldn't necessarily reject the data as being useless. 

Agreed.

> I agree,
> however, that we should encourage the modeling groups to try to
> recover or reproduce the lost time slices to make their output more
> complete.

Agreed.

> If that is impossible, I still think in many cases
> analysts will want access to the portions of the time-series that
> are available.

In which case we should require them to write misssing data fields for 
that portion. That should be trivial for them to do, and save the 
consumers a vast amount of time.  (ie use the CF missing data flag, we're 
not suggesintg htey have to re-run anything unless they want to).

This is Ag's option 2c, which you don't seem to mention.

> Consider, for example, a 1000 year control run with a decade missing
> in the middle (perhaps all contained in a single lost file).  Don't
> you think many researchers will make use of the two portions of the
> time-series that *are* available, and shouldn't the available data
> be assigned a DOI?

> As I recall, data not passing QC level 2 won't normally be replicated
> and wouldn't be assigned a DOI.  Is this correct?

Correct.

Cheers
Bryan

> best regards,
> Karl
> 
> On 5/4/11 1:08 AM, Bryan Lawrence wrote:
> > Hi Karl
> > 
> > There are two issues noted in your email:(1) missing variables, and
> > (2) missing time slices in a sequence.
> > 
> > I agree that (1) is something to be noted, I think (2) is something
> > that should cause failure, and require a response as Ag has
> > suggested. I don't think it's too much to ask a modelling group to
> > either provide the missing data, or provide missing data flags -
> > but actual missing files in a sequence should be an error and a
> > failure!
> > 
> > I think we should be holding a candle for the users here. The
> > reality is that no code is going to read the metadata to find
> > missing data, whereas code can read and understand missing data
> > flags.
> > 
> > Bryan
> > 
> >> Dear Ag,
> >> 
> >> There is another possible way of handling the "missing data"
> >> issue. I'm not sure that a dataset should be be required to be
> >> complete (i.e., required to include all time slices) to be
> >> considered eligible for DOI assignment.  That is, we could relax
> >> the criteria. Note that I don't think we require *all* variables
> >> requested within a single dataset to be present, so some datasets
> >> will indeed be incomplete but be eligible for a DOI.  I think the
> >> QC procedure should be to check with the modeling group, and if
> >> they can't supply the missing time-slices, then we somehow note
> >> this flaw in the dataset documentation and if other QC checks are
> >> passed, assign it a DOI.
> >> 
> >> The criteria for getting a DOI should be that there are no known
> >> errors in the data itself, and that there are no major problems
> >> with the metadata.  In this case the data will be reliable, and
> >> analysts will be welcome to use it and publish results, so I
> >> think it should be assigned a DOI.
> >> 
> >> What do others think?
> >> 
> >> Best regards,
> >> Karl
> >> 
> >> On 4/28/11 3:12 AM, ag.stephens at stfc.ac.uk wrote:
> >>> Dear all,
> >>> 
> >>> At BADC we have come across our first "missing data" issue in the
> >>> CMIP5 datasets we are ingesting. We have an example of some
> >>> missing months for a particular set of variables that was
> >>> revealed when running the QC code from DKRZ.
> >>> 
> >>> It would be very useful for the CMIP5 archive managers to make an
> >>> authoritative statement about how we should handle missing data
> >>> time steps in the archive.
> >>> 
> >>> I propose the following response when a Data Node receives a
> >>> dataset
> > 
> > in which time steps are missing:
> >>>    1. QC manager (i.e. whoever runs the QC code) informs Data
> >>>    Provider that there is missing data in a dataset (specifying
> >>>    full DRS structure and date range missing).
> >>>    
> >>>    2a. If Data Provider says "no, cannot provide this data" then
> >>>    the affected datasets cannot get a DOI and cannot be part of
> >>>    the "crystallised archive". STOP
> >>>    
> >>>    2b. Data Provider re-generates files, data is re-ingested, new
> >>>    version is generated, QC is re-run, all is good. STOP
> >>>    
> >>>    2c. Data Provider cannot re-generate but wants to pass QC - so
> >>>    needs to create the required files full of missing data.
> >>>    
> >>>    3. Data Provider creates missing data files and sends, data
> >>>    re-ingested, new version is generated, QC re-run, all good.
> >>>    STOP
> >>> 
> >>> In cases 2a and 2c it would also be very useful if the dataset is
> >>> annotated to inform the user which dates have been FILLED with
> >>> missing data. This would, I believe, be in the QC logs but we
> >>> might want a more prominent record of this if possible.
> >>> 
> >>> Cheers,
> >>> 
> >>> Ag
> >>> BADC--
> >>> Scanned by iCritical.
> > 
> > --
> > Bryan Lawrence
> > Director of Environmental Archival and Associated Research
> > (NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
> > STFC, Rutherford Appleton Laboratory
> > Phone +44 1235 445012; Fax ... 5848;
> > Web: home.badc.rl.ac.uk/lawrence

--
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848; 
Web: home.badc.rl.ac.uk/lawrence