[Go-essp-tech] Handling missing data in the CMIP5 archive
Karl Taylor
taylor13 at llnl.gov
Wed May 4 10:34:37 MDT 2011
Hi Bryan,
My view is that if the time-slices have actually been lost, we shouldn't
necessarily reject the data as being useless. I agree, however, that we
should encourage the modeling groups to try to recover or reproduce the
lost time slices to make their output more complete. If that is
impossible, I still think in many cases analysts will want access to the
portions of the time-series that are available.
Consider, for example, a 1000 year control run with a decade missing in
the middle (perhaps all contained in a single lost file). Don't you
think many researchers will make use of the two portions of the
time-series that *are* available, and shouldn't the available data be
assigned a DOI?
As I recall, data not passing QC level 2 won't normally be replicated
and wouldn't be assigned a DOI. Is this correct?
best regards,
Karl
On 5/4/11 1:08 AM, Bryan Lawrence wrote:
> Hi Karl
>
> There are two issues noted in your email:(1) missing variables, and (2)
> missing time slices in a sequence.
>
> I agree that (1) is something to be noted, I think (2) is something that
> should cause failure, and require a response as Ag has suggested. I
> don't think it's too much to ask a modelling group to either provide the
> missing data, or provide missing data flags - but actual missing files in
> a sequence should be an error and a failure!
>
> I think we should be holding a candle for the users here. The reality is
> that no code is going to read the metadata to find missing data, whereas
> code can read and understand missing data flags.
>
> Bryan
>
>> Dear Ag,
>>
>> There is another possible way of handling the "missing data" issue.
>> I'm not sure that a dataset should be be required to be complete
>> (i.e., required to include all time slices) to be considered
>> eligible for DOI assignment. That is, we could relax the criteria.
>> Note that I don't think we require *all* variables requested within
>> a single dataset to be present, so some datasets will indeed be
>> incomplete but be eligible for a DOI. I think the QC procedure
>> should be to check with the modeling group, and if they can't supply
>> the missing time-slices, then we somehow note this flaw in the
>> dataset documentation and if other QC checks are passed, assign it a
>> DOI.
>>
>> The criteria for getting a DOI should be that there are no known
>> errors in the data itself, and that there are no major problems with
>> the metadata. In this case the data will be reliable, and analysts
>> will be welcome to use it and publish results, so I think it should
>> be assigned a DOI.
>>
>> What do others think?
>>
>> Best regards,
>> Karl
>>
>> On 4/28/11 3:12 AM, ag.stephens at stfc.ac.uk wrote:
>>> Dear all,
>>>
>>> At BADC we have come across our first "missing data" issue in the
>>> CMIP5 datasets we are ingesting. We have an example of some
>>> missing months for a particular set of variables that was revealed
>>> when running the QC code from DKRZ.
>>>
>>> It would be very useful for the CMIP5 archive managers to make an
>>> authoritative statement about how we should handle missing data
>>> time steps in the archive.
>>>
>>> I propose the following response when a Data Node receives a dataset
> in which time steps are missing:
>>> 1. QC manager (i.e. whoever runs the QC code) informs Data
>>> Provider that there is missing data in a dataset (specifying
>>> full DRS structure and date range missing).
>>>
>>> 2a. If Data Provider says "no, cannot provide this data" then the
>>> affected datasets cannot get a DOI and cannot be part of the
>>> "crystallised archive". STOP
>>>
>>> 2b. Data Provider re-generates files, data is re-ingested, new
>>> version is generated, QC is re-run, all is good. STOP
>>>
>>> 2c. Data Provider cannot re-generate but wants to pass QC - so
>>> needs to create the required files full of missing data.
>>>
>>> 3. Data Provider creates missing data files and sends, data
>>> re-ingested, new version is generated, QC re-run, all good. STOP
>>>
>>> In cases 2a and 2c it would also be very useful if the dataset is
>>> annotated to inform the user which dates have been FILLED with
>>> missing data. This would, I believe, be in the QC logs but we
>>> might want a more prominent record of this if possible.
>>>
>>> Cheers,
>>>
>>> Ag
>>> BADC--
>>> Scanned by iCritical.
> --
> Bryan Lawrence
> Director of Environmental Archival and Associated Research
> (NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
> STFC, Rutherford Appleton Laboratory
> Phone +44 1235 445012; Fax ... 5848;
> Web: home.badc.rl.ac.uk/lawrence
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110504/007a2010/attachment-0001.html
More information about the GO-ESSP-TECH
mailing list