[Go-essp-tech] Handling missing data in the CMIP5 archive

Michael Lautenschlager lautenschlager at dkrz.de
Thu May 5 04:05:20 MDT 2011


Hi Bryan and Karl,

I followed your email discussion from last night originally expecting 
another discussion direction. But at this point I would like to support 
Bryan's argument that missing_value flags should be strongly requested 
in the data files because I expect less work in data processing and data 
quality checks with these missing-value flags than without them.

But I would like bring another aspect into the discussion. How much 
missing data we can accept before we reject an atomic data set (i.e. the 
complete time series of a model run)? Clearly data quality is reduced 
with increase of missing data but where should be the threshold. What we 
can do and we will do it, we will identify it in the quality 
information  at QC-L3 and the DOI data publication. The amount of 
missing data will be reflected in the completeness report, the method of 
reconstruction, if there is a reconstruction, it will be reflected in 
the accuracy report.  It think we have to separate construction by using 
the original model results because of an post-processing error and 
reconstruction by re-running the model because the original model 
results are lost.

Both, an aggregation of the individual atomic data set completeness 
reports and accuracy reports will be visible along with other quality 
information at the DOI landing page after finalisation of QC-L3.

  Best wishes, Michael



Am 05.05.2011 10:41, schrieb Bryan Lawrence:
> Hi Karl
>
> Agreed. It's a judgement call, which is why Ag brought it up :-). What
> I've tried to do in these emails is bring out the exact use case, which
> seemed to be missing in the previous exchanges.
>
> That said, it's my contention that we should require the missing data
> flags.  In any other context, I would consider something that said it was
> a set from a..z to be incomplete if it had 24 members; thus of lower
> quality. How am I, the user, to know whehter the missing two datum
> points are known to be missing, or did I stuff up on the download?  Oh
> yes, I have to read the metadata ... but we have 1.5 million of these
> datasets!  Even accepting most folks aren't interested in it all, it
> might get a bit wearing ....
>
> Cheers
> Bryan
>
>
>
>> Hi Bryan,
>>
>> Oh, I left something out.  Why is it lots of work for the user to
>> notice by looking at the time axis that the spacing between
>> coordinates is greater than normal, and thus some time slices have
>> clearly been skipped?  For daily data,  for example, if the interval
>> between two successive time-coordinates is 10 days, then 9 samples
>> must be missing.
>>
>> I will concede that for some software and for some purposes having
>> time-slices included that are completely filled with the
>> missing_value flag could provide some advantages, so I guess I
>> wouldn't object to requiring this, but I think it's a judgment call
>> that's not that clear-cut.
>>
>> cheers,
>> Karl
>>
>> On 5/4/11 11:42 AM, Bryan Lawrence wrote:
>>> Hi Karl
>>>
>>> I think we're somewhat at cross purposes.
>>>
>>>> My view is that if the time-slices have actually been lost, we
>>>> shouldn't necessarily reject the data as being useless.
>>> Agreed.
>>>
>>>> I agree,
>>>> however, that we should encourage the modeling groups to try to
>>>> recover or reproduce the lost time slices to make their output
>>>> more complete.
>>> Agreed.
>>>
>>>> If that is impossible, I still think in many cases
>>>> analysts will want access to the portions of the time-series that
>>>> are available.
>>> In which case we should require them to write misssing data fields
>>> for that portion. That should be trivial for them to do, and save
>>> the consumers a vast amount of time.  (ie use the CF missing data
>>> flag, we're not suggesintg htey have to re-run anything unless
>>> they want to).
>>>
>>> This is Ag's option 2c, which you don't seem to mention.
>>>
>>>> Consider, for example, a 1000 year control run with a decade
>>>> missing in the middle (perhaps all contained in a single lost
>>>> file).  Don't you think many researchers will make use of the two
>>>> portions of the time-series that *are* available, and shouldn't
>>>> the available data be assigned a DOI?
>>>> As I recall, data not passing QC level 2 won't normally be
>>>> replicated and wouldn't be assigned a DOI.  Is this correct?
>>> Correct.
>>>
>>> Cheers
>>> Bryan
>>>
>>>> best regards,
>>>> Karl
>>>>
>>>> On 5/4/11 1:08 AM, Bryan Lawrence wrote:
>>>>> Hi Karl
>>>>>
>>>>> There are two issues noted in your email:(1) missing variables,
>>>>> and (2) missing time slices in a sequence.
>>>>>
>>>>> I agree that (1) is something to be noted, I think (2) is
>>>>> something that should cause failure, and require a response as
>>>>> Ag has suggested. I don't think it's too much to ask a modelling
>>>>> group to either provide the missing data, or provide missing
>>>>> data flags - but actual missing files in a sequence should be an
>>>>> error and a failure!
>>>>>
>>>>> I think we should be holding a candle for the users here. The
>>>>> reality is that no code is going to read the metadata to find
>>>>> missing data, whereas code can read and understand missing data
>>>>> flags.
>>>>>
>>>>> Bryan
>>>>>
>>>>>> Dear Ag,
>>>>>>
>>>>>> There is another possible way of handling the "missing data"
>>>>>> issue. I'm not sure that a dataset should be be required to be
>>>>>> complete (i.e., required to include all time slices) to be
>>>>>> considered eligible for DOI assignment.  That is, we could relax
>>>>>> the criteria. Note that I don't think we require *all* variables
>>>>>> requested within a single dataset to be present, so some
>>>>>> datasets will indeed be incomplete but be eligible for a DOI.
>>>>>> I think the QC procedure should be to check with the modeling
>>>>>> group, and if they can't supply the missing time-slices, then
>>>>>> we somehow note this flaw in the dataset documentation and if
>>>>>> other QC checks are passed, assign it a DOI.
>>>>>>
>>>>>> The criteria for getting a DOI should be that there are no known
>>>>>> errors in the data itself, and that there are no major problems
>>>>>> with the metadata.  In this case the data will be reliable, and
>>>>>> analysts will be welcome to use it and publish results, so I
>>>>>> think it should be assigned a DOI.
>>>>>>
>>>>>> What do others think?
>>>>>>
>>>>>> Best regards,
>>>>>> Karl
>>>>>>
>>>>>> On 4/28/11 3:12 AM, ag.stephens at stfc.ac.uk wrote:
>>>>>>> Dear all,
>>>>>>>
>>>>>>> At BADC we have come across our first "missing data" issue in
>>>>>>> the CMIP5 datasets we are ingesting. We have an example of
>>>>>>> some missing months for a particular set of variables that was
>>>>>>> revealed when running the QC code from DKRZ.
>>>>>>>
>>>>>>> It would be very useful for the CMIP5 archive managers to make
>>>>>>> an authoritative statement about how we should handle missing
>>>>>>> data time steps in the archive.
>>>>>>>
>>>>>>> I propose the following response when a Data Node receives a
>>>>>>> dataset
>>>>> in which time steps are missing:
>>>>>>>      1. QC manager (i.e. whoever runs the QC code) informs Data
>>>>>>>      Provider that there is missing data in a dataset
>>>>>>>      (specifying full DRS structure and date range missing).
>>>>>>>
>>>>>>>      2a. If Data Provider says "no, cannot provide this data"
>>>>>>>      then the affected datasets cannot get a DOI and cannot be
>>>>>>>      part of the "crystallised archive". STOP
>>>>>>>
>>>>>>>      2b. Data Provider re-generates files, data is re-ingested,
>>>>>>>      new version is generated, QC is re-run, all is good. STOP
>>>>>>>
>>>>>>>      2c. Data Provider cannot re-generate but wants to pass QC -
>>>>>>>      so needs to create the required files full of missing
>>>>>>>      data.
>>>>>>>
>>>>>>>      3. Data Provider creates missing data files and sends, data
>>>>>>>      re-ingested, new version is generated, QC re-run, all good.
>>>>>>>      STOP
>>>>>>>
>>>>>>> In cases 2a and 2c it would also be very useful if the dataset
>>>>>>> is annotated to inform the user which dates have been FILLED
>>>>>>> with missing data. This would, I believe, be in the QC logs
>>>>>>> but we might want a more prominent record of this if possible.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Ag
>>>>>>> BADC--
>>>>>>> Scanned by iCritical.
>>>>> --
>>>>> Bryan Lawrence
>>>>> Director of Environmental Archival and Associated Research
>>>>> (NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
>>>>> STFC, Rutherford Appleton Laboratory
>>>>> Phone +44 1235 445012; Fax ... 5848;
>>>>> Web: home.badc.rl.ac.uk/lawrence
>>> --
>>> Bryan Lawrence
>>> Director of Environmental Archival and Associated Research
>>> (NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
>>> STFC, Rutherford Appleton Laboratory
>>> Phone +44 1235 445012; Fax ... 5848;
>>> Web: home.badc.rl.ac.uk/lawrence
> --
> Bryan Lawrence
> Director of Environmental Archival and Associated Research
> (NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
> STFC, Rutherford Appleton Laboratory
> Phone +44 1235 445012; Fax ... 5848;
> Web: home.badc.rl.ac.uk/lawrence
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech


More information about the GO-ESSP-TECH mailing list