[Go-essp-tech] Handling missingdata in the CMIP5 archive

Estanislao Gonzalez gonzalez at dkrz.de
Fri Jun 24 05:59:19 MDT 2011


Hi,

I don't see we are properly communicating what QC in CMIP5 context is, 
so I can't think we will do it right for such special cases. We should 
try, of course...

That the user "sees" the missing file is possible, but I don't think 
it's probable. It's just too hard to see in my opinion. If it's realized 
only after the data is being worked with, it's just the same as if it is 
marked as non existing. Well, it's actually worse in my opinion, since I 
guess the user will be thinking it was a download problem and the file 
should be gathered again, and if it can't be found, maybe think it has 
been removed and start inquiring why that happened. This triggers the 
behavior you talked about Jamie, but I think a lot of time will be 
required until this is sorted out. I'd assume that if a problem arises 
with the file content, at least the file headers will be seen, and there 
is in my opinion where the user should find the required information.

I'm not aware how the "tagging" of the missing information is, but I 
suppose you can put a flag and info into the header of the file, without 
filling the arrays with any data at all (if we had been using NetCDF 4 
format with compression, the compression would be so good that it really 
doesn't matter what's there..). In the "worst" scenario a NaN will be, 
IMHO, more meaningful than trying to find out why the file is not there...

But I think there will be more users than those we can depict, so it's 
really hard (if not impossible) to come up with solution that covers or 
use cases.
I think we should concentrate on different ways of communicating this 
situation to the end user and try to apply as many as possible.

I don't think that omitting the file is "expressive" enough and will be 
probably misunderstood.

My 2c anyway. Thanks,
Estani


Am 24.06.2011 13:24, schrieb Kettleborough, Jamie:
> Hello Karl, Bryan, Ag, Martin, Michael...
>
> Its not clear to me that filling unavailable data in time series with
> the missing data flag is the right thing to do for all data users.  I
> agree we want to make using the data as least painful as possible, but I
> wonder if filling unavailable data with missing is actually going to be
> the least painful thing.
>
> If you are an impacts modeller using the 3 hour data as forcing/boundary
> condition data for a model then you probably want it very visible that
> there is a section of a time series unavailable.   I don't think you'd
> want missing data, instead you'd want to add in some sort of synthetic
> data for the unavailable period. In this case its very useful to have
> the visual clue of missing files to say you need to do something
> special.
>
> I know this is only one data use, and there are others - like deriving a
> climatology  - where you cope with unavailable data by dropping it from
> your sample, and take the hit of larger error/noise.  But even in this
> case I'm not confident that all data users will be using software
> packages that take account of missing data in the right way when
> deriving statistics. Again the visual clue of having missing files to
> tell you you need to do something special in a certain period can be
> useful.
>
> There are also the non functional issues as well - like who wants to
> store files full of missing data, who wants to transfer them.  Though
> these considerations will probably be small in the context of the data
> volumes of the CMIP5 archive.
>
> An alternative to always filling with missing data is to leave gaps
> where data is unavailable, and provide the information and help that
> users need to use the data.  The kind of information that might be
> provided is why data is unavailable and hints/tips/scripts to help deal
> with unavailable data.  This leaves it to the user to decide what is the
> best thing to do in their particular case. I'm not sure how hard it
> would be to write a script that would fill gaps with missing for any
> CMIP5 atomic data set - depending on how often data is unavailable and
> how many data users want missing data it might be a worthwhile thing to
> do (of course someone has to find the time to do it - I'm not
> volunteering.)
>
> I realise this alternative is still not great - but it is a messy
> situation when you loose data due to model crashes or disk problems or
> whatever, so I don't think there is a 'great' solution.
>
> Jamie
>
>> Message: 2
>> Date: Thu, 23 Jun 2011 16:25:44 -0700
>> From: Karl Taylor<taylor13 at llnl.gov>
>> Subject: Re: [Go-essp-tech] Handling missing data in the CMIP5 archive
>> To: "ag.stephens at stfc.ac.uk"<ag.stephens at stfc.ac.uk>
>> Cc: "go-essp-tech at ucar.edu"<go-essp-tech at ucar.edu>
>> Message-ID:<4E03CB78.8010909 at llnl.gov>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> Dear Ag,
>>
>> I think the only thing not yet decided about this is whether
>> we would require time-slices that can't be recovered would be
>> required to be included but filled with the missing data
>> flag, or if they could simply be omitted entirely.
>>
>> Bryan seemed to feel strongly that all time-series should be
>> present, although some time-slices could be entirely filled
>> with "missing".  My opinion was that the user should extract
>> the time-coordinate, which would indicate which time-samples
>> were included (so there would be no reason to generate any
>> time-slices entirely filled with "missing".
>>
>> If no one else has a strong opinion, let's go with Bryan's preference.
>>
>> In summary:
>>
>> If an entire time-slice is missing, before the data will be
>> assigned a DOI that time-slice should be:
>>
>> 1) recovered  (ideally)
>> 2) if impossible to recover, the time-slice should be
>> entirely filled with "missing".
>>
>> Best regards,
>> Karl
>>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech


-- 
Estanislao Gonzalez

Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

Phone:   +49 (40) 46 00 94-126
E-Mail:  gonzalez at dkrz.de



More information about the GO-ESSP-TECH mailing list