[Go-essp-tech] Handling missingdata in the CMIP5 archive

Fri Jun 24 06:47:10 MDT 2011

Hello Estani,

You are right - when I was talking about 'visual' clues I'd missed the obvious  - that is hard to see what *isn't there* in a list of things that look pretty much the same apart from some date strings.  What is more likely is someone's analysis will fail in some way. Where this is the case I'd prefer the failure to be early so I know something special has happened (annoying though it is).  Unavailable times in a coordinate array, I think, will be something that will be more likely to send out the earlier signal that there is something unusual about this data than data arrays filled with missing data.

I also agree that if a file is missing on a users machine they don't immediately know where it got lost - it could be a transfer failure to their machine (though I'm not sure how likely it is that they will end up with no file, rather than a truncated file.)  They can find out though by going back to the 'listing' of what files are in the CMIP5 archive or - if it existed - the information that tells users what data is unavailable.

So, despite missing some obvious point in my original e-mail, I'm still not convinced that filling with missing data is the right thing to do.

Jamie

(sorry I know I don't know much about CMIP5 QC or doi's or ESG infrastructure)

> -----Original Message-----
> From: Estanislao Gonzalez [mailto:gonzalez at dkrz.de] 
> Sent: 24 June 2011 12:59
> To: Kettleborough, Jamie
> Cc: go-essp-tech at ucar.edu
> Subject: Re: [Go-essp-tech] Handling missingdata in the CMIP5 archive
> 
> Hi,
> 
> I don't see we are properly communicating what QC in CMIP5 
> context is, so I can't think we will do it right for such 
> special cases. We should try, of course...
> 
> That the user "sees" the missing file is possible, but I 
> don't think it's probable. It's just too hard to see in my 
> opinion. If it's realized only after the data is being worked 
> with, it's just the same as if it is marked as non existing. 
> Well, it's actually worse in my opinion, since I guess the 
> user will be thinking it was a download problem and the file 
> should be gathered again, and if it can't be found, maybe 
> think it has been removed and start inquiring why that 
> happened. This triggers the behavior you talked about Jamie, 
> but I think a lot of time will be required until this is 
> sorted out. I'd assume that if a problem arises with the file 
> content, at least the file headers will be seen, and there is 
> in my opinion where the user should find the required information.
> 
> I'm not aware how the "tagging" of the missing information 
> is, but I suppose you can put a flag and info into the header 
> of the file, without filling the arrays with any data at all 
> (if we had been using NetCDF 4 format with compression, the 
> compression would be so good that it really doesn't matter 
> what's there..). In the "worst" scenario a NaN will be, IMHO, 
> more meaningful than trying to find out why the file is not there...
> 
> But I think there will be more users than those we can 
> depict, so it's really hard (if not impossible) to come up 
> with solution that covers or use cases.
> I think we should concentrate on different ways of 
> communicating this situation to the end user and try to apply 
> as many as possible.
> 
> I don't think that omitting the file is "expressive" enough 
> and will be probably misunderstood.
> 
> My 2c anyway. Thanks,
> Estani
> 
> 
> Am 24.06.2011 13:24, schrieb Kettleborough, Jamie:
> > Hello Karl, Bryan, Ag, Martin, Michael...
> >
> > Its not clear to me that filling unavailable data in time 
> series with 
> > the missing data flag is the right thing to do for all data 
> users.  I 
> > agree we want to make using the data as least painful as 
> possible, but 
> > I wonder if filling unavailable data with missing is 
> actually going to 
> > be the least painful thing.
> >
> > If you are an impacts modeller using the 3 hour data as 
> > forcing/boundary condition data for a model then you 
> probably want it very visible that
> > there is a section of a time series unavailable.   I don't 
> think you'd
> > want missing data, instead you'd want to add in some sort 
> of synthetic 
> > data for the unavailable period. In this case its very 
> useful to have 
> > the visual clue of missing files to say you need to do something 
> > special.
> >
> > I know this is only one data use, and there are others - 
> like deriving 
> > a climatology  - where you cope with unavailable data by 
> dropping it 
> > from your sample, and take the hit of larger error/noise.  
> But even in 
> > this case I'm not confident that all data users will be 
> using software 
> > packages that take account of missing data in the right way when 
> > deriving statistics. Again the visual clue of having 
> missing files to 
> > tell you you need to do something special in a certain 
> period can be 
> > useful.
> >
> > There are also the non functional issues as well - like who 
> wants to 
> > store files full of missing data, who wants to transfer 
> them.  Though 
> > these considerations will probably be small in the context 
> of the data 
> > volumes of the CMIP5 archive.
> >
> > An alternative to always filling with missing data is to leave gaps 
> > where data is unavailable, and provide the information and 
> help that 
> > users need to use the data.  The kind of information that might be 
> > provided is why data is unavailable and hints/tips/scripts to help 
> > deal with unavailable data.  This leaves it to the user to 
> decide what 
> > is the best thing to do in their particular case. I'm not sure how 
> > hard it would be to write a script that would fill gaps 
> with missing 
> > for any
> > CMIP5 atomic data set - depending on how often data is 
> unavailable and 
> > how many data users want missing data it might be a 
> worthwhile thing 
> > to do (of course someone has to find the time to do it - I'm not
> > volunteering.)
> >
> > I realise this alternative is still not great - but it is a messy 
> > situation when you loose data due to model crashes or disk 
> problems or 
> > whatever, so I don't think there is a 'great' solution.
> >
> > Jamie
> >
> >> Message: 2
> >> Date: Thu, 23 Jun 2011 16:25:44 -0700
> >> From: Karl Taylor<taylor13 at llnl.gov>
> >> Subject: Re: [Go-essp-tech] Handling missing data in the CMIP5 
> >> archive
> >> To: "ag.stephens at stfc.ac.uk"<ag.stephens at stfc.ac.uk>
> >> Cc: "go-essp-tech at ucar.edu"<go-essp-tech at ucar.edu>
> >> Message-ID:<4E03CB78.8010909 at llnl.gov>
> >> Content-Type: text/plain; charset="iso-8859-1"
> >>
> >> Dear Ag,
> >>
> >> I think the only thing not yet decided about this is 
> whether we would 
> >> require time-slices that can't be recovered would be 
> required to be 
> >> included but filled with the missing data flag, or if they could 
> >> simply be omitted entirely.
> >>
> >> Bryan seemed to feel strongly that all time-series should 
> be present, 
> >> although some time-slices could be entirely filled with 
> "missing".  
> >> My opinion was that the user should extract the time-coordinate, 
> >> which would indicate which time-samples were included (so 
> there would 
> >> be no reason to generate any time-slices entirely filled with 
> >> "missing".
> >>
> >> If no one else has a strong opinion, let's go with Bryan's 
> preference.
> >>
> >> In summary:
> >>
> >> If an entire time-slice is missing, before the data will 
> be assigned 
> >> a DOI that time-slice should be:
> >>
> >> 1) recovered  (ideally)
> >> 2) if impossible to recover, the time-slice should be 
> entirely filled 
> >> with "missing".
> >>
> >> Best regards,
> >> Karl
> >>
> > _______________________________________________
> > GO-ESSP-TECH mailing list
> > GO-ESSP-TECH at ucar.edu
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> 
> 
> --
> Estanislao Gonzalez
> 
> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches 
> Klimarechenzentrum (DKRZ) - German Climate Computing Centre 
> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
> 
> Phone:   +49 (40) 46 00 94-126
> E-Mail:  gonzalez at dkrz.de
> 
>