[Go-essp-tech] Handling missingdata in the CMIP5 archive

Kettleborough, Jamie jamie.kettleborough at metoffice.gov.uk
Fri Jun 24 05:24:06 MDT 2011


Hello Karl, Bryan, Ag, Martin, Michael...

Its not clear to me that filling unavailable data in time series with
the missing data flag is the right thing to do for all data users.  I
agree we want to make using the data as least painful as possible, but I
wonder if filling unavailable data with missing is actually going to be
the least painful thing.

If you are an impacts modeller using the 3 hour data as forcing/boundary
condition data for a model then you probably want it very visible that
there is a section of a time series unavailable.   I don't think you'd
want missing data, instead you'd want to add in some sort of synthetic
data for the unavailable period. In this case its very useful to have
the visual clue of missing files to say you need to do something
special.

I know this is only one data use, and there are others - like deriving a
climatology  - where you cope with unavailable data by dropping it from
your sample, and take the hit of larger error/noise.  But even in this
case I'm not confident that all data users will be using software
packages that take account of missing data in the right way when
deriving statistics. Again the visual clue of having missing files to
tell you you need to do something special in a certain period can be
useful.

There are also the non functional issues as well - like who wants to
store files full of missing data, who wants to transfer them.  Though
these considerations will probably be small in the context of the data
volumes of the CMIP5 archive.

An alternative to always filling with missing data is to leave gaps
where data is unavailable, and provide the information and help that
users need to use the data.  The kind of information that might be
provided is why data is unavailable and hints/tips/scripts to help deal
with unavailable data.  This leaves it to the user to decide what is the
best thing to do in their particular case. I'm not sure how hard it
would be to write a script that would fill gaps with missing for any
CMIP5 atomic data set - depending on how often data is unavailable and
how many data users want missing data it might be a worthwhile thing to
do (of course someone has to find the time to do it - I'm not
volunteering.)

I realise this alternative is still not great - but it is a messy
situation when you loose data due to model crashes or disk problems or
whatever, so I don't think there is a 'great' solution.

Jamie

> 
> Message: 2
> Date: Thu, 23 Jun 2011 16:25:44 -0700
> From: Karl Taylor <taylor13 at llnl.gov>
> Subject: Re: [Go-essp-tech] Handling missing data in the CMIP5 archive
> To: "ag.stephens at stfc.ac.uk" <ag.stephens at stfc.ac.uk>
> Cc: "go-essp-tech at ucar.edu" <go-essp-tech at ucar.edu>
> Message-ID: <4E03CB78.8010909 at llnl.gov>
> Content-Type: text/plain; charset="iso-8859-1"
> 
> Dear Ag,
> 
> I think the only thing not yet decided about this is whether 
> we would require time-slices that can't be recovered would be 
> required to be included but filled with the missing data 
> flag, or if they could simply be omitted entirely.
> 
> Bryan seemed to feel strongly that all time-series should be 
> present, although some time-slices could be entirely filled 
> with "missing".  My opinion was that the user should extract 
> the time-coordinate, which would indicate which time-samples 
> were included (so there would be no reason to generate any 
> time-slices entirely filled with "missing".
> 
> If no one else has a strong opinion, let's go with Bryan's preference.
> 
> In summary:
> 
> If an entire time-slice is missing, before the data will be 
> assigned a DOI that time-slice should be:
> 
> 1) recovered  (ideally)
> 2) if impossible to recover, the time-slice should be 
> entirely filled with "missing".
> 
> Best regards,
> Karl
> 


More information about the GO-ESSP-TECH mailing list