[Go-essp-tech] Handling missing data in the CMIP5 archive
Bryan Lawrence
bryan.lawrence at stfc.ac.uk
Thu May 5 04:07:18 MDT 2011
Hi Phil
You guys have known about this error (and others like it, for weeks).
Are you really saying it's a vast effort to have a short python code that
provides missing data sequences when *you(mohc)*
a) already know exactly what is missing, and
b) how long it is missing for.
c) have written your own internal q.c. report for it (which must take a
comparable amount of time to running a "provide missing data" code.
We then find it immediately too, because we're running q.c. code, and we
could fix it too (and would, but then it wouldn't be *your* data any
more). Every user will do this too (or cock up because they don't spot
the problem).
I know we're all spread thin, and I can lose this argument in the
context of CMIP5 q.c up to level 2, but I'm not ready to given for q.c.
level 3.
In particular, a DOI is not a URL alone. It has to mean something a wee
bit more than that if you want it to be taken seriously as a
publication. Speaking personally, I wont go to bat with the journals and
ISI unless it is!
I know you're spread thin, but we are all spread thin. I'm not
suggesting you have to rerun the model here!
Cheers
Bryan
> Hi Karl,
>
> As a modelling centre affected (or should that be afflicted!) by this
> particular issue, it's probably time for us to chime in with our own
> 2 cents worth.
>
> There are a variety of technical and human reasons why there are
> occasional small temporal gaps in the model data that we have
> submitted to the CMIP5 archive: model crashes/restarts, files not
> making it into our archive system, start/end dates not specified
> exactly in conformance with the CMIP5 experiment plan, etc, etc.
> (Given the number of experiments that MOHC is conducting I don't
> think it would be humanely possible for us to get everything right
> all the time :-).
>
> If it was trivial matter to identify and fix these small bits of
> missing data I can assure you that we would have done that. The
> reality, however, is that the complexities (and, yes, quirks) of the
> UM, together with the software integration aspects of the CMOR
> library, mean that is by no means a trivial technical issue. And
> like the rest of the CMIP5/ESGF endeavour - that's you guys! - we
> have very few resources spread fairly thinly. Hence we have had to
> make decisions on where to prioritise our efforts. Do we fix
> occasional small gaps in data time-series, or do we focus on CFMIP2,
> TAMIP, 60-level models, or invest *significant* effort into
> understanding and using the CMIP5
> questionnaire! (In the latter case, to the not inconsiderable benefit
> of other modelling centres.)
>
> So, in the same spirit in which the compliance rules were relaxed
> with regard to provision of model metadata via the CMIP5
> questionnaire, we would hope that similar flexibility be extended to
> the submission of model data, some of which may contain occasional
> small portions of missing data. Not surprisingly perhaps, we
> believe that it is far preferable to have 99% of the data for a
> particular simulation available in the archive than have it rejected
> (or non-DOI'd) because of, say, 1 missing month or year.
>
> Also, given that we have been submitting model data to the archive
> since last October, it would seem somewhat, er, punitive to
> introduce a stricter data compliance rule at this stage in the game!
>
> For our part we will endeavour to minimise the size/number of
> temporal gaps in our submitted data. And, as time and reources
> permit, we will investigate technical solutions that will enable us
> to supply files of missing data where we do have such gaps. In the
> meantime we will continue to utilise the appropriate mechanisms
> (e.g. the CMIP5 questionnaire) to flag up data quality issues such
> as this.
>
> Regards,
>
> Phil
>
>
>
> ________________________________
>
> From: go-essp-tech-bounces at ucar.edu
> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Karl Taylor
> Sent: 04 May 2011 20:41
> To: Bryan Lawrence
> Cc: go-essp-tech at ucar.edu
> Subject: Re: [Go-essp-tech] Handling missing data in the CMIP5
> archive
>
>
> Hi Bryan,
>
> Oh, I left something out. Why is it lots of work for the user
> to notice by looking at the time axis that the spacing between
> coordinates is greater than normal, and thus some time slices have
> clearly been skipped? For daily data, for example, if the interval
> between two successive time-coordinates is 10 days, then 9 samples
> must be missing.
>
> I will concede that for some software and for some purposes
> having time-slices included that are completely filled with the
> missing_value flag could provide some advantages, so I guess I
> wouldn't object to requiring this, but I think it's a judgment call
> that's not that clear-cut.
>
> cheers,
> Karl
>
> On 5/4/11 11:42 AM, Bryan Lawrence wrote:
>
> Hi Karl
>
> I think we're somewhat at cross purposes.
>
>
> My view is that if the time-slices have actually
> been lost, we
> shouldn't necessarily reject the data as being
> useless.
>
> Agreed.
>
>
> I agree,
> however, that we should encourage the modeling
> groups to try to
> recover or reproduce the lost time slices to
> make their output more
> complete.
>
> Agreed.
>
>
> If that is impossible, I still think in many
> cases
> analysts will want access to the portions of the
> time-series that
> are available.
>
> In which case we should require them to write misssing
> data fields for
> that portion. That should be trivial for them to do, and
> save the
> consumers a vast amount of time. (ie use the CF missing
> data flag, we're
> not suggesintg htey have to re-run anything unless they
> want to).
>
> This is Ag's option 2c, which you don't seem to mention.
>
>
> Consider, for example, a 1000 year control run
> with a decade missing
> in the middle (perhaps all contained in a single
> lost file). Don't
> you think many researchers will make use of the
> two portions of the
> time-series that *are* available, and shouldn't
> the available data
> be assigned a DOI?
>
>
>
> As I recall, data not passing QC level 2 won't
> normally be replicated
> and wouldn't be assigned a DOI. Is this
> correct?
>
> Correct.
>
> Cheers
> Bryan
>
>
>
> best regards,
> Karl
>
> On 5/4/11 1:08 AM, Bryan Lawrence wrote:
>
> Hi Karl
>
> There are two issues noted in your
> email:(1) missing variables, and
> (2) missing time slices in a sequence.
>
> I agree that (1) is something to be
> noted, I think (2) is something
> that should cause failure, and require a
> response as Ag has
> suggested. I don't think it's too much
> to ask a modelling group to
> either provide the missing data, or
> provide missing data flags -
> but actual missing files in a sequence
> should be an error and a
> failure!
>
> I think we should be holding a candle
> for the users here. The
> reality is that no code is going to read
> the metadata to find
> missing data, whereas code can read and
> understand missing data
> flags.
>
> Bryan
>
>
> Dear Ag,
>
> There is another possible way of
> handling the "missing data"
> issue. I'm not sure that a dataset
> should be be required to be
> complete (i.e., required to include all
> time slices) to be
> considered eligible for DOI assignment.
> That is, we could relax
> the criteria. Note that I don't think we
> require *all* variables
> requested within a single dataset to be
> present, so some datasets
> will indeed be incomplete but be
> eligible for a DOI. I think the
> QC procedure should be to check with the
> modeling group, and if
> they can't supply the missing
> time-slices, then we somehow note
> this flaw in the dataset documentation
> and if other QC checks are
> passed, assign it a DOI.
>
> The criteria for getting a DOI should be
> that there are no known
> errors in the data itself, and that
> there are no major problems
> with the metadata. In this case the
> data will be reliable, and
> analysts will be welcome to use it and
> publish results, so I
> think it should be assigned a DOI.
>
> What do others think?
>
> Best regards,
> Karl
>
> On 4/28/11 3:12 AM,
> ag.stephens at stfc.ac.uk wrote:
>
> Dear all,
>
> At BADC we have come across our first
> "missing data" issue in the
> CMIP5 datasets we are ingesting. We have
> an example of some
> missing months for a particular set of
> variables that was
> revealed when running the QC code from
> DKRZ.
>
> It would be very useful for the CMIP5
> archive managers to make an
> authoritative statement about how we
> should handle missing data
> time steps in the archive.
>
> I propose the following response when a
> Data Node receives a
> dataset
>
> in which time steps are missing:
>
> 1. QC manager (i.e. whoever runs the
> QC code) informs Data
> Provider that there is missing data
> in a dataset (specifying
> full DRS structure and date range
> missing).
>
> 2a. If Data Provider says "no, cannot
> provide this data" then
> the affected datasets cannot get a
> DOI and cannot be part of
> the "crystallised archive". STOP
>
> 2b. Data Provider re-generates files,
> data is re-ingested, new
> version is generated, QC is re-run,
> all is good. STOP
>
> 2c. Data Provider cannot re-generate
> but wants to pass QC - so
> needs to create the required files
> full of missing data.
>
> 3. Data Provider creates missing data
> files and sends, data
> re-ingested, new version is
> generated, QC re-run, all good.
> STOP
>
> In cases 2a and 2c it would also be very
> useful if the dataset is
> annotated to inform the user which dates
> have been FILLED with
> missing data. This would, I believe, be
> in the QC logs but we
> might want a more prominent record of
> this if possible.
>
> Cheers,
>
> Ag
> BADC--
> Scanned by iCritical.
>
> --
> Bryan Lawrence
> Director of Environmental Archival and
> Associated Research
> (NCAS/British Atmospheric Data Centre
> and NCEO/NERC NEODC)
> STFC, Rutherford Appleton Laboratory
> Phone +44 1235 445012; Fax ... 5848;
> Web: home.badc.rl.ac.uk/lawrence
>
> --
> Bryan Lawrence
> Director of Environmental Archival and Associated
> Research
> (NCAS/British Atmospheric Data Centre and NCEO/NERC
> NEODC)
> STFC, Rutherford Appleton Laboratory
> Phone +44 1235 445012; Fax ... 5848;
> Web: home.badc.rl.ac.uk/lawrence
--
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848;
Web: home.badc.rl.ac.uk/lawrence
More information about the GO-ESSP-TECH
mailing list