[Go-essp-tech] Handling missing data in the CMIP5 archive
Bryan Lawrence
bryan.lawrence at stfc.ac.uk
Thu May 5 04:20:06 MDT 2011
Hi Phil
Actually having written that, and thought a bit longer, I think the
situation could be put analagous to atm.chem.phys discussion papers.
I have no problem with passing q.c level 2 in this situation, and
allowing folks access to the data (analogous to a discussion paper for
atm.chem.phys).
We now have what is effectively a discusison paper anda referees report:
"you have unflagged missing data" - both public.
When the final version is received for formal publication, we expect it
to have been fixed (using flags or a rerun, whatever). At which point it
is eligible to get a persistent DOI to a fixed dataset (and move from
being a discussion paper to being a real paper).
Otherwise the position (wrt a doi publication) is analogous to "we put a
lot of work into our paper, we don't want to do any more because we're
busy" ... which doesn't and shouldn't fly, for a publication.
Cheers
Bryan
> Hi Phil
>
> You guys have known about this error (and others like it, for weeks).
>
> Are you really saying it's a vast effort to have a short python code
> that provides missing data sequences when *you(mohc)*
> a) already know exactly what is missing, and
> b) how long it is missing for.
> c) have written your own internal q.c. report for it (which must
> take a comparable amount of time to running a "provide missing data"
> code.
>
> We then find it immediately too, because we're running q.c. code, and
> we could fix it too (and would, but then it wouldn't be *your* data
> any more). Every user will do this too (or cock up because they
> don't spot the problem).
>
> I know we're all spread thin, and I can lose this argument in the
> context of CMIP5 q.c up to level 2, but I'm not ready to given for
> q.c. level 3.
>
> In particular, a DOI is not a URL alone. It has to mean something a
> wee bit more than that if you want it to be taken seriously as a
> publication. Speaking personally, I wont go to bat with the journals
> and ISI unless it is!
>
> I know you're spread thin, but we are all spread thin. I'm not
> suggesting you have to rerun the model here!
>
> Cheers
> Bryan
>
> > Hi Karl,
> >
> > As a modelling centre affected (or should that be afflicted!) by
> > this particular issue, it's probably time for us to chime in with
> > our own 2 cents worth.
> >
> > There are a variety of technical and human reasons why there are
> > occasional small temporal gaps in the model data that we have
> > submitted to the CMIP5 archive: model crashes/restarts, files not
> > making it into our archive system, start/end dates not specified
> > exactly in conformance with the CMIP5 experiment plan, etc, etc.
> > (Given the number of experiments that MOHC is conducting I don't
> > think it would be humanely possible for us to get everything right
> > all the time :-).
> >
> > If it was trivial matter to identify and fix these small bits of
> > missing data I can assure you that we would have done that. The
> > reality, however, is that the complexities (and, yes, quirks) of
> > the UM, together with the software integration aspects of the CMOR
> > library, mean that is by no means a trivial technical issue. And
> > like the rest of the CMIP5/ESGF endeavour - that's you guys! - we
> > have very few resources spread fairly thinly. Hence we have had
> > to make decisions on where to prioritise our efforts. Do we fix
> > occasional small gaps in data time-series, or do we focus on
> > CFMIP2, TAMIP, 60-level models, or invest *significant* effort
> > into understanding and using the CMIP5
> > questionnaire! (In the latter case, to the not inconsiderable
> > benefit of other modelling centres.)
> >
> > So, in the same spirit in which the compliance rules were relaxed
> > with regard to provision of model metadata via the CMIP5
> > questionnaire, we would hope that similar flexibility be extended
> > to the submission of model data, some of which may contain
> > occasional small portions of missing data. Not surprisingly
> > perhaps, we believe that it is far preferable to have 99% of the
> > data for a particular simulation available in the archive than
> > have it rejected (or non-DOI'd) because of, say, 1 missing month
> > or year.
> >
> > Also, given that we have been submitting model data to the archive
> > since last October, it would seem somewhat, er, punitive to
> > introduce a stricter data compliance rule at this stage in the
> > game!
> >
> > For our part we will endeavour to minimise the size/number of
> > temporal gaps in our submitted data. And, as time and reources
> > permit, we will investigate technical solutions that will enable us
> > to supply files of missing data where we do have such gaps. In the
> > meantime we will continue to utilise the appropriate mechanisms
> > (e.g. the CMIP5 questionnaire) to flag up data quality issues such
> > as this.
> >
> > Regards,
> >
> > Phil
> >
> >
> >
> > ________________________________
> >
> > From: go-essp-tech-bounces at ucar.edu
> >
> > [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Karl Taylor
> >
> > Sent: 04 May 2011 20:41
> > To: Bryan Lawrence
> > Cc: go-essp-tech at ucar.edu
> > Subject: Re: [Go-essp-tech] Handling missing data in the CMIP5
> >
> > archive
> >
> > Hi Bryan,
> >
> > Oh, I left something out. Why is it lots of work for the user
> >
> > to notice by looking at the time axis that the spacing between
> > coordinates is greater than normal, and thus some time slices have
> > clearly been skipped? For daily data, for example, if the
> > interval between two successive time-coordinates is 10 days, then
> > 9 samples must be missing.
> >
> > I will concede that for some software and for some purposes
> >
> > having time-slices included that are completely filled with the
> > missing_value flag could provide some advantages, so I guess I
> > wouldn't object to requiring this, but I think it's a judgment call
> > that's not that clear-cut.
> >
> > cheers,
> > Karl
> >
> > On 5/4/11 11:42 AM, Bryan Lawrence wrote:
> > Hi Karl
> >
> > I think we're somewhat at cross purposes.
> >
> > My view is that if the time-slices have actually
> >
> > been lost, we
> >
> > shouldn't necessarily reject the data as being
> >
> > useless.
> >
> > Agreed.
> >
> > I agree,
> > however, that we should encourage the modeling
> >
> > groups to try to
> >
> > recover or reproduce the lost time slices to
> >
> > make their output more
> >
> > complete.
> >
> > Agreed.
> >
> > If that is impossible, I still think in many
> >
> > cases
> >
> > analysts will want access to the portions of the
> >
> > time-series that
> >
> > are available.
> >
> > In which case we should require them to write misssing
> >
> > data fields for
> >
> > that portion. That should be trivial for them to do, and
> >
> > save the
> >
> > consumers a vast amount of time. (ie use the CF missing
> >
> > data flag, we're
> >
> > not suggesintg htey have to re-run anything unless they
> >
> > want to).
> >
> > This is Ag's option 2c, which you don't seem to mention.
> >
> > Consider, for example, a 1000 year control run
> >
> > with a decade missing
> >
> > in the middle (perhaps all contained in a single
> >
> > lost file). Don't
> >
> > you think many researchers will make use of the
> >
> > two portions of the
> >
> > time-series that *are* available, and shouldn't
> >
> > the available data
> >
> > be assigned a DOI?
> >
> >
> >
> > As I recall, data not passing QC level 2 won't
> >
> > normally be replicated
> >
> > and wouldn't be assigned a DOI. Is this
> >
> > correct?
> >
> > Correct.
> >
> > Cheers
> > Bryan
> >
> > best regards,
> > Karl
> >
> > On 5/4/11 1:08 AM, Bryan Lawrence wrote:
> > Hi Karl
> >
> > There are two issues noted in your
> >
> > email:(1) missing variables, and
> >
> > (2) missing time slices in a sequence.
> >
> > I agree that (1) is something to be
> >
> > noted, I think (2) is something
> >
> > that should cause failure, and require a
> >
> > response as Ag has
> >
> > suggested. I don't think it's too much
> >
> > to ask a modelling group to
> >
> > either provide the missing data, or
> >
> > provide missing data flags -
> >
> > but actual missing files in a sequence
> >
> > should be an error and a
> >
> > failure!
> >
> > I think we should be holding a candle
> >
> > for the users here. The
> >
> > reality is that no code is going to read
> >
> > the metadata to find
> >
> > missing data, whereas code can read and
> >
> > understand missing data
> >
> > flags.
> >
> > Bryan
> >
> >
> > Dear Ag,
> >
> > There is another possible way of
> >
> > handling the "missing data"
> >
> > issue. I'm not sure that a dataset
> >
> > should be be required to be
> >
> > complete (i.e., required to include all
> >
> > time slices) to be
> >
> > considered eligible for DOI assignment.
> >
> > That is, we could relax
> >
> > the criteria. Note that I don't think we
> >
> > require *all* variables
> >
> > requested within a single dataset to be
> >
> > present, so some datasets
> >
> > will indeed be incomplete but be
> >
> > eligible for a DOI. I think the
> >
> > QC procedure should be to check with the
> >
> > modeling group, and if
> >
> > they can't supply the missing
> >
> > time-slices, then we somehow note
> >
> > this flaw in the dataset documentation
> >
> > and if other QC checks are
> >
> > passed, assign it a DOI.
> >
> > The criteria for getting a DOI should be
> >
> > that there are no known
> >
> > errors in the data itself, and that
> >
> > there are no major problems
> >
> > with the metadata. In this case the
> >
> > data will be reliable, and
> >
> > analysts will be welcome to use it and
> >
> > publish results, so I
> >
> > think it should be assigned a DOI.
> >
> > What do others think?
> >
> > Best regards,
> > Karl
> >
> > On 4/28/11 3:12 AM,
> >
> > ag.stephens at stfc.ac.uk wrote:
> > Dear all,
> >
> > At BADC we have come across our first
> >
> > "missing data" issue in the
> >
> > CMIP5 datasets we are ingesting. We have
> >
> > an example of some
> >
> > missing months for a particular set of
> >
> > variables that was
> >
> > revealed when running the QC code from
> >
> > DKRZ.
> >
> > It would be very useful for the CMIP5
> >
> > archive managers to make an
> >
> > authoritative statement about how we
> >
> > should handle missing data
> >
> > time steps in the archive.
> >
> > I propose the following response when a
> >
> > Data Node receives a
> >
> > dataset
> >
> > in which time steps are missing:
> > 1. QC manager (i.e. whoever runs the
> >
> > QC code) informs Data
> >
> > Provider that there is missing data
> >
> > in a dataset (specifying
> >
> > full DRS structure and date range
> >
> > missing).
> >
> > 2a. If Data Provider says "no, cannot
> >
> > provide this data" then
> >
> > the affected datasets cannot get a
> >
> > DOI and cannot be part of
> >
> > the "crystallised archive". STOP
> >
> > 2b. Data Provider re-generates files,
> >
> > data is re-ingested, new
> >
> > version is generated, QC is re-run,
> >
> > all is good. STOP
> >
> > 2c. Data Provider cannot re-generate
> >
> > but wants to pass QC - so
> >
> > needs to create the required files
> >
> > full of missing data.
> >
> > 3. Data Provider creates missing data
> >
> > files and sends, data
> >
> > re-ingested, new version is
> >
> > generated, QC re-run, all good.
> >
> > STOP
> >
> > In cases 2a and 2c it would also be very
> >
> > useful if the dataset is
> >
> > annotated to inform the user which dates
> >
> > have been FILLED with
> >
> > missing data. This would, I believe, be
> >
> > in the QC logs but we
> >
> > might want a more prominent record of
> >
> > this if possible.
> >
> > Cheers,
> >
> > Ag
> > BADC--
> > Scanned by iCritical.
> >
> > --
> > Bryan Lawrence
> > Director of Environmental Archival and
> >
> > Associated Research
> >
> > (NCAS/British Atmospheric Data Centre
> >
> > and NCEO/NERC NEODC)
> >
> > STFC, Rutherford Appleton Laboratory
> > Phone +44 1235 445012; Fax ... 5848;
> > Web: home.badc.rl.ac.uk/lawrence
> >
> > --
> > Bryan Lawrence
> > Director of Environmental Archival and Associated
> >
> > Research
> >
> > (NCAS/British Atmospheric Data Centre and NCEO/NERC
> >
> > NEODC)
> >
> > STFC, Rutherford Appleton Laboratory
> > Phone +44 1235 445012; Fax ... 5848;
> > Web: home.badc.rl.ac.uk/lawrence
>
> --
> Bryan Lawrence
> Director of Environmental Archival and Associated Research
> (NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
> STFC, Rutherford Appleton Laboratory
> Phone +44 1235 445012; Fax ... 5848;
> Web: home.badc.rl.ac.uk/lawrence
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
--
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848;
Web: home.badc.rl.ac.uk/lawrence
More information about the GO-ESSP-TECH
mailing list