<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD>

<META content="text/html; charset=us-ascii" http-equiv=Content-Type>

<META name=GENERATOR content="MSHTML 8.00.6001.19019"></HEAD>

<BODY bgColor=#ffffff text=#000000>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011>Hi Karl,</SPAN></FONT></DIV>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011></SPAN></FONT>&nbsp;</DIV>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011>As a modelling centre affected (or should that be 

afflicted!) by this particular issue, it's probably time for us to chime in with 

our own 2 cents worth.</SPAN></FONT></DIV>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011></SPAN></FONT>&nbsp;</DIV>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011>There are a variety of technical and human reasons why 

there are occasional small temporal gaps in the model data that we have 

submitted to the CMIP5 archive: model crashes/restarts, files not making it into 

our archive system, start/end dates not specified exactly in conformance with 

the CMIP5 experiment plan, etc, etc.&nbsp;(Given&nbsp;the number of 

experiments&nbsp;that MOHC is&nbsp;conducting I don't think it would be humanely 

possible for us to get everything right all the time :-).</SPAN></FONT></DIV>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011></SPAN></FONT>&nbsp;</DIV>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011>If it was trivial matter to identify and fix these 

small bits of missing data I can assure you that we would have done that. The 

reality, however, is that the complexities (and, yes, quirks)&nbsp;of the UM, 

together with the software integration aspects of the CMOR library, mean that is 

by no means a trivial technical issue.&nbsp; And like the rest of the CMIP5/ESGF 

endeavour - that's&nbsp;you guys! -&nbsp;we have&nbsp;very few resources spread 

fairly thinly.&nbsp; Hence we have had to make decisions on where to prioritise 

our efforts. Do we fix occasional small gaps in data time-series, or do we focus 

on CFMIP2, TAMIP, 60-level models, or invest *significant* effort 

into&nbsp;understanding and using&nbsp;the CMIP5 questionnaire! (In the latter 

case, to the not inconsiderable benefit of other modelling 

centres.)</SPAN></FONT></DIV>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011></SPAN></FONT>&nbsp;</DIV>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011>So, in the same spirit in which the compliance rules 

were relaxed with regard to provision of model metadata via the CMIP5 

questionnaire, we would hope that&nbsp;similar flexibility&nbsp;be extended to 

the&nbsp;submission of model&nbsp;data, some of which may contain occasional 

small portions of missing data.&nbsp; Not surprisingly perhaps, we believe that 

it is far preferable to have 99% of&nbsp;the data for a 

particular&nbsp;simulation available in the archive than have it rejected (or 

non-DOI'd) because of, say, 1 missing month or year.</SPAN></FONT></DIV>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011></SPAN></FONT>&nbsp;</DIV>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011>Also, given that we have been submitting model data to 

the archive since last October, it would seem somewhat, er,&nbsp;punitive to 

introduce a stricter data compliance rule at this stage in the 

game!</SPAN></FONT></DIV>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011></SPAN></FONT>&nbsp;</DIV>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011>For our part we will endeavour to minimise the 

size/number of temporal gaps in our submitted data. And, as time and reources 

permit,&nbsp;we will investigate technical solutions that will enable us to 

supply files of missing data where we do have such gaps. In the meantime we will 

continue to utilise the appropriate mechanisms (e.g. the CMIP5 questionnaire) to 

flag up data quality issues such as this.</SPAN></FONT></DIV>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011></SPAN></FONT>&nbsp;</DIV>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011>Regards,</SPAN></FONT></DIV>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011></SPAN></FONT>&nbsp;</DIV>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011>Phil</SPAN></FONT></DIV>

<DIV dir=ltr align=left><FONT color=#0000ff size=2 face=Tahoma><SPAN 

class=085485508-05052011></SPAN></FONT>&nbsp;</DIV><BR>

<BLOCKQUOTE 

style="BORDER-LEFT: #0000ff 2px solid; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; MARGIN-RIGHT: 0px" 

dir=ltr>

  <DIV dir=ltr lang=en-us class=OutlookMessageHeader align=left>

  <HR tabIndex=-1>

  <FONT size=2 face=Tahoma><B>From:</B> go-essp-tech-bounces@ucar.edu 

  [mailto:go-essp-tech-bounces@ucar.edu] <B>On Behalf Of </B>Karl 

  Taylor<BR><B>Sent:</B> 04 May 2011 20:41<BR><B>To:</B> Bryan 

  Lawrence<BR><B>Cc:</B> go-essp-tech@ucar.edu<BR><B>Subject:</B> Re: 

  [Go-essp-tech] Handling missing data in the CMIP5 archive<BR></FONT><BR></DIV>

  <DIV></DIV><FONT face="Times New Roman">Hi Bryan,<BR><BR>Oh, I left something 

  out.&nbsp; Why is it lots of work for the user to notice by looking at the 

  time axis that the spacing between coordinates is greater than normal, and 

  thus some time slices have clearly been skipped?&nbsp; For daily data,&nbsp; 

  for example, if the interval between two successive time-coordinates is 10 

  days, then 9 samples must be missing.&nbsp;&nbsp;&nbsp; <BR><BR>I will concede 

  that for some software and for some purposes having time-slices included that 

  are completely filled with the missing_value flag could provide some 

  advantages, so I guess I wouldn't object to requiring this, but I think it's a 

  judgment call that's not that 

  clear-cut.<BR><BR>cheers,<BR>Karl</FONT><BR><BR>On 5/4/11 11:42 AM, Bryan 

  Lawrence wrote: 

  <BLOCKQUOTE cite=mid:201105041942.15959.bryan.lawrence@stfc.ac.uk type="cite"><PRE wrap="">Hi Karl

I think we're somewhat at cross purposes.

</PRE>

    <BLOCKQUOTE type="cite"><PRE wrap="">My view is that if the time-slices have actually been lost, we

shouldn't necessarily reject the data as being useless. 

</PRE></BLOCKQUOTE><PRE wrap="">Agreed.

</PRE>

    <BLOCKQUOTE type="cite"><PRE wrap="">I agree,

however, that we should encourage the modeling groups to try to

recover or reproduce the lost time slices to make their output more

complete.

</PRE></BLOCKQUOTE><PRE wrap="">Agreed.

</PRE>

    <BLOCKQUOTE type="cite"><PRE wrap="">If that is impossible, I still think in many cases

analysts will want access to the portions of the time-series that

are available.

</PRE></BLOCKQUOTE><PRE wrap="">In which case we should require them to write misssing data fields for 

that portion. That should be trivial for them to do, and save the 

consumers a vast amount of time.  (ie use the CF missing data flag, we're 

not suggesintg htey have to re-run anything unless they want to).

This is Ag's option 2c, which you don't seem to mention.

</PRE>

    <BLOCKQUOTE type="cite"><PRE wrap="">Consider, for example, a 1000 year control run with a decade missing

in the middle (perhaps all contained in a single lost file).  Don't

you think many researchers will make use of the two portions of the

time-series that *are* available, and shouldn't the available data

be assigned a DOI?

</PRE></BLOCKQUOTE><PRE wrap=""></PRE>

    <BLOCKQUOTE type="cite"><PRE wrap="">As I recall, data not passing QC level 2 won't normally be replicated

and wouldn't be assigned a DOI.  Is this correct?

</PRE></BLOCKQUOTE><PRE wrap="">Correct.

Cheers

Bryan

</PRE>

    <BLOCKQUOTE type="cite"><PRE wrap="">best regards,

Karl

On 5/4/11 1:08 AM, Bryan Lawrence wrote:

</PRE>

      <BLOCKQUOTE type="cite"><PRE wrap="">Hi Karl

There are two issues noted in your email:(1) missing variables, and

(2) missing time slices in a sequence.

I agree that (1) is something to be noted, I think (2) is something

that should cause failure, and require a response as Ag has

suggested. I don't think it's too much to ask a modelling group to

either provide the missing data, or provide missing data flags -

but actual missing files in a sequence should be an error and a

failure!

I think we should be holding a candle for the users here. The

reality is that no code is going to read the metadata to find

missing data, whereas code can read and understand missing data

flags.

Bryan

</PRE>

        <BLOCKQUOTE type="cite"><PRE wrap="">Dear Ag,

There is another possible way of handling the "missing data"

issue. I'm not sure that a dataset should be be required to be

complete (i.e., required to include all time slices) to be

considered eligible for DOI assignment.  That is, we could relax

the criteria. Note that I don't think we require *all* variables

requested within a single dataset to be present, so some datasets

will indeed be incomplete but be eligible for a DOI.  I think the

QC procedure should be to check with the modeling group, and if

they can't supply the missing time-slices, then we somehow note

this flaw in the dataset documentation and if other QC checks are

passed, assign it a DOI.

The criteria for getting a DOI should be that there are no known

errors in the data itself, and that there are no major problems

with the metadata.  In this case the data will be reliable, and

analysts will be welcome to use it and publish results, so I

think it should be assigned a DOI.

What do others think?

Best regards,

Karl

On 4/28/11 3:12 AM, <A class=moz-txt-link-abbreviated href="mailto:ag.stephens@stfc.ac.uk">ag.stephens@stfc.ac.uk</A> wrote:

</PRE>

          <BLOCKQUOTE type="cite"><PRE wrap="">Dear all,

At BADC we have come across our first "missing data" issue in the

CMIP5 datasets we are ingesting. We have an example of some

missing months for a particular set of variables that was

revealed when running the QC code from DKRZ.

It would be very useful for the CMIP5 archive managers to make an

authoritative statement about how we should handle missing data

time steps in the archive.

I propose the following response when a Data Node receives a

dataset

</PRE></BLOCKQUOTE></BLOCKQUOTE><PRE wrap="">in which time steps are missing:

</PRE>

        <BLOCKQUOTE type="cite">

          <BLOCKQUOTE type="cite"><PRE wrap="">   1. QC manager (i.e. whoever runs the QC code) informs Data

   Provider that there is missing data in a dataset (specifying

   full DRS structure and date range missing).

   2a. If Data Provider says "no, cannot provide this data" then

   the affected datasets cannot get a DOI and cannot be part of

   the "crystallised archive". STOP

   2b. Data Provider re-generates files, data is re-ingested, new

   version is generated, QC is re-run, all is good. STOP

   2c. Data Provider cannot re-generate but wants to pass QC - so

   needs to create the required files full of missing data.

   3. Data Provider creates missing data files and sends, data

   re-ingested, new version is generated, QC re-run, all good.

   STOP

In cases 2a and 2c it would also be very useful if the dataset is

annotated to inform the user which dates have been FILLED with

missing data. This would, I believe, be in the QC logs but we

might want a more prominent record of this if possible.

Cheers,

Ag

BADC--

Scanned by iCritical.

</PRE></BLOCKQUOTE></BLOCKQUOTE><PRE wrap="">--

Bryan Lawrence

Director of Environmental Archival and Associated Research

(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)

STFC, Rutherford Appleton Laboratory

Phone +44 1235 445012; Fax ... 5848;

Web: home.badc.rl.ac.uk/lawrence

</PRE></BLOCKQUOTE></BLOCKQUOTE><PRE wrap="">--

Bryan Lawrence

Director of Environmental Archival and Associated Research

(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)

STFC, Rutherford Appleton Laboratory

Phone +44 1235 445012; Fax ... 5848; 

Web: home.badc.rl.ac.uk/lawrence

</PRE></BLOCKQUOTE></BLOCKQUOTE></BODY></HTML>