<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">Hi folks,<br>

      <br>

      to add to this important topic I would like to raise a few

      comments and highlight possible guidance we could made, from a

      data producer and provider perspective.<br>

      <br>

      Sorry for this long email but it was not easy to pack it more than

      this.<br>

      <br>

      1. Version as we have now are too high level (dataset level) to be

      useful to the users. They are in some sense useful to data

      provider but it's clearly not enough in this context as well.<br>

      2. tracking_id are very useful. As things stands now this is the

      most robust we have to build version information system for users.<br>

      3. checksum are useful but not at all error prone and they are

      costly. Few months back it was not a good idea to build on top of

      it a version information system for users.<br>

      <br>

      We developed a prototype version information system for users. It

      highlights the methodological approach and only cover IPSL

      results.<br>

      <br>

      1. we need list of problems<br>

      2. we need list of files affected by a given problem<br>

      3. we need list of (files, problem) status ie (corrected, not

      corrected)<br>

      <br>

      This page provide errata related to our IPSL-CM results only.<br>

      <a class="moz-txt-link-freetext"

href="http://icmc.ipsl.fr/research/international-projects/cmip5/errata-ipsl">http://icmc.ipsl.fr/research/international-projects/cmip5/errata-ipsl</a>

      <br>

      <br>

      The interesting part is that you can provide a list of tracking_id

      (example netcdf_tracking_id.txt attached). <br>

      The system will tell you:<br>

      - whether the file is from the latest dataset version or not. (Not

      so useful information I agree)<br>

      - if not has the file really changed compared to previous dataset

      version. (This is useful : the dataset version changed but not the

      file I'm interested in)<br>

      - history of correction made on those files (example : <a

        class="moz-txt-link-freetext"

href="http://icmc.ipsl.fr/research/international-projects/cmip5/87-research/international-projects/cmip5/errata/227">http://icmc.ipsl.fr/research/international-projects/cmip5/87-research/international-projects/cmip5/errata/227</a>)

      <br>

      - if you don't have the latest version of a given file you have

      access to the list of problems that has been solved.<br>

      - if you have the latest version of a given file BUT a problem

      still need to be solved you can make a proper decision.<br>

      <br>

      I agree it needs some formal thinking. The attached pdf provides a

      few steps towards this.<br>

      <br>

      We suggest that part of this information can be captured during

      publication and after the fact (new published version = comments

      and list of issues (tickets)).<br>

      <br>

      We suggest to leverage the ESGF search system as a place holder

      and the entry point for this information.<br>

      <br>

      File level versioning is what the users want.<br>

      <br>

      thanks.<br>

      S&eacute;bastien<br>

      <br>

      Le 06/03/2013 10:38, Kettleborough, Jamie a &eacute;crit&nbsp;:<br>

    </div>

    <blockquote

cite="mid:F9565F07C6F37743B851EF445FE7B6320224B9@EXXCMPD1DAG3.cmpd1.metoffice.gov.uk"

      type="cite">

      <meta http-equiv="Content-Type" content="text/html;

        charset=ISO-8859-1">

      <meta name="GENERATOR" content="MSHTML 8.00.6001.19400">

      <div dir="ltr" align="left">

        <div dir="ltr" align="left"><span class="903553309-06032013">Hello,</span></div>

        <div dir="ltr" align="left"><span class="903553309-06032013"></span>&nbsp;</div>

        <div dir="ltr" align="left"><span class="903553309-06032013">is

            there a straw man document (or anything like that) around

            thoughts/proposals on versioning&nbsp;in ESG?&nbsp; I think it would

            be great to get some user review (both data providers and

            data consumers) of this if possible.</span></div>

        <div dir="ltr" align="left"><span class="903553309-06032013"></span>&nbsp;</div>

        <div dir="ltr" align="left"><span class="903553309-06032013">Thanks,</span></div>

        <div dir="ltr" align="left"><span class="903553309-06032013"></span>&nbsp;</div>

        <div dir="ltr" align="left"><span class="903553309-06032013">Jamie</span></div>

        <br>

      </div>

      <br>

      <blockquote style="BORDER-LEFT: #000000 2px solid; PADDING-LEFT:

        5px; MARGIN-LEFT: 5px; MARGIN-RIGHT: 0px" dir="ltr">

        <div dir="ltr" class="OutlookMessageHeader" lang="en-us"

          align="left">

          <hr tabindex="-1">

          <font face="Tahoma" size="2"><b>From:</b>

            <a class="moz-txt-link-abbreviated" href="mailto:go-essp-tech-bounces@ucar.edu">go-essp-tech-bounces@ucar.edu</a>

            [<a class="moz-txt-link-freetext" href="mailto:go-essp-tech-bounces@ucar.edu">mailto:go-essp-tech-bounces@ucar.edu</a>]

            <b>On Behalf Of </b>Christensen, Sigurd W.<br>

            <b>Sent:</b> 05 March 2013 19:57<br>

            <b>To:</b> <a class="moz-txt-link-abbreviated" href="mailto:go-essp-tech@ucar.edu">go-essp-tech@ucar.edu</a><br>

            <b>Subject:</b> [Go-essp-tech] Towards versioning in ESG<br>

          </font><br>

        </div>

        <div dir="ltr" align="left">

          <p><font face="Arial">Folks,</font></p>

          <p><font face="Arial">Thanks for the opportunity to discuss

              versioning on today's call.</font></p>

          <p>&nbsp;</p>

          <p><span class="897273719-05032013"><font face="Arial">As

                others have expressed, in the December 21 and March 4

                postings on this topic, my main concern is that

                versioning serve the needs of the end user.&nbsp; We should

                provide an easy way&nbsp;for the end user to determine

                whether data and metadata&nbsp;the user&nbsp;has previously

                retrieved and used in an analysis is still current, or

                has been revised in a way that might affect the

                analysis.&nbsp;</font></span></p>

          <p><span class="897273719-05032013"></span>&nbsp;</p>

          <p><font face="Arial">I agreed to post to this list&nbsp;<span

                class="897273719-05032013">a consideration</span>&nbsp;I

              mentioned on&nbsp;<span class="897273719-05032013">today's</span>

              call: observational datasets that routinely are extended

              through time as current data become available. This

              situation was also raised on this list by George Huffman

              on December 21, 2012. I agree with his thought that

              provoking a new version each time a new data increment is

              added is unwieldy both for the data producers and for the

              users.</font></p>

          <p>&nbsp;</p>

          <p><span class="897273719-05032013"><font face="Arial">I also

                support George's notion that we consider the standards

                for DOIs (Digital Object Identifiers) in conjunction

                with the discussion of versioning.</font></span></p>

          <p><span class="897273719-05032013"></span>&nbsp;</p>

          <p><span class="897273719-05032013"><font face="Arial">A final

                thought for now: I&nbsp;feel that&nbsp;we should make information

                available to the users about what changed with a new

                version.</font></span></p>

          <p><span class="897273719-05032013"></span>&nbsp;</p>

          <p><span class="897273719-05032013"><font face="Arial">&nbsp; - Sig

                Christensen</font></span></p>

          <p><span class="897273719-05032013"></span>&nbsp;</p>

          <p><span class="897273719-05032013"></span>&nbsp;</p>

          <hr tabindex="-1">

          <p><font face="Tahoma"><b>From:</b>

              <a class="moz-txt-link-abbreviated" href="mailto:go-essp-tech-bounces@ucar.edu">go-essp-tech-bounces@ucar.edu</a>

              [<a class="moz-txt-link-freetext" href="mailto:go-essp-tech-bounces@ucar.edu">mailto:go-essp-tech-bounces@ucar.edu</a>]

              <b>On Behalf Of </b>Drach, Bob<br>

              <b>Sent:</b> Monday, March 04, 2013 21:26<br>

              <b>To:</b> Taylor, Karl Taylor<br>

              <b>Cc:</b> <a class="moz-txt-link-abbreviated" href="mailto:go-essp-tech@ucar.edu">go-essp-tech@ucar.edu</a><br>

              <b>Subject:</b> Re: [Go-essp-tech] definition of dataset

              version<br>

            </font><br>

          </p>

        </div>

        <div style="FONT-FAMILY: Tahoma; DIRECTION: ltr; COLOR: #000000;

          FONT-SIZE: 10pt">

          Hi Karl,<br>

          <br>

          As you suggest, the broader question is what guidance we

          should give to data providers and users on usage of the

          dataset version, file tracking ID, and file checksum.<br>

          <br>

          It's true that the dataset version may not be of much use to

          data users if they don't record when the data was downloaded.

          But since the version indicates the date of publication, it

          still might give some indication when a dataset has gone out

          of date. The tracking ID is a random UUID generated by CMOR,

          and is meant as a 'bar code' to track the data through ESGF.

          Since it's a global attribute that is visible on the data

          portal, it is relatively easy for a user to discover and

          compare with the file value. However its usage and purpose

          haven't been well defined, and in some cases data providers

          have probably modified data in place without changing the

          tracking ID (hopefully not too often). Checksums are

          definitive, but trivial modifications can't be made without

          changing the checksum.<br>

          <br>

          To answer your question, the timestamp in the ESGF SOLR index

          is associated with the dataset as a whole, and indicates the

          publication time.<br>

          <br>

          I'm opening the discussion to the GO-ESSP list for comments.<br>

          <br>

          --Bob<br>

          <br>

          <div style="FONT-FAMILY: Times New Roman; COLOR: rgb(0,0,0);

            FONT-SIZE: 16px">

            <hr tabindex="-1">

            <div style="DIRECTION: ltr" id="divRpF600567"><font

                face="Tahoma" size="2"><b>From:</b> Karl Taylor

                [<a class="moz-txt-link-abbreviated" href="mailto:taylor13@llnl.gov">taylor13@llnl.gov</a>]<br>

                <b>Sent:</b> Monday, March 04, 2013 3:45 PM<br>

                <b>To:</b> Drach, Bob<br>

                <b>Cc:</b> Williams, Dean N.; Painter, Jeff; Ganzberger,

                Michael<br>

                <b>Subject:</b> Re: definition of dataset version<br>

              </font><br>

            </div>

            <div><font face="Times New Roman">Hi Bob,<br>

                <br>

                I think the "version numbers" assigned datasets are

                pretty unhelpful to most users.&nbsp; Most users won't record

                or remember what version they have downloaded.&nbsp; Perhaps

                some users will know what *date* they downloaded data,

                and all users can determine the tracking_id's and

                chksums for their files, so we should provide support

                for determining whether files are current based on this

                information.<br>

                <br>

                Is the date recorded by ESGF assigned to a dataset or to

                each file? &nbsp; If it's assigned to a dataset, then I'm not

                sure that will be much use either.<br>

                <br>

                I think when a user asks us whether a file is current or

                not, based on the checksum or tracking_id, we should

                return the following information:<br>

                <br>

                "You have the latest version of this file"&nbsp; -- if the

                checksum provided by the user is identical to the latest

                file version in the CMIP archive.<br>

                "A newer variant of the file exists, but differences are

                unlikely to affect your analysis"&nbsp; --&nbsp; if the only

                changes made have been to some subset of the file's

                global attributes that we think will not lead to

                misinterpretation of the data itself.<br>

                "A new version of the file exists and should be used in

                place of the one you downloaded"&nbsp; --&nbsp; otherwise<br>

                <br>

                We would list the set of global attributes that could be

                wrong in case 2.<br>

                <br>

                We could use tracking_id's rather than chksums, but we

                would have to weed out the cases where a critically

                important global attribute had been modified, but the

                tracking_id hadn't. &nbsp; [I'd guess that there aren't any

                cases where the data itself has been modified without

                changing the chksum, but there might be quite a few

                cases where important global attributes have been

                changed.]<br>

                <br>

                Would the above be practical?<br>

                <br>

                Karl<br>

                <br>

                <br>

              </font>

              <div class="moz-cite-prefix">On 3/4/13 1:21 PM, Drach, Bob

                wrote:<br>

              </div>

              <blockquote type="cite">

                <style id="owaParaStyle" type="text/css">P {

        MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px

}

BODY {

        FONT-FAMILY: Tahoma; DIRECTION: ltr; COLOR: #000000; FONT-SIZE: 10pt

}

P {

        MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px

}

BODY {

}

BODY {

}

BODY {

}

BODY {

}

</style>

                <div style="FONT-FAMILY: Tahoma; DIRECTION: ltr; COLOR:

                  rgb(0,0,0); FONT-SIZE: 10pt">

                  Hi Karl,<br>

                  <br>

                  Dean requested that we have a conversation about

                  dataset versioning on the GO-ESSP telecon tomorrow.

                  I'm curious about your views on the subject.

                  <br>

                  <br>

                  Specifically, the question arose for the case where a

                  modeling group has regenerated data through CMOR, to

                  replace data lost in a disk crash. The data providers

                  assert that the data is identical to the published

                  version. However, because it has been regenerated the

                  checksums and tracking IDs differ. The question is

                  whether the data should be published with the previous

                  version number or should be considered a new version.<br>

                  <br>

                  At the moment we leave the choice to the data

                  publishers, and the publishing client by default

                  generates a new version number when any file in a

                  dataset has been added, deleted, or modified. However,

                  this leaves some ambiguous cases, such as when:<br>

                  <br>

                  - the metadata has been modified, but the actual data

                  is unchanged;<br>

                  - the data has been regenerated through CMOR, such

                  that all data and metadata fields are unchanged, with

                  the sole exception of the tracking ID (and therefore

                  the checksum has changed as well).<br>

                  <br>

                  My opinion is that an updated version number should be

                  a signal to the end users that something significant

                  has changed that is worth their attention. If nothing

                  has changed except the tracking ID and history

                  attributes, the dataset should be republished with the

                  original version number. There may be similar cases

                  where minor metadata modifications don't warrant a new

                  version number. On the other hand, modification of

                  metadata that guides processing - axis definitions,

                  units, dataset identification fields, etc., should

                  trigger a new version number.<br>

                  <br>

                  This approach has the implication that the tracking ID

                  and checksum of a file could change even though the

                  parent dataset version stays the same.<br>

                  <br>

                  Any thoughts on the matter?<br>

                  <br>

                  --Bob<br>

                </div>

              </blockquote>

              <br>

            </div>

          </div>

        </div>

      </blockquote>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

GO-ESSP-TECH mailing list

<a class="moz-txt-link-abbreviated" href="mailto:GO-ESSP-TECH@ucar.edu">GO-ESSP-TECH@ucar.edu</a>

<a class="moz-txt-link-freetext" href="http://mailman.ucar.edu/mailman/listinfo/go-essp-tech">http://mailman.ucar.edu/mailman/listinfo/go-essp-tech</a>

</pre>

    </blockquote>

    <br>

    <br>

    <pre class="moz-signature" cols="72">-- 

S&eacute;bastien Denvil

IPSL, P&ocirc;le de mod&eacute;lisation du climat

UPMC, Case 101, 4 place Jussieu,

75252 Paris Cedex 5

Tour 45-55 2&egrave;me &eacute;tage Bureau 209

Tel: 33 1 44 27 21 10

Fax: 33 1 44 27 39 02

</pre>

  </body>

</html>