<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">Hi folks,<br>
<br>
to add to this important topic I would like to raise a few
comments and highlight possible guidance we could made, from a
data producer and provider perspective.<br>
<br>
Sorry for this long email but it was not easy to pack it more than
this.<br>
<br>
1. Version as we have now are too high level (dataset level) to be
useful to the users. They are in some sense useful to data
provider but it's clearly not enough in this context as well.<br>
2. tracking_id are very useful. As things stands now this is the
most robust we have to build version information system for users.<br>
3. checksum are useful but not at all error prone and they are
costly. Few months back it was not a good idea to build on top of
it a version information system for users.<br>
<br>
We developed a prototype version information system for users. It
highlights the methodological approach and only cover IPSL
results.<br>
<br>
1. we need list of problems<br>
2. we need list of files affected by a given problem<br>
3. we need list of (files, problem) status ie (corrected, not
corrected)<br>
<br>
This page provide errata related to our IPSL-CM results only.<br>
<a class="moz-txt-link-freetext"
href="http://icmc.ipsl.fr/research/international-projects/cmip5/errata-ipsl">http://icmc.ipsl.fr/research/international-projects/cmip5/errata-ipsl</a>
<br>
<br>
The interesting part is that you can provide a list of tracking_id
(example netcdf_tracking_id.txt attached). <br>
The system will tell you:<br>
- whether the file is from the latest dataset version or not. (Not
so useful information I agree)<br>
- if not has the file really changed compared to previous dataset
version. (This is useful : the dataset version changed but not the
file I'm interested in)<br>
- history of correction made on those files (example : <a
class="moz-txt-link-freetext"
href="http://icmc.ipsl.fr/research/international-projects/cmip5/87-research/international-projects/cmip5/errata/227">http://icmc.ipsl.fr/research/international-projects/cmip5/87-research/international-projects/cmip5/errata/227</a>)
<br>
- if you don't have the latest version of a given file you have
access to the list of problems that has been solved.<br>
- if you have the latest version of a given file BUT a problem
still need to be solved you can make a proper decision.<br>
<br>
I agree it needs some formal thinking. The attached pdf provides a
few steps towards this.<br>
<br>
We suggest that part of this information can be captured during
publication and after the fact (new published version = comments
and list of issues (tickets)).<br>
<br>
We suggest to leverage the ESGF search system as a place holder
and the entry point for this information.<br>
<br>
File level versioning is what the users want.<br>
<br>
thanks.<br>
Sébastien<br>
<br>
Le 06/03/2013 10:38, Kettleborough, Jamie a écrit :<br>
</div>
<blockquote
cite="mid:F9565F07C6F37743B851EF445FE7B6320224B9@EXXCMPD1DAG3.cmpd1.metoffice.gov.uk"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
<meta name="GENERATOR" content="MSHTML 8.00.6001.19400">
<div dir="ltr" align="left">
<div dir="ltr" align="left"><span class="903553309-06032013">Hello,</span></div>
<div dir="ltr" align="left"><span class="903553309-06032013"></span> </div>
<div dir="ltr" align="left"><span class="903553309-06032013">is
there a straw man document (or anything like that) around
thoughts/proposals on versioning in ESG? I think it would
be great to get some user review (both data providers and
data consumers) of this if possible.</span></div>
<div dir="ltr" align="left"><span class="903553309-06032013"></span> </div>
<div dir="ltr" align="left"><span class="903553309-06032013">Thanks,</span></div>
<div dir="ltr" align="left"><span class="903553309-06032013"></span> </div>
<div dir="ltr" align="left"><span class="903553309-06032013">Jamie</span></div>
<br>
</div>
<br>
<blockquote style="BORDER-LEFT: #000000 2px solid; PADDING-LEFT:
5px; MARGIN-LEFT: 5px; MARGIN-RIGHT: 0px" dir="ltr">
<div dir="ltr" class="OutlookMessageHeader" lang="en-us"
align="left">
<hr tabindex="-1">
<font face="Tahoma" size="2"><b>From:</b>
<a class="moz-txt-link-abbreviated" href="mailto:go-essp-tech-bounces@ucar.edu">go-essp-tech-bounces@ucar.edu</a>
[<a class="moz-txt-link-freetext" href="mailto:go-essp-tech-bounces@ucar.edu">mailto:go-essp-tech-bounces@ucar.edu</a>]
<b>On Behalf Of </b>Christensen, Sigurd W.<br>
<b>Sent:</b> 05 March 2013 19:57<br>
<b>To:</b> <a class="moz-txt-link-abbreviated" href="mailto:go-essp-tech@ucar.edu">go-essp-tech@ucar.edu</a><br>
<b>Subject:</b> [Go-essp-tech] Towards versioning in ESG<br>
</font><br>
</div>
<div dir="ltr" align="left">
<p><font face="Arial">Folks,</font></p>
<p><font face="Arial">Thanks for the opportunity to discuss
versioning on today's call.</font></p>
<p> </p>
<p><span class="897273719-05032013"><font face="Arial">As
others have expressed, in the December 21 and March 4
postings on this topic, my main concern is that
versioning serve the needs of the end user. We should
provide an easy way for the end user to determine
whether data and metadata the user has previously
retrieved and used in an analysis is still current, or
has been revised in a way that might affect the
analysis. </font></span></p>
<p><span class="897273719-05032013"></span> </p>
<p><font face="Arial">I agreed to post to this list <span
class="897273719-05032013">a consideration</span> I
mentioned on <span class="897273719-05032013">today's</span>
call: observational datasets that routinely are extended
through time as current data become available. This
situation was also raised on this list by George Huffman
on December 21, 2012. I agree with his thought that
provoking a new version each time a new data increment is
added is unwieldy both for the data producers and for the
users.</font></p>
<p> </p>
<p><span class="897273719-05032013"><font face="Arial">I also
support George's notion that we consider the standards
for DOIs (Digital Object Identifiers) in conjunction
with the discussion of versioning.</font></span></p>
<p><span class="897273719-05032013"></span> </p>
<p><span class="897273719-05032013"><font face="Arial">A final
thought for now: I feel that we should make information
available to the users about what changed with a new
version.</font></span></p>
<p><span class="897273719-05032013"></span> </p>
<p><span class="897273719-05032013"><font face="Arial"> - Sig
Christensen</font></span></p>
<p><span class="897273719-05032013"></span> </p>
<p><span class="897273719-05032013"></span> </p>
<hr tabindex="-1">
<p><font face="Tahoma"><b>From:</b>
<a class="moz-txt-link-abbreviated" href="mailto:go-essp-tech-bounces@ucar.edu">go-essp-tech-bounces@ucar.edu</a>
[<a class="moz-txt-link-freetext" href="mailto:go-essp-tech-bounces@ucar.edu">mailto:go-essp-tech-bounces@ucar.edu</a>]
<b>On Behalf Of </b>Drach, Bob<br>
<b>Sent:</b> Monday, March 04, 2013 21:26<br>
<b>To:</b> Taylor, Karl Taylor<br>
<b>Cc:</b> <a class="moz-txt-link-abbreviated" href="mailto:go-essp-tech@ucar.edu">go-essp-tech@ucar.edu</a><br>
<b>Subject:</b> Re: [Go-essp-tech] definition of dataset
version<br>
</font><br>
</p>
</div>
<div style="FONT-FAMILY: Tahoma; DIRECTION: ltr; COLOR: #000000;
FONT-SIZE: 10pt">
Hi Karl,<br>
<br>
As you suggest, the broader question is what guidance we
should give to data providers and users on usage of the
dataset version, file tracking ID, and file checksum.<br>
<br>
It's true that the dataset version may not be of much use to
data users if they don't record when the data was downloaded.
But since the version indicates the date of publication, it
still might give some indication when a dataset has gone out
of date. The tracking ID is a random UUID generated by CMOR,
and is meant as a 'bar code' to track the data through ESGF.
Since it's a global attribute that is visible on the data
portal, it is relatively easy for a user to discover and
compare with the file value. However its usage and purpose
haven't been well defined, and in some cases data providers
have probably modified data in place without changing the
tracking ID (hopefully not too often). Checksums are
definitive, but trivial modifications can't be made without
changing the checksum.<br>
<br>
To answer your question, the timestamp in the ESGF SOLR index
is associated with the dataset as a whole, and indicates the
publication time.<br>
<br>
I'm opening the discussion to the GO-ESSP list for comments.<br>
<br>
--Bob<br>
<br>
<div style="FONT-FAMILY: Times New Roman; COLOR: rgb(0,0,0);
FONT-SIZE: 16px">
<hr tabindex="-1">
<div style="DIRECTION: ltr" id="divRpF600567"><font
face="Tahoma" size="2"><b>From:</b> Karl Taylor
[<a class="moz-txt-link-abbreviated" href="mailto:taylor13@llnl.gov">taylor13@llnl.gov</a>]<br>
<b>Sent:</b> Monday, March 04, 2013 3:45 PM<br>
<b>To:</b> Drach, Bob<br>
<b>Cc:</b> Williams, Dean N.; Painter, Jeff; Ganzberger,
Michael<br>
<b>Subject:</b> Re: definition of dataset version<br>
</font><br>
</div>
<div><font face="Times New Roman">Hi Bob,<br>
<br>
I think the "version numbers" assigned datasets are
pretty unhelpful to most users. Most users won't record
or remember what version they have downloaded. Perhaps
some users will know what *date* they downloaded data,
and all users can determine the tracking_id's and
chksums for their files, so we should provide support
for determining whether files are current based on this
information.<br>
<br>
Is the date recorded by ESGF assigned to a dataset or to
each file? If it's assigned to a dataset, then I'm not
sure that will be much use either.<br>
<br>
I think when a user asks us whether a file is current or
not, based on the checksum or tracking_id, we should
return the following information:<br>
<br>
"You have the latest version of this file" -- if the
checksum provided by the user is identical to the latest
file version in the CMIP archive.<br>
"A newer variant of the file exists, but differences are
unlikely to affect your analysis" -- if the only
changes made have been to some subset of the file's
global attributes that we think will not lead to
misinterpretation of the data itself.<br>
"A new version of the file exists and should be used in
place of the one you downloaded" -- otherwise<br>
<br>
We would list the set of global attributes that could be
wrong in case 2.<br>
<br>
We could use tracking_id's rather than chksums, but we
would have to weed out the cases where a critically
important global attribute had been modified, but the
tracking_id hadn't. [I'd guess that there aren't any
cases where the data itself has been modified without
changing the chksum, but there might be quite a few
cases where important global attributes have been
changed.]<br>
<br>
Would the above be practical?<br>
<br>
Karl<br>
<br>
<br>
</font>
<div class="moz-cite-prefix">On 3/4/13 1:21 PM, Drach, Bob
wrote:<br>
</div>
<blockquote type="cite">
<style id="owaParaStyle" type="text/css">P {
        MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px
}
BODY {
        FONT-FAMILY: Tahoma; DIRECTION: ltr; COLOR: #000000; FONT-SIZE: 10pt
}
P {
        MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px
}
BODY {
        
}
BODY {
        
}
BODY {
        
}
BODY {
        
}
</style>
<div style="FONT-FAMILY: Tahoma; DIRECTION: ltr; COLOR:
rgb(0,0,0); FONT-SIZE: 10pt">
Hi Karl,<br>
<br>
Dean requested that we have a conversation about
dataset versioning on the GO-ESSP telecon tomorrow.
I'm curious about your views on the subject.
<br>
<br>
Specifically, the question arose for the case where a
modeling group has regenerated data through CMOR, to
replace data lost in a disk crash. The data providers
assert that the data is identical to the published
version. However, because it has been regenerated the
checksums and tracking IDs differ. The question is
whether the data should be published with the previous
version number or should be considered a new version.<br>
<br>
At the moment we leave the choice to the data
publishers, and the publishing client by default
generates a new version number when any file in a
dataset has been added, deleted, or modified. However,
this leaves some ambiguous cases, such as when:<br>
<br>
- the metadata has been modified, but the actual data
is unchanged;<br>
- the data has been regenerated through CMOR, such
that all data and metadata fields are unchanged, with
the sole exception of the tracking ID (and therefore
the checksum has changed as well).<br>
<br>
My opinion is that an updated version number should be
a signal to the end users that something significant
has changed that is worth their attention. If nothing
has changed except the tracking ID and history
attributes, the dataset should be republished with the
original version number. There may be similar cases
where minor metadata modifications don't warrant a new
version number. On the other hand, modification of
metadata that guides processing - axis definitions,
units, dataset identification fields, etc., should
trigger a new version number.<br>
<br>
This approach has the implication that the tracking ID
and checksum of a file could change even though the
parent dataset version stays the same.<br>
<br>
Any thoughts on the matter?<br>
<br>
--Bob<br>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
GO-ESSP-TECH mailing list
<a class="moz-txt-link-abbreviated" href="mailto:GO-ESSP-TECH@ucar.edu">GO-ESSP-TECH@ucar.edu</a>
<a class="moz-txt-link-freetext" href="http://mailman.ucar.edu/mailman/listinfo/go-essp-tech">http://mailman.ucar.edu/mailman/listinfo/go-essp-tech</a>
</pre>
</blockquote>
<br>
<br>
<pre class="moz-signature" cols="72">--
Sébastien Denvil
IPSL, Pôle de modélisation du climat
UPMC, Case 101, 4 place Jussieu,
75252 Paris Cedex 5
Tour 45-55 2ème étage Bureau 209
Tel: 33 1 44 27 21 10
Fax: 33 1 44 27 39 02
</pre>
</body>
</html>