<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML dir=ltr><HEAD>

<META content="text/html; charset=us-ascii" http-equiv=Content-Type>

<META name=GENERATOR content="MSHTML 8.00.6001.19400"></HEAD>

<BODY bgColor=#ffffff fpstyle="1" ocsi="0">

<DIV dir=ltr align=left>

<P><FONT face=Arial>Folks,</FONT></P>

<P><FONT face=Arial>Thanks for the opportunity to discuss versioning on today's 

call.</FONT></P>

<P><FONT face=Arial></FONT>&nbsp;</P>

<P><SPAN class=897273719-05032013><FONT face=Arial>As others have expressed, in 

the December 21 and March 4 postings on this topic, my main concern is that 

versioning serve the needs of the end user.&nbsp; We should provide an easy 

way&nbsp;for the end user to determine whether data and metadata&nbsp;the 

user&nbsp;has previously retrieved and used in an analysis is still current, or 

has been revised in a way that might affect the 

analysis.&nbsp;</FONT></SPAN></P>

<P><SPAN class=897273719-05032013><FONT face=Arial></FONT></SPAN><FONT 

face=Arial></FONT>&nbsp;</P>

<P><FONT face=Arial>I agreed to post to this list&nbsp;<SPAN 

class=897273719-05032013>a consideration</SPAN>&nbsp;I mentioned on&nbsp;<SPAN 

class=897273719-05032013>today's</SPAN> call: observational datasets that 

routinely are extended through time as current data become available. This 

situation was also raised on this list by George Huffman on December 21, 2012. I 

agree with his thought that provoking a new version each time a new data 

increment is added is unwieldy both for the data producers and for the 

users.</FONT></P>

<P><FONT face=Arial></FONT>&nbsp;</P>

<P><SPAN class=897273719-05032013><FONT face=Arial>I also support George's 

notion that we consider the standards for DOIs (Digital Object Identifiers) in 

conjunction with the discussion of versioning.</FONT></SPAN></P>

<P><SPAN class=897273719-05032013><FONT face=Arial></FONT></SPAN>&nbsp;</P>

<P><SPAN class=897273719-05032013><FONT face=Arial>A final thought for now: 

I&nbsp;feel that&nbsp;we should make information available to the users about 

what changed with a new version.</FONT></SPAN></P>

<P><SPAN class=897273719-05032013><FONT face=Arial></FONT></SPAN>&nbsp;</P>

<P><SPAN class=897273719-05032013><FONT face=Arial>&nbsp; - Sig 

Christensen</FONT></SPAN></P>

<P><SPAN class=897273719-05032013><FONT face=Arial></FONT></SPAN>&nbsp;</P>

<P><SPAN class=897273719-05032013><FONT face=Arial></FONT></SPAN>&nbsp;</P>

<P>

<HR tabIndex=-1>

</P>

<P><FONT face=Tahoma><B>From:</B> go-essp-tech-bounces@ucar.edu 

[mailto:go-essp-tech-bounces@ucar.edu] <B>On Behalf Of </B>Drach, 

Bob<BR><B>Sent:</B> Monday, March 04, 2013 21:26<BR><B>To:</B> Taylor, Karl 

Taylor<BR><B>Cc:</B> go-essp-tech@ucar.edu<BR><B>Subject:</B> Re: [Go-essp-tech] 

definition of dataset version<BR></FONT><BR></P></DIV>

<DIV></DIV>

<DIV 

style="FONT-FAMILY: Tahoma; DIRECTION: ltr; COLOR: #000000; FONT-SIZE: 10pt">Hi 

Karl,<BR><BR>As you suggest, the broader question is what guidance we should 

give to data providers and users on usage of the dataset version, file tracking 

ID, and file checksum.<BR><BR>It's true that the dataset version may not be of 

much use to data users if they don't record when the data was downloaded. But 

since the version indicates the date of publication, it still might give some 

indication when a dataset has gone out of date. The tracking ID is a random UUID 

generated by CMOR, and is meant as a 'bar code' to track the data through ESGF. 

Since it's a global attribute that is visible on the data portal, it is 

relatively easy for a user to discover and compare with the file value. However 

its usage and purpose haven't been well defined, and in some cases data 

providers have probably modified data in place without changing the tracking ID 

(hopefully not too often). Checksums are definitive, but trivial modifications 

can't be made without changing the checksum.<BR><BR>To answer your question, the 

timestamp in the ESGF SOLR index is associated with the dataset as a whole, and 

indicates the publication time.<BR><BR>I'm opening the discussion to the GO-ESSP 

list for comments.<BR><BR>--Bob<BR><BR>

<DIV style="FONT-FAMILY: Times New Roman; COLOR: rgb(0,0,0); FONT-SIZE: 16px">

<HR tabIndex=-1>


<DIV style="DIRECTION: ltr" id=divRpF600567><FONT size=2 

face=Tahoma><B>From:</B> Karl Taylor [taylor13@llnl.gov]<BR><B>Sent:</B> Monday, 

March 04, 2013 3:45 PM<BR><B>To:</B> Drach, Bob<BR><B>Cc:</B> Williams, Dean N.; 

Painter, Jeff; Ganzberger, Michael<BR><B>Subject:</B> Re: definition of dataset 

version<BR></FONT><BR></DIV>

<DIV></DIV>

<DIV><FONT face="Times New Roman">Hi Bob,<BR><BR>I think the "version numbers" 

assigned datasets are pretty unhelpful to most users.&nbsp; Most users won't 

record or remember what version they have downloaded.&nbsp; Perhaps some users 

will know what *date* they downloaded data, and all users can determine the 

tracking_id's and chksums for their files, so we should provide support for 

determining whether files are current based on this information.<BR><BR>Is the 

date recorded by ESGF assigned to a dataset or to each file? &nbsp; If it's 

assigned to a dataset, then I'm not sure that will be much use either.<BR><BR>I 

think when a user asks us whether a file is current or not, based on the 

checksum or tracking_id, we should return the following information:<BR><BR>"You 

have the latest version of this file"&nbsp; -- if the checksum provided by the 

user is identical to the latest file version in the CMIP archive.<BR>"A newer 

variant of the file exists, but differences are unlikely to affect your 

analysis"&nbsp; --&nbsp; if the only changes made have been to some subset of 

the file's global attributes that we think will not lead to misinterpretation of 

the data itself.<BR>"A new version of the file exists and should be used in 

place of the one you downloaded"&nbsp; --&nbsp; otherwise<BR><BR>We would list 

the set of global attributes that could be wrong in case 2.<BR><BR>We could use 

tracking_id's rather than chksums, but we would have to weed out the cases where 

a critically important global attribute had been modified, but the tracking_id 

hadn't. &nbsp; [I'd guess that there aren't any cases where the data itself has 

been modified without changing the chksum, but there might be quite a few cases 

where important global attributes have been changed.]<BR><BR>Would the above be 

practical?<BR><BR>Karl<BR><BR><BR></FONT>

<DIV class=moz-cite-prefix>On 3/4/13 1:21 PM, Drach, Bob wrote:<BR></DIV>

<BLOCKQUOTE type="cite">

  <STYLE id=owaParaStyle type=text/css>

<!--

p

        {margin-top:0;

        margin-bottom:0}

-->

BODY {direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;}P {margin-top:0;margin-bottom:0;}BODY {scrollbar-base-color:undefined;scrollbar-highlight-color:undefined;scrollbar-darkshadow-color:undefined;scrollbar-track-color:undefined;scrollbar-arrow-color:undefined}BODY {scrollbar-base-color:undefined;scrollbar-highlight-color:undefined;scrollbar-darkshadow-color:undefined;scrollbar-track-color:undefined;scrollbar-arrow-color:undefined}BODY {scrollbar-base-color:undefined;scrollbar-highlight-color:undefined;scrollbar-darkshadow-color:undefined;scrollbar-track-color:undefined;scrollbar-arrow-color:undefined}BODY {scrollbar-base-color:undefined;scrollbar-highlight-color:undefined;scrollbar-darkshadow-color:undefined;scrollbar-track-color:undefined;scrollbar-arrow-color:undefined}</STYLE>


  <DIV 

  style="FONT-FAMILY: Tahoma; DIRECTION: ltr; COLOR: rgb(0,0,0); FONT-SIZE: 10pt">Hi 

  Karl,<BR><BR>Dean requested that we have a conversation about dataset 

  versioning on the GO-ESSP telecon tomorrow. I'm curious about your views on 

  the subject. <BR><BR>Specifically, the question arose for the case where a 

  modeling group has regenerated data through CMOR, to replace data lost in a 

  disk crash. The data providers assert that the data is identical to the 

  published version. However, because it has been regenerated the checksums and 

  tracking IDs differ. The question is whether the data should be published with 

  the previous version number or should be considered a new version.<BR><BR>At 

  the moment we leave the choice to the data publishers, and the publishing 

  client by default generates a new version number when any file in a dataset 

  has been added, deleted, or modified. However, this leaves some ambiguous 

  cases, such as when:<BR><BR>- the metadata has been modified, but the actual 

  data is unchanged;<BR>- the data has been regenerated through CMOR, such that 

  all data and metadata fields are unchanged, with the sole exception of the 

  tracking ID (and therefore the checksum has changed as well).<BR><BR>My 

  opinion is that an updated version number should be a signal to the end users 

  that something significant has changed that is worth their attention. If 

  nothing has changed except the tracking ID and history attributes, the dataset 

  should be republished with the original version number. There may be similar 

  cases where minor metadata modifications don't warrant a new version number. 

  On the other hand, modification of metadata that guides processing - axis 

  definitions, units, dataset identification fields, etc., should trigger a new 

  version number.<BR><BR>This approach has the implication that the tracking ID 

  and checksum of a file could change even though the parent dataset version 

  stays the same.<BR><BR>Any thoughts on the 

matter?<BR><BR>--Bob<BR></DIV></BLOCKQUOTE><BR></DIV></DIV></DIV></BODY></HTML>