[Go-essp-tech] definition of dataset version

Mon Mar 4 19:26:27 MST 2013

Hi Karl,

As you suggest, the broader question is what guidance we should give to data providers and users on usage of the dataset version, file tracking ID, and file checksum.

It's true that the dataset version may not be of much use to data users if they don't record when the data was downloaded. But since the version indicates the date of publication, it still might give some indication when a dataset has gone out of date. The tracking ID is a random UUID generated by CMOR, and is meant as a 'bar code' to track the data through ESGF. Since it's a global attribute that is visible on the data portal, it is relatively easy for a user to discover and compare with the file value. However its usage and purpose haven't been well defined, and in some cases data providers have probably modified data in place without changing the tracking ID (hopefully not too often). Checksums are definitive, but trivial modifications can't be made without changing the checksum.

To answer your question, the timestamp in the ESGF SOLR index is associated with the dataset as a whole, and indicates the publication time.

I'm opening the discussion to the GO-ESSP list for comments.

--Bob

________________________________
From: Karl Taylor [taylor13 at llnl.gov]
Sent: Monday, March 04, 2013 3:45 PM
To: Drach, Bob
Cc: Williams, Dean N.; Painter, Jeff; Ganzberger, Michael
Subject: Re: definition of dataset version

Hi Bob,

I think the "version numbers" assigned datasets are pretty unhelpful to most users.  Most users won't record or remember what version they have downloaded.  Perhaps some users will know what *date* they downloaded data, and all users can determine the tracking_id's and chksums for their files, so we should provide support for determining whether files are current based on this information.

Is the date recorded by ESGF assigned to a dataset or to each file?   If it's assigned to a dataset, then I'm not sure that will be much use either.

I think when a user asks us whether a file is current or not, based on the checksum or tracking_id, we should return the following information:

"You have the latest version of this file"  -- if the checksum provided by the user is identical to the latest file version in the CMIP archive.
"A newer variant of the file exists, but differences are unlikely to affect your analysis"  --  if the only changes made have been to some subset of the file's global attributes that we think will not lead to misinterpretation of the data itself.
"A new version of the file exists and should be used in place of the one you downloaded"  --  otherwise

We would list the set of global attributes that could be wrong in case 2.

We could use tracking_id's rather than chksums, but we would have to weed out the cases where a critically important global attribute had been modified, but the tracking_id hadn't.   [I'd guess that there aren't any cases where the data itself has been modified without changing the chksum, but there might be quite a few cases where important global attributes have been changed.]

Would the above be practical?

Karl

On 3/4/13 1:21 PM, Drach, Bob wrote:
Hi Karl,

Dean requested that we have a conversation about dataset versioning on the GO-ESSP telecon tomorrow. I'm curious about your views on the subject.

Specifically, the question arose for the case where a modeling group has regenerated data through CMOR, to replace data lost in a disk crash. The data providers assert that the data is identical to the published version. However, because it has been regenerated the checksums and tracking IDs differ. The question is whether the data should be published with the previous version number or should be considered a new version.

At the moment we leave the choice to the data publishers, and the publishing client by default generates a new version number when any file in a dataset has been added, deleted, or modified. However, this leaves some ambiguous cases, such as when:

- the metadata has been modified, but the actual data is unchanged;
- the data has been regenerated through CMOR, such that all data and metadata fields are unchanged, with the sole exception of the tracking ID (and therefore the checksum has changed as well).

My opinion is that an updated version number should be a signal to the end users that something significant has changed that is worth their attention. If nothing has changed except the tracking ID and history attributes, the dataset should be republished with the original version number. There may be similar cases where minor metadata modifications don't warrant a new version number. On the other hand, modification of metadata that guides processing - axis definitions, units, dataset identification fields, etc., should trigger a new version number.

This approach has the implication that the tracking ID and checksum of a file could change even though the parent dataset version stays the same.

Any thoughts on the matter?

--Bob

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20130305/8b523bd1/attachment.html