[Go-essp-tech] Towards versioning in ESG

Sébastien Denvil sebastien.denvil at ipsl.jussieu.fr
Wed Mar 6 03:14:08 MST 2013


Hi folks,

to add to this important topic I would like to raise a few comments and 
highlight possible guidance we could made, from a data producer and 
provider perspective.

Sorry for this long email but it was not easy to pack it more than this.

1. Version as we have now are too high level (dataset level) to be 
useful to the users. They are in some sense useful to data provider but 
it's clearly not enough in this context as well.
2. tracking_id are very useful. As things stands now this is the most 
robust we have to build version information system for users.
3. checksum are useful but not at all error prone and they are costly. 
Few months back it was not a good idea to build on top of it a version 
information system for users.

We developed a prototype version information system for users. It 
highlights the methodological approach and only cover IPSL results.

1. we need list of problems
2. we need list of files affected by a given problem
3. we need list of (files, problem) status ie (corrected, not corrected)

This page provide errata related to our IPSL-CM results only.
http://icmc.ipsl.fr/research/international-projects/cmip5/errata-ipsl

The interesting part is that you can provide a list of tracking_id 
(example netcdf_tracking_id.txt attached).
The system will tell you:
- whether the file is from the latest dataset version or not. (Not so 
useful information I agree)
- if not has the file really changed compared to previous dataset 
version. (This is useful : the dataset version changed but not the file 
I'm interested in)
- history of correction made on those files (example : 
http://icmc.ipsl.fr/research/international-projects/cmip5/87-research/international-projects/cmip5/errata/227) 

- if you don't have the latest version of a given file you have access 
to the list of problems that has been solved.
- if you have the latest version of a given file BUT a problem still 
need to be solved you can make a proper decision.

I agree it needs some formal thinking. The attached pdf provides a few 
steps towards this.

We suggest that part of this information can be captured during 
publication and after the fact (new published version = comments and 
list of issues (tickets)).

We suggest to leverage the ESGF search system as a place holder and the 
entry point for this information.

File level versioning is what the users want.

thanks.
Sébastien

Le 06/03/2013 10:38, Kettleborough, Jamie a écrit :
> Hello,
> is there a straw man document (or anything like that) around 
> thoughts/proposals on versioning in ESG?  I think it would be great to 
> get some user review (both data providers and data consumers) of this 
> if possible.
> Thanks,
> Jamie
>
>
>     ------------------------------------------------------------------------
>     *From:* go-essp-tech-bounces at ucar.edu
>     [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Christensen,
>     Sigurd W.
>     *Sent:* 05 March 2013 19:57
>     *To:* go-essp-tech at ucar.edu
>     *Subject:* [Go-essp-tech] Towards versioning in ESG
>
>     Folks,
>
>     Thanks for the opportunity to discuss versioning on today's call.
>
>     As others have expressed, in the December 21 and March 4 postings
>     on this topic, my main concern is that versioning serve the needs
>     of the end user.  We should provide an easy way for the end user
>     to determine whether data and metadata the user has previously
>     retrieved and used in an analysis is still current, or has been
>     revised in a way that might affect the analysis.
>
>     I agreed to post to this list a consideration I mentioned on
>     today's call: observational datasets that routinely are extended
>     through time as current data become available. This situation was
>     also raised on this list by George Huffman on December 21, 2012. I
>     agree with his thought that provoking a new version each time a
>     new data increment is added is unwieldy both for the data
>     producers and for the users.
>
>     I also support George's notion that we consider the standards for
>     DOIs (Digital Object Identifiers) in conjunction with the
>     discussion of versioning.
>
>     A final thought for now: I feel that we should make information
>     available to the users about what changed with a new version.
>
>       - Sig Christensen
>
>     ------------------------------------------------------------------------
>
>     *From:* go-essp-tech-bounces at ucar.edu
>     [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Drach, Bob
>     *Sent:* Monday, March 04, 2013 21:26
>     *To:* Taylor, Karl Taylor
>     *Cc:* go-essp-tech at ucar.edu
>     *Subject:* Re: [Go-essp-tech] definition of dataset version
>
>     Hi Karl,
>
>     As you suggest, the broader question is what guidance we should
>     give to data providers and users on usage of the dataset version,
>     file tracking ID, and file checksum.
>
>     It's true that the dataset version may not be of much use to data
>     users if they don't record when the data was downloaded. But since
>     the version indicates the date of publication, it still might give
>     some indication when a dataset has gone out of date. The tracking
>     ID is a random UUID generated by CMOR, and is meant as a 'bar
>     code' to track the data through ESGF. Since it's a global
>     attribute that is visible on the data portal, it is relatively
>     easy for a user to discover and compare with the file value.
>     However its usage and purpose haven't been well defined, and in
>     some cases data providers have probably modified data in place
>     without changing the tracking ID (hopefully not too often).
>     Checksums are definitive, but trivial modifications can't be made
>     without changing the checksum.
>
>     To answer your question, the timestamp in the ESGF SOLR index is
>     associated with the dataset as a whole, and indicates the
>     publication time.
>
>     I'm opening the discussion to the GO-ESSP list for comments.
>
>     --Bob
>
>     ------------------------------------------------------------------------
>     *From:* Karl Taylor [taylor13 at llnl.gov]
>     *Sent:* Monday, March 04, 2013 3:45 PM
>     *To:* Drach, Bob
>     *Cc:* Williams, Dean N.; Painter, Jeff; Ganzberger, Michael
>     *Subject:* Re: definition of dataset version
>
>     Hi Bob,
>
>     I think the "version numbers" assigned datasets are pretty
>     unhelpful to most users.  Most users won't record or remember what
>     version they have downloaded.  Perhaps some users will know what
>     *date* they downloaded data, and all users can determine the
>     tracking_id's and chksums for their files, so we should provide
>     support for determining whether files are current based on this
>     information.
>
>     Is the date recorded by ESGF assigned to a dataset or to each
>     file?   If it's assigned to a dataset, then I'm not sure that will
>     be much use either.
>
>     I think when a user asks us whether a file is current or not,
>     based on the checksum or tracking_id, we should return the
>     following information:
>
>     "You have the latest version of this file"  -- if the checksum
>     provided by the user is identical to the latest file version in
>     the CMIP archive.
>     "A newer variant of the file exists, but differences are unlikely
>     to affect your analysis"  --  if the only changes made have been
>     to some subset of the file's global attributes that we think will
>     not lead to misinterpretation of the data itself.
>     "A new version of the file exists and should be used in place of
>     the one you downloaded"  --  otherwise
>
>     We would list the set of global attributes that could be wrong in
>     case 2.
>
>     We could use tracking_id's rather than chksums, but we would have
>     to weed out the cases where a critically important global
>     attribute had been modified, but the tracking_id hadn't.   [I'd
>     guess that there aren't any cases where the data itself has been
>     modified without changing the chksum, but there might be quite a
>     few cases where important global attributes have been changed.]
>
>     Would the above be practical?
>
>     Karl
>
>
>     On 3/4/13 1:21 PM, Drach, Bob wrote:
>>     Hi Karl,
>>
>>     Dean requested that we have a conversation about dataset
>>     versioning on the GO-ESSP telecon tomorrow. I'm curious about
>>     your views on the subject.
>>
>>     Specifically, the question arose for the case where a modeling
>>     group has regenerated data through CMOR, to replace data lost in
>>     a disk crash. The data providers assert that the data is
>>     identical to the published version. However, because it has been
>>     regenerated the checksums and tracking IDs differ. The question
>>     is whether the data should be published with the previous version
>>     number or should be considered a new version.
>>
>>     At the moment we leave the choice to the data publishers, and the
>>     publishing client by default generates a new version number when
>>     any file in a dataset has been added, deleted, or modified.
>>     However, this leaves some ambiguous cases, such as when:
>>
>>     - the metadata has been modified, but the actual data is unchanged;
>>     - the data has been regenerated through CMOR, such that all data
>>     and metadata fields are unchanged, with the sole exception of the
>>     tracking ID (and therefore the checksum has changed as well).
>>
>>     My opinion is that an updated version number should be a signal
>>     to the end users that something significant has changed that is
>>     worth their attention. If nothing has changed except the tracking
>>     ID and history attributes, the dataset should be republished with
>>     the original version number. There may be similar cases where
>>     minor metadata modifications don't warrant a new version number.
>>     On the other hand, modification of metadata that guides
>>     processing - axis definitions, units, dataset identification
>>     fields, etc., should trigger a new version number.
>>
>>     This approach has the implication that the tracking ID and
>>     checksum of a file could change even though the parent dataset
>>     version stays the same.
>>
>>     Any thoughts on the matter?
>>
>>     --Bob
>
>
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech


-- 
Sébastien Denvil
IPSL, Pôle de modélisation du climat
UPMC, Case 101, 4 place Jussieu,
75252 Paris Cedex 5

Tour 45-55 2ème étage Bureau 209
Tel: 33 1 44 27 21 10
Fax: 33 1 44 27 39 02

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20130306/449fdf9f/attachment-0001.html 
-------------- next part --------------
153b2789-a134-48b1-ac13-41a9e9993c8f
6b71c053-ce65-4283-b42e-7c4004ee077d
55554428-231a-4db9-96bc-bdf64b057679
cd7cfef2-229d-4c14-a770-38eb390617a1
85e73ab8-c755-4c6c-8afb-61c7a8ccce5b
03489edf-a01e-4fbe-b3bc-e77c8d452159
215e8f55-f2e8-4057-93af-1d4d01d9f98d
64a30b52-b228-4c3b-be78-c1735afb8019
4bf43b46-b37a-4133-8d5a-c44b1c124bb8
57745e88-6d55-48aa-b2e4-501837c643fd
46759257-e100-4611-8094-52b96bb13c97
b983236f-2c58-4fd8-845d-e5e9c6e8c734
5570329f-c0c4-4fb5-8345-804e76a54b85
04e4679a-a36b-445b-a7bf-bdf15b567099
df5f7ff0-82f7-4c7a-9806-365a3a692ad5
64a30b52-b228-4c3b-be78-c1735afb8019
04e4679a-a36b-445b-a7bf-bdf15b567099
df5f7ff0-82f7-4c7a-9806-365a3a692ad5
-------------- next part --------------
A non-text attachment was scrubbed...
Name: versionning.management.pdf
Type: file/pdf
Size: 844704 bytes
Desc: not available
Url : http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20130306/449fdf9f/attachment-0002.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2958 bytes
Desc: Signature cryptographique S/MIME
Url : http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20130306/449fdf9f/attachment-0003.bin 


More information about the GO-ESSP-TECH mailing list