[Go-essp-tech] Towards versioning in ESG

Mon Mar 18 08:03:44 MDT 2013

Thanks MartinŠyep happy to discuss later in the year too.

Cheers!
Chris

On 3/13/13 4:25 AM, "martin.juckes at stfc.ac.uk" <martin.juckes at stfc.ac.uk>
wrote:

>Hi Chris,
>
>Interesting -- but we also need to do more work on capturing metadata
>relevant to versions that is not readily available at present. Or perhaps
>develop a tool which can summarise changes to binary data in a human
>friendly way -- but I suspect that we would not be able to make much
>progress in that direction.
>
>We will probably need some focussed discussions on this later in the year,
>
>cheers,
>Martin
>
>
>________________________________
>From: Mattmann, Chris A (388J) [chris.a.mattmann at jpl.nasa.gov]
>Sent: 09 March 2013 18:32
>To: Juckes, Martin (STFC,RAL,RALSP); sebastien.denvil at ipsl.jussieu.fr;
>go-essp-tech at ucar.edu
>Subject: Re: [Go-essp-tech] Towards versioning in ESG
>
>Hey Martin,
>
>This makes a lot of sense.
>
>Maybe Apache Tika could help in building such an API?
>
>http://tika.apache.org/
>
>It's main goal is to be a "Babel Fish" for extracting useful text and
>metadata from any kind of file, starting with IANA's set of 1200+ rich
>file types, and including basic support for HDF5, NetCDF and some other
>relevant science data formats. There are efforts underway to try and
>integrate GDAL into Tika as well to understand GIS formats, so it may be
>of use here.
>
>Cheers,
>Chris
>
>
>From: "martin.juckes at stfc.ac.uk<mailto:martin.juckes at stfc.ac.uk>"
><martin.juckes at stfc.ac.uk<mailto:martin.juckes at stfc.ac.uk>>
>Date: Wednesday, March 6, 2013 5:56 AM
>To: 
>"sebastien.denvil at ipsl.jussieu.fr<mailto:sebastien.denvil at ipsl.jussieu.fr>
>" 
><sebastien.denvil at ipsl.jussieu.fr<mailto:sebastien.denvil at ipsl.jussieu.fr>
>>, "go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>"
>><go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>>
>Subject: Re: [Go-essp-tech] Towards versioning in ESG
>
>Hello All,
>
>Some more thoughts on version control etc.
>
>I was included in a recent exchange between a user (Urs Beyerle of ETH)
>and Karl, and on interesting aspect of that conversation was that Urs was
>comparing ESGF with systems such as git. For ESGF a lot of effort has
>been put into creating robust and useful metadata in the files. Git and
>systems like it have nothing to do with in-file metadata, instead they
>provide a system of tracking metadata about the files, and keep the
>structure in which this metadata is stored hidden from the users. Git
>does version control at the repository level, and users are happy with
>his because they have the tools to extract the kind of information they
>want. They do not have to navigate a directory structure to find out
>about versions. What we have at present, which is a great step forward,
>is a system of keeping multiple versions in the archive. What we need are
>tools to allow users (and archive managers) to extract useful information
>about their files from the archive metadata. Synchro-data (from IPSL)
>does some of this.
>
>One of the problems is that users like in-file metadata in their data
>files, because it is accessible to data processing software, but want to
>have features which can only be supported by external meta-data about how
>the file has been published, moved into later versions, replaced etc.
>Creating an API which allows data processing software to access external
>metadata will be important.
>
>Cheers,
>Martin
>
>From: go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu>
>[mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Sébastien Denvil
>Sent: 06 March 2013 10:14
>To: go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
>Subject: Re: [Go-essp-tech] Towards versioning in ESG
>
>Hi folks,
>
>to add to this important topic I would like to raise a few comments and
>highlight possible guidance we could made, from a data producer and
>provider perspective.
>
>Sorry for this long email but it was not easy to pack it more than this.
>
>1. Version as we have now are too high level (dataset level) to be useful
>to the users. They are in some sense useful to data provider but it's
>clearly not enough in this context as well.
>2. tracking_id are very useful. As things stands now this is the most
>robust we have to build version information system for users.
>3. checksum are useful but not at all error prone and they are costly.
>Few months back it was not a good idea to build on top of it a version
>information system for users.
>
>We developed a prototype version information system for users. It
>highlights the methodological approach and only cover IPSL results.
>
>1. we need list of problems
>2. we need list of files affected by a given problem
>3. we need list of (files, problem) status ie (corrected, not corrected)
>
>This page provide errata related to our IPSL-CM results only.
>http://icmc.ipsl.fr/research/international-projects/cmip5/errata-ipsl
>
>The interesting part is that you can provide a list of tracking_id
>(example netcdf_tracking_id.txt attached).
>The system will tell you:
>- whether the file is from the latest dataset version or not. (Not so
>useful information I agree)
>- if not has the file really changed compared to previous dataset
>version. (This is useful : the dataset version changed but not the file
>I'm interested in)
>- history of correction made on those files (example :
>http://icmc.ipsl.fr/research/international-projects/cmip5/87-research/inte
>rnational-projects/cmip5/errata/227)
>- if you don't have the latest version of a given file you have access to
>the list of problems that has been solved.
>- if you have the latest version of a given file BUT a problem still need
>to be solved you can make a proper decision.
>
>I agree it needs some formal thinking. The attached pdf provides a few
>steps towards this.
>
>We suggest that part of this information can be captured during
>publication and after the fact (new published version = comments and list
>of issues (tickets)).
>
>We suggest to leverage the ESGF search system as a place holder and the
>entry point for this information.
>
>File level versioning is what the users want.
>
>thanks.
>Sébastien
>
>Le 06/03/2013 10:38, Kettleborough, Jamie a écrit :
>Hello,
>
>is there a straw man document (or anything like that) around
>thoughts/proposals on versioning in ESG?  I think it would be great to
>get some user review (both data providers and data consumers) of this if
>possible.
>
>Thanks,
>
>Jamie
>
>
>________________________________
>From:go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu>
>[mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Christensen, Sigurd W.
>Sent: 05 March 2013 19:57
>To: go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
>Subject: [Go-essp-tech] Towards versioning in ESG
>
>Folks,
>
>Thanks for the opportunity to discuss versioning on today's call.
>
>
>
>As others have expressed, in the December 21 and March 4 postings on this
>topic, my main concern is that versioning serve the needs of the end
>user.  We should provide an easy way for the end user to determine
>whether data and metadata the user has previously retrieved and used in
>an analysis is still current, or has been revised in a way that might
>affect the analysis.
>
>
>
>I agreed to post to this list a consideration I mentioned on today's
>call: observational datasets that routinely are extended through time as
>current data become available. This situation was also raised on this
>list by George Huffman on December 21, 2012. I agree with his thought
>that provoking a new version each time a new data increment is added is
>unwieldy both for the data producers and for the users.
>
>
>
>I also support George's notion that we consider the standards for DOIs
>(Digital Object Identifiers) in conjunction with the discussion of
>versioning.
>
>
>
>A final thought for now: I feel that we should make information available
>to the users about what changed with a new version.
>
>
>
>  - Sig Christensen
>
>
>
>
>
>________________________________
>
>From:go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu>
>[mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Drach, Bob
>Sent: Monday, March 04, 2013 21:26
>To: Taylor, Karl Taylor
>Cc: go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
>Subject: Re: [Go-essp-tech] definition of dataset version
>Hi Karl,
>
>As you suggest, the broader question is what guidance we should give to
>data providers and users on usage of the dataset version, file tracking
>ID, and file checksum.
>
>It's true that the dataset version may not be of much use to data users
>if they don't record when the data was downloaded. But since the version
>indicates the date of publication, it still might give some indication
>when a dataset has gone out of date. The tracking ID is a random UUID
>generated by CMOR, and is meant as a 'bar code' to track the data through
>ESGF. Since it's a global attribute that is visible on the data portal,
>it is relatively easy for a user to discover and compare with the file
>value. However its usage and purpose haven't been well defined, and in
>some cases data providers have probably modified data in place without
>changing the tracking ID (hopefully not too often). Checksums are
>definitive, but trivial modifications can't be made without changing the
>checksum.
>
>To answer your question, the timestamp in the ESGF SOLR index is
>associated with the dataset as a whole, and indicates the publication
>time.
>
>I'm opening the discussion to the GO-ESSP list for comments.
>
>--Bob
>________________________________
>From: Karl Taylor [taylor13 at llnl.gov<mailto:taylor13 at llnl.gov>]
>Sent: Monday, March 04, 2013 3:45 PM
>To: Drach, Bob
>Cc: Williams, Dean N.; Painter, Jeff; Ganzberger, Michael
>Subject: Re: definition of dataset version
>Hi Bob,
>
>I think the "version numbers" assigned datasets are pretty unhelpful to
>most users.  Most users won't record or remember what version they have
>downloaded.  Perhaps some users will know what *date* they downloaded
>data, and all users can determine the tracking_id's and chksums for their
>files, so we should provide support for determining whether files are
>current based on this information.
>
>Is the date recorded by ESGF assigned to a dataset or to each file?   If
>it's assigned to a dataset, then I'm not sure that will be much use
>either.
>
>I think when a user asks us whether a file is current or not, based on
>the checksum or tracking_id, we should return the following information:
>
>"You have the latest version of this file"  -- if the checksum provided
>by the user is identical to the latest file version in the CMIP archive.
>"A newer variant of the file exists, but differences are unlikely to
>affect your analysis"  --  if the only changes made have been to some
>subset of the file's global attributes that we think will not lead to
>misinterpretation of the data itself.
>"A new version of the file exists and should be used in place of the one
>you downloaded"  --  otherwise
>
>We would list the set of global attributes that could be wrong in case 2.
>
>We could use tracking_id's rather than chksums, but we would have to weed
>out the cases where a critically important global attribute had been
>modified, but the tracking_id hadn't.   [I'd guess that there aren't any
>cases where the data itself has been modified without changing the
>chksum, but there might be quite a few cases where important global
>attributes have been changed.]
>
>Would the above be practical?
>
>Karl
>
>On 3/4/13 1:21 PM, Drach, Bob wrote:
>Hi Karl,
>
>Dean requested that we have a conversation about dataset versioning on
>the GO-ESSP telecon tomorrow. I'm curious about your views on the subject.
>
>Specifically, the question arose for the case where a modeling group has
>regenerated data through CMOR, to replace data lost in a disk crash. The
>data providers assert that the data is identical to the published
>version. However, because it has been regenerated the checksums and
>tracking IDs differ. The question is whether the data should be published
>with the previous version number or should be considered a new version.
>
>At the moment we leave the choice to the data publishers, and the
>publishing client by default generates a new version number when any file
>in a dataset has been added, deleted, or modified. However, this leaves
>some ambiguous cases, such as when:
>
>- the metadata has been modified, but the actual data is unchanged;
>- the data has been regenerated through CMOR, such that all data and
>metadata fields are unchanged, with the sole exception of the tracking ID
>(and therefore the checksum has changed as well).
>
>My opinion is that an updated version number should be a signal to the
>end users that something significant has changed that is worth their
>attention. If nothing has changed except the tracking ID and history
>attributes, the dataset should be republished with the original version
>number. There may be similar cases where minor metadata modifications
>don't warrant a new version number. On the other hand, modification of
>metadata that guides processing - axis definitions, units, dataset
>identification fields, etc., should trigger a new version number.
>
>This approach has the implication that the tracking ID and checksum of a
>file could change even though the parent dataset version stays the same.
>
>Any thoughts on the matter?
>
>--Bob
>
>
>
>
>
>_______________________________________________
>
>GO-ESSP-TECH mailing list
>
>GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
>
>http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>
>
>
>--
>
>Sébastien Denvil
>
>IPSL, Pôle de modélisation du climat
>
>UPMC, Case 101, 4 place Jussieu,
>
>75252 Paris Cedex 5
>
>
>
>Tour 45-55 2ème étage Bureau 209
>
>Tel: 33 1 44 27 21 10
>
>Fax: 33 1 44 27 39 02
>
>
>--
>Scanned by iCritical.
>
>-- 
>Scanned by iCritical.