[Go-essp-tech] [esgf-devel] Changing checksums without changing version number ?

Karl Taylor taylor13 at llnl.gov
Fri Dec 21 11:17:29 MST 2012


Hi all,

The original idea was to make more use of "tracking_id", which is a 
global attribute uniquely identifying each file.  If this were relied on 
to distinguish among different versions, a data provider could make a 
decision to assign the same tracking_id to a file that replaces an older 
file, when only trivial differences exist (as in Stephane's case).  I'm 
not sure why a "batch number" is needed if we already have "tracking_id".

cheers,
Karl

On 12/21/12 4:50 AM, martin.juckes at stfc.ac.uk wrote:
> Hello All,
>
> I basically agree, but would say it is more that best practise -- creating a new publication version every time a file changed is the agreed approach and the only way we can track changes in the current system.
>
> As has been pointed out, it would be very desirable to have a better way of keeping track of and communicating the nature of the changes between data versions. In the long term, a clean and robust approach is likely to need additional hooks in the file metadata, which we clearly don't want to introduce in CMIP5. In the short term, it might be possible to set up a wiki where modelling groups can list changes, tagged with (for example) institute, model, experiment and date. In the very short term, we could set up a google-doc spreadsheet with 5 columns: institute, model, experiment, date of file modification, URL (pointing to a page maintained by the modelling group). This would be simple and very useful.
>
> In the longer term, one problem is that the data producer cannot record what is different in a new version when they are producing the data, because the version number is, by design, assigned at a later date. This is a common enough problem, and the common solution is for the producer to stamp products with a batch number and then record information against the batch number. I think it would be a useful step to add a global "batch" attribute to the CMOR2 standard.
>
> Data checksums would also be very useful. This is not technically difficult -- it is just a question of agreeing how to do it. A useful first step might be to add a feature in CMOR2 to calculate checksums for each variable (e.g. the fletcher32 checksum which is used by the NetCDF library) and add them as a variable attribute. One step we could take for CMIP5 would be to record somewhere, for each publication unit, number of files, number of files which are changed/new/ or have changed data. But this would require running code at the data publication stage to compare new and old versions of each file, and the overhead in CPU time may be beyond our resources.
>
> cheers,
> Martin
> ________________________________________
> From: go-essp-tech-bounces at ucar.edu [go-essp-tech-bounces at ucar.edu] on behalf of Sébastien Denvil [sebastien.denvil at ipsl.jussieu.fr]
> Sent: 21 December 2012 11:03
> To: Stéphane Senesi
> Cc: go-essp-tech at ucar.edu; esgf-devel at lists.llnl.gov
> Subject: Re: [Go-essp-tech] [esgf-devel] Changing checksums without changing version number ?
>
> Hello all,
>
> indeed Stéphane the best practice as things stand now is to republish a
> new version.
>
> I converge with the Estani's idea that we might want to have checksums
> over the data part only (or versioning over the data part and versioning
> over the metadata part).
>
> In Europe IS-ENES phase 1 will come to an end early next year. In this
> context we plan to write a data management white paper. I will draft a
> first version of this paper by mid-January. This white paper will
> emphasize best practices and recommendations in particular with respect
> to versioning (and how to keep track of changes and versions history). I
> already have a few names I plan to ask advices and/or contributions. If
> you want to contribute/share your ideas on this do not hesitate to ping me.
>
> regards and merry Christmas!
> Sébastien
>
> Le 21/12/2012 11:27, Stéphane Senesi a écrit :
>> Estanislao,
>>
>> Estanislao Gonzalez wrote, On 21/12/2012 10:25:
>>> Hi All,
>>>
>>> IMHO:
>>> new checksum == new version
>>>
>>> As the system is designed it means we will have more versions than
>>> required if (and only if) the same data is regenerated with some
>>> unimportant meta-data change (In all other cases, a new version is
>>> compelling). I don't see a reason while this particular scenario should
>>> happen often, but perhaps I'm wrong.
>>> So if I were you Stéphane, I would make sure the data is generated
>>> "exactly" as the old one (same checksum). If that's not possible, then
>>> it's not the same data, ergo new version.
>>>
>>> In the future, we might want to have checksums over the data part only
>>> and skip the meta-data (or at most include just some meta-data entries,
>>> which I wouldn't by the way)
>> In my case, the only metadata that I cannot re-construct is a small part
>> of the 'history' attribute, namely the date at which the (other, useful)
>> meta-data attributes were fixed ...  But I understand the argument.
>>
>> The need for a coordinated way to describe and publish (clear text)
>> version information remains strong. Not sure about the usefulness of a
>> CIM on version information ;-)
>>
>> S
>>
>>> . But as the system is designed right now
>>> that's not possible.
>>>
>>> My 2c,
>>> Estani
>>>
>>> On 21.12.2012 10:04, Kettleborough, Jamie wrote:
>>>> Hello Karl,
>>>>
>>>> Is there any documentation anywhere on what changes should or
>>>> shouldn't trigger new versions? Is it worth adding this advice to
>>>> that
>>>> documentation?
>>>>
>>>> Thanks,
>>>>
>>>> Jamie
>>>>
>>>>> -------------------------
>>>>> FROM: go-essp-tech-bounces at ucar.edu
>>>>> [mailto:go-essp-tech-bounces at ucar.edu] ON BEHALF OF Karl Taylor
>>>>> SENT: 20 December 2012 17:51
>>>>> TO: stephen.pascoe at stfc.ac.uk
>>>>> CC: esgf-devel at lists.llnl.gov; go-essp-tech at ucar.edu
>>>>> SUBJECT: Re: [Go-essp-tech] [esgf-devel] Changing checksums without
>>>>> changing version number ?
>>>>>
>>>>> Hi Stephane and Stephen,
>>>>>
>>>>> I agree with Stephen that what you should do is not obvious, but I
>>>>> think least confusing is to indeed publish as a new version. We'll
>>>>> need to work to make easily accessible information about how versions
>>>>> differ (or don't) so as to avoid folks reanalyzing output when it is
>>>>> unnecessary.
>>>>>
>>>>> Karl
>>>>>
>>>>> On 12/20/12 2:42 AM, stephen.pascoe at stfc.ac.uk wrote:
>>>>>
>>>>>> Hi Stephane,
>>>>>>
>>>>>> This is a difficult use case to resolve. I have broadened the
>>>>>> thread to go-essp-tech because it affects the whole plan of how we
>>>>>> keep track of changing data.
>>>>>>
>>>>>> My opinion is that you should publish this data as a new version.
>>>>>> We have been assuming that each dataset version has a stable set of
>>>>>> checksums. We'd like to build tools around this assumption that
>>>>>> checks the integrity of the archive (admittedly we haven't got there
>>>>>> yet).
>>>>>>
>>>>>> If you republish files as the same versions but with different
>>>>>> checksums we cannot tell that only the metadata has changed. Thinking
>>>>>> about the archive as a whole, we have to assume that any
>>>>>> file-versions that change checksum could be different data and flag
>>>>>> it as such. It would be better to create a new version and document
>>>>>> that this version is a trivial change.
>>>>>>
>>>>>> Unfortunately we don't have a good system for documenting these
>>>>>> version transitions. BADC did produce a web-app for this some months
>>>>>> ago but it didn't catch on [1]. Also there is a wiki page [2] where
>>>>>> you can note down data provider issues but I doubt any users know it
>>>>>> exists. If you record what you've done in one of those places the
>>>>>> knowledge will not be lost.
>>>>>>
>>>>>> [1] http://cmip-ingest1.badc.rl.ac.uk/drscomm/search/ [1] (Contact
>>>>>> Ag Stephens for details: CC'd above)
>>>>>> [2] http://esgf.org/wiki/CMIP5ProviderFAQ [2]
>>>>>>
>>>>>> Cheers,
>>>>>> Stephen.
>>>>>>
>>>>>> ---
>>>>>> Stephen Pascoe +44 (0)1235 445980
>>>>>> Centre of Environmental Data Archival
>>>>>> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11
>>>>>> 0QX, UK
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: owner-esgf-devel at lists.llnl.gov
>>>>>> [mailto:owner-esgf-devel at lists.llnl.gov] On Behalf Of Stéphane Senesi
>>>>>> Sent: 19 December 2012 16:58
>>>>>> To: esgf-devel at lists.llnl.gov
>>>>>> Subject: [esgf-devel] Changing checksums without changing version
>>>>>> number ?
>>>>>>
>>>>>> Dear all,
>>>>>>
>>>>>> We experienced a failure of a disk array for our CMIP5 data's ESG
>>>>>> datanode. We are able to produce again the same data and metadata,
>>>>>> except that, in each NetCDF file, a small part of the "history"
>>>>>> metadata is changed (namely the date of the last setting for some
>>>>>> metadata).
>>>>>> Hence, the cheksum does change, and we have no way to avoid it.
>>>>>>
>>>>>> We can either re-publish the affected datasets with a new version
>>>>>> number or with the same version number.
>>>>>>
>>>>>> In the first case, all users may think that the data is new, and
>>>>>> will have to consider if they want to download it again, and, if they
>>>>>> do, may eventually complain that we generate additional non-sensical
>>>>>> work
>>>>>>
>>>>>> In the second one, meticulous users will complain that the
>>>>>> checksums in our thredds catalog are not the same as the checksums
>>>>>> for the files they have already downloaded
>>>>>>
>>>>>> What is the best way forward ? I suspect it is the second one,
>>>>>> because checksums are not supposed to be data identifiers but only
>>>>>> used for check of data corruption immediately after the transfer. But
>>>>>> does everybody agree with that ?
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> --
>>>>>> Stéphane Sénési
>>>>>> Ingénieur - équipe Assemblage du Système Terre Centre National de
>>>>>> Recherches Météorologiques Groupe de Météorologie à Grande Echelle et
>>>>>> Climat
>>>>>>
>>>>>> CNRM/GMGEC/ASTER
>>>>>> 42 Av Coriolis
>>>>>> F-31057 Toulouse Cedex 1
>>>>>>
>>>>>> +33.5.61.07.99.31 (Fax :....9610)
>>>> Links:
>>>> ------
>>>> [1] http://cmip-ingest1.badc.rl.ac.uk/drscomm/search/
>>>> [2] http://esgf.org/wiki/CMIP5ProviderFAQ
>
> --
> Sébastien Denvil
> IPSL, Pôle de modélisation du climat
> UPMC, Case 101, 4 place Jussieu,
> 75252 Paris Cedex 5
>
> Tour 45-55 2ème étage Bureau 209
> Tel: 33 1 44 27 21 10
> Fax: 33 1 44 27 39 02
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20121221/bfd09229/attachment-0001.html 


More information about the GO-ESSP-TECH mailing list