[Go-essp-tech] [esgf-devel] Changing checksums without changing version number ?

Stéphane Senesi Stephane.Senesi at meteo.fr
Fri Dec 21 03:27:37 MST 2012


Estanislao,

Estanislao Gonzalez wrote, On 21/12/2012 10:25:
> Hi All,
>
> IMHO:
> new checksum == new version
>
> As the system is designed it means we will have more versions than
> required if (and only if) the same data is regenerated with some
> unimportant meta-data change (In all other cases, a new version is
> compelling). I don't see a reason while this particular scenario should
> happen often, but perhaps I'm wrong.
> So if I were you Stéphane, I would make sure the data is generated
> "exactly" as the old one (same checksum). If that's not possible, then
> it's not the same data, ergo new version.
>
> In the future, we might want to have checksums over the data part only
> and skip the meta-data (or at most include just some meta-data entries,
> which I wouldn't by the way)

In my case, the only metadata that I cannot re-construct is a small part 
of the 'history' attribute, namely the date at which the (other, useful) 
meta-data attributes were fixed ...  But I understand the argument.

The need for a coordinated way to describe and publish (clear text) 
version information remains strong. Not sure about the usefulness of a 
CIM on version information ;-)

S

> . But as the system is designed right now
> that's not possible.
>
> My 2c,
> Estani
>
> On 21.12.2012 10:04, Kettleborough, Jamie wrote:
>> Hello Karl,
>>
>> Is there any documentation anywhere on what changes should or
>> shouldn't trigger new versions? Is it worth adding this advice to
>> that
>> documentation?
>>
>> Thanks,
>>
>> Jamie
>>
>>> -------------------------
>>> FROM: go-essp-tech-bounces at ucar.edu
>>> [mailto:go-essp-tech-bounces at ucar.edu] ON BEHALF OF Karl Taylor
>>> SENT: 20 December 2012 17:51
>>> TO: stephen.pascoe at stfc.ac.uk
>>> CC: esgf-devel at lists.llnl.gov; go-essp-tech at ucar.edu
>>> SUBJECT: Re: [Go-essp-tech] [esgf-devel] Changing checksums without
>>> changing version number ?
>>>
>>> Hi Stephane and Stephen,
>>>
>>> I agree with Stephen that what you should do is not obvious, but I
>>> think least confusing is to indeed publish as a new version. We'll
>>> need to work to make easily accessible information about how versions
>>> differ (or don't) so as to avoid folks reanalyzing output when it is
>>> unnecessary.
>>>
>>> Karl
>>>
>>> On 12/20/12 2:42 AM, stephen.pascoe at stfc.ac.uk wrote:
>>>
>>>> Hi Stephane,
>>>>
>>>> This is a difficult use case to resolve. I have broadened the
>>>> thread to go-essp-tech because it affects the whole plan of how we
>>>> keep track of changing data.
>>>>
>>>> My opinion is that you should publish this data as a new version.
>>>> We have been assuming that each dataset version has a stable set of
>>>> checksums. We'd like to build tools around this assumption that
>>>> checks the integrity of the archive (admittedly we haven't got there
>>>> yet).
>>>>
>>>> If you republish files as the same versions but with different
>>>> checksums we cannot tell that only the metadata has changed. Thinking
>>>> about the archive as a whole, we have to assume that any
>>>> file-versions that change checksum could be different data and flag
>>>> it as such. It would be better to create a new version and document
>>>> that this version is a trivial change.
>>>>
>>>> Unfortunately we don't have a good system for documenting these
>>>> version transitions. BADC did produce a web-app for this some months
>>>> ago but it didn't catch on [1]. Also there is a wiki page [2] where
>>>> you can note down data provider issues but I doubt any users know it
>>>> exists. If you record what you've done in one of those places the
>>>> knowledge will not be lost.
>>>>
>>>> [1] http://cmip-ingest1.badc.rl.ac.uk/drscomm/search/ [1] (Contact
>>>> Ag Stephens for details: CC'd above)
>>>> [2] http://esgf.org/wiki/CMIP5ProviderFAQ [2]
>>>>
>>>> Cheers,
>>>> Stephen.
>>>>
>>>> ---
>>>> Stephen Pascoe +44 (0)1235 445980
>>>> Centre of Environmental Data Archival
>>>> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11
>>>> 0QX, UK
>>>>
>>>> -----Original Message-----
>>>> From: owner-esgf-devel at lists.llnl.gov
>>>> [mailto:owner-esgf-devel at lists.llnl.gov] On Behalf Of Stéphane Senesi
>>>> Sent: 19 December 2012 16:58
>>>> To: esgf-devel at lists.llnl.gov
>>>> Subject: [esgf-devel] Changing checksums without changing version
>>>> number ?
>>>>
>>>> Dear all,
>>>>
>>>> We experienced a failure of a disk array for our CMIP5 data's ESG
>>>> datanode. We are able to produce again the same data and metadata,
>>>> except that, in each NetCDF file, a small part of the "history"
>>>> metadata is changed (namely the date of the last setting for some
>>>> metadata).
>>>> Hence, the cheksum does change, and we have no way to avoid it.
>>>>
>>>> We can either re-publish the affected datasets with a new version
>>>> number or with the same version number.
>>>>
>>>> In the first case, all users may think that the data is new, and
>>>> will have to consider if they want to download it again, and, if they
>>>> do, may eventually complain that we generate additional non-sensical
>>>> work
>>>>
>>>> In the second one, meticulous users will complain that the
>>>> checksums in our thredds catalog are not the same as the checksums
>>>> for the files they have already downloaded
>>>>
>>>> What is the best way forward ? I suspect it is the second one,
>>>> because checksums are not supposed to be data identifiers but only
>>>> used for check of data corruption immediately after the transfer. But
>>>> does everybody agree with that ?
>>>>
>>>> Regards
>>>>
>>>> --
>>>> Stéphane Sénési
>>>> Ingénieur - équipe Assemblage du Système Terre Centre National de
>>>> Recherches Météorologiques Groupe de Météorologie à Grande Echelle et
>>>> Climat
>>>>
>>>> CNRM/GMGEC/ASTER
>>>> 42 Av Coriolis
>>>> F-31057 Toulouse Cedex 1
>>>>
>>>> +33.5.61.07.99.31 (Fax :....9610)
>>
>> Links:
>> ------
>> [1] http://cmip-ingest1.badc.rl.ac.uk/drscomm/search/
>> [2] http://esgf.org/wiki/CMIP5ProviderFAQ


-- 
Stéphane Sénési
Ingénieur - équipe Assemblage du Système Terre
Centre National de Recherches Météorologiques
Groupe de Météorologie à Grande Echelle et Climat

CNRM/GMGEC/ASTER
42 Av Coriolis
F-31057 Toulouse Cedex 1

+33.5.61.07.99.31 (Fax :....9610)



More information about the GO-ESSP-TECH mailing list