[Go-essp-tech] [esgf-devel] Changing checksums without changing version number ?

martin.juckes at stfc.ac.uk martin.juckes at stfc.ac.uk
Fri Dec 21 12:51:26 MST 2012


Hi Karl,

the idea of a batch number would be to make it easier to refer to large numbers of files, since the data producers will typically be processing thousands at a time. The work-flow might be: discover a need to re-process files, record the reasons and label the record, run the data processing job and put the label in every file. After that, everything becomes easy. It would certainly be possible to build a system around tracking_ids, but I think the granularity will make it difficult. I may be overestimating the difficulty.

regards,
Martin

________________________________
From: Karl Taylor [taylor13 at llnl.gov]
Sent: 21 December 2012 18:17
To: Juckes, Martin (STFC,RAL,RALSP)
Cc: sebastien.denvil at ipsl.jussieu.fr; Stephane.Senesi at meteo.fr; go-essp-tech at ucar.edu; esgf-devel at lists.llnl.gov
Subject: Re: [Go-essp-tech] [esgf-devel] Changing checksums without changing version number ?

Hi all,

The original idea was to make more use of "tracking_id", which is a global attribute uniquely identifying each file.  If this were relied on to distinguish among different versions, a data provider could make a decision to assign the same tracking_id to a file that replaces an older file, when only trivial differences exist (as in Stephane's case).  I'm not sure why a "batch number" is needed if we already have "tracking_id".

cheers,
Karl

On 12/21/12 4:50 AM, martin.juckes at stfc.ac.uk<mailto:martin.juckes at stfc.ac.uk> wrote:

Hello All,

I basically agree, but would say it is more that best practise -- creating a new publication version every time a file changed is the agreed approach and the only way we can track changes in the current system.

As has been pointed out, it would be very desirable to have a better way of keeping track of and communicating the nature of the changes between data versions. In the long term, a clean and robust approach is likely to need additional hooks in the file metadata, which we clearly don't want to introduce in CMIP5. In the short term, it might be possible to set up a wiki where modelling groups can list changes, tagged with (for example) institute, model, experiment and date. In the very short term, we could set up a google-doc spreadsheet with 5 columns: institute, model, experiment, date of file modification, URL (pointing to a page maintained by the modelling group). This would be simple and very useful.

In the longer term, one problem is that the data producer cannot record what is different in a new version when they are producing the data, because the version number is, by design, assigned at a later date. This is a common enough problem, and the common solution is for the producer to stamp products with a batch number and then record information against the batch number. I think it would be a useful step to add a global "batch" attribute to the CMOR2 standard.

Data checksums would also be very useful. This is not technically difficult -- it is just a question of agreeing how to do it. A useful first step might be to add a feature in CMOR2 to calculate checksums for each variable (e.g. the fletcher32 checksum which is used by the NetCDF library) and add them as a variable attribute. One step we could take for CMIP5 would be to record somewhere, for each publication unit, number of files, number of files which are changed/new/ or have changed data. But this would require running code at the data publication stage to compare new and old versions of each file, and the overhead in CPU time may be beyond our resources.

cheers,
Martin
________________________________________
From: go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu> [go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu>] on behalf of Sébastien Denvil [sebastien.denvil at ipsl.jussieu.fr<mailto:sebastien.denvil at ipsl.jussieu.fr>]
Sent: 21 December 2012 11:03
To: Stéphane Senesi
Cc: go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>; esgf-devel at lists.llnl.gov<mailto:esgf-devel at lists.llnl.gov>
Subject: Re: [Go-essp-tech] [esgf-devel] Changing checksums without changing version number ?

Hello all,

indeed Stéphane the best practice as things stand now is to republish a
new version.

I converge with the Estani's idea that we might want to have checksums
over the data part only (or versioning over the data part and versioning
over the metadata part).

In Europe IS-ENES phase 1 will come to an end early next year. In this
context we plan to write a data management white paper. I will draft a
first version of this paper by mid-January. This white paper will
emphasize best practices and recommendations in particular with respect
to versioning (and how to keep track of changes and versions history). I
already have a few names I plan to ask advices and/or contributions. If
you want to contribute/share your ideas on this do not hesitate to ping me.

regards and merry Christmas!
Sébastien

Le 21/12/2012 11:27, Stéphane Senesi a écrit :


Estanislao,

Estanislao Gonzalez wrote, On 21/12/2012 10:25:


Hi All,

IMHO:
new checksum == new version

As the system is designed it means we will have more versions than
required if (and only if) the same data is regenerated with some
unimportant meta-data change (In all other cases, a new version is
compelling). I don't see a reason while this particular scenario should
happen often, but perhaps I'm wrong.
So if I were you Stéphane, I would make sure the data is generated
"exactly" as the old one (same checksum). If that's not possible, then
it's not the same data, ergo new version.

In the future, we might want to have checksums over the data part only
and skip the meta-data (or at most include just some meta-data entries,
which I wouldn't by the way)


In my case, the only metadata that I cannot re-construct is a small part
of the 'history' attribute, namely the date at which the (other, useful)
meta-data attributes were fixed ...  But I understand the argument.

The need for a coordinated way to describe and publish (clear text)
version information remains strong. Not sure about the usefulness of a
CIM on version information ;-)

S



. But as the system is designed right now
that's not possible.

My 2c,
Estani

On 21.12.2012 10:04, Kettleborough, Jamie wrote:


Hello Karl,

Is there any documentation anywhere on what changes should or
shouldn't trigger new versions? Is it worth adding this advice to
that
documentation?

Thanks,

Jamie



-------------------------
FROM: go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu>
[mailto:go-essp-tech-bounces at ucar.edu] ON BEHALF OF Karl Taylor
SENT: 20 December 2012 17:51
TO: stephen.pascoe at stfc.ac.uk<mailto:stephen.pascoe at stfc.ac.uk>
CC: esgf-devel at lists.llnl.gov<mailto:esgf-devel at lists.llnl.gov>; go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
SUBJECT: Re: [Go-essp-tech] [esgf-devel] Changing checksums without
changing version number ?

Hi Stephane and Stephen,

I agree with Stephen that what you should do is not obvious, but I
think least confusing is to indeed publish as a new version. We'll
need to work to make easily accessible information about how versions
differ (or don't) so as to avoid folks reanalyzing output when it is
unnecessary.

Karl

On 12/20/12 2:42 AM, stephen.pascoe at stfc.ac.uk<mailto:stephen.pascoe at stfc.ac.uk> wrote:



Hi Stephane,

This is a difficult use case to resolve. I have broadened the
thread to go-essp-tech because it affects the whole plan of how we
keep track of changing data.

My opinion is that you should publish this data as a new version.
We have been assuming that each dataset version has a stable set of
checksums. We'd like to build tools around this assumption that
checks the integrity of the archive (admittedly we haven't got there
yet).

If you republish files as the same versions but with different
checksums we cannot tell that only the metadata has changed. Thinking
about the archive as a whole, we have to assume that any
file-versions that change checksum could be different data and flag
it as such. It would be better to create a new version and document
that this version is a trivial change.

Unfortunately we don't have a good system for documenting these
version transitions. BADC did produce a web-app for this some months
ago but it didn't catch on [1]. Also there is a wiki page [2] where
you can note down data provider issues but I doubt any users know it
exists. If you record what you've done in one of those places the
knowledge will not be lost.

[1] http://cmip-ingest1.badc.rl.ac.uk/drscomm/search/ [1] (Contact
Ag Stephens for details: CC'd above)
[2] http://esgf.org/wiki/CMIP5ProviderFAQ [2]

Cheers,
Stephen.

---
Stephen Pascoe +44 (0)1235 445980
Centre of Environmental Data Archival
STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11
0QX, UK

-----Original Message-----
From: owner-esgf-devel at lists.llnl.gov<mailto:owner-esgf-devel at lists.llnl.gov>
[mailto:owner-esgf-devel at lists.llnl.gov] On Behalf Of Stéphane Senesi
Sent: 19 December 2012 16:58
To: esgf-devel at lists.llnl.gov<mailto:esgf-devel at lists.llnl.gov>
Subject: [esgf-devel] Changing checksums without changing version
number ?

Dear all,

We experienced a failure of a disk array for our CMIP5 data's ESG
datanode. We are able to produce again the same data and metadata,
except that, in each NetCDF file, a small part of the "history"
metadata is changed (namely the date of the last setting for some
metadata).
Hence, the cheksum does change, and we have no way to avoid it.

We can either re-publish the affected datasets with a new version
number or with the same version number.

In the first case, all users may think that the data is new, and
will have to consider if they want to download it again, and, if they
do, may eventually complain that we generate additional non-sensical
work

In the second one, meticulous users will complain that the
checksums in our thredds catalog are not the same as the checksums
for the files they have already downloaded

What is the best way forward ? I suspect it is the second one,
because checksums are not supposed to be data identifiers but only
used for check of data corruption immediately after the transfer. But
does everybody agree with that ?

Regards

--
Stéphane Sénési
Ingénieur - équipe Assemblage du Système Terre Centre National de
Recherches Météorologiques Groupe de Météorologie à Grande Echelle et
Climat

CNRM/GMGEC/ASTER
42 Av Coriolis
F-31057 Toulouse Cedex 1

+33.5.61.07.99.31 (Fax :....9610)


Links:
------
[1] http://cmip-ingest1.badc.rl.ac.uk/drscomm/search/
[2] http://esgf.org/wiki/CMIP5ProviderFAQ


--
Sébastien Denvil
IPSL, Pôle de modélisation du climat
UPMC, Case 101, 4 place Jussieu,
75252 Paris Cedex 5

Tour 45-55 2ème étage Bureau 209
Tel: 33 1 44 27 21 10
Fax: 33 1 44 27 39 02




-- 
Scanned by iCritical.


More information about the GO-ESSP-TECH mailing list