[Go-essp-tech] versions, checksums and the TDS

Mon Sep 26 10:30:17 MDT 2011

Hello,

sorry, I haven't had time to digest everything.  

In response to the tracking id issue: I think there is a significant chance that a data provider might accidentally provide different files with the same tracking id.  The most likely case is around correction of things like 'forcings' and 'branch_date' - those NetCDF attributes that are left to the data provider to manage.  I think its easy to make a slip with these (anecdotally I've hear there are already examples in the CMIP5 repository), then correct later (not sure whether there are plans to correct the examples already seen).  Correction with a simple ncatted will not update the tracking id.

I don't know if this case of different files, same tracking id, has happened already - I guess someone could find out by trawling the catalogues...

Rather than worry about whether tracking id is reliable I think its better to invest effort in getting the checksum in the system for all data.  *But* I don't control any effort on this, so weigh my opinions with that in mind...

(All this was in quite a rush - hope I haven't said anything too stupid/more stupid than usual)

Jamie

________________________________

	From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Estanislao Gonzalez
	Sent: 25 September 2011 19:25
	To: Karl Taylor
	Cc: go-essp-tech at ucar.edu
	Subject: Re: [Go-essp-tech] versions, checksums and the TDS

	I meant indeed the data providers, AFAIK some typical post-processing corrections are not generating new tracking_ids. But I might be wrong.

	I think the best place for this to be checked would be in the publisher itself. I assume the tracking id is a column attribute in a DB table. If that's the case it might have already a unique constraint or it could be added easily, but that is something Bob certainly knows better.

	Thanks,
	Estani
	Am 25.09.2011 19:25, schrieb Karl Taylor: 

		Hi Estani,

		I'm not advocating using the tracking_id to test whether two files are identical.  I'm suggesting that for most users, they will be able to use it to determine whether they have the latest version of a particular file, as opposed to some earlier version.  It's true that you can modify a file without changing the tracking_id, but I'm pretty sure all but a tiny number of users will download the files and never modify them.  Whether or not a user alters files, new files available from the CMIP5 archive will have tracking_ids that the user doesn't have locally, so if they are interested, they can download the new files.

		The above assumes that data *providers* take care to generate a new tracking_id when they generate a file containing new data.  Is this a risky assumption?  Couldn't the CMIP5 QA procedure check whether a file has the same tracking_id as any other file in the system?

		best regards,
		Karl

		On 9/25/11 2:49 AM, Estanislao Gonzalez wrote: 

			I recall a problem that when altering the file with some tools (cdo?) the tracking id wasn't automatically changed.
			Are we sure that the same tracking id point to the same file now? 
			Is the previous not a problem anymore?

			Thanks,
			Estani
			Am 24.09.2011 18:17, schrieb Karl Taylor: 

				Dear all,

				Concerning:

				On 9/24/11 4:35 AM, Estanislao Gonzalez wrote: 

					2) checksums
					They are the only reference to the outside that a data node give of the 
					changes a file suffered from one version to another, i.e. for 
					replication we use that information to retrieve only files that change 
					from one version to another. The same principle could be applied for 
					tools designed for end users.

				without disputing that checksums should be mandatory, I want to point out that a user who has lost the checksum associated with a file he has downloaded shouldn't have to recompute the checksum to determine whether his file is a copy of a file residing at the datanode.  Recall that recorded in each netCDF file is a unique tracking_id, which I'm almost positive is also in the thredds catalog.  It will certainly be quicker for the user to read the tracking_id and then check whether it matches the latest version.  I think we want to maintain tracking_id as an option for checking whether new files exist in a new version.

				best regards,
				Karl

			-- 
			Estanislao Gonzalez

			Max-Planck-Institut für Meteorologie (MPI-M)
			Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
			Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

			Phone:   +49 (40) 46 00 94-126
			E-Mail:  gonzalez at dkrz.de 

	-- 
	Estanislao Gonzalez

	Max-Planck-Institut für Meteorologie (MPI-M)
	Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
	Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

	Phone:   +49 (40) 46 00 94-126
	E-Mail:  gonzalez at dkrz.de 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110926/40a51cbe/attachment-0001.html