[Go-essp-tech] versions, checksums and the TDS

Sat Sep 24 05:35:20 MDT 2011

Hi,

I'm a little bit off the multiple proposals that arose in these last 
days. I'll try to summarize them while commenting, but I think somebody 
should take a step aside and describe the use cases that can't be 
fulfilled with our current federation instance.

1) "latest" directory
The main problem is that latest is a moving target, e.g. security will 
break since it is designed to secure URLs and not data, the latest URL 
is meant to change from version to version (new files added, old 
removed, etc).
GridFTP using BDM will be indeed the only viable way of doing this. 
Still, if we wanted to allow this (which we don't since it breaks 
completely our security infrastructure) the user will lose any reference 
to the version being downloaded. And with that, any metadata referred to 
it. They will only see that some files change and won't be able to 
compare them to the version they had, since they'll have no clue about 
it (same directory named _latest_).
So because of security and meta-data loss I'll strongly discourage 
following this path any further (just my opinion anyways)
In my opinion, users don't want just to have access to a directory 
holding the latest version. I think they might want to access a 
particular version, which is the latest at that time, realize when a new 
version comes out, know what was changed and possibly download the new 
version along with the one they already had in order for them to compare 
them in case the results are too far away from their expectations. But I 
might be wrong.

2) checksums
They are the only reference to the outside that a data node give of the 
changes a file suffered from one version to another, i.e. for 
replication we use that information to retrieve only files that change 
from one version to another. The same principle could be applied for 
tools designed for end users.
In any case checksums are a must that should be enforced. Checksumming 
is not a cheap operation and if done wrong (i.e. in a serial manner) it 
will take too much time. I think we could provide a simple tool for 
esgcet that does this after the files have been published so that it 
updates the DB directly for files that are not served off-line.
Regarding this I'll like to summarize what "type" of versions are there 
and how we are using them (and mention a problem last point will be 
causing). The catalog version was intended to version the catalogs 
themselves, i.e. the meta-data, and the file version to version the 
files. We are currently (and this won't change) using the catalog 
version as the data-set version, which versions a set of files. The file 
version is not used at all. This means that if we republish a catalog 
with new meta-data, i.e. the checksums that weren't there before, we 
can't mark it as new with a new version number, as it would imply data 
has changed. So we allow catalogs to change without having any 
notification of these changes, as long as the underlying data isn't 
altered. These breaks any tool harvesting the catalogs, or at least 
force them to re-harvest all catalogs in all the federation all the 
time, which is pretty much the same thing.

3) Checksums inconsistencies
We've seen catalogs that where published with new data, but the 
checksums weren't updated. I'm not entirely sure how that could 
happened, but this broke our tools completely. There's no work around 
for this. If possible we should try to prevent this from happening though...

4) version comparison
Karl suggestion is quite valid and can be summed up in this question: 
I've got a file, how do I know it hasn't been superseded by a newer one?
If the notification infrastructure works (I don't think it has, I've 
never got an email telling me something I've downloaded changed) this 
wouldn't be necessary. Still, I think the scientist will like to be able 
to check manually this is indeed true.
Basically the procedure is: get latest file, compare with local and see 
if it has changed.
If we have the checksums it will be: get checksums from latest file, 
compute local checksum and compare. (which reduces bandwidth considerably)
I don't think creating those listing is a viable option, as publishers 
already don't want to use the DRS structure (as Karl pointed out) and in 
general have a pretty rough time with the system as it is. Furthermore, 
as Gavin mentioned, we don't want to keep information outside the 
catalogs. But the Catalogs *are not* the source for meta-data, they are 
just an API. I think we might want to start writing other APIs as well, 
some basic web-services that gives access to this kind of information. I 
think this is something we (developers) should do and not the 
administrators. The information is already there in the DB, so I see no 
point in duplicating it and keeping a whole new structure for it, not to 
mention that keeping it synchronized will be really a pain.
A simple example of how such and API for assuring a file hasn't change 
might look like:
Query: http://server/thredds/validate?checksum=ae4..&version=latest
response: either an HTTP 200 if it's ok or 404 if it's not
This could be triggered via a simple "wget http:... -O - -q || echo 
'file has changed' " command or with any other language or program there is.
Does it make sense?

Anyway, it's a large mail already, sorry for that. I've tried to 
summarize the previous thread that started at the archive view of data 
and was already aiming at the end-user view. I might have missed 
something, please feel free to comment.

Thanks,
Estani

-- 
Estanislao Gonzalez

Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

Phone:   +49 (40) 46 00 94-126
E-Mail:  gonzalez at dkrz.de