[Go-essp-tech] versions, checksums and the TDS
Estanislao Gonzalez
gonzalez at dkrz.de
Sat Sep 24 05:35:20 MDT 2011
Hi,
I'm a little bit off the multiple proposals that arose in these last
days. I'll try to summarize them while commenting, but I think somebody
should take a step aside and describe the use cases that can't be
fulfilled with our current federation instance.
1) "latest" directory
The main problem is that latest is a moving target, e.g. security will
break since it is designed to secure URLs and not data, the latest URL
is meant to change from version to version (new files added, old
removed, etc).
GridFTP using BDM will be indeed the only viable way of doing this.
Still, if we wanted to allow this (which we don't since it breaks
completely our security infrastructure) the user will lose any reference
to the version being downloaded. And with that, any metadata referred to
it. They will only see that some files change and won't be able to
compare them to the version they had, since they'll have no clue about
it (same directory named _latest_).
So because of security and meta-data loss I'll strongly discourage
following this path any further (just my opinion anyways)
In my opinion, users don't want just to have access to a directory
holding the latest version. I think they might want to access a
particular version, which is the latest at that time, realize when a new
version comes out, know what was changed and possibly download the new
version along with the one they already had in order for them to compare
them in case the results are too far away from their expectations. But I
might be wrong.
2) checksums
They are the only reference to the outside that a data node give of the
changes a file suffered from one version to another, i.e. for
replication we use that information to retrieve only files that change
from one version to another. The same principle could be applied for
tools designed for end users.
In any case checksums are a must that should be enforced. Checksumming
is not a cheap operation and if done wrong (i.e. in a serial manner) it
will take too much time. I think we could provide a simple tool for
esgcet that does this after the files have been published so that it
updates the DB directly for files that are not served off-line.
Regarding this I'll like to summarize what "type" of versions are there
and how we are using them (and mention a problem last point will be
causing). The catalog version was intended to version the catalogs
themselves, i.e. the meta-data, and the file version to version the
files. We are currently (and this won't change) using the catalog
version as the data-set version, which versions a set of files. The file
version is not used at all. This means that if we republish a catalog
with new meta-data, i.e. the checksums that weren't there before, we
can't mark it as new with a new version number, as it would imply data
has changed. So we allow catalogs to change without having any
notification of these changes, as long as the underlying data isn't
altered. These breaks any tool harvesting the catalogs, or at least
force them to re-harvest all catalogs in all the federation all the
time, which is pretty much the same thing.
3) Checksums inconsistencies
We've seen catalogs that where published with new data, but the
checksums weren't updated. I'm not entirely sure how that could
happened, but this broke our tools completely. There's no work around
for this. If possible we should try to prevent this from happening though...
4) version comparison
Karl suggestion is quite valid and can be summed up in this question:
I've got a file, how do I know it hasn't been superseded by a newer one?
If the notification infrastructure works (I don't think it has, I've
never got an email telling me something I've downloaded changed) this
wouldn't be necessary. Still, I think the scientist will like to be able
to check manually this is indeed true.
Basically the procedure is: get latest file, compare with local and see
if it has changed.
If we have the checksums it will be: get checksums from latest file,
compute local checksum and compare. (which reduces bandwidth considerably)
I don't think creating those listing is a viable option, as publishers
already don't want to use the DRS structure (as Karl pointed out) and in
general have a pretty rough time with the system as it is. Furthermore,
as Gavin mentioned, we don't want to keep information outside the
catalogs. But the Catalogs *are not* the source for meta-data, they are
just an API. I think we might want to start writing other APIs as well,
some basic web-services that gives access to this kind of information. I
think this is something we (developers) should do and not the
administrators. The information is already there in the DB, so I see no
point in duplicating it and keeping a whole new structure for it, not to
mention that keeping it synchronized will be really a pain.
A simple example of how such and API for assuring a file hasn't change
might look like:
Query: http://server/thredds/validate?checksum=ae4..&version=latest
response: either an HTTP 200 if it's ok or 404 if it's not
This could be triggered via a simple "wget http:... -O - -q || echo
'file has changed' " command or with any other language or program there is.
Does it make sense?
Anyway, it's a large mail already, sorry for that. I've tried to
summarize the previous thread that started at the archive view of data
and was already aiming at the end-user view. I might have missed
something, please feel free to comment.
Thanks,
Estani
--
Estanislao Gonzalez
Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
Phone: +49 (40) 46 00 94-126
E-Mail: gonzalez at dkrz.de
More information about the GO-ESSP-TECH
mailing list