[Go-essp-tech] Reasoning for the use of symbolic links in drslib

Wed Sep 21 04:46:07 MDT 2011

Hello Gavin,

so is that a consensus - every data node should record the checksums for
every file in the thredds catalogues?

(Urm... not sure its really my role to say this - so sorry if I've
stepped out of line).

Jamie

p.s.  do I have friends?  I thought I was just making enemies

________________________________

	From: Gavin M. Bell [mailto:gavin at llnl.gov] 
	Sent: 20 September 2011 17:26
	To: Kettleborough, Jamie
	Cc: V. Balaji; go-essp-tech at ucar.edu;
esg-node-dev at lists.llnl.gov
	Subject: Re: [Go-essp-tech] Reasoning for the use of symbolic
links in drslib

	Jamie and friends.

	You've answered your own questions :-)... 
	It is the catalog where these checksums (and other features) are
recorded.
	And thus using the catalog you can see what has changed.
	There is a new catalog for every version of a dataset. Given
that...
	you can quickly and easily inspect catalog_v1 and catalog_v2 to
find what the changes are.
	This all answers the question of "WHAT" (to download)... the
other question of "HOW" is a different, but related story.
	The trick is to not conflate the two issues which is what
filesystem discussions do.  When talking about filesystems you are
stipulating the what but implicitly conflating the HOW because you are
implicitly designing for tools that intrinsically use the filesystem.
It is a muddying of the waters that doesn't separate the two concerns.
We need to deal with these two concepts independently in a way that does
not  limit the system or cause undo burden on institutions by requiring
a rigid structure.

	As I mentioned... it's not the filesystem we need to look at...
it's the catalogs.

	just my $0.02 - I'll stop flogging this particular horse... but
I hope I have done a better job expressing the issues and where the
solution lies (IMHO).

	On 9/20/11 8:14 AM, Kettleborough, Jamie wrote: 

		Hello Balaji,

		I agree - getting all nodes to make the checksums
available would be a
		good thing.  It gives you both the data integrity check
on download, and
		the ability to see what files really have changed from
one publication
		version to the next.

		I don't know how hard it is to do this, particularly for
data that is
		already published.

		Jamie 

			-----Original Message-----
			From: V. Balaji [mailto:V.Balaji at noaa.gov] 
			Sent: 20 September 2011 16:01
			To: Kettleborough, Jamie
			Cc: Karl Taylor; go-essp-tech at ucar.edu;
esg-node-dev at lists.llnl.gov
			Subject: Re: [Go-essp-tech] Reasoning for the
use of symbolic 
			links in drslib

			If nodes can currently choose to record
checksums or not, I'd 
			strongly recommend this be a non-optional
requirement.. how 
			could anyone download any data with confidence
without being 
			able to checksum?

			You can of course check timestamps and filesizes
and so on, 
			but you have to consider those optimizations...
a fast option 
			for the less paranoid to avoid the sum
computation, which has 
			to be the gold standard.

			"Trust but checksum".

			Kettleborough, Jamie writes:

				Hello Karl, everyone,

					For replicating the latest
version, I agree that your alternate 
				structure poses difficulties (but it
seems like there must 

			be a way to 

				smartly determine whether the file you
already have a file 

			and simply 

				need to move it, rather than bring it
over again).

				Doesn't every user (not just the
replication system) have 

			this problem:

				they want to know what files have
changed (or not changed) at a new 
				publication version.  No one wants to be
using band width 

			or storage 

				space to fetch and store files they
already have.  How is a user 
				expected to know what has really
changed?  Estani mentions 

			check sums 

				- OK, but I don't think all nodes expose
them (is this 

			right?).  You 

				may try to infer from modification dates
(not sure, I 

			haven't look at 

				them that closely).  You may try to
infer from the 

			TRACKING_ID - but 

				I'm not sure how reliable this is (I can
imagine scenarios where 
				different files share the same
TRACKING_ID - e.g. if they have been 
				modified with an nco tool).

				Is there a recommended method for users
to understand what *files* 
				have actually changed when a new
publication version appears?

				Thanks,

				Jamie

			-- 

			V. Balaji                               Office:
+1-609-452-6516
			Head, Modeling Systems Group, GFDL      Home:
+1-212-253-6662
			Princeton University                    Email:
v.balaji at noaa.gov

		_______________________________________________
		GO-ESSP-TECH mailing list
		GO-ESSP-TECH at ucar.edu
		http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

	-- 
	Gavin M. Bell
	--

	 "Never mistake a clear view for a short distance."
	       	       -Paul Saffo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110921/dbbf973d/attachment-0001.html