[Go-essp-tech] Reasoning for the use of symbolic links in drslib

Fri Sep 23 04:10:15 MDT 2011

Hello Karl,

thanks for responding on this and making the user view much more
explicit.  And thanks for the note on the checksum - its good to know
this is close to being 'required'.

I also agree that the risk to CMIP5 (and the model contribution to IPCC
AR5) due to data access problems is sufficiently high that simple
(non-general) solutions that can be delivered quickly are needed.  I
think that many of the early CMIP5/working group 1 users are happy to
take some of the responsibility for filtering which data they need,
q.c., etc on themselves.  So this user base can toleterate simpler
solutions.  This may not apply to working groups 2, 3... I don't know.
Later studies of working group 1 may need richer model meta-data.

Some questions /comments on your proposal:

1. I think the list files are derivable from the thredds catalogue
entries for the publication version dataset (if they all contained  the
checksums) - I think you suspect this.  In a sense (I think) they are a
reformatting of the thredds catalogues into a form more parsable by
users. If it can be achieved in time then I think its safer to get the
checksums in the thredds cataloges and derive any other format from
there.
2. do you think you would expose these list files through http?  You
mention gridFTP but how soon do you think gridFTP will be available for
the users that need it...
   a. I'm not sure how many have data available through gridFTP
(http://esgf.org/wiki/Cmip5Status/ArchiveView suggest not many?)
   b. I'm not sure how many users will have gridFTP clients (or maybe
you can use a standard ftp client?)
3. Do you need to capture this idea of 'latest' in the user view, or can
the user work this out based on the version number?
   a. including 'latest' makes is easier for users as it takes one bit
of responsibility away from them
   b. but you may be introducing an inconsistency between the thredds
interface (which doesn't really expose this idea of latest), and the
more file based interface).
   c. this exposure of 'latest' may be a minor point, (but its ringing
alarm bells with me).
4. I don't think you need time sample do you - isn't that in the file
name?
5. what is the full path to the file - the one visible through gridFTP,
or through the thredds file server or what?
6. an addition - but in the same vain of simplicity, can we have an easy
to parse list of servers that hold CMIP5 data available via http.  In
the first instance this could be populated by hand.  It could be as
simple as a csv file - server,pki_status.

I'm afraid I haven't had time to think about all the issues around hard
links, soft links and tape storage - and there may be more major issues
there.

Jamie

________________________________

	From: Karl Taylor [mailto:taylor13 at llnl.gov] 
	Sent: 21 September 2011 23:19
	To: stephen.pascoe at stfc.ac.uk
	Cc: gavin at llnl.gov; Kettleborough, Jamie; go-essp-tech at ucar.edu;
esg-node-dev at lists.llnl.gov
	Subject: Re: [Go-essp-tech] Reasoning for the use of symbolic
links in drslib

	Hi Stephen and all,

	I would add another requirement (or is this part of 4?):

	5.  A user (as opposed to a data provider or a "replicator" or a
data center data manager) should be able to determine (through an
automated scripted process) whether a file previously downloaded is in
the current (i.e., "latest")  version of a dataset, or has been
withdrawn or replaced.

	To meet all the requirements in a practical way in the next few
weeks, I'll suggest an alternative approach:  We could use drsLib to
create the DRS directory structure, but populate the lowest level (where
the files would normally be found) with a single text file  (referred to
subsequently as the "listing file") containing the following
information:

	the publication-level dataset version THREDDS id, which is:
<activity>.<product>.<institute>.<model>.<experiment>.<frequency>.<model
ing realm>.<MIP table>.<ensemble member>.<version number>
	plus the <variable name>
	followed by a table with:
	filename      time units     time of 1st time sample
time of last time sample       full path to file    tracking_id
checksum     
	----------       ------------     ----------------------------
-----------------------------     -------------------    -------------
----------
	file1
	file2 
	. 
	.
	.
	fileN

	The "listing file" would be stored twice for the latest version
of each dataset:  once under the numbered version subdirectory and
*also* under the generically labeled "latest" directory.  [This is so a
user interested in the latest version can find it without knowing its
actual number.]  By the way the time information included in the list
might not be absolutely essential, but it could be helpful for those
only wanting to download specific time-segments of an integration.

	I realize this is not a particularly elegant approach, but if
users were given access to the drs directory structure (say, through
gridftp), they could run a script that navigated directly to a variable
of interest (based on the DRS directory structure specifications) and
download the "listing file" stored there.  Then, the "latest" listing
file could be compared to the older "listing file" (previously
downloaded by the user) to determine whether a new version was available
(by simply comparing the <version numbers> stored in the THREDDS ID).
If the user didn't have the most recent version, he could then compare
the two "listing files" (old and new) to determine which files were new
and which (if any) had been eliminated.

	At that point, the user could generate a local copy of the
latest version by moving/deleting files not found in the latest "listing
file" and by downloading (using, for example, gridftp) only the new
files.

	I bet that in a single day Stephen could enhance drslib to
produce these list files, rather than creating the symbolic links to the
actual file locations as it currently does.  Note that if the actual
files were moved into new directories sometime in the future, a utility
would have to be written to modify all the "list files" to point to the
new file locations (but that's also true of the symbolic links, I think)

	Also note that creation of a new version would *not* require
changing any of the existing "list files" (except the list file in the
"latest" directory would be removed).  A new version subdirectory would
have to be created and for each variable in the dataset, the new "list
file" for that version would have to be generated (and copied also to
"latest").

	I'll be interested in your response to this idea and trust that
any time spent thinking about it is warranted (i.e., that this is not a
completely stupid suggestion).  Will it meet all of Stephen's needs?
Are there any other solutions to the data users' troubles in obtaining
data, which we can implement in the next few weeks (since that should be
our goal here).

	My primary interest is in making CMIP5 data easily obtainable by
users (which appears not to be the case at present), and to allow users
to write scripts to troll for new data they are interested in and
discover any new versions of data that should replace the old.  This is
not meant to be a general solution to all of the possible ESG
applications.  Also, I'm guessing that a similar approach could be
followed where instead of reading the "list files", one read the
catalogs, but I doubt that this would be as easy for the typical user to
do.

	Best regards,
	Karl

	P.S. to weigh in on another issue, I think it *will* be
essential to require, as part of ESG publication that the check-sum be
recorded (in the THREDDS catalog, if I'm not mistaken).  We haven't
asked groups to republish data conforming to this new requirement
because I want to make sure that any other required alterations in the
configuration of the publisher are also communicated, so we only have to
ask groups to republish once.  Note also that if my "alternative"
approach outlined above is adopted, the checksums could either be gotten
from the catalog (if they were computed and stored there) or be
calculated by drslib itself; there would be no need to republish data to
ESG

	On 9/20/11 2:35 PM, stephen.pascoe at stfc.ac.uk wrote: 

		Hi All,

		Lots of good discussion here and sorry I've been keeping
quiet.  I want to remind ourselves of the requirements I laid out in the
wiki page

		1. It should allow data from multiple versions to be
kept on disk simultaneously.

		2. It should avoid storing multiple copies of files that
are present in more than one version.

		3. It should be straightforward to copy dataset changes
(i.e. differences between versions) between nodes to allow efficient
replication.

		4. It should rely only on the filesystem so that generic
tools like FTP could be used to expose the structure if necessary.

		In my view we should address these directly.  Are they
needed?  Which are the most important?

		Gavin said about catalogs

		> you can quickly and easily inspect catalog_v1 and
catalog_v2 to find what the changes are.
		> This all answers the question of "WHAT" (to
download)... the other question of "HOW" is a different, but related
story.
		> The trick is to not conflate the two issues which is
what filesystem discussions do. . 

		But THREDDS conflates the two as well!  A THREDDS
catalog contains descriptions of service endpoints that are not
independent of the node serving the data (the "HOW").  Maybe we should
have developed a true catalog format but that is not where we are now.
The replication client use THREDDS catalogs in this way but when I last
looked it was completely unaware of versions -- i.e. it won't help with
#3.  

		I don't see how Gavin's point addresses any of the
requirements above.  Even if we ditch #4, which I expect Gavin would
argue for, it doesn't directly solve the problem for #1-#3 either.

		Briefly on some other points that have been made...

		Balaji, some archive tools maybe can detect 2 paths
pointing to the same filesystem inode but both Estani and I have
enquired with our backup people and they say hard links must be avoided.
I am happy to include a hard-linking option in drslib though.  I've
created a bugzilla ticket for it.

		Karl, I think putting real files in "latest" is
equivalent to putting real files in the latest "vYYYYMMDD" directory.
The directories can be renamed trivially on upgrade but you still have
the same problems as the wiki page says.

		I'm sure there were other points but I've lost track.
Checksums will have to wait for another email.

		Cheers,

		Stephen.

		---

		Stephen Pascoe  +44 (0)1235 445980

		Centre of Environmental Data Archival

		STFC Rutherford Appleton Laboratory, Harwell Oxford,
Didcot OX11 0QX, UK

		From: go-essp-tech-bounces at ucar.edu
[mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Gavin M. Bell
		Sent: 20 September 2011 17:26
		To: Kettleborough, Jamie
		Cc: go-essp-tech at ucar.edu; esg-node-dev at lists.llnl.gov
		Subject: Re: [Go-essp-tech] Reasoning for the use of
symbolic links in drslib

		Jamie and friends.

		You've answered your own questions :-)... 
		It is the catalog where these checksums (and other
features) are recorded.
		And thus using the catalog you can see what has changed.
		There is a new catalog for every version of a dataset.
Given that...
		you can quickly and easily inspect catalog_v1 and
catalog_v2 to find what the changes are.
		This all answers the question of "WHAT" (to download)...
the other question of "HOW" is a different, but related story.
		The trick is to not conflate the two issues which is
what filesystem discussions do.  When talking about filesystems you are
stipulating the what but implicitly conflating the HOW because you are
implicitly designing for tools that intrinsically use the filesystem.
It is a muddying of the waters that doesn't separate the two concerns.
We need to deal with these two concepts independently in a way that does
not  limit the system or cause undo burden on institutions by requiring
a rigid structure.

		As I mentioned... it's not the filesystem we need to
look at... it's the catalogs.

		just my $0.02 - I'll stop flogging this particular
horse... but I hope I have done a better job expressing the issues and
where the solution lies (IMHO).

		On 9/20/11 8:14 AM, Kettleborough, Jamie wrote: 

		Hello Balaji,

		I agree - getting all nodes to make the checksums
available would be a
		good thing.  It gives you both the data integrity check
on download, and
		the ability to see what files really have changed from
one publication
		version to the next.

		I don't know how hard it is to do this, particularly for
data that is
		already published.

		Jamie 

			-----Original Message-----
			From: V. Balaji [mailto:V.Balaji at noaa.gov] 
			Sent: 20 September 2011 16:01
			To: Kettleborough, Jamie
			Cc: Karl Taylor; go-essp-tech at ucar.edu;
esg-node-dev at lists.llnl.gov
			Subject: Re: [Go-essp-tech] Reasoning for the
use of symbolic 
			links in drslib

			If nodes can currently choose to record
checksums or not, I'd 
			strongly recommend this be a non-optional
requirement.. how 
			could anyone download any data with confidence
without being 
			able to checksum?

			You can of course check timestamps and filesizes
and so on, 
			but you have to consider those optimizations...
a fast option 
			for the less paranoid to avoid the sum
computation, which has 
			to be the gold standard.

			"Trust but checksum".

			Kettleborough, Jamie writes:

				Hello Karl, everyone,

				   For replicating the latest version, I
agree that your alternate 
				structure poses difficulties (but it
seems like there must 

			be a way to 

				smartly determine whether the file you
already have a file 

			and simply 

				need to move it, rather than bring it
over again).

				Doesn't every user (not just the
replication system) have 

			this problem:

				they want to know what files have
changed (or not changed) at a new 
				publication version.  No one wants to be
using band width 

			or storage 

				space to fetch and store files they
already have.  How is a user 
				expected to know what has really
changed?  Estani mentions 

			check sums 

				- OK, but I don't think all nodes expose
them (is this 

			right?).  You 

				may try to infer from modification dates
(not sure, I 

			haven't look at 

				them that closely).  You may try to
infer from the 

			TRACKING_ID - but 

				I'm not sure how reliable this is (I can
imagine scenarios where 
				different files share the same
TRACKING_ID - e.g. if they have been 
				modified with an nco tool).

				Is there a recommended method for users
to understand what *files* 
				have actually changed when a new
publication version appears?

				Thanks,

				Jamie

			-- 

			V. Balaji                               Office:
+1-609-452-6516
			Head, Modeling Systems Group, GFDL      Home:
+1-212-253-6662
			Princeton University                    Email:
v.balaji at noaa.gov

		_______________________________________________
		GO-ESSP-TECH mailing list
		GO-ESSP-TECH at ucar.edu
		http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

		-- 
		Gavin M. Bell
		--

		 "Never mistake a clear view for a short distance."
		               -Paul Saffo

		-- 
		Scanned by iCritical. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110923/d7e38c8e/attachment-0001.html