[Go-essp-tech] Reasoning for the use of symbolic links in drslib

Fri Sep 23 11:06:51 MDT 2011

Hello Jamie,

Yes, you are right ... it would be better to rely on the catalogs (I'll 
take your word for it that THREDDS is the appropriate catalog to 
consult, but someone should confirm), but as you note I don't see tools 
being put in place quickly (in the next couple of weeks) to make it easy 
for casual users to access and interpret the information in those catalogs.

I've made some quick comments below:

On 9/23/11 3:10 AM, Kettleborough, Jamie wrote:
> Hello Karl,
> thanks for responding on this and making the user view much more 
> explicit.  And thanks for the note on the checksum - its good to know 
> this is close to being 'required'.
> I also agree that the risk to CMIP5 (and the model contribution to 
> IPCC AR5) due to data access problems is sufficiently high that simple 
> (non-general) solutions that can be delivered quickly are needed.  I 
> think that many of the early CMIP5/working group 1 users are happy to 
> take some of the responsibility for filtering which data they need, 
> q.c., etc on themselves.  So this user base can toleterate simpler 
> solutions.  This may not apply to working groups 2, 3... I don't 
> know.  Later studies of working group 1 may need richer model meta-data.
> Some questions /comments on your proposal:
> 1. I think the list files are derivable from the thredds catalogue 
> entries for the publication version dataset (if they all contained 
>  the checksums) - I think you suspect this.  In a sense (I think) they 
> are a reformatting of the thredds catalogues into a form more parsable 
> by users. If it can be achieved in time then I think its safer to get 
> the checksums in the thredds cataloges and derive any other format 
> from there.
See earlier comment, and I agree getting the checksums in the thredds 
catalogs is indeed a good idea, even if they can be found is the 
"listing file"
> 2. do you think you would expose these list files through http?  You 
> mention gridFTP but how soon do you think gridFTP will be available 
> for the users that need it...
> a. I'm not sure how many have data available through gridFTP 
> (http://esgf.org/wiki/Cmip5Status/ArchiveView suggest not many?)
>    b. I'm not sure how many users will have gridFTP clients (or maybe 
> you can use a standard ftp client?)
Yes, definitely http should be able to see the listing files.  Users 
should have options, and I agree the gridFTP option may not be in place 
soon enough.
> 3. Do you need to capture this idea of 'latest' in the user view, or 
> can the user work this out based on the version number?
>    a. including 'latest' makes is easier for users as it takes one bit 
> of responsibility away from them
>    b. but you may be introducing an inconsistency between the thredds 
> interface (which doesn't really expose this idea of latest), and the 
> more file based interface).
>    c. this exposure of 'latest' may be a minor point, (but its ringing 
> alarm bells with me).
My impression is that the implementation for CMIP5 of thredds may not 
easily handle versioning (but I could be completely wrong about this).  
If it can, and simply doesn't recognize "latest", I don't see that as a 
big problem.  The catalog structure we're talking about wouldn't have to 
include branches labeled "latest" if the software written to access the 
data were made smart enough to read the names of all the version 
subdirectories (or extract from thredds all the version numbers), and 
then it could determine latest.
> 4. I don't think you need time sample do you - isn't that in the file 
> name?
Well, I don't want to entirely trust the file name to accurately reflect 
this information, and it also doesn't include the units of time, which 
might be necessary for a fuller understanding (not sure about this 
though.  The main motivation was to make it easier for the user ... 
slightly easier to read the information from the listing file table than 
to parse the file name, I think.  Open to omitting time though.
> 5. what is the full path to the file - the one visible through 
> gridFTP, or through the thredds file server or what?
Someone will have to advise me on this.
> 6. an addition - but in the same vain of simplicity, can we have an 
> easy to parse list of servers that hold CMIP5 data available via 
> http.  In the first instance this could be populated by hand.  It 
> could be as simple as a csv file - server,pki_status.
Good idea, I think.

7.  Another issue:  How would authentication work for users accessing 
data with http or gridftp using the "listing file" or thredds catalog?

best regards,
Karl
> I'm afraid I haven't had time to think about all the issues around 
> hard links, soft links and tape storage - and there may be more major 
> issues there.
> Jamie
>
>     ------------------------------------------------------------------------
>     *From:* Karl Taylor [mailto:taylor13 at llnl.gov]
>     *Sent:* 21 September 2011 23:19
>     *To:* stephen.pascoe at stfc.ac.uk
>     *Cc:* gavin at llnl.gov; Kettleborough, Jamie; go-essp-tech at ucar.edu;
>     esg-node-dev at lists.llnl.gov
>     *Subject:* Re: [Go-essp-tech] Reasoning for the use of symbolic
>     links in drslib
>
>     Hi Stephen and all,
>
>     I would add another requirement (or is this part of 4?):
>
>     5.  A user (as opposed to a data provider or a "replicator" or a
>     data center data manager) should be able to determine (through an
>     automated scripted process) whether a file previously downloaded
>     is in the current (i.e., "latest")  version of a dataset, or has
>     been withdrawn or replaced.
>
>     To meet all the requirements in a practical way in the next few
>     weeks, I'll suggest an alternative approach:  We could use drsLib
>     to create the DRS directory structure, but populate the lowest
>     level (where the files would normally be found) with a single text
>     file  (referred to subsequently as the "listing file") containing
>     the following information:
>
>     the publication-level dataset version THREDDS id, which is:
>     <activity>.<product>.<institute>.<model>.<experiment>.<frequency>.<modeling
>     realm>.<MIP table>.<ensemble member>.<version number>
>     plus the <variable name>
>     followed by a table with:
>     filename      time units     time of 1st time sample          time
>     of last time sample       full path to file tracking_id    checksum
>     ----------       ------------    
>     ----------------------------          
>     -----------------------------     -------------------   
>     -------------      ----------
>     file1
>     file2
>     .
>     .
>     .
>     fileN
>
>     The "listing file" would be stored twice for the latest version of
>     each dataset:  once under the numbered version subdirectory and
>     *also* under the generically labeled "latest" directory.  [This is
>     so a user interested in the latest version can find it without
>     knowing its actual number.]  By the way the time information
>     included in the list might not be absolutely essential, but it
>     could be helpful for those only wanting to download specific
>     time-segments of an integration.
>
>     I realize this is not a particularly elegant approach, but if
>     users were given access to the drs directory structure (say,
>     through gridftp), they could run a script that navigated directly
>     to a variable of interest (based on the DRS directory structure
>     specifications) and download the "listing file" stored there. 
>     Then, the "latest" listing file could be compared to the older
>     "listing file" (previously downloaded by the user) to determine
>     whether a new version was available (by simply comparing the
>     <version numbers> stored in the THREDDS ID).  If the user didn't
>     have the most recent version, he could then compare the two
>     "listing files" (old and new) to determine which files were new
>     and which (if any) had been eliminated.
>
>     At that point, the user could generate a local copy of the latest
>     version by moving/deleting files not found in the latest "listing
>     file" and by downloading (using, for example, gridftp) only the
>     new files.
>
>     I bet that in a single dayStephen could enhance drslib to produce
>     these list files, rather than creating the symbolic links to the
>     actual file locations as it currently does.  Note that if the
>     actual files were moved into new directories sometime in the
>     future, a utility would have to be written to modify all the "list
>     files" to point to the new file locations (but that's also true of
>     the symbolic links, I think)
>
>     Also note that creation of a new version would *not* require
>     changing any of the existing "list files" (except the list file in
>     the "latest" directory would be removed).  A new version
>     subdirectory would have to be created and for each variable in the
>     dataset, the new "list file" for that version would have to be
>     generated (and copied also to "latest").
>
>     I'll be interested in your response to this idea and trust that
>     any time spent thinking about it is warranted (i.e., that this is
>     not a completely stupid suggestion).  Will it meet all of
>     Stephen's needs?  Are there any other solutions to the data users'
>     troubles in obtaining data, which we can implement in the next few
>     weeks (since that should be our goal here).
>
>     My primary interest is in making CMIP5 data easily obtainable by
>     users (which appears not to be the case at present), and to allow
>     users to write scripts to troll for new data they are interested
>     in and discover any new versions of data that should replace the
>     old.  This is not meant to be a general solution to all of the
>     possible ESG applications.  Also, I'm guessing that a similar
>     approach could be followed where instead of reading the "list
>     files", one read the catalogs, but I doubt that this would be as
>     easy for the typical user to do.
>
>     Best regards,
>     Karl
>
>     P.S. to weigh in on another issue, I think it *will* be essential
>     to require, as part of ESG publication that the check-sum be
>     recorded (in the THREDDS catalog, if I'm not mistaken).  We
>     haven't asked groups to republish data conforming to this new
>     requirement because I want to make sure that any other required
>     alterations in the configuration of the publisher are also
>     communicated, so we only have to ask groups to republish once. 
>     Note also that if my "alternative" approach outlined above is
>     adopted, the checksums could either be gotten from the catalog (if
>     they were computed and stored there) or be calculated by drslib
>     itself; there would be no need to republish data to ESG
>
>
>     On 9/20/11 2:35 PM, stephen.pascoe at stfc.ac.uk wrote:
>>
>>     Hi All,
>>
>>     Lots of good discussion here and sorry I've been keeping quiet. 
>>     I want to remind ourselves of the requirements I laid out in the
>>     wiki page
>>
>>     1. It should allow data from multiple versions to be kept on disk
>>     simultaneously.
>>
>>     2. It should avoid storing multiple copies of files that are
>>     present in more than one version.
>>
>>     3. It should be straightforward to copy dataset changes (i.e.
>>     differences between versions) between nodes to allow efficient
>>     replication.
>>
>>     4. It should rely only on the filesystem so that generic tools
>>     like FTP could be used to expose the structure if necessary.
>>
>>     In my view we should address these directly.  Are they needed? 
>>     Which are the most important?
>>
>>     Gavin said about catalogs
>>
>>     > you can quickly and easily inspect catalog_v1 and catalog_v2 to
>>     find what the changes are.
>>     > This all answers the question of "WHAT" (to download)... the
>>     other question of "HOW" is a different, but related story.
>>     > The trick is to not conflate the two issues which is what
>>     filesystem discussions do. .
>>
>>     But THREDDS conflates the two as well!  A THREDDS catalog
>>     contains descriptions of service endpoints that are not
>>     independent of the node serving the data (the "HOW").  Maybe we
>>     should have developed a true catalog format but that is not where
>>     we are now.  The replication client use THREDDS catalogs in this
>>     way but when I last looked it was completely unaware of versions
>>     -- i.e. it won't help with #3.
>>
>>     I don't see how Gavin's point addresses any of the requirements
>>     above.  Even if we ditch #4, which I expect Gavin would argue
>>     for, it doesn't directly solve the problem for #1-#3 either.
>>
>>     Briefly on some other points that have been made...
>>
>>     Balaji, some archive tools maybe can detect 2 paths pointing to
>>     the same filesystem inode but both Estani and I have enquired
>>     with our backup people and they say hard links must be avoided. 
>>     I am happy to include a hard-linking option in drslib though. 
>>     I've created a bugzilla ticket for it.
>>
>>     Karl, I think putting real files in "latest" is equivalent to
>>     putting real files in the latest "vYYYYMMDD" directory.  The
>>     directories can be renamed trivially on upgrade but you still
>>     have the same problems as the wiki page says.
>>
>>     I'm sure there were other points but I've lost track.  Checksums
>>     will have to wait for another email.
>>
>>     Cheers,
>>
>>     Stephen.
>>
>>     ---
>>
>>     Stephen Pascoe  +44 (0)1235 445980
>>
>>     Centre of Environmental Data Archival
>>
>>     STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11
>>     0QX, UK
>>
>>     *From:*go-essp-tech-bounces at ucar.edu
>>     [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Gavin M. Bell
>>     *Sent:* 20 September 2011 17:26
>>     *To:* Kettleborough, Jamie
>>     *Cc:* go-essp-tech at ucar.edu; esg-node-dev at lists.llnl.gov
>>     *Subject:* Re: [Go-essp-tech] Reasoning for the use of symbolic
>>     links in drslib
>>
>>     Jamie and friends.
>>
>>     You've answered your own questions :-)...
>>     It is the catalog where these checksums (and other features) are
>>     recorded.
>>     And thus using the catalog you can see what has changed.
>>     There is a new catalog for every version of a dataset. Given that...
>>     you can quickly and easily inspect catalog_v1 and catalog_v2 to
>>     find what the changes are.
>>     This all answers the question of "WHAT" (to download)... the
>>     other question of "HOW" is a different, but related story.
>>     The trick is to not conflate the two issues which is what
>>     filesystem discussions do.  When talking about filesystems you
>>     are stipulating the what but implicitly conflating the HOW
>>     because you are implicitly designing for tools that intrinsically
>>     use the filesystem.  It is a muddying of the waters that doesn't
>>     separate the two concerns.  We need to deal with these two
>>     concepts independently in a way that does not  limit the system
>>     or cause undo burden on institutions by requiring a rigid structure.
>>
>>     As I mentioned... it's not the filesystem we need to look at...
>>     it's the catalogs.
>>
>>     just my $0.02 - I'll stop flogging this particular horse... but I
>>     hope I have done a better job expressing the issues and where the
>>     solution lies (IMHO).
>>
>>     On 9/20/11 8:14 AM, Kettleborough, Jamie wrote:
>>
>>     Hello Balaji,
>>       
>>     I agree - getting all nodes to make the checksums available would be a
>>     good thing.  It gives you both the data integrity check on download, and
>>     the ability to see what files really have changed from one publication
>>     version to the next.
>>       
>>     I don't know how hard it is to do this, particularly for data that is
>>     already published.
>>       
>>     Jamie
>>       
>>
>>         -----Original Message-----
>>
>>         From: V. Balaji [mailto:V.Balaji at noaa.gov]
>>
>>         Sent: 20 September 2011 16:01
>>
>>         To: Kettleborough, Jamie
>>
>>         Cc: Karl Taylor;go-essp-tech at ucar.edu  <mailto:go-essp-tech at ucar.edu>;esg-node-dev at lists.llnl.gov  <mailto:esg-node-dev at lists.llnl.gov>
>>
>>         Subject: Re: [Go-essp-tech] Reasoning for the use of symbolic
>>
>>         links in drslib
>>
>>           
>>
>>         If nodes can currently choose to record checksums or not, I'd
>>
>>         strongly recommend this be a non-optional requirement.. how
>>
>>         could anyone download any data with confidence without being
>>
>>         able to checksum?
>>
>>           
>>
>>         You can of course check timestamps and filesizes and so on,
>>
>>         but you have to consider those optimizations... a fast option
>>
>>         for the less paranoid to avoid the sum computation, which has
>>
>>         to be the gold standard.
>>
>>           
>>
>>         "Trust but checksum".
>>
>>           
>>
>>         Kettleborough, Jamie writes:
>>
>>           
>>
>>             Hello Karl, everyone,
>>
>>               
>>
>>               
>>
>>                 For replicating the latest version, I agree that your alternate
>>
>>             structure poses difficulties (but it seems like there must
>>
>>         be a way to
>>
>>             smartly determine whether the file you already have a file
>>
>>         and simply
>>
>>             need to move it, rather than bring it over again).
>>
>>               
>>
>>               
>>
>>             Doesn't every user (not just the replication system) have
>>
>>         this problem:
>>
>>             they want to know what files have changed (or not changed) at a new
>>
>>             publication version.  No one wants to be using band width
>>
>>         or storage
>>
>>             space to fetch and store files they already have.  How is a user
>>
>>             expected to know what has really changed?  Estani mentions
>>
>>         check sums
>>
>>             - OK, but I don't think all nodes expose them (is this
>>
>>         right?).  You
>>
>>             may try to infer from modification dates (not sure, I
>>
>>         haven't look at
>>
>>             them that closely).  You may try to infer from the
>>
>>         TRACKING_ID - but
>>
>>             I'm not sure how reliable this is (I can imagine scenarios where
>>
>>             different files share the same TRACKING_ID - e.g. if they have been
>>
>>             modified with an nco tool).
>>
>>               
>>
>>             Is there a recommended method for users to understand what *files*
>>
>>             have actually changed when a new publication version appears?
>>
>>               
>>
>>             Thanks,
>>
>>               
>>
>>             Jamie
>>
>>               
>>
>>           
>>
>>         -- 
>>
>>           
>>
>>         V. Balaji                               Office:  +1-609-452-6516
>>
>>         Head, Modeling Systems Group, GFDL      Home:    +1-212-253-6662
>>
>>         Princeton University                    Email:v.balaji at noaa.gov  <mailto:v.balaji at noaa.gov>
>>
>>           
>>
>>     _______________________________________________
>>     GO-ESSP-TECH mailing list
>>     GO-ESSP-TECH at ucar.edu  <mailto:GO-ESSP-TECH at ucar.edu>
>>     http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>
>>
>>
>>     -- 
>>     Gavin M. Bell
>>     --
>>       
>>       "Never mistake a clear view for a short distance."
>>                     -Paul Saffo
>>       
>>
>>     -- 
>>     Scanned by iCritical.
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110923/a941d08d/attachment-0001.html