[Go-essp-tech] Reasoning for the use of symbolic links in drslib

Fri Sep 23 11:11:55 MDT 2011

Hi Balaji,

I think it would be more difficult for some users to do what they want 
if only the files were listed that should *not* be downloaded, but I'll 
think more about this.

Karl

On 9/23/11 8:00 AM, V. Balaji wrote:
> I'd like to assume, optimistically, that the number of files that have
> non-latest status (i.e been superseded or retracted) is going to be very
> small compared to the total number of files on the system.
>
> In that case, perhaps it would be more efficient for the additional text
> file proposed by Karl to list only those files that have been superseded
> and should _not_ be downloaded, thus only the non-latest rather than the
> latest.
>
> Additionally it should be possible to set a flag in THREDDS that marks
> those files as non-latest, and then have some behaviour from the UI,
> like a warning popup, signalling that an alternate file is to be
> downloaded instead.
>
> Even more basic, we can return to the filesystem directory (which is
> the *original* data catalog, using a software artifact developed in
> 1969 or so... we should all be so lucky as to create a data structure
> that's used 40 years from now...) and signal a file as non-latest by
> simply flipping its read bit (chmod a-r).
>
> PS. I am glad we have consensus on checksums, but I'd like to hear
> comments on Karl's proposal on how to enforce this requirement on
> already catalogued files, pulled out and re-quoted here
>
>>        P.S. to weigh in on another issue, I think it *will* be
>> essential to require, as part of ESG publication that the check-sum be
>> recorded (in the THREDDS catalog, if I'm not mistaken).  We haven't
>> asked groups to republish data conforming to this new requirement
>> because I want to make sure that any other required alterations in the
>> configuration of the publisher are also communicated, so we only have to
>> ask groups to republish once.  Note also that if my "alternative"
>> approach outlined above is adopted, the checksums could either be gotten
>> from the catalog (if they were computed and stored there) or be
>> calculated by drslib itself; there would be no need to republish data to
>> ESG
> Is Karl's proposed design ok? Look for it in THREDDS, if not found,
> have drslib or something add it in.
>
> Thanks,
>
> Kettleborough, Jamie writes:
>
>> Hello Karl,
>>
>> thanks for responding on this and making the user view much more
>> explicit.  And thanks for the note on the checksum - its good to know
>> this is close to being 'required'.
>>
>> I also agree that the risk to CMIP5 (and the model contribution to IPCC
>> AR5) due to data access problems is sufficiently high that simple
>> (non-general) solutions that can be delivered quickly are needed.  I
>> think that many of the early CMIP5/working group 1 users are happy to
>> take some of the responsibility for filtering which data they need,
>> q.c., etc on themselves.  So this user base can toleterate simpler
>> solutions.  This may not apply to working groups 2, 3... I don't know.
>> Later studies of working group 1 may need richer model meta-data.
>>
>> Some questions /comments on your proposal:
>>
>> 1. I think the list files are derivable from the thredds catalogue
>> entries for the publication version dataset (if they all contained  the
>> checksums) - I think you suspect this.  In a sense (I think) they are a
>> reformatting of the thredds catalogues into a form more parsable by
>> users. If it can be achieved in time then I think its safer to get the
>> checksums in the thredds cataloges and derive any other format from
>> there.
>> 2. do you think you would expose these list files through http?  You
>> mention gridFTP but how soon do you think gridFTP will be available for
>> the users that need it...
>>    a. I'm not sure how many have data available through gridFTP
>> (http://esgf.org/wiki/Cmip5Status/ArchiveView suggest not many?)
>>    b. I'm not sure how many users will have gridFTP clients (or maybe
>> you can use a standard ftp client?)
>> 3. Do you need to capture this idea of 'latest' in the user view, or can
>> the user work this out based on the version number?
>>    a. including 'latest' makes is easier for users as it takes one bit
>> of responsibility away from them
>>    b. but you may be introducing an inconsistency between the thredds
>> interface (which doesn't really expose this idea of latest), and the
>> more file based interface).
>>    c. this exposure of 'latest' may be a minor point, (but its ringing
>> alarm bells with me).
>> 4. I don't think you need time sample do you - isn't that in the file
>> name?
>> 5. what is the full path to the file - the one visible through gridFTP,
>> or through the thredds file server or what?
>> 6. an addition - but in the same vain of simplicity, can we have an easy
>> to parse list of servers that hold CMIP5 data available via http.  In
>> the first instance this could be populated by hand.  It could be as
>> simple as a csv file - server,pki_status.
>>
>> I'm afraid I haven't had time to think about all the issues around hard
>> links, soft links and tape storage - and there may be more major issues
>> there.
>>
>> Jamie
>>
>>
>> ________________________________
>>
>>        From: Karl Taylor [mailto:taylor13 at llnl.gov]
>>        Sent: 21 September 2011 23:19
>>        To: stephen.pascoe at stfc.ac.uk
>>        Cc: gavin at llnl.gov; Kettleborough, Jamie; go-essp-tech at ucar.edu;
>> esg-node-dev at lists.llnl.gov
>>        Subject: Re: [Go-essp-tech] Reasoning for the use of symbolic
>> links in drslib
>>
>>
>>        Hi Stephen and all,
>>
>>        I would add another requirement (or is this part of 4?):
>>
>>        5.  A user (as opposed to a data provider or a "replicator" or a
>> data center data manager) should be able to determine (through an
>> automated scripted process) whether a file previously downloaded is in
>> the current (i.e., "latest")  version of a dataset, or has been
>> withdrawn or replaced.
>>
>>        To meet all the requirements in a practical way in the next few
>> weeks, I'll suggest an alternative approach:  We could use drsLib to
>> create the DRS directory structure, but populate the lowest level (where
>> the files would normally be found) with a single text file  (referred to
>> subsequently as the "listing file") containing the following
>> information:
>>
>>        the publication-level dataset version THREDDS id, which is:
>> <activity>.<product>.<institute>.<model>.<experiment>.<frequency>.<model
>> ing realm>.<MIP table>.<ensemble member>.<version number>
>>        plus the<variable name>
>>        followed by a table with:
>>        filename      time units     time of 1st time sample
>> time of last time sample       full path to file    tracking_id
>> checksum
>>        ----------       ------------     ----------------------------
>> -----------------------------     -------------------    -------------
>> ----------
>>        file1
>>        file2
>>        .
>>        .
>>        .
>>        fileN
>>
>>        The "listing file" would be stored twice for the latest version
>> of each dataset:  once under the numbered version subdirectory and
>> *also* under the generically labeled "latest" directory.  [This is so a
>> user interested in the latest version can find it without knowing its
>> actual number.]  By the way the time information included in the list
>> might not be absolutely essential, but it could be helpful for those
>> only wanting to download specific time-segments of an integration.
>>
>>        I realize this is not a particularly elegant approach, but if
>> users were given access to the drs directory structure (say, through
>> gridftp), they could run a script that navigated directly to a variable
>> of interest (based on the DRS directory structure specifications) and
>> download the "listing file" stored there.  Then, the "latest" listing
>> file could be compared to the older "listing file" (previously
>> downloaded by the user) to determine whether a new version was available
>> (by simply comparing the<version numbers>  stored in the THREDDS ID).
>> If the user didn't have the most recent version, he could then compare
>> the two "listing files" (old and new) to determine which files were new
>> and which (if any) had been eliminated.
>>
>>        At that point, the user could generate a local copy of the
>> latest version by moving/deleting files not found in the latest "listing
>> file" and by downloading (using, for example, gridftp) only the new
>> files.
>>
>>        I bet that in a single day Stephen could enhance drslib to
>> produce these list files, rather than creating the symbolic links to the
>> actual file locations as it currently does.  Note that if the actual
>> files were moved into new directories sometime in the future, a utility
>> would have to be written to modify all the "list files" to point to the
>> new file locations (but that's also true of the symbolic links, I think)
>>
>>        Also note that creation of a new version would *not* require
>> changing any of the existing "list files" (except the list file in the
>> "latest" directory would be removed).  A new version subdirectory would
>> have to be created and for each variable in the dataset, the new "list
>> file" for that version would have to be generated (and copied also to
>> "latest").
>>
>>        I'll be interested in your response to this idea and trust that
>> any time spent thinking about it is warranted (i.e., that this is not a
>> completely stupid suggestion).  Will it meet all of Stephen's needs?
>> Are there any other solutions to the data users' troubles in obtaining
>> data, which we can implement in the next few weeks (since that should be
>> our goal here).
>>
>>        My primary interest is in making CMIP5 data easily obtainable by
>> users (which appears not to be the case at present), and to allow users
>> to write scripts to troll for new data they are interested in and
>> discover any new versions of data that should replace the old.  This is
>> not meant to be a general solution to all of the possible ESG
>> applications.  Also, I'm guessing that a similar approach could be
>> followed where instead of reading the "list files", one read the
>> catalogs, but I doubt that this would be as easy for the typical user to
>> do.
>>
>>        Best regards,
>>        Karl
>>
>>        P.S. to weigh in on another issue, I think it *will* be
>> essential to require, as part of ESG publication that the check-sum be
>> recorded (in the THREDDS catalog, if I'm not mistaken).  We haven't
>> asked groups to republish data conforming to this new requirement
>> because I want to make sure that any other required alterations in the
>> configuration of the publisher are also communicated, so we only have to
>> ask groups to republish once.  Note also that if my "alternative"
>> approach outlined above is adopted, the checksums could either be gotten
>> from the catalog (if they were computed and stored there) or be
>> calculated by drslib itself; there would be no need to republish data to
>> ESG
>>
>>
>>        On 9/20/11 2:35 PM, stephen.pascoe at stfc.ac.uk wrote:
>>
>>                Hi All,
>>
>>
>>
>>                Lots of good discussion here and sorry I've been keeping
>> quiet.  I want to remind ourselves of the requirements I laid out in the
>> wiki page
>>
>>
>>
>>                1. It should allow data from multiple versions to be
>> kept on disk simultaneously.
>>
>>                2. It should avoid storing multiple copies of files that
>> are present in more than one version.
>>
>>                3. It should be straightforward to copy dataset changes
>> (i.e. differences between versions) between nodes to allow efficient
>> replication.
>>
>>                4. It should rely only on the filesystem so that generic
>> tools like FTP could be used to expose the structure if necessary.
>>
>>
>>
>>                In my view we should address these directly.  Are they
>> needed?  Which are the most important?
>>
>>
>>
>>                Gavin said about catalogs
>>
>>                >  you can quickly and easily inspect catalog_v1 and
>> catalog_v2 to find what the changes are.
>>                >  This all answers the question of "WHAT" (to
>> download)... the other question of "HOW" is a different, but related
>> story.
>>                >  The trick is to not conflate the two issues which is
>> what filesystem discussions do. .
>>
>>
>>
>>                But THREDDS conflates the two as well!  A THREDDS
>> catalog contains descriptions of service endpoints that are not
>> independent of the node serving the data (the "HOW").  Maybe we should
>> have developed a true catalog format but that is not where we are now.
>> The replication client use THREDDS catalogs in this way but when I last
>> looked it was completely unaware of versions -- i.e. it won't help with
>> #3.
>>
>>
>>
>>                I don't see how Gavin's point addresses any of the
>> requirements above.  Even if we ditch #4, which I expect Gavin would
>> argue for, it doesn't directly solve the problem for #1-#3 either.
>>
>>
>>
>>                Briefly on some other points that have been made...
>>
>>
>>
>>                Balaji, some archive tools maybe can detect 2 paths
>> pointing to the same filesystem inode but both Estani and I have
>> enquired with our backup people and they say hard links must be avoided.
>> I am happy to include a hard-linking option in drslib though.  I've
>> created a bugzilla ticket for it.
>>
>>
>>
>>                Karl, I think putting real files in "latest" is
>> equivalent to putting real files in the latest "vYYYYMMDD" directory.
>> The directories can be renamed trivially on upgrade but you still have
>> the same problems as the wiki page says.
>>
>>
>>
>>                I'm sure there were other points but I've lost track.
>> Checksums will have to wait for another email.
>>
>>
>>
>>                Cheers,
>>
>>                Stephen.
>>
>>
>>
>>
>>
>>                ---
>>
>>                Stephen Pascoe  +44 (0)1235 445980
>>
>>                Centre of Environmental Data Archival
>>
>>                STFC Rutherford Appleton Laboratory, Harwell Oxford,
>> Didcot OX11 0QX, UK
>>
>>
>>
>>                From: go-essp-tech-bounces at ucar.edu
>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Gavin M. Bell
>>                Sent: 20 September 2011 17:26
>>                To: Kettleborough, Jamie
>>                Cc: go-essp-tech at ucar.edu; esg-node-dev at lists.llnl.gov
>>                Subject: Re: [Go-essp-tech] Reasoning for the use of
>> symbolic links in drslib
>>
>>
>>
>>                Jamie and friends.
>>
>>                You've answered your own questions :-)...
>>                It is the catalog where these checksums (and other
>> features) are recorded.
>>                And thus using the catalog you can see what has changed.
>>                There is a new catalog for every version of a dataset.
>> Given that...
>>                you can quickly and easily inspect catalog_v1 and
>> catalog_v2 to find what the changes are.
>>                This all answers the question of "WHAT" (to download)...
>> the other question of "HOW" is a different, but related story.
>>                The trick is to not conflate the two issues which is
>> what filesystem discussions do.  When talking about filesystems you are
>> stipulating the what but implicitly conflating the HOW because you are
>> implicitly designing for tools that intrinsically use the filesystem.
>> It is a muddying of the waters that doesn't separate the two concerns.
>> We need to deal with these two concepts independently in a way that does
>> not  limit the system or cause undo burden on institutions by requiring
>> a rigid structure.
>>
>>                As I mentioned... it's not the filesystem we need to
>> look at... it's the catalogs.
>>
>>                just my $0.02 - I'll stop flogging this particular
>> horse... but I hope I have done a better job expressing the issues and
>> where the solution lies (IMHO).
>>
>>                On 9/20/11 8:14 AM, Kettleborough, Jamie wrote:
>>
>>                Hello Balaji,
>>
>>                I agree - getting all nodes to make the checksums
>> available would be a
>>                good thing.  It gives you both the data integrity check
>> on download, and
>>                the ability to see what files really have changed from
>> one publication
>>                version to the next.
>>
>>                I don't know how hard it is to do this, particularly for
>> data that is
>>                already published.
>>
>>                Jamie
>>
>>
>>                        -----Original Message-----
>>                        From: V. Balaji [mailto:V.Balaji at noaa.gov]
>>                        Sent: 20 September 2011 16:01
>>                        To: Kettleborough, Jamie
>>                        Cc: Karl Taylor; go-essp-tech at ucar.edu;
>> esg-node-dev at lists.llnl.gov
>>                        Subject: Re: [Go-essp-tech] Reasoning for the
>> use of symbolic
>>                        links in drslib
>>
>>                        If nodes can currently choose to record
>> checksums or not, I'd
>>                        strongly recommend this be a non-optional
>> requirement.. how
>>                        could anyone download any data with confidence
>> without being
>>                        able to checksum?
>>
>>                        You can of course check timestamps and filesizes
>> and so on,
>>                        but you have to consider those optimizations...
>> a fast option
>>                        for the less paranoid to avoid the sum
>> computation, which has
>>                        to be the gold standard.
>>
>>                        "Trust but checksum".
>>
>>                        Kettleborough, Jamie writes:
>>
>>
>>                                Hello Karl, everyone,
>>
>>
>>                                   For replicating the latest version, I
>> agree that your alternate
>>                                structure poses difficulties (but it
>> seems like there must
>>
>>                        be a way to
>>
>>                                smartly determine whether the file you
>> already have a file
>>
>>                        and simply
>>
>>                                need to move it, rather than bring it
>> over again).
>>
>>
>>                                Doesn't every user (not just the
>> replication system) have
>>
>>                        this problem:
>>
>>                                they want to know what files have
>> changed (or not changed) at a new
>>                                publication version.  No one wants to be
>> using band width
>>
>>                        or storage
>>
>>                                space to fetch and store files they
>> already have.  How is a user
>>                                expected to know what has really
>> changed?  Estani mentions
>>
>>                        check sums
>>
>>                                - OK, but I don't think all nodes expose
>> them (is this
>>
>>                        right?).  You
>>
>>                                may try to infer from modification dates
>> (not sure, I
>>
>>                        haven't look at
>>
>>                                them that closely).  You may try to
>> infer from the
>>
>>                        TRACKING_ID - but
>>
>>                                I'm not sure how reliable this is (I can
>> imagine scenarios where
>>                                different files share the same
>> TRACKING_ID - e.g. if they have been
>>                                modified with an nco tool).
>>
>>                                Is there a recommended method for users
>> to understand what *files*
>>                                have actually changed when a new
>> publication version appears?
>>
>>                                Thanks,
>>
>>                                Jamie
>>
>>
>>
>>                        --
>>
>>                        V. Balaji                               Office:
>> +1-609-452-6516
>>                        Head, Modeling Systems Group, GFDL      Home:
>> +1-212-253-6662
>>                        Princeton University                    Email:
>> v.balaji at noaa.gov
>>
>>
>>                _______________________________________________
>>                GO-ESSP-TECH mailing list
>>                GO-ESSP-TECH at ucar.edu
>>                http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>
>>
>>
>>
>>
>>                --
>>                Gavin M. Bell
>>                --
>>
>>                 "Never mistake a clear view for a short distance."
>>                               -Paul Saffo
>>
>>
>>
>>                --
>>                Scanned by iCritical.
>>
>>
>>
> --
>
> V. Balaji                               Office:  +1-609-452-6516
> Head, Modeling Systems Group, GFDL      Home:    +1-212-253-6662
> Princeton University                    Email: v.balaji at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110923/68914f56/attachment-0001.html