[Go-essp-tech] Reasoning for the use of symbolic links in drslib

Karl Taylor taylor13 at llnl.gov
Mon Sep 19 06:02:40 MDT 2011


Hi Estani,

I've embedded 2 questions below because I do not understand.  If you 
think I should understand this, please clarify.

thanks,
Karl

On 9/19/11 1:31 AM, Estanislao Gonzalez wrote:
> Hi,
>
> I've started this discussion about hard vs soft links a while ago and
> the results as far as I remember where these:
> 1) soft-links are identifiable as such, and could be handled properly
> via tools
> 2) Hard-links would result in the same file being stored on tape
> multiple times, because of 1 not holding.
I don't understand "because of 1 not holding".
> We are using the drslib for managing our own data here, but I won't be
> using it for managing the replicas.
> To summarize why:
> 1) soft-links are (and should be) transparent to the APIs publishing
> them (i.e. we should not rely on how people structure their files from
> the outside, either a file, a sof- or a hard-link should be treated
> equally so that tools does not have to cope with a growing complexity).
> 2) the drs-lib creates a non-drs structure within the drs one. A small
> change could solve this by storing the files in e.g. cmip5_file/... and
> the links in cmip5/.. though.
What is the non-drs structure created by drs-lib?
> 3) Only a few tools can handle the copying of files and spft-links
> properly, and only in bulk. Copying a single file could result in either
> copying a link pointing to a non existent file, or a link and a file
> from a potentially different version. This is very non-intuitive.
>
> What we are aiming to do is:
> 1) replicate on a meta-data basis, i.e. first get the metadata and then
> make a decision on what to do next. We are using the file name and
> checksum to recognize a file that hasn't change from a previous version.
> The source url of an unchanged file will be changed to file://... and
> ponted to the existing file (e.g. for those old files in case a new
> version is published and still uses them). This assure we move only what
> we need.
> 2) Then we will create links from the new file to the old same file.
> It's not clear if we will point to the original file (first version
> acquired) or to the linked one. It depends on the type of link to use,
> either:
> - soft-links: problems with maintaining a temporary structure while the
> data is being downloaded and prepared and moving it around. The
> difference is that they point to other drs_valid files (specifically,
> the one from last version, even if it's already a link). Version
> deletion is tricky; an error and potentially all versions of the file
> will be lost.
> - or hard-links: files can be created, deleted, or moved around in the
> same files system without any penalty. Version deletion is trivial. But
> you need to maintain the DB to recognize those (which we do, in the
> esg-publisher); finding all hard-links of a file is always possible but
> extremely time consuming (same as soft-links). Thought this will only be
> desirable for improving the tape storage strategy.
>
> In either case, I do think we should not rely on whether a node uses
> soft-links, files or hard-links. To the outside they should be treated
> like normal files in my opinion. Saving space, and more importantly how
> to do so, should be a decision of each archive site. It really depends
> on the technology in use.
>
> Thanks,
> Estani
>
> Am 19.09.2011 02:23, schrieb V. Balaji:
>> So long as source and destination are on the same filesystem, you
>> can use hard links instead of symbolic ('ln' instead of 'ln -s'),
>> can't you? Hard links are transparent (or do I mean opaque... in
>> either case, I mean not a problem:-) for any data transfer protocol,
>> including gridftp, and do not cost anything in bandwidth or storage.
>>
>> But hard links cannot cross filesystem partitions.
>>
>> Karl Taylor writes:
>>
>>> Hi Stephen,
>>>
>>> For replicating the latest version, I agree that your alternate structure
>>> poses difficulties (but it seems like there must be a way to smartly
>>> determine whether the file you already have a file and simply need to move
>>> it, rather than bring it over again).
>>>
>>> I wanted to bring up another use case in which your alternative offers some
>>> advantages (after a slight modification):  a user using gridftp wants to
>>> retrieve the latest version of datasets distributed around several sites
>>> (ignore replicated versions for simplicity).  If the user wants to write a
>>> script to retrieve the data without having to look up what the latest version
>>> number is for each dataset, wouldn't it be best to put the actual files in
>>> "latest" subdirectory, with pointers going from the version identified
>>> directory to "latest"?   All of these complications seem to be due to
>>> gridftp's inability to follow links.
>>>
>>> Are there other ways to enable users to do the above without putting the
>>> actual files in a directory named "latest"?
>>>
>>> regards,
>>> Karl
>>>
>>> On 9/17/11 1:16 PM, stephen.pascoe at stfc.ac.uk wrote:
>>>> Hi All,
>>>>
>>>> I'm aware that many people are reluctant to use drslib because of it's use
>>>> of symbolic links when constructing the DRS directory structure.  I
>>>> completely understand caution when using symbolic links but I wanted to
>>>> make the case for why I believe it is necessary to meet the goals of a
>>>> consistent distributed and versioned archive.  Therefore I've prepared a
>>>> wiki page that goes into the technical details:
>>>>
>>>> http://esgf.org/wiki/DrsVersionLinking
>>>>
>>>> Please read it carefully if you are currently considering how to implement
>>>> the DRS directory structure.  I propose we discuss this at the next ESGF
>>>> telco.
>>>>
>>>> Thanks,
>>>>
>>>> Stephen.
>>>>
>>>> ---
>>>>
>>>> Stephen Pascoe  +44 (0)1235 445980
>>>>
>>>> Centre of Environmental Data Archival
>>>>
>>>> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
>>>>
>>>>
>>>> -- 
>>>> Scanned by iCritical.
>>>>
>>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110919/52d6a3b2/attachment-0001.html 


More information about the GO-ESSP-TECH mailing list