[Go-essp-tech] Reasoning for the use of symbolic links in drslib
Estanislao Gonzalez
gonzalez at dkrz.de
Mon Sep 19 02:31:25 MDT 2011
Hi,
I've started this discussion about hard vs soft links a while ago and
the results as far as I remember where these:
1) soft-links are identifiable as such, and could be handled properly
via tools
2) Hard-links would result in the same file being stored on tape
multiple times, because of 1 not holding.
We are using the drslib for managing our own data here, but I won't be
using it for managing the replicas.
To summarize why:
1) soft-links are (and should be) transparent to the APIs publishing
them (i.e. we should not rely on how people structure their files from
the outside, either a file, a sof- or a hard-link should be treated
equally so that tools does not have to cope with a growing complexity).
2) the drs-lib creates a non-drs structure within the drs one. A small
change could solve this by storing the files in e.g. cmip5_file/... and
the links in cmip5/.. though.
3) Only a few tools can handle the copying of files and spft-links
properly, and only in bulk. Copying a single file could result in either
copying a link pointing to a non existent file, or a link and a file
from a potentially different version. This is very non-intuitive.
What we are aiming to do is:
1) replicate on a meta-data basis, i.e. first get the metadata and then
make a decision on what to do next. We are using the file name and
checksum to recognize a file that hasn't change from a previous version.
The source url of an unchanged file will be changed to file://... and
ponted to the existing file (e.g. for those old files in case a new
version is published and still uses them). This assure we move only what
we need.
2) Then we will create links from the new file to the old same file.
It's not clear if we will point to the original file (first version
acquired) or to the linked one. It depends on the type of link to use,
either:
- soft-links: problems with maintaining a temporary structure while the
data is being downloaded and prepared and moving it around. The
difference is that they point to other drs_valid files (specifically,
the one from last version, even if it's already a link). Version
deletion is tricky; an error and potentially all versions of the file
will be lost.
- or hard-links: files can be created, deleted, or moved around in the
same files system without any penalty. Version deletion is trivial. But
you need to maintain the DB to recognize those (which we do, in the
esg-publisher); finding all hard-links of a file is always possible but
extremely time consuming (same as soft-links). Thought this will only be
desirable for improving the tape storage strategy.
In either case, I do think we should not rely on whether a node uses
soft-links, files or hard-links. To the outside they should be treated
like normal files in my opinion. Saving space, and more importantly how
to do so, should be a decision of each archive site. It really depends
on the technology in use.
Thanks,
Estani
Am 19.09.2011 02:23, schrieb V. Balaji:
> So long as source and destination are on the same filesystem, you
> can use hard links instead of symbolic ('ln' instead of 'ln -s'),
> can't you? Hard links are transparent (or do I mean opaque... in
> either case, I mean not a problem:-) for any data transfer protocol,
> including gridftp, and do not cost anything in bandwidth or storage.
>
> But hard links cannot cross filesystem partitions.
>
> Karl Taylor writes:
>
>> Hi Stephen,
>>
>> For replicating the latest version, I agree that your alternate structure
>> poses difficulties (but it seems like there must be a way to smartly
>> determine whether the file you already have a file and simply need to move
>> it, rather than bring it over again).
>>
>> I wanted to bring up another use case in which your alternative offers some
>> advantages (after a slight modification): a user using gridftp wants to
>> retrieve the latest version of datasets distributed around several sites
>> (ignore replicated versions for simplicity). If the user wants to write a
>> script to retrieve the data without having to look up what the latest version
>> number is for each dataset, wouldn't it be best to put the actual files in
>> "latest" subdirectory, with pointers going from the version identified
>> directory to "latest"? All of these complications seem to be due to
>> gridftp's inability to follow links.
>>
>> Are there other ways to enable users to do the above without putting the
>> actual files in a directory named "latest"?
>>
>> regards,
>> Karl
>>
>> On 9/17/11 1:16 PM, stephen.pascoe at stfc.ac.uk wrote:
>>> Hi All,
>>>
>>> I'm aware that many people are reluctant to use drslib because of it's use
>>> of symbolic links when constructing the DRS directory structure. I
>>> completely understand caution when using symbolic links but I wanted to
>>> make the case for why I believe it is necessary to meet the goals of a
>>> consistent distributed and versioned archive. Therefore I've prepared a
>>> wiki page that goes into the technical details:
>>>
>>> http://esgf.org/wiki/DrsVersionLinking
>>>
>>> Please read it carefully if you are currently considering how to implement
>>> the DRS directory structure. I propose we discuss this at the next ESGF
>>> telco.
>>>
>>> Thanks,
>>>
>>> Stephen.
>>>
>>> ---
>>>
>>> Stephen Pascoe +44 (0)1235 445980
>>>
>>> Centre of Environmental Data Archival
>>>
>>> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
>>>
>>>
>>> --
>>> Scanned by iCritical.
>>>
>>>
--
Estanislao Gonzalez
Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
Phone: +49 (40) 46 00 94-126
E-Mail: gonzalez at dkrz.de
More information about the GO-ESSP-TECH
mailing list