<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Sorry, I'll try to clarify this better.<br>

    <br>

    Am 19.09.2011 14:02, schrieb Karl Taylor:

    <blockquote cite="mid:4E772F60.3050200@llnl.gov" type="cite">

      <meta content="text/html; charset=ISO-8859-1"

        http-equiv="Content-Type">

      <font face="Times New Roman">Hi Estani,<br>

        <br>

        I've embedded 2 questions below because I do not understand.&nbsp; If

        you think I should understand this, please clarify.<br>

        <br>

        thanks,<br>

        Karl<br>

      </font><br>

      On 9/19/11 1:31 AM, Estanislao Gonzalez wrote:

      <blockquote cite="mid:4E76FDDD.5030109@dkrz.de" type="cite">

        <pre wrap="">Hi,

I've started this discussion about hard vs soft links a while ago and 

the results as far as I remember where these:

1) soft-links are identifiable as such, and could be handled properly 

via tools

2) Hard-links would result in the same file being stored on tape 

multiple times, because of 1 not holding.

</pre>

      </blockquote>

      I don't understand "because of 1 not holding".&nbsp; <br>

    </blockquote>

    Sorry for writing too fast and carelessly ... this makes no sense.<br>

    I meant that as hard links are really like normal files, the will

    get ingested multiple times when stored into tape. Tools can

    therefore not rely on "detecting" them as they would with

    soft-links. [I'm told that our tape storage cannot store links, so

    we wouldn't benefit from this advantage of soft-links)<br>

    <br>

    <blockquote cite="mid:4E772F60.3050200@llnl.gov" type="cite">

      <blockquote cite="mid:4E76FDDD.5030109@dkrz.de" type="cite">

        <pre wrap="">We are using the drslib for managing our own data here, but I won't be 

using it for managing the replicas.

To summarize why:

1) soft-links are (and should be) transparent to the APIs publishing 

them (i.e. we should not rely on how people structure their files from 

the outside, either a file, a sof- or a hard-link should be treated 

equally so that tools does not have to cope with a growing complexity).

2) the drs-lib creates a non-drs structure within the drs one. A small 

change could solve this by storing the files in e.g. cmip5_file/... and 

the links in cmip5/.. though.

</pre>

      </blockquote>

      What is the non-drs structure created by drs-lib?<br>

    </blockquote>

    the files/ directory living amid the different versions. Stephen

    described that here: <a class="moz-txt-link-freetext" href="http://esgf.org/wiki/DrsVersionLinking">http://esgf.org/wiki/DrsVersionLinking</a><br>

    <br>

    to sum it up, this would be a valid and common scenario where a new

    file is added to a dataset (e.g. a couple more years have been

    compute):<br>

    f1 =

../cmip5/output1/MyINST/MyModel/exp/freq/realm/table/r1i1p1/files/vara_20110101/vara_1.nc<br>

    f2 =

../cmip5/output1/MyINST/MyModel/exp/freq/realm/table/r1i1p1/files/vara_20110808/vara_2.nc<br>

    l1

    =../cmip5/output1/MyINST/MyModel/exp/freq/realm/table/r1i1p1/v20110101/vara/vara_1.nc

    -&gt; f1<br>

    l2 =

    ../cmip5/output1/MyINST/MyModel/exp/freq/realm/table/r1i1p1/v20110808/vara/vara_1.nc

    -&gt; f1<br>

    l3 =

    ../cmip5/output1/MyINST/MyModel/exp/freq/realm/table/r1i1p1/v20110808/vara/vara_2.nc

    -&gt; f2<br>

    [so l1 -&gt; f1 means link1 points to file 1]<br>

    <br>

    The first 2 are embedded in the DRS structure but are not DRS

    conform (they are _normally_ not meant to be accessed from the

    outside). The replication tool must know this in order to replicate

    it as it is shown here.<br>

    What if I only replicate v20110808? The only possibility I have to

    get exactly this structure, is to use a tool that replicates

    everything "exactly" as it's here (keyword: bulk, no subset of

    files) and to replicate links and files as required.<br>

    <br>

    <br>

    And sorry again for that awful explanation I gave before.<br>

    <br>

    Thanks,<br>

    Estani<br>

    <blockquote cite="mid:4E772F60.3050200@llnl.gov" type="cite">

      <blockquote cite="mid:4E76FDDD.5030109@dkrz.de" type="cite">

        <pre wrap="">3) Only a few tools can handle the copying of files and spft-links 

properly, and only in bulk. Copying a single file could result in either 

copying a link pointing to a non existent file, or a link and a file 

from a potentially different version. This is very non-intuitive.

What we are aiming to do is:

1) replicate on a meta-data basis, i.e. first get the metadata and then 

make a decision on what to do next. We are using the file name and 

checksum to recognize a file that hasn't change from a previous version. 

The source url of an unchanged file will be changed to <a moz-do-not-send="true" class="moz-txt-link-freetext" href="file://">file://</a>... and 

ponted to the existing file (e.g. for those old files in case a new 

version is published and still uses them). This assure we move only what 

we need.

2) Then we will create links from the new file to the old same file. 

It's not clear if we will point to the original file (first version 

acquired) or to the linked one. It depends on the type of link to use, 

either:

- soft-links: problems with maintaining a temporary structure while the 

data is being downloaded and prepared and moving it around. The 

difference is that they point to other drs_valid files (specifically, 

the one from last version, even if it's already a link). Version 

deletion is tricky; an error and potentially all versions of the file 

will be lost.

- or hard-links: files can be created, deleted, or moved around in the 

same files system without any penalty. Version deletion is trivial. But 

you need to maintain the DB to recognize those (which we do, in the 

esg-publisher); finding all hard-links of a file is always possible but 

extremely time consuming (same as soft-links). Thought this will only be 

desirable for improving the tape storage strategy.

In either case, I do think we should not rely on whether a node uses 

soft-links, files or hard-links. To the outside they should be treated 

like normal files in my opinion. Saving space, and more importantly how 

to do so, should be a decision of each archive site. It really depends 

on the technology in use.

Thanks,

Estani

Am 19.09.2011 02:23, schrieb V. Balaji:

</pre>

        <blockquote type="cite">

          <pre wrap="">So long as source and destination are on the same filesystem, you

can use hard links instead of symbolic ('ln' instead of 'ln -s'),

can't you? Hard links are transparent (or do I mean opaque... in

either case, I mean not a problem:-) for any data transfer protocol,

including gridftp, and do not cost anything in bandwidth or storage.

But hard links cannot cross filesystem partitions.

Karl Taylor writes:

</pre>

          <blockquote type="cite">

            <pre wrap="">Hi Stephen,

For replicating the latest version, I agree that your alternate structure

poses difficulties (but it seems like there must be a way to smartly

determine whether the file you already have a file and simply need to move

it, rather than bring it over again).

I wanted to bring up another use case in which your alternative offers some

advantages (after a slight modification):  a user using gridftp wants to

retrieve the latest version of datasets distributed around several sites

(ignore replicated versions for simplicity).  If the user wants to write a

script to retrieve the data without having to look up what the latest version

number is for each dataset, wouldn't it be best to put the actual files in

"latest" subdirectory, with pointers going from the version identified

directory to "latest"?   All of these complications seem to be due to

gridftp's inability to follow links.

Are there other ways to enable users to do the above without putting the

actual files in a directory named "latest"?

regards,

Karl

On 9/17/11 1:16 PM, <a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:stephen.pascoe@stfc.ac.uk">stephen.pascoe@stfc.ac.uk</a> wrote:

</pre>

            <blockquote type="cite">

              <pre wrap="">Hi All,

I'm aware that many people are reluctant to use drslib because of it's use

of symbolic links when constructing the DRS directory structure.  I

completely understand caution when using symbolic links but I wanted to

make the case for why I believe it is necessary to meet the goals of a

consistent distributed and versioned archive.  Therefore I've prepared a

wiki page that goes into the technical details:

<a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://esgf.org/wiki/DrsVersionLinking">http://esgf.org/wiki/DrsVersionLinking</a>

Please read it carefully if you are currently considering how to implement

the DRS directory structure.  I propose we discuss this at the next ESGF

telco.

Thanks,

Stephen.

---

Stephen Pascoe  +44 (0)1235 445980

Centre of Environmental Data Archival

STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK

-- 

Scanned by iCritical.

</pre>

            </blockquote>

          </blockquote>

        </blockquote>

        <pre wrap="">

</pre>

      </blockquote>

    </blockquote>

    <br>

    <br>

    <pre class="moz-signature" cols="72">-- 

Estanislao Gonzalez

Max-Planck-Institut f&uuml;r Meteorologie (MPI-M)

Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre

Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

Phone:   +49 (40) 46 00 94-126

E-Mail:  <a class="moz-txt-link-abbreviated" href="mailto:gonzalez@dkrz.de">gonzalez@dkrz.de</a> </pre>

  </body>

</html>