<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#ffffff" text="#000000">

    <font face="Times New Roman">Hi Balaji,<br>

      <br>

      I think it would be more difficult for some users to do what they

      want if only the files were listed that should *not* be

      downloaded, but I'll think more about this.<br>

      <br>

      Karl<br>

      <br>

    </font><br>

    On 9/23/11 8:00 AM, V. Balaji wrote:

    <blockquote cite="mid:alpine.DEB.2.00.1109231031000.2069@adyar"

      type="cite">

      <pre wrap="">I'd like to assume, optimistically, that the number of files that have

non-latest status (i.e been superseded or retracted) is going to be very

small compared to the total number of files on the system.

In that case, perhaps it would be more efficient for the additional text

file proposed by Karl to list only those files that have been superseded

and should _not_ be downloaded, thus only the non-latest rather than the

latest.

Additionally it should be possible to set a flag in THREDDS that marks

those files as non-latest, and then have some behaviour from the UI,

like a warning popup, signalling that an alternate file is to be

downloaded instead.

Even more basic, we can return to the filesystem directory (which is

the *original* data catalog, using a software artifact developed in

1969 or so... we should all be so lucky as to create a data structure

that's used 40 years from now...) and signal a file as non-latest by

simply flipping its read bit (chmod a-r).

PS. I am glad we have consensus on checksums, but I'd like to hear

comments on Karl's proposal on how to enforce this requirement on

already catalogued files, pulled out and re-quoted here

</pre>

      <blockquote type="cite">

        <pre wrap="">      P.S. to weigh in on another issue, I think it *will* be

essential to require, as part of ESG publication that the check-sum be

recorded (in the THREDDS catalog, if I'm not mistaken).  We haven't

asked groups to republish data conforming to this new requirement

because I want to make sure that any other required alterations in the

configuration of the publisher are also communicated, so we only have to

ask groups to republish once.  Note also that if my "alternative"

approach outlined above is adopted, the checksums could either be gotten

from the catalog (if they were computed and stored there) or be

calculated by drslib itself; there would be no need to republish data to

ESG

</pre>

      </blockquote>

      <pre wrap="">

Is Karl's proposed design ok? Look for it in THREDDS, if not found,

have drslib or something add it in.

Thanks,

Kettleborough, Jamie writes:

</pre>

      <blockquote type="cite">

        <pre wrap="">Hello Karl,

thanks for responding on this and making the user view much more

explicit.  And thanks for the note on the checksum - its good to know

this is close to being 'required'.

I also agree that the risk to CMIP5 (and the model contribution to IPCC

AR5) due to data access problems is sufficiently high that simple

(non-general) solutions that can be delivered quickly are needed.  I

think that many of the early CMIP5/working group 1 users are happy to

take some of the responsibility for filtering which data they need,

q.c., etc on themselves.  So this user base can toleterate simpler

solutions.  This may not apply to working groups 2, 3... I don't know.

Later studies of working group 1 may need richer model meta-data.

Some questions /comments on your proposal:

1. I think the list files are derivable from the thredds catalogue

entries for the publication version dataset (if they all contained  the

checksums) - I think you suspect this.  In a sense (I think) they are a

reformatting of the thredds catalogues into a form more parsable by

users. If it can be achieved in time then I think its safer to get the

checksums in the thredds cataloges and derive any other format from

there.

2. do you think you would expose these list files through http?  You

mention gridFTP but how soon do you think gridFTP will be available for

the users that need it...

  a. I'm not sure how many have data available through gridFTP

(<a class="moz-txt-link-freetext" href="http://esgf.org/wiki/Cmip5Status/ArchiveView">http://esgf.org/wiki/Cmip5Status/ArchiveView</a> suggest not many?)

  b. I'm not sure how many users will have gridFTP clients (or maybe

you can use a standard ftp client?)

3. Do you need to capture this idea of 'latest' in the user view, or can

the user work this out based on the version number?

  a. including 'latest' makes is easier for users as it takes one bit

of responsibility away from them

  b. but you may be introducing an inconsistency between the thredds

interface (which doesn't really expose this idea of latest), and the

more file based interface).

  c. this exposure of 'latest' may be a minor point, (but its ringing

alarm bells with me).

4. I don't think you need time sample do you - isn't that in the file

name?

5. what is the full path to the file - the one visible through gridFTP,

or through the thredds file server or what?

6. an addition - but in the same vain of simplicity, can we have an easy

to parse list of servers that hold CMIP5 data available via http.  In

the first instance this could be populated by hand.  It could be as

simple as a csv file - server,pki_status.

I'm afraid I haven't had time to think about all the issues around hard

links, soft links and tape storage - and there may be more major issues

there.

Jamie

________________________________

      From: Karl Taylor [<a class="moz-txt-link-freetext" href="mailto:taylor13@llnl.gov">mailto:taylor13@llnl.gov</a>]

      Sent: 21 September 2011 23:19

      To: <a class="moz-txt-link-abbreviated" href="mailto:stephen.pascoe@stfc.ac.uk">stephen.pascoe@stfc.ac.uk</a>

      Cc: <a class="moz-txt-link-abbreviated" href="mailto:gavin@llnl.gov">gavin@llnl.gov</a>; Kettleborough, Jamie; <a class="moz-txt-link-abbreviated" href="mailto:go-essp-tech@ucar.edu">go-essp-tech@ucar.edu</a>;

<a class="moz-txt-link-abbreviated" href="mailto:esg-node-dev@lists.llnl.gov">esg-node-dev@lists.llnl.gov</a>

      Subject: Re: [Go-essp-tech] Reasoning for the use of symbolic

links in drslib

      Hi Stephen and all,

      I would add another requirement (or is this part of 4?):

      5.  A user (as opposed to a data provider or a "replicator" or a

data center data manager) should be able to determine (through an

automated scripted process) whether a file previously downloaded is in

the current (i.e., "latest")  version of a dataset, or has been

withdrawn or replaced.

      To meet all the requirements in a practical way in the next few

weeks, I'll suggest an alternative approach:  We could use drsLib to

create the DRS directory structure, but populate the lowest level (where

the files would normally be found) with a single text file  (referred to

subsequently as the "listing file") containing the following

information:

      the publication-level dataset version THREDDS id, which is:

&lt;activity&gt;.&lt;product&gt;.&lt;institute&gt;.&lt;model&gt;.&lt;experiment&gt;.&lt;frequency&gt;.&lt;model

ing realm&gt;.&lt;MIP table&gt;.&lt;ensemble member&gt;.&lt;version number&gt;

      plus the &lt;variable name&gt;

      followed by a table with:

      filename      time units     time of 1st time sample

time of last time sample       full path to file    tracking_id

checksum

      ----------       ------------     ----------------------------

-----------------------------     -------------------    -------------

----------

      file1

      file2

      .

      .

      .

      fileN

      The "listing file" would be stored twice for the latest version

of each dataset:  once under the numbered version subdirectory and

*also* under the generically labeled "latest" directory.  [This is so a

user interested in the latest version can find it without knowing its

actual number.]  By the way the time information included in the list

might not be absolutely essential, but it could be helpful for those

only wanting to download specific time-segments of an integration.

      I realize this is not a particularly elegant approach, but if

users were given access to the drs directory structure (say, through

gridftp), they could run a script that navigated directly to a variable

of interest (based on the DRS directory structure specifications) and

download the "listing file" stored there.  Then, the "latest" listing

file could be compared to the older "listing file" (previously

downloaded by the user) to determine whether a new version was available

(by simply comparing the &lt;version numbers&gt; stored in the THREDDS ID).

If the user didn't have the most recent version, he could then compare

the two "listing files" (old and new) to determine which files were new

and which (if any) had been eliminated.

      At that point, the user could generate a local copy of the

latest version by moving/deleting files not found in the latest "listing

file" and by downloading (using, for example, gridftp) only the new

files.

      I bet that in a single day Stephen could enhance drslib to

produce these list files, rather than creating the symbolic links to the

actual file locations as it currently does.  Note that if the actual

files were moved into new directories sometime in the future, a utility

would have to be written to modify all the "list files" to point to the

new file locations (but that's also true of the symbolic links, I think)

      Also note that creation of a new version would *not* require

changing any of the existing "list files" (except the list file in the

"latest" directory would be removed).  A new version subdirectory would

have to be created and for each variable in the dataset, the new "list

file" for that version would have to be generated (and copied also to

"latest").

      I'll be interested in your response to this idea and trust that

any time spent thinking about it is warranted (i.e., that this is not a

completely stupid suggestion).  Will it meet all of Stephen's needs?

Are there any other solutions to the data users' troubles in obtaining

data, which we can implement in the next few weeks (since that should be

our goal here).

      My primary interest is in making CMIP5 data easily obtainable by

users (which appears not to be the case at present), and to allow users

to write scripts to troll for new data they are interested in and

discover any new versions of data that should replace the old.  This is

not meant to be a general solution to all of the possible ESG

applications.  Also, I'm guessing that a similar approach could be

followed where instead of reading the "list files", one read the

catalogs, but I doubt that this would be as easy for the typical user to

do.

      Best regards,

      Karl

      P.S. to weigh in on another issue, I think it *will* be

essential to require, as part of ESG publication that the check-sum be

recorded (in the THREDDS catalog, if I'm not mistaken).  We haven't

asked groups to republish data conforming to this new requirement

because I want to make sure that any other required alterations in the

configuration of the publisher are also communicated, so we only have to

ask groups to republish once.  Note also that if my "alternative"

approach outlined above is adopted, the checksums could either be gotten

from the catalog (if they were computed and stored there) or be

calculated by drslib itself; there would be no need to republish data to

ESG

      On 9/20/11 2:35 PM, <a class="moz-txt-link-abbreviated" href="mailto:stephen.pascoe@stfc.ac.uk">stephen.pascoe@stfc.ac.uk</a> wrote:

              Hi All,

              Lots of good discussion here and sorry I've been keeping

quiet.  I want to remind ourselves of the requirements I laid out in the

wiki page

              1. It should allow data from multiple versions to be

kept on disk simultaneously.

              2. It should avoid storing multiple copies of files that

are present in more than one version.

              3. It should be straightforward to copy dataset changes

(i.e. differences between versions) between nodes to allow efficient

replication.

              4. It should rely only on the filesystem so that generic

tools like FTP could be used to expose the structure if necessary.

              In my view we should address these directly.  Are they

needed?  Which are the most important?

              Gavin said about catalogs

              &gt; you can quickly and easily inspect catalog_v1 and

catalog_v2 to find what the changes are.

              &gt; This all answers the question of "WHAT" (to

download)... the other question of "HOW" is a different, but related

story.

              &gt; The trick is to not conflate the two issues which is

what filesystem discussions do. .

              But THREDDS conflates the two as well!  A THREDDS

catalog contains descriptions of service endpoints that are not

independent of the node serving the data (the "HOW").  Maybe we should

have developed a true catalog format but that is not where we are now.

The replication client use THREDDS catalogs in this way but when I last

looked it was completely unaware of versions -- i.e. it won't help with

#3.

              I don't see how Gavin's point addresses any of the

requirements above.  Even if we ditch #4, which I expect Gavin would

argue for, it doesn't directly solve the problem for #1-#3 either.

              Briefly on some other points that have been made...

              Balaji, some archive tools maybe can detect 2 paths

pointing to the same filesystem inode but both Estani and I have

enquired with our backup people and they say hard links must be avoided.

I am happy to include a hard-linking option in drslib though.  I've

created a bugzilla ticket for it.

              Karl, I think putting real files in "latest" is

equivalent to putting real files in the latest "vYYYYMMDD" directory.

The directories can be renamed trivially on upgrade but you still have

the same problems as the wiki page says.

              I'm sure there were other points but I've lost track.

Checksums will have to wait for another email.

              Cheers,

              Stephen.

              ---

              Stephen Pascoe  +44 (0)1235 445980

              Centre of Environmental Data Archival

              STFC Rutherford Appleton Laboratory, Harwell Oxford,

Didcot OX11 0QX, UK

              From: <a class="moz-txt-link-abbreviated" href="mailto:go-essp-tech-bounces@ucar.edu">go-essp-tech-bounces@ucar.edu</a>

[<a class="moz-txt-link-freetext" href="mailto:go-essp-tech-bounces@ucar.edu">mailto:go-essp-tech-bounces@ucar.edu</a>] On Behalf Of Gavin M. Bell

              Sent: 20 September 2011 17:26

              To: Kettleborough, Jamie

              Cc: <a class="moz-txt-link-abbreviated" href="mailto:go-essp-tech@ucar.edu">go-essp-tech@ucar.edu</a>; <a class="moz-txt-link-abbreviated" href="mailto:esg-node-dev@lists.llnl.gov">esg-node-dev@lists.llnl.gov</a>

              Subject: Re: [Go-essp-tech] Reasoning for the use of

symbolic links in drslib

              Jamie and friends.

              You've answered your own questions :-)...

              It is the catalog where these checksums (and other

features) are recorded.

              And thus using the catalog you can see what has changed.

              There is a new catalog for every version of a dataset.

Given that...

              you can quickly and easily inspect catalog_v1 and

catalog_v2 to find what the changes are.

              This all answers the question of "WHAT" (to download)...

the other question of "HOW" is a different, but related story.

              The trick is to not conflate the two issues which is

what filesystem discussions do.  When talking about filesystems you are

stipulating the what but implicitly conflating the HOW because you are

implicitly designing for tools that intrinsically use the filesystem.

It is a muddying of the waters that doesn't separate the two concerns.

We need to deal with these two concepts independently in a way that does

not  limit the system or cause undo burden on institutions by requiring

a rigid structure.

              As I mentioned... it's not the filesystem we need to

look at... it's the catalogs.

              just my $0.02 - I'll stop flogging this particular

horse... but I hope I have done a better job expressing the issues and

where the solution lies (IMHO).

              On 9/20/11 8:14 AM, Kettleborough, Jamie wrote:

              Hello Balaji,

              I agree - getting all nodes to make the checksums

available would be a

              good thing.  It gives you both the data integrity check

on download, and

              the ability to see what files really have changed from

one publication

              version to the next.

              I don't know how hard it is to do this, particularly for

data that is

              already published.

              Jamie

                      -----Original Message-----

                      From: V. Balaji [<a class="moz-txt-link-freetext" href="mailto:V.Balaji@noaa.gov">mailto:V.Balaji@noaa.gov</a>]

                      Sent: 20 September 2011 16:01

                      To: Kettleborough, Jamie

                      Cc: Karl Taylor; <a class="moz-txt-link-abbreviated" href="mailto:go-essp-tech@ucar.edu">go-essp-tech@ucar.edu</a>;

<a class="moz-txt-link-abbreviated" href="mailto:esg-node-dev@lists.llnl.gov">esg-node-dev@lists.llnl.gov</a>

                      Subject: Re: [Go-essp-tech] Reasoning for the

use of symbolic

                      links in drslib

                      If nodes can currently choose to record

checksums or not, I'd

                      strongly recommend this be a non-optional

requirement.. how

                      could anyone download any data with confidence

without being

                      able to checksum?

                      You can of course check timestamps and filesizes

and so on,

                      but you have to consider those optimizations...

a fast option

                      for the less paranoid to avoid the sum

computation, which has

                      to be the gold standard.

                      "Trust but checksum".

                      Kettleborough, Jamie writes:

                              Hello Karl, everyone,

                                 For replicating the latest version, I

agree that your alternate

                              structure poses difficulties (but it

seems like there must

                      be a way to

                              smartly determine whether the file you

already have a file

                      and simply

                              need to move it, rather than bring it

over again).

                              Doesn't every user (not just the

replication system) have

                      this problem:

                              they want to know what files have

changed (or not changed) at a new

                              publication version.  No one wants to be

using band width

                      or storage

                              space to fetch and store files they

already have.  How is a user

                              expected to know what has really

changed?  Estani mentions

                      check sums

                              - OK, but I don't think all nodes expose

them (is this

                      right?).  You

                              may try to infer from modification dates

(not sure, I

                      haven't look at

                              them that closely).  You may try to

infer from the

                      TRACKING_ID - but

                              I'm not sure how reliable this is (I can

imagine scenarios where

                              different files share the same

TRACKING_ID - e.g. if they have been

                              modified with an nco tool).

                              Is there a recommended method for users

to understand what *files*

                              have actually changed when a new

publication version appears?

                              Thanks,

                              Jamie

                      --

                      V. Balaji                               Office:

+1-609-452-6516

                      Head, Modeling Systems Group, GFDL      Home:

+1-212-253-6662

                      Princeton University                    Email:

<a class="moz-txt-link-abbreviated" href="mailto:v.balaji@noaa.gov">v.balaji@noaa.gov</a>

              _______________________________________________

              GO-ESSP-TECH mailing list

              <a class="moz-txt-link-abbreviated" href="mailto:GO-ESSP-TECH@ucar.edu">GO-ESSP-TECH@ucar.edu</a>

              <a class="moz-txt-link-freetext" href="http://mailman.ucar.edu/mailman/listinfo/go-essp-tech">http://mailman.ucar.edu/mailman/listinfo/go-essp-tech</a>

              --

              Gavin M. Bell

              --

               "Never mistake a clear view for a short distance."

                             -Paul Saffo

              --

              Scanned by iCritical.

</pre>

      </blockquote>

      <pre wrap="">

--

V. Balaji                               Office:  +1-609-452-6516

Head, Modeling Systems Group, GFDL      Home:    +1-212-253-6662

Princeton University                    Email: <a class="moz-txt-link-abbreviated" href="mailto:v.balaji@noaa.gov">v.balaji@noaa.gov</a>

</pre>

    </blockquote>

  </body>

</html>