[Go-essp-tech] [esg-node-dev] Use of <metadata> element in THREDDS catalogs

Gavin M. Bell gavin at llnl.gov
Thu Jun 2 10:47:49 MDT 2011


 I agree with you Stephen.... completely.

The dataset is OUR (ESGF) logical file unit.  It would be great if we
could make the world think in datasets and completely encapsulate the
notion of files, I would love that, but until we acclimate people with
the dataset notion as they use the system the notion of "file" as we are
all used to cannot be avoided.  We should manipulate things in terms of
the ESGF *logical* file = the dataset as represented by the catalog...
as much as we can, because it makes sense in our model of how things
should be grouped.  At the replication level things should only be
manipulated in the context of datasets.  For the user... we should
support files, but I think we should do the following:

If a user wants a file from a dataset, they should be able to get the
file but we should maintain the context of the dataset by maintaining
the dataset as a physical filesystem construct.  For example if you use
a mac you will see that an "application" is really a top level directory
for a set of files.  When you download an application what you get is a
set of files in a file hierarchy such that in concert they manifest the
application you expect.  Along the same lines, I would propose that we
have a similar construct for datasets and their relationship to files.
The details of this layout is something I'd like to bring up for
discussion, given that the basic premise of what I am saying is accepted.

We can then build tools to provide that internalize this construct and
thus able to manipulate datasets directly.  I have been mulling over
building an ESG SHELL and I think I will finally do so.... as a part of
that shell you would be able to perform augmented shell commands like
"ls" that would operate accordingly in the context of our notion of dataset.

something like
.
`-- foo_dataset
    |-- foo_datafile1.nc
    |-- foo_datafile2.nc
    |-- foo_datafile3.nc
    |-- foo_datafile5.nc
    `-- foo_dataset.catalog

With this kind of structure you would always have the full catalog for
the dataset present and represented.  You may have all or a subset of
files that are in the catalog present.  In the replication scenario, you
would have them all.  In the end user scenario you may have a subset. 
The augmented esgf-shell "ls" command you would be able to additionally
see what files are present vs what files are not.  Also because you have
the catalog you can check the checksums of the files and you can then
issue an esgf-shell command to "complete" the dataset and have it pull
down the rest of the files.  In the replication scheme I am exploring
this is how this is intended to work.  Also the location of the top
level foo_dataset is under the data.repl directory where all replicas
are kept.  This bears fruit down the line by simplifying several
operations down the line.  This imposition is not required for the data
publisher over datasets that they are custodians for, because of the
ability to use the publisher's database to perform this file location -
which is part of another scheme I have hatched to divorce the filesystem
from the tyranny of the DRS's overreaching (IMHO) filesystem mandate.

Now I'll be the first to mention that this proposal to impose a
filesystem structure is somewhat hypocritical, since I have railed
against the DRS's imposition of structure on the filesystem... but I
think in this context is it limited enough in scope and provides enough
of a benefit to be justified.

I'd like to have this conversation.

Trust me... this is the way to go. (IMHO)  :-)

On 6/2/11 12:58 AM, stephen.pascoe at stfc.ac.uk wrote:
> My instinct is that we should accept datasets are collections of files
> and not try to completely hide this idea, however most of the system
> should focus on datasets because they more flexible.  

-- 
Gavin M. Bell
Lawrence Livermore National Labs
--

 "Never mistake a clear view for a short distance."
       	       -Paul Saffo

(GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)

 A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110602/592bc047/attachment.html 


More information about the GO-ESSP-TECH mailing list