<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#ffffcc" text="#000000">
I agree with you Stephen.... completely.<br>
<br>
The dataset is OUR (ESGF) logical file unit. It would be great if
we could make the world think in datasets and completely encapsulate
the notion of files, I would love that, but until we acclimate
people with the dataset notion as they use the system the notion of
"file" as we are all used to cannot be avoided. We should
manipulate things in terms of the ESGF *logical* file = the dataset
as represented by the catalog... as much as we can, because it makes
sense in our model of how things should be grouped. At the
replication level things should only be manipulated in the context
of datasets. For the user... we should support files, but I think
we should do the following:<br>
<br>
If a user wants a file from a dataset, they should be able to get
the file but we should maintain the context of the dataset by
maintaining the dataset as a physical filesystem construct. For
example if you use a mac you will see that an "application" is
really a top level directory for a set of files. When you download
an application what you get is a set of files in a file hierarchy
such that in concert they manifest the application you expect.
Along the same lines, I would propose that we have a similar
construct for datasets and their relationship to files.<br>
The details of this layout is something I'd like to bring up for
discussion, given that the basic premise of what I am saying is
accepted.<br>
<br>
We can then build tools to provide that internalize this construct
and thus able to manipulate datasets directly. I have been mulling
over building an ESG SHELL and I think I will finally do so.... as a
part of that shell you would be able to perform augmented shell
commands like "ls" that would operate accordingly in the context of
our notion of dataset.<br>
<br>
something like<br>
.<br>
`-- foo_dataset<br>
|-- foo_datafile1.nc<br>
|-- foo_datafile2.nc<br>
|-- foo_datafile3.nc<br>
|-- foo_datafile5.nc<br>
`-- foo_dataset.catalog<br>
<br>
With this kind of structure you would always have the full catalog
for the dataset present and represented. You may have all or a
subset of files that are in the catalog present. In the replication
scenario, you would have them all. In the end user scenario you may
have a subset. The augmented esgf-shell "ls" command you would be
able to additionally see what files are present vs what files are
not. Also because you have the catalog you can check the checksums
of the files and you can then issue an esgf-shell command to
"complete" the dataset and have it pull down the rest of the files.
In the replication scheme I am exploring this is how this is
intended to work. Also the location of the top level foo_dataset is
under the data.repl directory where all replicas are kept. This
bears fruit down the line by simplifying several operations down the
line. This imposition is not required for the data publisher over
datasets that they are custodians for, because of the ability to use
the publisher's database to perform this file location - which is
part of another scheme I have hatched to divorce the filesystem from
the tyranny of the DRS's overreaching (IMHO) filesystem mandate. <br>
<br>
Now I'll be the first to mention that this proposal to impose a
filesystem structure is somewhat hypocritical, since I have railed
against the DRS's imposition of structure on the filesystem... but I
think in this context is it limited enough in scope and provides
enough of a benefit to be justified.<br>
<br>
I'd like to have this conversation.<br>
<br>
Trust me... this is the way to go. (IMHO) :-)<br>
<br>
On 6/2/11 12:58 AM, <a class="moz-txt-link-abbreviated" href="mailto:stephen.pascoe@stfc.ac.uk">stephen.pascoe@stfc.ac.uk</a> wrote:
<blockquote
cite="mid:4C353E6E4A08AE4792B350DAA392B52119E27D@EXCHMBX01.fed.cclrc.ac.uk"
type="cite"><span style="font-size: 11pt; font-family:
"Calibri","sans-serif"; color: rgb(31, 73,
125);">My instinct is that we should accept datasets are
collections of files and not try to completely hide this idea,
however most of the system should focus on datasets because they
more flexible. </span></blockquote>
<br>
<pre class="moz-signature" cols="72">--
Gavin M. Bell
Lawrence Livermore National Labs
--
"Never mistake a clear view for a short distance."
         -Paul Saffo
(GPG Key - <a class="moz-txt-link-freetext" href="http://rainbow.llnl.gov/dist/keys/gavin.asc">http://rainbow.llnl.gov/dist/keys/gavin.asc</a>)
A796 CE39 9C31 68A4 52A7 1F6B 66B7 B250 21D5 6D3E
</pre>
</body>
</html>