[Go-essp-tech] [esg-node-dev] Use of <metadata> element in THREDDS catalogs

Mon Jun 6 16:31:18 MDT 2011

 Hi Chris,

Thanks for posting the docs.
I'll take a look at it.

I am pretty hell bent on building a shell.  It is the right thing to
do.  Maybe this is a place where we can play well with CDX?
I don't know enough about CDX.  I do know that command line is the way
to go.  I don't have a dog in the RPC hunt because if you design things
properly all RPC layers should be interchangeable (IMHO).

I'll have more meat on this skeleton... so in a couple weeks can we
gather up interested parties and have this conversation.

On 6/4/11 2:34 PM, Mattmann, Chris A (388J) wrote:
> (from the peanut gallery)
>
>> If a user wants a file from a dataset, they should be able to get the file but we should maintain the context of the dataset by maintaining the dataset as a physical filesystem construct.  For example if you use a mac you will see that an "application" is really a top level directory for a set of files.  When you download an application what you get is a set of files in a file hierarchy such that in concert they manifest the application you expect.  Along the same lines, I would propose that we have a similar construct for datasets and their relationship to files.
> +1 and FWIW, this is the way that we do it in Apache OODT-ville, and on related projects.
>
>
>> The details of this layout is something I'd like to bring up for discussion, given that the basic premise of what I am saying is accepted.
>>
>> We can then build tools to provide that internalize this construct and thus able to manipulate datasets directly.  I have been mulling over building an ESG SHELL and I think I will finally do so.... as a part of that shell you would be able to perform augmented shell commands like "ls" that would operate accordingly in the context of our notion of dataset.
>>
>> something like
>> .
>> `-- foo_dataset
>>     |-- foo_datafile1.nc
>>     |-- foo_datafile2.nc
>>     |-- foo_datafile3.nc
>>     |-- foo_datafile5.nc
>>     `-- foo_dataset.catalog
> Note here too that based on our experience on the CDX project, these types of tools are useful that's for sure. See this paper for some more information [1]. One tradeoff is the maintainability of specific UNIX-like commands versus simply putting up a POSIX or service style facade interface in front of the underlying grid services that in turn talk and speak nicely with the UNIX-like commands. One way to do this would be to front ESGF services with WebDAV or something similar (aka FUSE at the filesystem/OS level) and then "mount" ESGF datasets.
>
> This doesn't solve the problem of downstream services though, and value-added metadata (unless we go so far as to build Spotlight :-), which I don't think is the way to go. Some examples of value-added services that are interesting can be seen in the context of OODT-139 [2] over at Apache. We've been working on pedigree/tracing, metadata dumping, etc. that might be nice to look at and collaborate on here. 
>
>> With this kind of structure you would always have the full catalog for the dataset present and represented.  You may have all or a subset of files that are in the catalog present.  In the replication scenario, you would have them all.  In the end user scenario you may have a subset.  The augmented esgf-shell "ls" command you would be     able to additionally see what files are present vs what files are not.  
> Yep, we've built some commands to do this too in CDX, called cdxls. 
>
>> Also because you have the catalog you can check the checksums of the files and you can then issue an esgf-shell command to "complete" the dataset and have it pull down the rest of the files. 
> (for us, this is cdxget). 
>
>>  In the replication scheme I am exploring this is how this is intended to work.  Also the location of the top level foo_dataset is under the data.repl directory where all replicas are kept.  This bears fruit down the line by simplifying several operations down the line.  This imposition is not required for the data publisher over datasets that they are custodians for, because of the ability to use the publisher's database to perform this file location - which is part of another scheme I have hatched to divorce the filesystem from the tyranny of the DRS's overreaching (IMHO) filesystem mandate. 
> Heh.
>
>> Now I'll be the first to mention that this proposal to impose a filesystem structure is somewhat hypocritical, since I have railed against the DRS's imposition of structure on the filesystem... but I think in this context is it limited enough in scope and provides enough of a benefit to be justified.
>>
>> I'd like to have this conversation.
>>
>> Trust me... this is the way to go. (IMHO)  :-)
> I'd like to be a part of this conversation as well.
>
> Thanks!
>
> Cheers,
> Chris
>
> [1] http://sunset.usc.edu/~mattmann/pubs/SEN10.pdf
> [2] http://issues.apache.org/jira/browse/OODT-139
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann at nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> Phone: +1 (818) 354-8810
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>

-- 
Gavin M. Bell
Lawrence Livermore National Labs
--

 "Never mistake a clear view for a short distance."
       	       -Paul Saffo

(GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)

 A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20110606/8004b837/attachment-0001.html