[Go-essp-tech] [is-enes-sa2-jra4] Example of configuring a datanode to serve CMIP3-DRS

Mon Jul 5 12:10:03 MDT 2010

Hello gentle-people,

Here is my two cents on this whole DRS business.  I think that the
fundamental issue to all of this is the ability to do resource
resolution (lookup).  The issue of having urls match a DRS structure
that matches the filesystem is a red herring (IMHO).  The basic issue is
to be able to issue a query to the system such that you find what you
are looking for.  This query mechanism should be separate mechanism than
filesystem correspondence.  The driving issue behind the file system
correspondence push is so that people and/or applications can infer the
location of resources in some regimented way.  The true heart of the
issue is not with the file system.  The heart of the issue is to perform
a query such that you provide resource resolution.  The file system is a
familiar mechanism but it isn't the only one.  The file system takes a
query (the file system path) and returns the resource to us (the bits
sitting at an inode location somewhere that is memory mapped to some
physical platter and spindle location, that is mapped to the file system
path).  We are overloading the file system query mechanism when it is
not necessary.

I propose the following:  We create a *filter* and a small database (the
latter we already have in the publisher).  We send a *query* to the web
server the web server *filter* intercepts that *query* and resolves it,
using the database to the actual resource location and returns the
resource you want.  Implementing this in a filter divorces the query
structure from the file system structure.  The use of the database (that
is generated by the publisher when it scans) provides the resolution.
With this mechanism in place, WGET, as well as any other URL based tool
will be able to fetch the data as intended.

BTW: The "query" is whatever we make it up to be... (not a reference to
SQL query).

This gives the data-node admin the ability to put their files wherever
they want.  If they move files around and so on, they just have to
rescan with the publisher.  The issues around design and efficiency can
be address with varying degrees of cleverness.

I welcome any thoughts on this issue... Please talk me down :-). I think
it is about time we put this DRS issue to bed.

Estanislao Gonzalez wrote:
> Hi Bob,
> 
> I guess you must be on vacations now. Anyway, here's the question, maybe
> someone else can answer it:
> 
> The very first idea I had was almost what you proposed. Your proposal
> though leaves URLs of the form:
> http://*myserver/thredds/fileserver/CMIP5_replicas/output/...
>                                                              <---
> (almost) DRS Structure ----------->
> 
> Which has no valid DRS structure (CMIP5_replicas nor CMIP5_core are in
> the DRS vocabulary).
> 
> My proposal has a very similar flaw:
> http://*myserver/thredds/fileserver/replicated/CMIP5/output/...
>                                                                                
> <--- full DRS Structure ----------->
> The DRS structure is preserved, but you cannot easily infer the correct
> URL from any dataset. I think the Idea is: if you know the prefix
> (http.../fileserver/) and the dataset DRS name you can always get the
> file without even browising the TDS:
> prefix + DRS = URL to file
> 
> AFAIK the URL structure used by the TDS will never be 100% DRS conform
> (according to the DRS version 0.27)
> This one has the form:
> http://*<hostname>/<activity>/<product>/<institute>/<model>/<experiment>/<frequency>/<modeling
> realm>/<variable identifier>/<ensemble member>/<version>/ [<endpoint>],
> 
> where the TDS one has the endpoint moved to the front (the
> thredds/fileserver, thredds/dodsC, etc parts).
> 
> To sum things up:
> Is it possible to publish files from different directory structures into
> an unified URL structure so that it is completely transparent to the user?
> Am I the only one addressing this problem? Are all other institutions
> planning  to publish all files from only one directory?
> 
> The only viable solution I can think of is to rely on Stephen's
> versioning concept and maintaining a single true DRS structure with
> links to files kept in other more manageable directory structures (This
> will probably involve adapting Stephen's tool).
> 
> Thanks,
> Estani
> 
> 
> Bob Drach wrote:
>> Hi Estani,
>>
>> It should be possible to do what you want without running multiple
>> data nodes.
>>
>> The purpose of the THREDDS dataset roots is to hide the directory
>> structure from the end user, and to limit what the TDS can access. But
>> THREDDS can certainly have multiple dataset roots.
>>
>> In your example below, you should associate different paths with the
>> locations, for example:
>>
>>> <datasetRoot path="CMIP5_replicas" location="/replicated/CMIP5"/>
>>> <datasetRoot path="CMIP5_core" location="/core/CMIP5"/>
>>
>> Also be aware that in the publisher configuration:
>>
>> - the directory_format can have multiple values, separated by vertical
>> bars (|). The publisher will use the first format that matches the
>> directory structure being scanned.
>>
>> - a useful strategy is to create different project sections for
>> various groups of directives. You could define a cmip5_replica
>> project, a cmip5_core project, etc.
>>
>> Bob
>>
>> On Jul 1, 2010, at 5:42 AM, Estanislao Gonzalez wrote:
>>
>>> Hi Bryan,
>>>
>>> thanks for your answer!
>>> Running multiple ESG data nodes is always a possibility, but it seems an
>>> overkill to us as we may have several different "data repositories".
>>> We would like to separate: core-replicated, core-non-replicated,
>>> non-core, non-core-on-hpss, as well as other non-cmip5 data. Having 5+
>>> ESG data nodes is not viable in our scenario.
>>>
>>> The TDS allows the separation of access URL from the underlying file
>>> structure so that it might be possible. AFAIK the publisher does not
>>> provide a simple way of doing this.
>>>
>>> Setting thredds_dataset_roots to different values while publishing
>>> doesn't appear to work as those are mapped to a map-entry at the
>>> catalog root:
>>> <datasetRoot path="CMIP5" location="/replicated/CMIP5"/>
>>> <datasetRoot path="CMIP5" location="/core/CMIP5"/>
>>> ..
>>>
>>> which is clearly non bijective and can't therefore be reversed to
>>> locate the file from a given URL.
>>>
>>> While publishing all referred data will be held on a known location.
>>> Is it possible to use somehow this information to setup a proper
>>> catalog configuration so that the URL can be properly mapped? At
>>> least on a dataset level?
>>>
>>> The whole HPSS staging procedure should be completely transparent to
>>> the user, as well as the location of the files. I was just looking at
>>> other options in case we cannot publish them the way we want...
>>>
>>> Cheers,
>>> Estani
>>>
>>>
>>>
>>>
>>> Bryan Lawrence wrote:
>>>> sorry.
>>>>
>>>> the first sentence should have read
>>>>
>>>> Just to note that *our* approach to the local versus replication issue
>>>> will be ...
>>>>
>>>> Cheers
>>>> Bryan
>>>>
>>>> On Thursday 01 Jul 2010 11:25:37 Bryan Lawrence wrote:
>>>>
>>>>> Hi Estani
>>>>>
>>>>> Just to note that your approach to the local versus replication will
>>>>> be to run two different ESG nodes ... which is in fact the desired
>>>>> outcome so as to get the right things in the catalogues at the right
>>>>> time (vis- a-viz qc etc).
>>>>>
>>>>> The issue with respect to cache, I'm not so sure about, in what way
>>>>> do you want to expose that into ESG?
>>>>>
>>>>> Bryan
>>>>>
>>>>> On Wednesday 30 Jun 2010 17:05:57 Estanislao Gonzalez wrote:
>>>>>
>>>>>> Hi Stephen,
>>>>>>
>>>>>> the page contains really helpful information, thanks a lot!
>>>>>>
>>>>>> I'm also interested in some variables of the DEFAULT section from
>>>>>> the esg.ini configuration file. More specifically:
>>>>>> thredds_dataset_roots (and maybe thredds_aggregation_services or
>>>>>> any other which was changed or you think it might be important)
>>>>>>
>>>>>> The main question here is: how can different local directory
>>>>>> structures be published to the same DRS structure?
>>>>>> The example scenario in our case will be:
>>>>>> /replicated/<DRS structure> - for replicated data
>>>>>> /local/<DRS structure> - for non replicated data hold on disk
>>>>>> /cache/<DRS structure> - for data staged from a HPSS system
>>>>>>
>>>>>> The only solution I can think of is to extend the URL before the
>>>>>> DRS structure starts (the URL won't be 100% DRS conform anyway). So
>>>>>>    http://**server/thredds/fileserver/<DRS structure>
>>>>>> will turn into
>>>>>>    http://**server/thredds/fileserver/replicated/<DRS structure>
>>>>>>    http://**server/thredds/fileserver/local/<DRS structure>
>>>>>>    http://**server/thredds/fileserver/cache/<DRS structure>
>>>>>>
>>>>>> Is that viable? Are there any other options?
>>>>>>
>>>>>> Thanks,
>>>>>> Estani
>>>>>>
>>>>>> stephen.pascoe at stfc.ac.uk wrote:
>>>>>>
>>>>>>> To illustrate how the ESG datanode can be configured to serve
>>>>>>> data for CMIP5 we have deployed a datanode containing a subset of
>>>>>>> CMIP3 in the Data Reference Syntax. Some key features of this
>>>>>>> deployment are:
>>>>>>>
>>>>>>>    * The underlying directory structure is based on the Data
>>>>>>>      Reference Syntax.
>>>>>>>    * Datasets published at the realm level.
>>>>>>>    * The token-based security filter is replaced by the
>>>>>>>      OpenidRelyingParty security filter.
>>>>>>>
>>>>>>> Further notes can be found at
>>>>>>> http://**proj.badc.rl.ac.uk/go-essp/wiki/CMIP3_Datanode
>>>>>>>
>>>>>>> This test deployment should be of interest to anyone wanting to
>>>>>>> know how DRS identifiers could be exposed in THREDDS catalogues
>>>>>>> and the TDS HTML interface.  You can also try downloading files
>>>>>>> with OpenID authentication or via wget with SSL-client
>>>>>>> certificate authentication.  See the link above for details.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Stephen.
>>>>>>>
>>>>>>>
>>>>>>> ---
>>>>>>> Stephen Pascoe  +44 (0)1235 445980
>>>>>>> British Atmospheric Data Centre
>>>>>>> Rutherford Appleton Laboratory
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----------------------------------------------------------------
>>>>>>> -- -----
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> GO-ESSP-TECH mailing list
>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>> http://**mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>
>>>>
>>>
>>> -- 
>>> Estanislao Gonzalez
>>>
>>> Max-Planck-Institut für Meteorologie (MPI-M)
>>> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>
>>> Phone:   +49 (40) 46 00 94-126
>>> E-Mail:  estanislao.gonzalez at zmaw.de
>>>
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu
>>> http://**mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>
> 
> 

-- 
Gavin M. Bell
Lawrence Livermore National Labs
--

 "Never mistake a clear view for a short distance."
       	       -Paul Saffo

(GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)

 A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E