[Go-essp-tech] [is-enes-sa2-jra4] Example of configuring a datanode to serve CMIP3-DRS

Doutriaux, Charles doutriaux1 at llnl.gov
Tue Jul 6 08:36:21 MDT 2010


I second Gavin on this.

As far as my two cents go (not too far), I think this whole DRS thing is
more of a distraction than anything else. The many hours we all spent on
this would have probably been better spent developing the "filter" proposed
by Gavin.

C.


On 7/5/10 11:10 AM, "Gavin M Bell" <gavin at llnl.gov> wrote:

> Hello gentle-people,
> 
> Here is my two cents on this whole DRS business.  I think that the
> fundamental issue to all of this is the ability to do resource
> resolution (lookup).  The issue of having urls match a DRS structure
> that matches the filesystem is a red herring (IMHO).  The basic issue is
> to be able to issue a query to the system such that you find what you
> are looking for.  This query mechanism should be separate mechanism than
> filesystem correspondence.  The driving issue behind the file system
> correspondence push is so that people and/or applications can infer the
> location of resources in some regimented way.  The true heart of the
> issue is not with the file system.  The heart of the issue is to perform
> a query such that you provide resource resolution.  The file system is a
> familiar mechanism but it isn't the only one.  The file system takes a
> query (the file system path) and returns the resource to us (the bits
> sitting at an inode location somewhere that is memory mapped to some
> physical platter and spindle location, that is mapped to the file system
> path).  We are overloading the file system query mechanism when it is
> not necessary.
> 
> I propose the following:  We create a *filter* and a small database (the
> latter we already have in the publisher).  We send a *query* to the web
> server the web server *filter* intercepts that *query* and resolves it,
> using the database to the actual resource location and returns the
> resource you want.  Implementing this in a filter divorces the query
> structure from the file system structure.  The use of the database (that
> is generated by the publisher when it scans) provides the resolution.
> With this mechanism in place, WGET, as well as any other URL based tool
> will be able to fetch the data as intended.
> 
> BTW: The "query" is whatever we make it up to be... (not a reference to
> SQL query).
> 
> This gives the data-node admin the ability to put their files wherever
> they want.  If they move files around and so on, they just have to
> rescan with the publisher.  The issues around design and efficiency can
> be address with varying degrees of cleverness.
> 
> I welcome any thoughts on this issue... Please talk me down :-). I think
> it is about time we put this DRS issue to bed.
> 
> 
> 
> Estanislao Gonzalez wrote:
>> Hi Bob,
>> 
>> I guess you must be on vacations now. Anyway, here's the question, maybe
>> someone else can answer it:
>> 
>> The very first idea I had was almost what you proposed. Your proposal
>> though leaves URLs of the form:
>> http://*myserver/thredds/fileserver/CMIP5_replicas/output/...
>>                                                              <---
>> (almost) DRS Structure ----------->
>> 
>> Which has no valid DRS structure (CMIP5_replicas nor CMIP5_core are in
>> the DRS vocabulary).
>> 
>> My proposal has a very similar flaw:
>> http://*myserver/thredds/fileserver/replicated/CMIP5/output/...
>>                 
>> <--- full DRS Structure ----------->
>> The DRS structure is preserved, but you cannot easily infer the correct
>> URL from any dataset. I think the Idea is: if you know the prefix
>> (http.../fileserver/) and the dataset DRS name you can always get the
>> file without even browising the TDS:
>> prefix + DRS = URL to file
>> 
>> AFAIK the URL structure used by the TDS will never be 100% DRS conform
>> (according to the DRS version 0.27)
>> This one has the form:
>> http://*<hostname>/<activity>/<product>/<institute>/<model>/<experiment>/<fre
>> quency>/<modeling
>> realm>/<variable identifier>/<ensemble member>/<version>/ [<endpoint>],
>> 
>> where the TDS one has the endpoint moved to the front (the
>> thredds/fileserver, thredds/dodsC, etc parts).
>> 
>> To sum things up:
>> Is it possible to publish files from different directory structures into
>> an unified URL structure so that it is completely transparent to the user?
>> Am I the only one addressing this problem? Are all other institutions
>> planning  to publish all files from only one directory?
>> 
>> The only viable solution I can think of is to rely on Stephen's
>> versioning concept and maintaining a single true DRS structure with
>> links to files kept in other more manageable directory structures (This
>> will probably involve adapting Stephen's tool).
>> 
>> Thanks,
>> Estani
>> 
>> 
>> Bob Drach wrote:
>>> Hi Estani,
>>> 
>>> It should be possible to do what you want without running multiple
>>> data nodes.
>>> 
>>> The purpose of the THREDDS dataset roots is to hide the directory
>>> structure from the end user, and to limit what the TDS can access. But
>>> THREDDS can certainly have multiple dataset roots.
>>> 
>>> In your example below, you should associate different paths with the
>>> locations, for example:
>>> 
>>>> <datasetRoot path="CMIP5_replicas" location="/replicated/CMIP5"/>
>>>> <datasetRoot path="CMIP5_core" location="/core/CMIP5"/>
>>> 
>>> Also be aware that in the publisher configuration:
>>> 
>>> - the directory_format can have multiple values, separated by vertical
>>> bars (|). The publisher will use the first format that matches the
>>> directory structure being scanned.
>>> 
>>> - a useful strategy is to create different project sections for
>>> various groups of directives. You could define a cmip5_replica
>>> project, a cmip5_core project, etc.
>>> 
>>> Bob
>>> 
>>> On Jul 1, 2010, at 5:42 AM, Estanislao Gonzalez wrote:
>>> 
>>>> Hi Bryan,
>>>> 
>>>> thanks for your answer!
>>>> Running multiple ESG data nodes is always a possibility, but it seems an
>>>> overkill to us as we may have several different "data repositories".
>>>> We would like to separate: core-replicated, core-non-replicated,
>>>> non-core, non-core-on-hpss, as well as other non-cmip5 data. Having 5+
>>>> ESG data nodes is not viable in our scenario.
>>>> 
>>>> The TDS allows the separation of access URL from the underlying file
>>>> structure so that it might be possible. AFAIK the publisher does not
>>>> provide a simple way of doing this.
>>>> 
>>>> Setting thredds_dataset_roots to different values while publishing
>>>> doesn't appear to work as those are mapped to a map-entry at the
>>>> catalog root:
>>>> <datasetRoot path="CMIP5" location="/replicated/CMIP5"/>
>>>> <datasetRoot path="CMIP5" location="/core/CMIP5"/>
>>>> ..
>>>> 
>>>> which is clearly non bijective and can't therefore be reversed to
>>>> locate the file from a given URL.
>>>> 
>>>> While publishing all referred data will be held on a known location.
>>>> Is it possible to use somehow this information to setup a proper
>>>> catalog configuration so that the URL can be properly mapped? At
>>>> least on a dataset level?
>>>> 
>>>> The whole HPSS staging procedure should be completely transparent to
>>>> the user, as well as the location of the files. I was just looking at
>>>> other options in case we cannot publish them the way we want...
>>>> 
>>>> Cheers,
>>>> Estani
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Bryan Lawrence wrote:
>>>>> sorry.
>>>>> 
>>>>> the first sentence should have read
>>>>> 
>>>>> Just to note that *our* approach to the local versus replication issue
>>>>> will be ...
>>>>> 
>>>>> Cheers
>>>>> Bryan
>>>>> 
>>>>> On Thursday 01 Jul 2010 11:25:37 Bryan Lawrence wrote:
>>>>> 
>>>>>> Hi Estani
>>>>>> 
>>>>>> Just to note that your approach to the local versus replication will
>>>>>> be to run two different ESG nodes ... which is in fact the desired
>>>>>> outcome so as to get the right things in the catalogues at the right
>>>>>> time (vis- a-viz qc etc).
>>>>>> 
>>>>>> The issue with respect to cache, I'm not so sure about, in what way
>>>>>> do you want to expose that into ESG?
>>>>>> 
>>>>>> Bryan
>>>>>> 
>>>>>> On Wednesday 30 Jun 2010 17:05:57 Estanislao Gonzalez wrote:
>>>>>> 
>>>>>>> Hi Stephen,
>>>>>>> 
>>>>>>> the page contains really helpful information, thanks a lot!
>>>>>>> 
>>>>>>> I'm also interested in some variables of the DEFAULT section from
>>>>>>> the esg.ini configuration file. More specifically:
>>>>>>> thredds_dataset_roots (and maybe thredds_aggregation_services or
>>>>>>> any other which was changed or you think it might be important)
>>>>>>> 
>>>>>>> The main question here is: how can different local directory
>>>>>>> structures be published to the same DRS structure?
>>>>>>> The example scenario in our case will be:
>>>>>>> /replicated/<DRS structure> - for replicated data
>>>>>>> /local/<DRS structure> - for non replicated data hold on disk
>>>>>>> /cache/<DRS structure> - for data staged from a HPSS system
>>>>>>> 
>>>>>>> The only solution I can think of is to extend the URL before the
>>>>>>> DRS structure starts (the URL won't be 100% DRS conform anyway). So
>>>>>>>    http://**server/thredds/fileserver/<DRS structure>
>>>>>>> will turn into
>>>>>>>    http://**server/thredds/fileserver/replicated/<DRS structure>
>>>>>>>    http://**server/thredds/fileserver/local/<DRS structure>
>>>>>>>    http://**server/thredds/fileserver/cache/<DRS structure>
>>>>>>> 
>>>>>>> Is that viable? Are there any other options?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Estani
>>>>>>> 
>>>>>>> stephen.pascoe at stfc.ac.uk wrote:
>>>>>>> 
>>>>>>>> To illustrate how the ESG datanode can be configured to serve
>>>>>>>> data for CMIP5 we have deployed a datanode containing a subset of
>>>>>>>> CMIP3 in the Data Reference Syntax. Some key features of this
>>>>>>>> deployment are:
>>>>>>>> 
>>>>>>>>    * The underlying directory structure is based on the Data
>>>>>>>>      Reference Syntax.
>>>>>>>>    * Datasets published at the realm level.
>>>>>>>>    * The token-based security filter is replaced by the
>>>>>>>>      OpenidRelyingParty security filter.
>>>>>>>> 
>>>>>>>> Further notes can be found at
>>>>>>>> http://**proj.badc.rl.ac.uk/go-essp/wiki/CMIP3_Datanode
>>>>>>>> 
>>>>>>>> This test deployment should be of interest to anyone wanting to
>>>>>>>> know how DRS identifiers could be exposed in THREDDS catalogues
>>>>>>>> and the TDS HTML interface.  You can also try downloading files
>>>>>>>> with OpenID authentication or via wget with SSL-client
>>>>>>>> certificate authentication.  See the link above for details.
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Stephen.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ---
>>>>>>>> Stephen Pascoe  +44 (0)1235 445980
>>>>>>>> British Atmospheric Data Centre
>>>>>>>> Rutherford Appleton Laboratory
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----------------------------------------------------------------
>>>>>>>> -- -----
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>> http://**mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>> 
>>>>> 
>>>> 
>>>> -- 
>>>> Estanislao Gonzalez
>>>> 
>>>> Max-Planck-Institut für Meteorologie (MPI-M)
>>>> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>> 
>>>> Phone:   +49 (40) 46 00 94-126
>>>> E-Mail:  estanislao.gonzalez at zmaw.de
>>>> 
>>>> _______________________________________________
>>>> GO-ESSP-TECH mailing list
>>>> GO-ESSP-TECH at ucar.edu
>>>> http://**mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>> 
>> 
>> 



More information about the GO-ESSP-TECH mailing list