[Go-essp-tech] [is-enes-sa2-jra4] Example of configuring a datanode to serve CMIP3-DRS

Mon Jul 5 12:18:35 MDT 2010

Hi Gavin,

I agree completely. Having a regularized DRS syntax is a very good  
idea, but to implement it we will need to introduce a level of  
indirection between the DRS URL (your 'query') and the underlying  
filesystem. Separating these two concerns will have a very important  
benefit: it will allow the data node managers to organize their  
filesystems as they see fit.

Bob

On Jul 5, 2010, at 11:10 AM, Gavin M Bell wrote:

> Hello gentle-people,
>
> Here is my two cents on this whole DRS business.  I think that the
> fundamental issue to all of this is the ability to do resource
> resolution (lookup).  The issue of having urls match a DRS structure
> that matches the filesystem is a red herring (IMHO).  The basic  
> issue is
> to be able to issue a query to the system such that you find what you
> are looking for.  This query mechanism should be separate mechanism  
> than
> filesystem correspondence.  The driving issue behind the file system
> correspondence push is so that people and/or applications can infer  
> the
> location of resources in some regimented way.  The true heart of the
> issue is not with the file system.  The heart of the issue is to  
> perform
> a query such that you provide resource resolution.  The file system  
> is a
> familiar mechanism but it isn't the only one.  The file system takes a
> query (the file system path) and returns the resource to us (the bits
> sitting at an inode location somewhere that is memory mapped to some
> physical platter and spindle location, that is mapped to the file  
> system
> path).  We are overloading the file system query mechanism when it is
> not necessary.
>
> I propose the following:  We create a *filter* and a small database  
> (the
> latter we already have in the publisher).  We send a *query* to the  
> web
> server the web server *filter* intercepts that *query* and resolves  
> it,
> using the database to the actual resource location and returns the
> resource you want.  Implementing this in a filter divorces the query
> structure from the file system structure.  The use of the database  
> (that
> is generated by the publisher when it scans) provides the resolution.
> With this mechanism in place, WGET, as well as any other URL based  
> tool
> will be able to fetch the data as intended.
>
> BTW: The "query" is whatever we make it up to be... (not a reference  
> to
> SQL query).
>
> This gives the data-node admin the ability to put their files wherever
> they want.  If they move files around and so on, they just have to
> rescan with the publisher.  The issues around design and efficiency  
> can
> be address with varying degrees of cleverness.
>
> I welcome any thoughts on this issue... Please talk me down :-). I  
> think
> it is about time we put this DRS issue to bed.
>
>
>
> Estanislao Gonzalez wrote:
>> Hi Bob,
>>
>> I guess you must be on vacations now. Anyway, here's the question,  
>> maybe
>> someone else can answer it:
>>
>> The very first idea I had was almost what you proposed. Your proposal
>> though leaves URLs of the form:
>> http://*myserver/thredds/fileserver/CMIP5_replicas/output/...
>>                                                             <---
>> (almost) DRS Structure ----------->
>>
>> Which has no valid DRS structure (CMIP5_replicas nor CMIP5_core are  
>> in
>> the DRS vocabulary).
>>
>> My proposal has a very similar flaw:
>> http://*myserver/thredds/fileserver/replicated/CMIP5/output/...
>>
>> <--- full DRS Structure ----------->
>> The DRS structure is preserved, but you cannot easily infer the  
>> correct
>> URL from any dataset. I think the Idea is: if you know the prefix
>> (http.../fileserver/) and the dataset DRS name you can always get the
>> file without even browising the TDS:
>> prefix + DRS = URL to file
>>
>> AFAIK the URL structure used by the TDS will never be 100% DRS  
>> conform
>> (according to the DRS version 0.27)
>> This one has the form:
>> http://*<hostname>/<activity>/<product>/<institute>/<model>/ 
>> <experiment>/<frequency>/<modeling
>> realm>/<variable identifier>/<ensemble member>/<version>/  
>> [<endpoint>],
>>
>> where the TDS one has the endpoint moved to the front (the
>> thredds/fileserver, thredds/dodsC, etc parts).
>>
>> To sum things up:
>> Is it possible to publish files from different directory structures  
>> into
>> an unified URL structure so that it is completely transparent to  
>> the user?
>> Am I the only one addressing this problem? Are all other institutions
>> planning  to publish all files from only one directory?
>>
>> The only viable solution I can think of is to rely on Stephen's
>> versioning concept and maintaining a single true DRS structure with
>> links to files kept in other more manageable directory structures  
>> (This
>> will probably involve adapting Stephen's tool).
>>
>> Thanks,
>> Estani
>>
>>
>> Bob Drach wrote:
>>> Hi Estani,
>>>
>>> It should be possible to do what you want without running multiple
>>> data nodes.
>>>
>>> The purpose of the THREDDS dataset roots is to hide the directory
>>> structure from the end user, and to limit what the TDS can access.  
>>> But
>>> THREDDS can certainly have multiple dataset roots.
>>>
>>> In your example below, you should associate different paths with the
>>> locations, for example:
>>>
>>>> <datasetRoot path="CMIP5_replicas" location="/replicated/CMIP5"/>
>>>> <datasetRoot path="CMIP5_core" location="/core/CMIP5"/>
>>>
>>> Also be aware that in the publisher configuration:
>>>
>>> - the directory_format can have multiple values, separated by  
>>> vertical
>>> bars (|). The publisher will use the first format that matches the
>>> directory structure being scanned.
>>>
>>> - a useful strategy is to create different project sections for
>>> various groups of directives. You could define a cmip5_replica
>>> project, a cmip5_core project, etc.
>>>
>>> Bob
>>>
>>> On Jul 1, 2010, at 5:42 AM, Estanislao Gonzalez wrote:
>>>
>>>> Hi Bryan,
>>>>
>>>> thanks for your answer!
>>>> Running multiple ESG data nodes is always a possibility, but it  
>>>> seems an
>>>> overkill to us as we may have several different "data  
>>>> repositories".
>>>> We would like to separate: core-replicated, core-non-replicated,
>>>> non-core, non-core-on-hpss, as well as other non-cmip5 data.  
>>>> Having 5+
>>>> ESG data nodes is not viable in our scenario.
>>>>
>>>> The TDS allows the separation of access URL from the underlying  
>>>> file
>>>> structure so that it might be possible. AFAIK the publisher does  
>>>> not
>>>> provide a simple way of doing this.
>>>>
>>>> Setting thredds_dataset_roots to different values while publishing
>>>> doesn't appear to work as those are mapped to a map-entry at the
>>>> catalog root:
>>>> <datasetRoot path="CMIP5" location="/replicated/CMIP5"/>
>>>> <datasetRoot path="CMIP5" location="/core/CMIP5"/>
>>>> ..
>>>>
>>>> which is clearly non bijective and can't therefore be reversed to
>>>> locate the file from a given URL.
>>>>
>>>> While publishing all referred data will be held on a known  
>>>> location.
>>>> Is it possible to use somehow this information to setup a proper
>>>> catalog configuration so that the URL can be properly mapped? At
>>>> least on a dataset level?
>>>>
>>>> The whole HPSS staging procedure should be completely transparent  
>>>> to
>>>> the user, as well as the location of the files. I was just  
>>>> looking at
>>>> other options in case we cannot publish them the way we want...
>>>>
>>>> Cheers,
>>>> Estani
>>>>
>>>>
>>>>
>>>>
>>>> Bryan Lawrence wrote:
>>>>> sorry.
>>>>>
>>>>> the first sentence should have read
>>>>>
>>>>> Just to note that *our* approach to the local versus replication  
>>>>> issue
>>>>> will be ...
>>>>>
>>>>> Cheers
>>>>> Bryan
>>>>>
>>>>> On Thursday 01 Jul 2010 11:25:37 Bryan Lawrence wrote:
>>>>>
>>>>>> Hi Estani
>>>>>>
>>>>>> Just to note that your approach to the local versus replication  
>>>>>> will
>>>>>> be to run two different ESG nodes ... which is in fact the  
>>>>>> desired
>>>>>> outcome so as to get the right things in the catalogues at the  
>>>>>> right
>>>>>> time (vis- a-viz qc etc).
>>>>>>
>>>>>> The issue with respect to cache, I'm not so sure about, in what  
>>>>>> way
>>>>>> do you want to expose that into ESG?
>>>>>>
>>>>>> Bryan
>>>>>>
>>>>>> On Wednesday 30 Jun 2010 17:05:57 Estanislao Gonzalez wrote:
>>>>>>
>>>>>>> Hi Stephen,
>>>>>>>
>>>>>>> the page contains really helpful information, thanks a lot!
>>>>>>>
>>>>>>> I'm also interested in some variables of the DEFAULT section  
>>>>>>> from
>>>>>>> the esg.ini configuration file. More specifically:
>>>>>>> thredds_dataset_roots (and maybe thredds_aggregation_services or
>>>>>>> any other which was changed or you think it might be important)
>>>>>>>
>>>>>>> The main question here is: how can different local directory
>>>>>>> structures be published to the same DRS structure?
>>>>>>> The example scenario in our case will be:
>>>>>>> /replicated/<DRS structure> - for replicated data
>>>>>>> /local/<DRS structure> - for non replicated data hold on disk
>>>>>>> /cache/<DRS structure> - for data staged from a HPSS system
>>>>>>>
>>>>>>> The only solution I can think of is to extend the URL before the
>>>>>>> DRS structure starts (the URL won't be 100% DRS conform  
>>>>>>> anyway). So
>>>>>>>   http://**server/thredds/fileserver/<DRS structure>
>>>>>>> will turn into
>>>>>>>   http://**server/thredds/fileserver/replicated/<DRS structure>
>>>>>>>   http://**server/thredds/fileserver/local/<DRS structure>
>>>>>>>   http://**server/thredds/fileserver/cache/<DRS structure>
>>>>>>>
>>>>>>> Is that viable? Are there any other options?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Estani
>>>>>>>
>>>>>>> stephen.pascoe at stfc.ac.uk wrote:
>>>>>>>
>>>>>>>> To illustrate how the ESG datanode can be configured to serve
>>>>>>>> data for CMIP5 we have deployed a datanode containing a  
>>>>>>>> subset of
>>>>>>>> CMIP3 in the Data Reference Syntax. Some key features of this
>>>>>>>> deployment are:
>>>>>>>>
>>>>>>>>   * The underlying directory structure is based on the Data
>>>>>>>>     Reference Syntax.
>>>>>>>>   * Datasets published at the realm level.
>>>>>>>>   * The token-based security filter is replaced by the
>>>>>>>>     OpenidRelyingParty security filter.
>>>>>>>>
>>>>>>>> Further notes can be found at
>>>>>>>> http://**proj.badc.rl.ac.uk/go-essp/wiki/CMIP3_Datanode
>>>>>>>>
>>>>>>>> This test deployment should be of interest to anyone wanting to
>>>>>>>> know how DRS identifiers could be exposed in THREDDS catalogues
>>>>>>>> and the TDS HTML interface.  You can also try downloading files
>>>>>>>> with OpenID authentication or via wget with SSL-client
>>>>>>>> certificate authentication.  See the link above for details.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Stephen.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---
>>>>>>>> Stephen Pascoe  +44 (0)1235 445980
>>>>>>>> British Atmospheric Data Centre
>>>>>>>> Rutherford Appleton Laboratory
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -----------------------------------------------------------------
>>>>>>>> -- -----
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>> http://**mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>
>>>>>
>>>>
>>>> -- 
>>>> Estanislao Gonzalez
>>>>
>>>> Max-Planck-Institut für Meteorologie (MPI-M)
>>>> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing  
>>>> Centre
>>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>
>>>> Phone:   +49 (40) 46 00 94-126
>>>> E-Mail:  estanislao.gonzalez at zmaw.de
>>>>
>>>> _______________________________________________
>>>> GO-ESSP-TECH mailing list
>>>> GO-ESSP-TECH at ucar.edu
>>>> http://**mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>
>>
>>
>
> -- 
> Gavin M. Bell
> Lawrence Livermore National Labs
> --
>
> "Never mistake a clear view for a short distance."
>       	       -Paul Saffo
>
> (GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)
>
> A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E