[Go-essp-tech] [is-enes-sa2-jra4] Example of configuring adatanode to serve CMIP3-DRS

Mon Jul 5 12:45:27 MDT 2010

Martin and friends,

This is false economy.  Two things.  First implementing this is not
hard.  Secondly implementing this will resolve the issues r.w.t. the
incongruence between DRS and the filesystem that Estanislao's email
illuminated.  So it seems to me that the alternative is keep fitting
this square DRS peg in to the round file system hole.  That would mean
having to do a whole other set of gymnastics to get the DRS <-> file
system beast tamed.  There is work to be done either way because things
are not ready to go as it stands. I suggest we fix the problem at the
root, now, not "later".  Essentially the current course requires the
data providers to jump through file system layout hoops.  I am of the
opinion that we should "require" as little as possible from our users,
especially something like this... it hurts adoption IMHO.

Actually, let me frame this differently.  How about we fork efforts, and
have some folks think about what the *query* URL should be for the
functionality I suggested, while others continue the current path.  When
the former development is ripe I update the install script and have it
installed upon the clients' next install automagically, no slowdown for
anyone.  The null transform would be equivalent to what we have now so
we would be backward compatible for folks whole have done the task of
making their file systems congruent to DRS.  Fair enough?

Sound good?

martin.juckes at stfc.ac.uk wrote:
> Hello Gavin, Bob,
> 
> I agree that this is a good idea in principle, but I think it is a bad idea now. The thing about "now" is that we want to deploy and test the system we have agreed on. We want to do it now because modelling centres have supercomputers running and churning out vast volumes of data, there are thousands of scientists waiting to get at it and we have the job of installing a system to distribute it. It is, I think, I bad time to start implementing changes in the system design. Sorry if this sounds a bit harsh, but impending deadlines make me nervous,
> 
> cheers,
> Martin
> 
> 
> -----Original Message-----
> From: go-essp-tech-bounces at ucar.edu on behalf of Bob Drach
> Sent: Mon 05/07/2010 19:18
> To: Gavin M Bell
> Cc: go-essp-tech at ucar.edu; is-enes-sa2-jra4 at lists.enes.org; Charles Doutriaux
> Subject: Re: [Go-essp-tech] [is-enes-sa2-jra4] Example of configuring adatanode to serve CMIP3-DRS
>  
> Hi Gavin,
> 
> I agree completely. Having a regularized DRS syntax is a very good  
> idea, but to implement it we will need to introduce a level of  
> indirection between the DRS URL (your 'query') and the underlying  
> filesystem. Separating these two concerns will have a very important  
> benefit: it will allow the data node managers to organize their  
> filesystems as they see fit.
> 
> Bob
> 
> On Jul 5, 2010, at 11:10 AM, Gavin M Bell wrote:
> 
>> Hello gentle-people,
>>
>> Here is my two cents on this whole DRS business.  I think that the
>> fundamental issue to all of this is the ability to do resource
>> resolution (lookup).  The issue of having urls match a DRS structure
>> that matches the filesystem is a red herring (IMHO).  The basic  
>> issue is
>> to be able to issue a query to the system such that you find what you
>> are looking for.  This query mechanism should be separate mechanism  
>> than
>> filesystem correspondence.  The driving issue behind the file system
>> correspondence push is so that people and/or applications can infer  
>> the
>> location of resources in some regimented way.  The true heart of the
>> issue is not with the file system.  The heart of the issue is to  
>> perform
>> a query such that you provide resource resolution.  The file system  
>> is a
>> familiar mechanism but it isn't the only one.  The file system takes a
>> query (the file system path) and returns the resource to us (the bits
>> sitting at an inode location somewhere that is memory mapped to some
>> physical platter and spindle location, that is mapped to the file  
>> system
>> path).  We are overloading the file system query mechanism when it is
>> not necessary.
>>
>> I propose the following:  We create a *filter* and a small database  
>> (the
>> latter we already have in the publisher).  We send a *query* to the  
>> web
>> server the web server *filter* intercepts that *query* and resolves  
>> it,
>> using the database to the actual resource location and returns the
>> resource you want.  Implementing this in a filter divorces the query
>> structure from the file system structure.  The use of the database  
>> (that
>> is generated by the publisher when it scans) provides the resolution.
>> With this mechanism in place, WGET, as well as any other URL based  
>> tool
>> will be able to fetch the data as intended.
>>
>> BTW: The "query" is whatever we make it up to be... (not a reference  
>> to
>> SQL query).
>>
>> This gives the data-node admin the ability to put their files wherever
>> they want.  If they move files around and so on, they just have to
>> rescan with the publisher.  The issues around design and efficiency  
>> can
>> be address with varying degrees of cleverness.
>>
>> I welcome any thoughts on this issue... Please talk me down :-). I  
>> think
>> it is about time we put this DRS issue to bed.
>>
>>
>>
>> Estanislao Gonzalez wrote:
>>> Hi Bob,
>>>
>>> I guess you must be on vacations now. Anyway, here's the question,  
>>> maybe
>>> someone else can answer it:
>>>
>>> The very first idea I had was almost what you proposed. Your proposal
>>> though leaves URLs of the form:
>>> http://**myserver/thredds/fileserver/CMIP5_replicas/output/...
>>>                                                             <---
>>> (almost) DRS Structure ----------->
>>>
>>> Which has no valid DRS structure (CMIP5_replicas nor CMIP5_core are  
>>> in
>>> the DRS vocabulary).
>>>
>>> My proposal has a very similar flaw:
>>> http://**myserver/thredds/fileserver/replicated/CMIP5/output/...
>>>
>>> <--- full DRS Structure ----------->
>>> The DRS structure is preserved, but you cannot easily infer the  
>>> correct
>>> URL from any dataset. I think the Idea is: if you know the prefix
>>> (http.../fileserver/) and the dataset DRS name you can always get the
>>> file without even browising the TDS:
>>> prefix + DRS = URL to file
>>>
>>> AFAIK the URL structure used by the TDS will never be 100% DRS  
>>> conform
>>> (according to the DRS version 0.27)
>>> This one has the form:
>>> http://**<hostname>/<activity>/<product>/<institute>/<model>/ 
>>> <experiment>/<frequency>/<modeling
>>> realm>/<variable identifier>/<ensemble member>/<version>/  
>>> [<endpoint>],
>>>
>>> where the TDS one has the endpoint moved to the front (the
>>> thredds/fileserver, thredds/dodsC, etc parts).
>>>
>>> To sum things up:
>>> Is it possible to publish files from different directory structures  
>>> into
>>> an unified URL structure so that it is completely transparent to  
>>> the user?
>>> Am I the only one addressing this problem? Are all other institutions
>>> planning  to publish all files from only one directory?
>>>
>>> The only viable solution I can think of is to rely on Stephen's
>>> versioning concept and maintaining a single true DRS structure with
>>> links to files kept in other more manageable directory structures  
>>> (This
>>> will probably involve adapting Stephen's tool).
>>>
>>> Thanks,
>>> Estani
>>>
>>>
>>> Bob Drach wrote:
>>>> Hi Estani,
>>>>
>>>> It should be possible to do what you want without running multiple
>>>> data nodes.
>>>>
>>>> The purpose of the THREDDS dataset roots is to hide the directory
>>>> structure from the end user, and to limit what the TDS can access.  
>>>> But
>>>> THREDDS can certainly have multiple dataset roots.
>>>>
>>>> In your example below, you should associate different paths with the
>>>> locations, for example:
>>>>
>>>>> <datasetRoot path="CMIP5_replicas" location="/replicated/CMIP5"/>
>>>>> <datasetRoot path="CMIP5_core" location="/core/CMIP5"/>
>>>> Also be aware that in the publisher configuration:
>>>>
>>>> - the directory_format can have multiple values, separated by  
>>>> vertical
>>>> bars (|). The publisher will use the first format that matches the
>>>> directory structure being scanned.
>>>>
>>>> - a useful strategy is to create different project sections for
>>>> various groups of directives. You could define a cmip5_replica
>>>> project, a cmip5_core project, etc.
>>>>
>>>> Bob
>>>>
>>>> On Jul 1, 2010, at 5:42 AM, Estanislao Gonzalez wrote:
>>>>
>>>>> Hi Bryan,
>>>>>
>>>>> thanks for your answer!
>>>>> Running multiple ESG data nodes is always a possibility, but it  
>>>>> seems an
>>>>> overkill to us as we may have several different "data  
>>>>> repositories".
>>>>> We would like to separate: core-replicated, core-non-replicated,
>>>>> non-core, non-core-on-hpss, as well as other non-cmip5 data.  
>>>>> Having 5+
>>>>> ESG data nodes is not viable in our scenario.
>>>>>
>>>>> The TDS allows the separation of access URL from the underlying  
>>>>> file
>>>>> structure so that it might be possible. AFAIK the publisher does  
>>>>> not
>>>>> provide a simple way of doing this.
>>>>>
>>>>> Setting thredds_dataset_roots to different values while publishing
>>>>> doesn't appear to work as those are mapped to a map-entry at the
>>>>> catalog root:
>>>>> <datasetRoot path="CMIP5" location="/replicated/CMIP5"/>
>>>>> <datasetRoot path="CMIP5" location="/core/CMIP5"/>
>>>>> ..
>>>>>
>>>>> which is clearly non bijective and can't therefore be reversed to
>>>>> locate the file from a given URL.
>>>>>
>>>>> While publishing all referred data will be held on a known  
>>>>> location.
>>>>> Is it possible to use somehow this information to setup a proper
>>>>> catalog configuration so that the URL can be properly mapped? At
>>>>> least on a dataset level?
>>>>>
>>>>> The whole HPSS staging procedure should be completely transparent  
>>>>> to
>>>>> the user, as well as the location of the files. I was just  
>>>>> looking at
>>>>> other options in case we cannot publish them the way we want...
>>>>>
>>>>> Cheers,
>>>>> Estani
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Bryan Lawrence wrote:
>>>>>> sorry.
>>>>>>
>>>>>> the first sentence should have read
>>>>>>
>>>>>> Just to note that *our* approach to the local versus replication  
>>>>>> issue
>>>>>> will be ...
>>>>>>
>>>>>> Cheers
>>>>>> Bryan
>>>>>>
>>>>>> On Thursday 01 Jul 2010 11:25:37 Bryan Lawrence wrote:
>>>>>>
>>>>>>> Hi Estani
>>>>>>>
>>>>>>> Just to note that your approach to the local versus replication  
>>>>>>> will
>>>>>>> be to run two different ESG nodes ... which is in fact the  
>>>>>>> desired
>>>>>>> outcome so as to get the right things in the catalogues at the  
>>>>>>> right
>>>>>>> time (vis- a-viz qc etc).
>>>>>>>
>>>>>>> The issue with respect to cache, I'm not so sure about, in what  
>>>>>>> way
>>>>>>> do you want to expose that into ESG?
>>>>>>>
>>>>>>> Bryan
>>>>>>>
>>>>>>> On Wednesday 30 Jun 2010 17:05:57 Estanislao Gonzalez wrote:
>>>>>>>
>>>>>>>> Hi Stephen,
>>>>>>>>
>>>>>>>> the page contains really helpful information, thanks a lot!
>>>>>>>>
>>>>>>>> I'm also interested in some variables of the DEFAULT section  
>>>>>>>> from
>>>>>>>> the esg.ini configuration file. More specifically:
>>>>>>>> thredds_dataset_roots (and maybe thredds_aggregation_services or
>>>>>>>> any other which was changed or you think it might be important)
>>>>>>>>
>>>>>>>> The main question here is: how can different local directory
>>>>>>>> structures be published to the same DRS structure?
>>>>>>>> The example scenario in our case will be:
>>>>>>>> /replicated/<DRS structure> - for replicated data
>>>>>>>> /local/<DRS structure> - for non replicated data hold on disk
>>>>>>>> /cache/<DRS structure> - for data staged from a HPSS system
>>>>>>>>
>>>>>>>> The only solution I can think of is to extend the URL before the
>>>>>>>> DRS structure starts (the URL won't be 100% DRS conform  
>>>>>>>> anyway). So
>>>>>>>>   http://***server/thredds/fileserver/<DRS structure>
>>>>>>>> will turn into
>>>>>>>>   http://***server/thredds/fileserver/replicated/<DRS structure>
>>>>>>>>   http://***server/thredds/fileserver/local/<DRS structure>
>>>>>>>>   http://***server/thredds/fileserver/cache/<DRS structure>
>>>>>>>>
>>>>>>>> Is that viable? Are there any other options?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Estani
>>>>>>>>
>>>>>>>> stephen.pascoe at stfc.ac.uk wrote:
>>>>>>>>
>>>>>>>>> To illustrate how the ESG datanode can be configured to serve
>>>>>>>>> data for CMIP5 we have deployed a datanode containing a  
>>>>>>>>> subset of
>>>>>>>>> CMIP3 in the Data Reference Syntax. Some key features of this
>>>>>>>>> deployment are:
>>>>>>>>>
>>>>>>>>>   * The underlying directory structure is based on the Data
>>>>>>>>>     Reference Syntax.
>>>>>>>>>   * Datasets published at the realm level.
>>>>>>>>>   * The token-based security filter is replaced by the
>>>>>>>>>     OpenidRelyingParty security filter.
>>>>>>>>>
>>>>>>>>> Further notes can be found at
>>>>>>>>> http://***proj.badc.rl.ac.uk/go-essp/wiki/CMIP3_Datanode
>>>>>>>>>
>>>>>>>>> This test deployment should be of interest to anyone wanting to
>>>>>>>>> know how DRS identifiers could be exposed in THREDDS catalogues
>>>>>>>>> and the TDS HTML interface.  You can also try downloading files
>>>>>>>>> with OpenID authentication or via wget with SSL-client
>>>>>>>>> certificate authentication.  See the link above for details.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Stephen.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> Stephen Pascoe  +44 (0)1235 445980
>>>>>>>>> British Atmospheric Data Centre
>>>>>>>>> Rutherford Appleton Laboratory
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>> -- -----
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>> http://***mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>
>>>>> -- 
>>>>> Estanislao Gonzalez
>>>>>
>>>>> Max-Planck-Institut für Meteorologie (MPI-M)
>>>>> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing  
>>>>> Centre
>>>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>>
>>>>> Phone:   +49 (40) 46 00 94-126
>>>>> E-Mail:  estanislao.gonzalez at zmaw.de
>>>>>
>>>>> _______________________________________________
>>>>> GO-ESSP-TECH mailing list
>>>>> GO-ESSP-TECH at ucar.edu
>>>>> http://***mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>
>>>
>> -- 
>> Gavin M. Bell
>> Lawrence Livermore National Labs
>> --
>>
>> "Never mistake a clear view for a short distance."
>>       	       -Paul Saffo
>>
>> (GPG Key - http://*rainbow.llnl.gov/dist/keys/gavin.asc)
>>
>> A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
> 
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech
> 

-- 
Gavin M. Bell
Lawrence Livermore National Labs
--

 "Never mistake a clear view for a short distance."
       	       -Paul Saffo

(GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)

 A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E