[Go-essp-tech] [is-enes-sa2-jra4] Example of configuring a datanode to serve CMIP3-DRS
Doutriaux, Charles
doutriaux1 at llnl.gov
Tue Jul 6 08:36:21 MDT 2010
I second Gavin on this.
As far as my two cents go (not too far), I think this whole DRS thing is
more of a distraction than anything else. The many hours we all spent on
this would have probably been better spent developing the "filter" proposed
by Gavin.
C.
On 7/5/10 11:10 AM, "Gavin M Bell" <gavin at llnl.gov> wrote:
> Hello gentle-people,
>
> Here is my two cents on this whole DRS business. I think that the
> fundamental issue to all of this is the ability to do resource
> resolution (lookup). The issue of having urls match a DRS structure
> that matches the filesystem is a red herring (IMHO). The basic issue is
> to be able to issue a query to the system such that you find what you
> are looking for. This query mechanism should be separate mechanism than
> filesystem correspondence. The driving issue behind the file system
> correspondence push is so that people and/or applications can infer the
> location of resources in some regimented way. The true heart of the
> issue is not with the file system. The heart of the issue is to perform
> a query such that you provide resource resolution. The file system is a
> familiar mechanism but it isn't the only one. The file system takes a
> query (the file system path) and returns the resource to us (the bits
> sitting at an inode location somewhere that is memory mapped to some
> physical platter and spindle location, that is mapped to the file system
> path). We are overloading the file system query mechanism when it is
> not necessary.
>
> I propose the following: We create a *filter* and a small database (the
> latter we already have in the publisher). We send a *query* to the web
> server the web server *filter* intercepts that *query* and resolves it,
> using the database to the actual resource location and returns the
> resource you want. Implementing this in a filter divorces the query
> structure from the file system structure. The use of the database (that
> is generated by the publisher when it scans) provides the resolution.
> With this mechanism in place, WGET, as well as any other URL based tool
> will be able to fetch the data as intended.
>
> BTW: The "query" is whatever we make it up to be... (not a reference to
> SQL query).
>
> This gives the data-node admin the ability to put their files wherever
> they want. If they move files around and so on, they just have to
> rescan with the publisher. The issues around design and efficiency can
> be address with varying degrees of cleverness.
>
> I welcome any thoughts on this issue... Please talk me down :-). I think
> it is about time we put this DRS issue to bed.
>
>
>
> Estanislao Gonzalez wrote:
>> Hi Bob,
>>
>> I guess you must be on vacations now. Anyway, here's the question, maybe
>> someone else can answer it:
>>
>> The very first idea I had was almost what you proposed. Your proposal
>> though leaves URLs of the form:
>> http://*myserver/thredds/fileserver/CMIP5_replicas/output/...
>> <---
>> (almost) DRS Structure ----------->
>>
>> Which has no valid DRS structure (CMIP5_replicas nor CMIP5_core are in
>> the DRS vocabulary).
>>
>> My proposal has a very similar flaw:
>> http://*myserver/thredds/fileserver/replicated/CMIP5/output/...
>>
>> <--- full DRS Structure ----------->
>> The DRS structure is preserved, but you cannot easily infer the correct
>> URL from any dataset. I think the Idea is: if you know the prefix
>> (http.../fileserver/) and the dataset DRS name you can always get the
>> file without even browising the TDS:
>> prefix + DRS = URL to file
>>
>> AFAIK the URL structure used by the TDS will never be 100% DRS conform
>> (according to the DRS version 0.27)
>> This one has the form:
>> http://*<hostname>/<activity>/<product>/<institute>/<model>/<experiment>/<fre
>> quency>/<modeling
>> realm>/<variable identifier>/<ensemble member>/<version>/ [<endpoint>],
>>
>> where the TDS one has the endpoint moved to the front (the
>> thredds/fileserver, thredds/dodsC, etc parts).
>>
>> To sum things up:
>> Is it possible to publish files from different directory structures into
>> an unified URL structure so that it is completely transparent to the user?
>> Am I the only one addressing this problem? Are all other institutions
>> planning to publish all files from only one directory?
>>
>> The only viable solution I can think of is to rely on Stephen's
>> versioning concept and maintaining a single true DRS structure with
>> links to files kept in other more manageable directory structures (This
>> will probably involve adapting Stephen's tool).
>>
>> Thanks,
>> Estani
>>
>>
>> Bob Drach wrote:
>>> Hi Estani,
>>>
>>> It should be possible to do what you want without running multiple
>>> data nodes.
>>>
>>> The purpose of the THREDDS dataset roots is to hide the directory
>>> structure from the end user, and to limit what the TDS can access. But
>>> THREDDS can certainly have multiple dataset roots.
>>>
>>> In your example below, you should associate different paths with the
>>> locations, for example:
>>>
>>>> <datasetRoot path="CMIP5_replicas" location="/replicated/CMIP5"/>
>>>> <datasetRoot path="CMIP5_core" location="/core/CMIP5"/>
>>>
>>> Also be aware that in the publisher configuration:
>>>
>>> - the directory_format can have multiple values, separated by vertical
>>> bars (|). The publisher will use the first format that matches the
>>> directory structure being scanned.
>>>
>>> - a useful strategy is to create different project sections for
>>> various groups of directives. You could define a cmip5_replica
>>> project, a cmip5_core project, etc.
>>>
>>> Bob
>>>
>>> On Jul 1, 2010, at 5:42 AM, Estanislao Gonzalez wrote:
>>>
>>>> Hi Bryan,
>>>>
>>>> thanks for your answer!
>>>> Running multiple ESG data nodes is always a possibility, but it seems an
>>>> overkill to us as we may have several different "data repositories".
>>>> We would like to separate: core-replicated, core-non-replicated,
>>>> non-core, non-core-on-hpss, as well as other non-cmip5 data. Having 5+
>>>> ESG data nodes is not viable in our scenario.
>>>>
>>>> The TDS allows the separation of access URL from the underlying file
>>>> structure so that it might be possible. AFAIK the publisher does not
>>>> provide a simple way of doing this.
>>>>
>>>> Setting thredds_dataset_roots to different values while publishing
>>>> doesn't appear to work as those are mapped to a map-entry at the
>>>> catalog root:
>>>> <datasetRoot path="CMIP5" location="/replicated/CMIP5"/>
>>>> <datasetRoot path="CMIP5" location="/core/CMIP5"/>
>>>> ..
>>>>
>>>> which is clearly non bijective and can't therefore be reversed to
>>>> locate the file from a given URL.
>>>>
>>>> While publishing all referred data will be held on a known location.
>>>> Is it possible to use somehow this information to setup a proper
>>>> catalog configuration so that the URL can be properly mapped? At
>>>> least on a dataset level?
>>>>
>>>> The whole HPSS staging procedure should be completely transparent to
>>>> the user, as well as the location of the files. I was just looking at
>>>> other options in case we cannot publish them the way we want...
>>>>
>>>> Cheers,
>>>> Estani
>>>>
>>>>
>>>>
>>>>
>>>> Bryan Lawrence wrote:
>>>>> sorry.
>>>>>
>>>>> the first sentence should have read
>>>>>
>>>>> Just to note that *our* approach to the local versus replication issue
>>>>> will be ...
>>>>>
>>>>> Cheers
>>>>> Bryan
>>>>>
>>>>> On Thursday 01 Jul 2010 11:25:37 Bryan Lawrence wrote:
>>>>>
>>>>>> Hi Estani
>>>>>>
>>>>>> Just to note that your approach to the local versus replication will
>>>>>> be to run two different ESG nodes ... which is in fact the desired
>>>>>> outcome so as to get the right things in the catalogues at the right
>>>>>> time (vis- a-viz qc etc).
>>>>>>
>>>>>> The issue with respect to cache, I'm not so sure about, in what way
>>>>>> do you want to expose that into ESG?
>>>>>>
>>>>>> Bryan
>>>>>>
>>>>>> On Wednesday 30 Jun 2010 17:05:57 Estanislao Gonzalez wrote:
>>>>>>
>>>>>>> Hi Stephen,
>>>>>>>
>>>>>>> the page contains really helpful information, thanks a lot!
>>>>>>>
>>>>>>> I'm also interested in some variables of the DEFAULT section from
>>>>>>> the esg.ini configuration file. More specifically:
>>>>>>> thredds_dataset_roots (and maybe thredds_aggregation_services or
>>>>>>> any other which was changed or you think it might be important)
>>>>>>>
>>>>>>> The main question here is: how can different local directory
>>>>>>> structures be published to the same DRS structure?
>>>>>>> The example scenario in our case will be:
>>>>>>> /replicated/<DRS structure> - for replicated data
>>>>>>> /local/<DRS structure> - for non replicated data hold on disk
>>>>>>> /cache/<DRS structure> - for data staged from a HPSS system
>>>>>>>
>>>>>>> The only solution I can think of is to extend the URL before the
>>>>>>> DRS structure starts (the URL won't be 100% DRS conform anyway). So
>>>>>>> http://**server/thredds/fileserver/<DRS structure>
>>>>>>> will turn into
>>>>>>> http://**server/thredds/fileserver/replicated/<DRS structure>
>>>>>>> http://**server/thredds/fileserver/local/<DRS structure>
>>>>>>> http://**server/thredds/fileserver/cache/<DRS structure>
>>>>>>>
>>>>>>> Is that viable? Are there any other options?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Estani
>>>>>>>
>>>>>>> stephen.pascoe at stfc.ac.uk wrote:
>>>>>>>
>>>>>>>> To illustrate how the ESG datanode can be configured to serve
>>>>>>>> data for CMIP5 we have deployed a datanode containing a subset of
>>>>>>>> CMIP3 in the Data Reference Syntax. Some key features of this
>>>>>>>> deployment are:
>>>>>>>>
>>>>>>>> * The underlying directory structure is based on the Data
>>>>>>>> Reference Syntax.
>>>>>>>> * Datasets published at the realm level.
>>>>>>>> * The token-based security filter is replaced by the
>>>>>>>> OpenidRelyingParty security filter.
>>>>>>>>
>>>>>>>> Further notes can be found at
>>>>>>>> http://**proj.badc.rl.ac.uk/go-essp/wiki/CMIP3_Datanode
>>>>>>>>
>>>>>>>> This test deployment should be of interest to anyone wanting to
>>>>>>>> know how DRS identifiers could be exposed in THREDDS catalogues
>>>>>>>> and the TDS HTML interface. You can also try downloading files
>>>>>>>> with OpenID authentication or via wget with SSL-client
>>>>>>>> certificate authentication. See the link above for details.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Stephen.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---
>>>>>>>> Stephen Pascoe +44 (0)1235 445980
>>>>>>>> British Atmospheric Data Centre
>>>>>>>> Rutherford Appleton Laboratory
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -----------------------------------------------------------------
>>>>>>>> -- -----
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>> http://**mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>
>>>>>
>>>>
>>>> --
>>>> Estanislao Gonzalez
>>>>
>>>> Max-Planck-Institut für Meteorologie (MPI-M)
>>>> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>
>>>> Phone: +49 (40) 46 00 94-126
>>>> E-Mail: estanislao.gonzalez at zmaw.de
>>>>
>>>> _______________________________________________
>>>> GO-ESSP-TECH mailing list
>>>> GO-ESSP-TECH at ucar.edu
>>>> http://**mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>
>>
>>
More information about the GO-ESSP-TECH
mailing list