[Go-essp-tech] [is-enes-sa2-jra4] Example of configuring a datanode to serve CMIP3-DRS

Bryan Lawrence bryan.lawrence at stfc.ac.uk
Tue Jul 6 10:47:14 MDT 2010


Hi Folks

And my two cents, and I too am not following the detail, but I am enough 
to know that lots of different issues are being conflated.  Firstly, I'm 
totally with Balaji ... I know you can avoid the DRS, but the cost of 
doing so in confusion will be major.

The DRS was developed as a *requirement* by the community - by and large 
the conversation has been from folk whose task is to implement a 
requirement, not change it!  The DRS has been out for discussion for a 
very long time. Be very very careful before you undo it or even bend it.  
It is one of the very few fixed points in what has been a frustratingly 
rapidly changing landscape.

Despite the discussion so far I can see no reason to move away from what 
has been agreed for the best part of a few years. There are some issues 
to resolve (and in particular, the use of the output/replicated trees, 
and the fact that some parts of some variables may not be replicated 
from the output to the replicated tree), but a lot has already been 
invested in the DRS and its usage.

Bryan

On Tuesday 06 Jul 2010 15:36:21 Doutriaux, Charles wrote:
> I second Gavin on this.
> 
> As far as my two cents go (not too far), I think this whole DRS thing
>  is more of a distraction than anything else. The many hours we all
>  spent on this would have probably been better spent developing the
>  "filter" proposed by Gavin.
> 
> C.
> 
> On 7/5/10 11:10 AM, "Gavin M Bell" <gavin at llnl.gov> wrote:
> > Hello gentle-people,
> >
> > Here is my two cents on this whole DRS business.  I think that the
> > fundamental issue to all of this is the ability to do resource
> > resolution (lookup).  The issue of having urls match a DRS
> > structure that matches the filesystem is a red herring (IMHO).  The
> > basic issue is to be able to issue a query to the system such that
> > you find what you are looking for.  This query mechanism should be
> > separate mechanism than filesystem correspondence.  The driving
> > issue behind the file system correspondence push is so that people
> > and/or applications can infer the location of resources in some
> > regimented way.  The true heart of the issue is not with the file
> > system.  The heart of the issue is to perform a query such that you
> > provide resource resolution.  The file system is a familiar
> > mechanism but it isn't the only one.  The file system takes a query
> > (the file system path) and returns the resource to us (the bits
> > sitting at an inode location somewhere that is memory mapped to
> > some physical platter and spindle location, that is mapped to the
> > file system path).  We are overloading the file system query
> > mechanism when it is not necessary.
> >
> > I propose the following:  We create a *filter* and a small database
> > (the latter we already have in the publisher).  We send a *query*
> > to the web server the web server *filter* intercepts that *query*
> > and resolves it, using the database to the actual resource location
> > and returns the resource you want.  Implementing this in a filter
> > divorces the query structure from the file system structure.  The
> > use of the database (that is generated by the publisher when it
> > scans) provides the resolution. With this mechanism in place, WGET,
> > as well as any other URL based tool will be able to fetch the data
> > as intended.
> >
> > BTW: The "query" is whatever we make it up to be... (not a
> > reference to SQL query).
> >
> > This gives the data-node admin the ability to put their files
> > wherever they want.  If they move files around and so on, they just
> > have to rescan with the publisher.  The issues around design and
> > efficiency can be address with varying degrees of cleverness.
> >
> > I welcome any thoughts on this issue... Please talk me down :-). I
> > think it is about time we put this DRS issue to bed.
> >
> > Estanislao Gonzalez wrote:
> >> Hi Bob,
> >>
> >> I guess you must be on vacations now. Anyway, here's the question,
> >> maybe someone else can answer it:
> >>
> >> The very first idea I had was almost what you proposed. Your
> >> proposal though leaves URLs of the form:
> >> http://*myserver/thredds/fileserver/CMIP5_replicas/output/...
> >>                                                              <---
> >> (almost) DRS Structure ----------->
> >>
> >> Which has no valid DRS structure (CMIP5_replicas nor CMIP5_core
> >> are in the DRS vocabulary).
> >>
> >> My proposal has a very similar flaw:
> >> http://*myserver/thredds/fileserver/replicated/CMIP5/output/...
> >>
> >> <--- full DRS Structure ----------->
> >> The DRS structure is preserved, but you cannot easily infer the
> >> correct URL from any dataset. I think the Idea is: if you know the
> >> prefix (http.../fileserver/) and the dataset DRS name you can
> >> always get the file without even browising the TDS:
> >> prefix + DRS = URL to file
> >>
> >> AFAIK the URL structure used by the TDS will never be 100% DRS
> >> conform (according to the DRS version 0.27)
> >> This one has the form:
> >> http://*<hostname>/<activity>/<product>/<institute>/<model>/<exper
> >>iment>/<fre quency>/<modeling
> >> realm>/<variable identifier>/<ensemble member>/<version>/
> >> [<endpoint>],
> >>
> >> where the TDS one has the endpoint moved to the front (the
> >> thredds/fileserver, thredds/dodsC, etc parts).
> >>
> >> To sum things up:
> >> Is it possible to publish files from different directory
> >> structures into an unified URL structure so that it is completely
> >> transparent to the user? Am I the only one addressing this
> >> problem? Are all other institutions planning  to publish all files
> >> from only one directory?
> >>
> >> The only viable solution I can think of is to rely on Stephen's
> >> versioning concept and maintaining a single true DRS structure
> >> with links to files kept in other more manageable directory
> >> structures (This will probably involve adapting Stephen's tool).
> >>
> >> Thanks,
> >> Estani
> >>
> >> Bob Drach wrote:
> >>> Hi Estani,
> >>>
> >>> It should be possible to do what you want without running
> >>> multiple data nodes.
> >>>
> >>> The purpose of the THREDDS dataset roots is to hide the directory
> >>> structure from the end user, and to limit what the TDS can
> >>> access. But THREDDS can certainly have multiple dataset roots.
> >>>
> >>> In your example below, you should associate different paths with
> >>> the
> >>>
> >>> locations, for example:
> >>>> <datasetRoot path="CMIP5_replicas"
> >>>> location="/replicated/CMIP5"/> <datasetRoot path="CMIP5_core"
> >>>> location="/core/CMIP5"/>
> >>>
> >>> Also be aware that in the publisher configuration:
> >>>
> >>> - the directory_format can have multiple values, separated by
> >>> vertical bars (|). The publisher will use the first format that
> >>> matches the directory structure being scanned.
> >>>
> >>> - a useful strategy is to create different project sections for
> >>> various groups of directives. You could define a cmip5_replica
> >>> project, a cmip5_core project, etc.
> >>>
> >>> Bob
> >>>
> >>> On Jul 1, 2010, at 5:42 AM, Estanislao Gonzalez wrote:
> >>>> Hi Bryan,
> >>>>
> >>>> thanks for your answer!
> >>>> Running multiple ESG data nodes is always a possibility, but it
> >>>> seems an overkill to us as we may have several different "data
> >>>> repositories". We would like to separate: core-replicated,
> >>>> core-non-replicated, non-core, non-core-on-hpss, as well as
> >>>> other non-cmip5 data. Having 5+ ESG data nodes is not viable in
> >>>> our scenario.
> >>>>
> >>>> The TDS allows the separation of access URL from the underlying
> >>>> file structure so that it might be possible. AFAIK the publisher
> >>>> does not provide a simple way of doing this.
> >>>>
> >>>> Setting thredds_dataset_roots to different values while
> >>>> publishing doesn't appear to work as those are mapped to a
> >>>> map-entry at the catalog root:
> >>>> <datasetRoot path="CMIP5" location="/replicated/CMIP5"/>
> >>>> <datasetRoot path="CMIP5" location="/core/CMIP5"/>
> >>>> ..
> >>>>
> >>>> which is clearly non bijective and can't therefore be reversed
> >>>> to locate the file from a given URL.
> >>>>
> >>>> While publishing all referred data will be held on a known
> >>>> location. Is it possible to use somehow this information to
> >>>> setup a proper catalog configuration so that the URL can be
> >>>> properly mapped? At least on a dataset level?
> >>>>
> >>>> The whole HPSS staging procedure should be completely
> >>>> transparent to the user, as well as the location of the files. I
> >>>> was just looking at other options in case we cannot publish them
> >>>> the way we want...
> >>>>
> >>>> Cheers,
> >>>> Estani
> >>>>
> >>>> Bryan Lawrence wrote:
> >>>>> sorry.
> >>>>>
> >>>>> the first sentence should have read
> >>>>>
> >>>>> Just to note that *our* approach to the local versus
> >>>>> replication issue will be ...
> >>>>>
> >>>>> Cheers
> >>>>> Bryan
> >>>>>
> >>>>> On Thursday 01 Jul 2010 11:25:37 Bryan Lawrence wrote:
> >>>>>> Hi Estani
> >>>>>>
> >>>>>> Just to note that your approach to the local versus
> >>>>>> replication will be to run two different ESG nodes ... which
> >>>>>> is in fact the desired outcome so as to get the right things
> >>>>>> in the catalogues at the right time (vis- a-viz qc etc).
> >>>>>>
> >>>>>> The issue with respect to cache, I'm not so sure about, in
> >>>>>> what way do you want to expose that into ESG?
> >>>>>>
> >>>>>> Bryan
> >>>>>>
> >>>>>> On Wednesday 30 Jun 2010 17:05:57 Estanislao Gonzalez wrote:
> >>>>>>> Hi Stephen,
> >>>>>>>
> >>>>>>> the page contains really helpful information, thanks a lot!
> >>>>>>>
> >>>>>>> I'm also interested in some variables of the DEFAULT section
> >>>>>>> from the esg.ini configuration file. More specifically:
> >>>>>>> thredds_dataset_roots (and maybe thredds_aggregation_services
> >>>>>>> or any other which was changed or you think it might be
> >>>>>>> important)
> >>>>>>>
> >>>>>>> The main question here is: how can different local directory
> >>>>>>> structures be published to the same DRS structure?
> >>>>>>> The example scenario in our case will be:
> >>>>>>> /replicated/<DRS structure> - for replicated data
> >>>>>>> /local/<DRS structure> - for non replicated data hold on disk
> >>>>>>> /cache/<DRS structure> - for data staged from a HPSS system
> >>>>>>>
> >>>>>>> The only solution I can think of is to extend the URL before
> >>>>>>> the DRS structure starts (the URL won't be 100% DRS conform
> >>>>>>> anyway). So http://**server/thredds/fileserver/<DRS
> >>>>>>> structure> will turn into
> >>>>>>>    http://**server/thredds/fileserver/replicated/<DRS
> >>>>>>> structure> http://**server/thredds/fileserver/local/<DRS
> >>>>>>> structure> http://**server/thredds/fileserver/cache/<DRS
> >>>>>>> structure>
> >>>>>>>
> >>>>>>> Is that viable? Are there any other options?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Estani
> >>>>>>>
> >>>>>>> stephen.pascoe at stfc.ac.uk wrote:
> >>>>>>>> To illustrate how the ESG datanode can be configured to
> >>>>>>>> serve data for CMIP5 we have deployed a datanode containing
> >>>>>>>> a subset of CMIP3 in the Data Reference Syntax. Some key
> >>>>>>>> features of this deployment are:
> >>>>>>>>
> >>>>>>>>    * The underlying directory structure is based on the Data
> >>>>>>>>      Reference Syntax.
> >>>>>>>>    * Datasets published at the realm level.
> >>>>>>>>    * The token-based security filter is replaced by the
> >>>>>>>>      OpenidRelyingParty security filter.
> >>>>>>>>
> >>>>>>>> Further notes can be found at
> >>>>>>>> http://**proj.badc.rl.ac.uk/go-essp/wiki/CMIP3_Datanode
> >>>>>>>>
> >>>>>>>> This test deployment should be of interest to anyone wanting
> >>>>>>>> to know how DRS identifiers could be exposed in THREDDS
> >>>>>>>> catalogues and the TDS HTML interface.  You can also try
> >>>>>>>> downloading files with OpenID authentication or via wget
> >>>>>>>> with SSL-client certificate authentication.  See the link
> >>>>>>>> above for details.
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>> Stephen.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ---
> >>>>>>>> Stephen Pascoe  +44 (0)1235 445980
> >>>>>>>> British Atmospheric Data Centre
> >>>>>>>> Rutherford Appleton Laboratory
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ------------------------------------------------------------
> >>>>>>>>----- -- -----
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> GO-ESSP-TECH mailing list
> >>>>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>>>> http://**mailman.ucar.edu/mailman/listinfo/go-essp-tech
> 
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> 

-- 
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848; 
Web: home.badc.rl.ac.uk/lawrence


More information about the GO-ESSP-TECH mailing list