[Go-essp-tech] Replication: requested and output DRS products.

Wed Jul 7 03:48:53 MDT 2010

Hello Estani,

The reference to CMIP5_archive_size.xls was not very useful, apologies for referencing a file that isn't publicly available -- it is attached.

According to the DRS document, everything should be found under the "output" branch, and the "requested" branch will be a subset of the "output".

An end user may want a homogeneous dataset, and so may opt to restrict attention to the "requested" data where he is likely to find the same variables from a large range of models. He may, on the other hand, want all available data for a given set of experiments, in which case he should go to the "output" branch. He will then find additional (low priority) variables and extended time coverage from a small number of models.

I'll see what can be done about a "DRS:requested" and "ESGF:replicated" document (or wiki page),

cheers,
Martin 

-----Original Message-----
From: Estanislao Gonzalez [mailto:estanislao.gonzalez at zmaw.de]
Sent: Wed 07/07/2010 08:54
To: Juckes, Martin (STFC,RAL,SSTD)
Cc: Pascoe, Stephen (STFC,RAL,SSTD); gavin at llnl.gov; drach1 at llnl.gov; go-essp-tech at ucar.edu; is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
Subject: Re: Replication: requested and output DRS products.

Hi Martin,

I couldn't find the file you mentioned (CMIP5_archive_size.xls), could
you please provide a link to it?

I'm aware now that output > requested > replicated. But the distinction
between the later ones is not clear to me. I totally agree that it would
be great if someone could sum that up.

And one question from the "monster" thread that still remains is:
It is clear that requested is a subset of output. Does this imply that
all data under .../requested/... should also be found under the
.../output/... DRS sub-structure?

I think not... but then again, why would the end user need to know about
this separation?

Thanks,
Estnai

martin.juckes at stfc.ac.uk wrote:
> Hello again,
>
> The decision as to what is to be replicated is, I think embedded in "CMIP5_archive_size.xls", and its implementation through the DRS is based on the separation between "requested" and "output" products. It would be useful to have a brief document outlining these decisions and some code to implement them. I'm not sure of the latest status on these two points, perhaps Stephen can add something.
>
> cheers,
> Martin
>
>
> -----Original Message-----
> From: Estanislao Gonzalez [mailto:estanislao.gonzalez at zmaw.de]
> Sent: Tue 06/07/2010 17:15
> To: V. Balaji
> Cc: Juckes, Martin (STFC,RAL,SSTD); Pascoe, Stephen (STFC,RAL,SSTD); gavin at llnl.gov; drach1 at llnl.gov; go-essp-tech at ucar.edu; is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov; taylor13 at llnl.gov
> Subject: Re: [Go-essp-tech] [is-enes-sa2-jra4] Example of	configuringadatanode to serve CMIP3-DRS
>  
> Hi Balaji,
>
> To put things in context once more: (I think there's no such thing as
> over-clarification :-)
>
> DRS file and directory structure will be assured. The problem is if for
> some reason we have two different directories, e.g. A and B, and we want
> to publish data in DRS from both directories. So we have A/<DRS
> structure> and B/<DRS structure>.
> We'd like both of them to be mapped to a central URL, e.g.
> http://www.server.de/thredds/fileserver/<DRS structure> so that the user
> requires absolutely no knowledge about this separation.
>
> The remaining question is: why on earth would someone want to have A and
> B?! :-)
> Well some reasons are:
> 1) simplified management. We don't have a mega-mix of millions of files
> from which some have to me replicated, some are held only at our
> institution, some are "temporarily" held as being cached from tape.
> Telling these all apart might not be an easy task.
> 2) Safety. In such a context a simple error might be disastrous (e.g.
> someone tries to remove the replicated files to re-deploy them without
> being aware that they share the directory with other files...)
> 3) Backup. If we (ok, somebody else, we will have everything on tape, I
> think...) want to backup a portion of the data, this won't be easily
> achieved (the replicated data is already redundant, but the other isn't)
> 4) Storage. We might get more disks, but we will certainly won't be able
> to "merge" all of them into a single storage (well, that's because they
> will arrive way after we start publishing things, so the first disks
> will already have some data). In any case, for political (e.g.
> institutional), technical (e.g. disk speed) or philosophical (e.g
> ...uh....) reasons it might be desirable to keep different storages.
>
> And as I said we have to cope with that, somehow.
> The starter question was: can this be achieved with the publisher? And
> the answer was "no".
>
> And I totally agree with you regarding AR5. I must have a very good
> reason for not attaining to a default, even defacto ones. But the
> decision behind the storage in AR5 is a political one that, AFAIK, isn't
> taken yet.
>
> Well, I hope this helped to clarify things a bit.
>
> Thanks,
> Estani
>
> V. Balaji wrote:
>   
>> There are undoubtedly parts of this I'm not following too well, so I
>> apologize in advance for any misunderstandings. This is all from the
>> perspective of a modeling center.
>>
>> I do not understand the logic for _not_ wishing to lay data out in
>> DRS-compliant fashion on the public data server. I know you can do it,
>> but I don't understand why you'd want to. One thing I'd like to make
>> sure is captured as a requirement is that 'wget -r' should deliver
>> data laid out per DRS directory structure.
>>
>> The second issue is that, again from the modeling centre perspective, I
>> fervently hope that whatever's done for CMIP5 becomes a de-facto
>> standard for other projects requiring coordinated model data output. We
>> (modeling centres) cannot build one-off solutions for each project. We
>> have with some success made CMOR1/AR4 a template which was forked off
>> for other projects (ENSEMBLES, CHFP, HTAP), because there's no way we
>> can repeatedly undertake the task of integrating multiple inconsistent
>> CMORs and DRSes into our data processing workflow. This is in ref to
>> Martin's question about "non-CMIP5 data".
>>
>> martin.juckes at stfc.ac.uk writes:
>>
>>     
>>> Hello Estanislao, Gavin,
>>>
>>> There is a key part of your problem I don't understand -- what do you
>>> mean by "non CMIP5 data"?
>>>
>>> Before going into the ESGF CMIP5 archive, all files will be CMOR2
>>> compliant. This means that they fit in the "requested" or "product"
>>> categories of the DRS. The data to be replicated will be a subset of
>>> the "ESG published units" (also known as realm level datasets) in the
>>> "requested" category.
>>>
>>> There has been an agreement that the ESGF CMIP5 archive would be run
>>> on disk, and so it is not surprising that the infrastructure does not
>>> support tape storage. I can see that something along the lines Gavin
>>> describes would resolve the problems with tape storage, but we need
>>> to get the disk based system working as the first priority.
>>>
>>> Stephen raises the issue of replication and this is relevant, since
>>> straight disk to disk copies (i.e. to an external hard drive which
>>> can be posted) is a vital aspect of the replication plan. For the
>>> time being, this requires people to stick to the DRS directory
>>> structure.
>>>
>>> Within CMIP5 the data from different institutions is clearly
>>> separated at the institution directory level, I can't see why there
>>> should be any confusion here.
>>>
>>> For non-CMIP5 data -- why would you want to describe it with the
>>> CMIP5 DRS?
>>>
>>> cheers,
>>> Martin
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: is-enes-sa2-jra4-bounces at lists.enes.org on behalf of
>>> stephen.pascoe at stfc.ac.uk
>>> Sent: Tue 06/07/2010 12:13
>>> To: estanislao.gonzalez at zmaw.de; gavin at llnl.gov
>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu;
>>> is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
>>> Subject: Re: [is-enes-sa2-jra4] [Go-essp-tech] Example of
>>> configuringadatanode to serve CMIP3-DRS
>>>
>>>
>>>
>>> Hi Estanislao,
>>>
>>>       
>>>> * The only true problem is to differentiate between core and
>>>> non-core data (which as far as I node is a file issue instead of a
>>>> dataset one,
>>>> i.e. some datasets contain core and non core data)
>>>>         
>>> I'm not sure you were involved then but we had lengthy discussions
>>> last year on how we would deal with the separation of requested and
>>> non-requested data (Karl discourages the term "core").  There is a
>>> fundamental problem that the DRS vocabularies don't cleanly map onto
>>> what is requested and not requested.  The outcome was to introduce
>>> the DRS component "product" to divide the two.  If you are interested
>>> take a look at the following threads:
>>>
>>> http://mailman.ucar.edu/pipermail/go-essp-tech/2010-January/000335.html
>>> http://mailman.ucar.edu/pipermail/go-essp-tech/2009-December/000255.html
>>>
>>> There hasn't been much discussion of how we identify and manage
>>> requested data since then and the nitty-gritty details still aren't
>>> fixed.  This is going to be a challenge when we come to replicate.
>>>
>>> S.
>>>
>>> ---
>>> Stephen Pascoe  +44 (0)1235 445980
>>> British Atmospheric Data Centre
>>> Rutherford Appleton Laboratory
>>>
>>> -----Original Message-----
>>> From: is-enes-sa2-jra4-bounces at lists.enes.org
>>> [mailto:is-enes-sa2-jra4-bounces at lists.enes.org] On Behalf Of
>>> Estanislao Gonzalez
>>> Sent: 06 July 2010 11:18
>>> To: Gavin M Bell
>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu;
>>> is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
>>> Subject: Re: [is-enes-sa2-jra4] [Go-essp-tech] Example of configuring
>>> adatanode to serve CMIP3-DRS
>>>
>>> Hi people,
>>>
>>> well I think we do require something like this (at least at the major
>>> data nodes where data will get replicated). Managing all data mixed
>>> up under one single directory is not a very neat solution for the
>>> data administrator. In our particular case we will be publishing many
>>> (much?
>>> :-) data from different institutions and even types (not only CMIP5).
>>> And we shouldn't forget about the replicated data (is that ===
>>> core?), how can we tell which data requires being replicated? by
>>> maintaining a second "catalog" in a DB? I think by maintaining a
>>> separate filesystem a simple rsynch will do the job (after the very
>>> first replication, of course).
>>> In any case the fact that we at DKRZ cannot hold all CMIP5 data on
>>> disk (yes, the core one we can :-) implies that we will have to
>>> maintain a cache somewhere, and mixing this cache with the core data
>>> is something we should probably avoid.
>>>
>>> Gavin's solution, if I got it right, has a major problem. The
>>> catalogs will be created pointing to the real files (e.g.
>>> .../core/CMIP5), so that the filter can alter the request from the
>>> DRS query
>>> (../CMIP5/<core_data>) to the real one, and thus allow the TDS to
>>> work as usual. This leaves the catalogs unaltered and thereby the
>>> harvest data which will have no reference to the mapped DRS structure
>>> but to the real one. OR did I miss something here?
>>>
>>> I have already tried several possible solutions without any success
>>> at all:
>>> 1) Setting multiple datasetRoot entries is not allowed
>>> 2) Altering the TDS to accept multiple datasetRoot entries and look
>>> in all of them one after the other after something matches is almost
>>> impossible (for the time we have ahead, the mere architecture of the
>>> TDS is, in my opinion, a mess).
>>> 3) In general altering the TDS is not a "nice" solution.
>>> 4) Filtering the request breaks the coherence between the catalogs
>>> and the DRS "virtual" structure (the catalogs have no information
>>> whatsoever that a second link to the files exists.
>>>
>>> The only viable solution I can think of (and it is still to see if
>>> it's really viable) is to maintain the files somewhere else and link
>>> them to the "central" DRS filesystem before being published.
>>>
>>> After discussing this with Stephan we come up with something I'd like
>>> to sum up here:
>>> * All non CMIP5 data can be mapped to a DRS structure "not" starting
>>> with CMIP5 so it can be easily mapped to somewhere else (TDS allows
>>> that)
>>> * The only true problem is to differentiate between core and non-core
>>> data (which as far as I node is a file issue instead of a dataset
>>> one, i.e. some datasets contain core and non core data)
>>> * The replication can rely on external sources for differentiating
>>> this, e.g. a DB.
>>> * The cached non-core data can co-live, in the worst case scenario,
>>> with the core data by removing the write permits of the later (beside
>>> the security that it implies, this will be used as a flag in case the
>>> server is restarted. All non-flagged (write enabled) files will be
>>> treated as left overs from the stopped cache and will be further served)
>>>
>>> So we might get out with it without performing any major changes. But
>>> this is something we should definitely discuss before next iteration :-)
>>>
>>> I hope this brings some light into the matter... sorry for the
>>> lengthy mail...
>>>
>>> Regards,
>>> Estani
>>>
>>> Gavin M Bell wrote:
>>>       
>>>> Martin,
>>>>
>>>> The savings is that the data provider / data-node admin doesn't have
>>>> to any additional work, whether it be provide any filesystem <-> drs
>>>> mapping or (re)arranging their file system.  In the current state of
>>>> things all the salient information is already in the database created
>>>> as a result of the publisher [software] scan.  I think it would be
>>>> prudent to use that information to the benefit of our end users
>>>> instead of imposing a DRS directory structure requirement for esg
>>>> participation.
>>>>
>>>> You said:
>>>> "Remember that not having to configure the file system is only a real
>>>> saving if the alternative (configuring the file system to URL mapping)
>>>> is actually easier than configuring the file system."
>>>>
>>>> I am saying:
>>>> The 'alternative' you describe, does not exist.  Because there is no
>>>> "configuring the file system to URL mapping" necessary... unless the
>>>> end-user wants there to be. In which case we, as dutiful programmers,
>>>> provide that opportunity.  This is what my code sketch was
>>>> illustrating with the property "drs.resolve.strategy", and the use of
>>>> a factory and strategy pattern - of which we will set a default that
>>>> requires them to do *no additional work*.  The data-node admin won't
>>>> have to do any actual setup outside of running an "esg-node --update".
>>>> The upgrade/update process (determined by the esg-node install script)
>>>> will install the filter, without them having to do anything additional.
>>>>
>>>> Indeed, the code I posted was a quick and dirty filter code sketch
>>>> demonstrating that putting a filter in place is easy. Yes, the
>>>> resolution work would be done in the code that I only alluded to, the
>>>> "DRSResolver". Current duties preclude me from actually implementing
>>>> this issue outright, today, for this email conversation. However, if
>>>> we all conclude that it is worthwhile then I or someone else could
>>>> make it happen.
>>>>
>>>> I hope I have done a better job of making more clear my point; that we
>>>> can free our end-users of this DRS directory structure requirement
>>>> while allowing the DRS itself to be more flexible with it's
>>>> representation.
>>>> Also that the mechanism I described does not preclude anyone from
>>>> setting up their filesystem to follow the DRS structure, we get that
>>>> for free! :-)
>>>>
>>>> I am glad that we do indeed agree that the effort to bring this to
>>>> fruition can and should be done in a way that does not impede or
>>>> distract the current  deliverable path.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> martin.juckes at stfc.ac.uk wrote:
>>>>
>>>>         
>>>>> Er... the attachment you sent didn't actually do any mapping. But I'm
>>>>> sure it could be done. The extra work I'm talking about is the same
>>>>> as the extra work you talk about at the end of your mail, so I'm
>>>>> going to ignore your suggestion at the start of your email that there
>>>>> isn't any,
>>>>>
>>>>> cheers,
>>>>> Martin
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Gavin M Bell [mailto:gavin at llnl.gov]
>>>>> Sent: Mon 05/07/2010 21:37
>>>>> To: Juckes, Martin (STFC,RAL,SSTD)
>>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu;
>>>>> is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
>>>>> Subject: Re: [Go-essp-tech] [is-enes-sa2-jra4] Example of configuring
>>>>> adatanode to serve CMIP3-DRS
>>>>>
>>>>> Hi Martin,
>>>>>
>>>>> With regards to the savings... One, perhaps default, setup is not
>>>>> having the data provider do anything additional at all with respect
>>>>> to configuration or setup.  They simply use the publisher to scan
>>>>> their files into the system, something that must be done in all
>>>>> cases... (so we can normalize that out). With that said, they would
>>>>> not have to do
>>>>> *any* additional work.  No work is easier than some work, regardless
>>>>> of how easy ;-).
>>>>>
>>>>> I have attached the filter code that would almost do it.  The real
>>>>> intelligence would be in the "DRSResolver" object to do the
>>>>> resolution.
>>>>>  I would have sketched out that class as well but that would be
>>>>> tantamount to completing this task... and to finish it off I would
>>>>> have to confer with Bob on the publisher database.  And have us all
>>>>> settled on the DRS query syntax.
>>>>> With a DRS URL query scheme we could wrap this up quite directly.
>>>>>
>>>>> The DRSResolver would:
>>>>> -parse the request url (the query) and pull out the salient parts.
>>>>> -fashion those parts into a SQL query against the publisher database
>>>>> -Return the thredds' root based url to the rest of the processing
>>>>> stream. If it is not able to be resolved, punt and return the same
>>>>> input string as the output and let some other part of the process
>>>>> stream regurgitate an error.
>>>>>
>>>>> Because all the metadata is pulled out in the publisher's scan, file
>>>>> system placement of the scanned files is moot.
>>>>>
>>>>> In the code I attached, I leave room for the data-node user to select
>>>>> their own implementation of the resolver following a factory/strategy
>>>>> pattern.  At that point indeed we allow end users to do 'work' with
>>>>> doing their own mappings.  Perhaps we integrate a few canned mapping
>>>>> schemes etc... We can be arbitrarily cleaver with these kinds of
>>>>> things of course. :-)
>>>>>
>>>>> P.S.
>>>>> The DRSResolver logic would/should be ported to all ingress request
>>>>> streams.  Also the published catalogs would be published with the DRS
>>>>> query syntax scheme as the canonical name of the resource - something
>>>>> the search facility would use to identify the resource.
>>>>>
>>>>> done.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> martin.juckes at stfc.ac.uk wrote:
>>>>>
>>>>>           
>>>>>> Hi Gavin,
>>>>>>
>>>>>> I'm not convinced about the connection to Estanislao's email, but
>>>>>> the idea of thinking about the next step while implementing the
>>>>>> current system is certainly a good one. Remember that not having to
>>>>>> configure the file system is only a real saving if the alternative
>>>>>> (configuring the file system to URL mapping) is actually easier than
>>>>>> configuring the file system. Setting up the DRS is not difficult,
>>>>>>
>>>>>> cheers,
>>>>>> Martin
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Gavin M Bell [mailto:gavin at llnl.gov]
>>>>>> Sent: Mon 05/07/2010 19:45
>>>>>> To: Juckes, Martin (STFC,RAL,SSTD)
>>>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu;
>>>>>> is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
>>>>>> Subject: Re: [Go-essp-tech] [is-enes-sa2-jra4] Example of
>>>>>> configuring adatanode to serve CMIP3-DRS
>>>>>>
>>>>>> Martin and friends,
>>>>>>
>>>>>> This is false economy.  Two things.  First implementing this is not
>>>>>> hard.  Secondly implementing this will resolve the issues r.w.t. the
>>>>>> incongruence between DRS and the filesystem that Estanislao's email
>>>>>> illuminated.  So it seems to me that the alternative is keep fitting
>>>>>> this square DRS peg in to the round file system hole.  That would
>>>>>> mean having to do a whole other set of gymnastics to get the DRS <->
>>>>>> file system beast tamed.  There is work to be done either way
>>>>>> because things are not ready to go as it stands. I suggest we fix
>>>>>> the problem at the root, now, not "later".  Essentially the current
>>>>>> course requires the data providers to jump through file system
>>>>>> layout hoops.  I am of the opinion that we should "require" as
>>>>>> little as possible from our users, especially something like
>>>>>> this... it hurts adoption IMHO.
>>>>>>
>>>>>> Actually, let me frame this differently.  How about we fork efforts,
>>>>>> and have some folks think about what the *query* URL should be for
>>>>>> the functionality I suggested, while others continue the current
>>>>>> path.  When the former development is ripe I update the install
>>>>>> script and have it installed upon the clients' next install
>>>>>> automagically, no slowdown for anyone.  The null transform would be
>>>>>> equivalent to what we have now so we would be backward compatible
>>>>>> for folks whole have done the task of making their file systems
>>>>>> congruent to DRS.  Fair enough?
>>>>>>
>>>>>> Sound good?
>>>>>>
>>>>>> martin.juckes at stfc.ac.uk wrote:
>>>>>>
>>>>>>             
>>>>>>> Hello Gavin, Bob,
>>>>>>>
>>>>>>> I agree that this is a good idea in principle, but I think it is a
>>>>>>> bad idea now. The thing about "now" is that we want to deploy and
>>>>>>> test the system we have agreed on. We want to do it now because
>>>>>>> modelling centres have supercomputers running and churning out vast
>>>>>>> volumes of data, there are thousands of scientists waiting to get
>>>>>>> at it and we have the job of installing a system to distribute it.
>>>>>>> It is, I think, I bad time to start implementing changes in the
>>>>>>> system design. Sorry if this sounds a bit harsh, but impending
>>>>>>> deadlines make me nervous,
>>>>>>>
>>>>>>> cheers,
>>>>>>> Martin
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: go-essp-tech-bounces at ucar.edu on behalf of Bob Drach
>>>>>>> Sent: Mon 05/07/2010 19:18
>>>>>>> To: Gavin M Bell
>>>>>>> Cc: go-essp-tech at ucar.edu; is-enes-sa2-jra4 at lists.enes.org; Charles
>>>>>>> Doutriaux
>>>>>>> Subject: Re: [Go-essp-tech] [is-enes-sa2-jra4] Example of
>>>>>>> configuring adatanode to serve CMIP3-DRS
>>>>>>>
>>>>>>> Hi Gavin,
>>>>>>>
>>>>>>> I agree completely. Having a regularized DRS syntax is a very good
>>>>>>> idea, but to implement it we will need to introduce a level of
>>>>>>> indirection between the DRS URL (your 'query') and the underlying
>>>>>>> filesystem. Separating these two concerns will have a very
>>>>>>> important
>>>>>>> benefit: it will allow the data node managers to organize their
>>>>>>> filesystems as they see fit.
>>>>>>>
>>>>>>> Bob
>>>>>>>
>>>>>>> On Jul 5, 2010, at 11:10 AM, Gavin M Bell wrote:
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> Hello gentle-people,
>>>>>>>>
>>>>>>>> Here is my two cents on this whole DRS business.  I think that the
>>>>>>>> fundamental issue to all of this is the ability to do resource
>>>>>>>> resolution (lookup).  The issue of having urls match a DRS
>>>>>>>> structure that matches the filesystem is a red herring (IMHO).
>>>>>>>> The basic issue is to be able to issue a query to the system such
>>>>>>>> that you find what you are looking for.  This query mechanism
>>>>>>>> should be separate mechanism than filesystem correspondence.  The
>>>>>>>> driving issue behind the file system correspondence push is so
>>>>>>>> that people and/or applications can infer the location of
>>>>>>>> resources in some regimented way.  The true heart of the issue is
>>>>>>>> not with the file system.  The heart of the issue is to perform a
>>>>>>>> query such that you provide resource resolution.  The file system
>>>>>>>> is a familiar mechanism but it isn't the only one.  The file
>>>>>>>> system takes a query (the file system path) and returns the
>>>>>>>> resource to us (the bits sitting at an inode location somewhere
>>>>>>>> that is memory mapped to some physical platter and spindle
>>>>>>>> location, that is mapped to the file system path).  We are
>>>>>>>> overloading the file system query mechanism when it is not
>>>>>>>> necessary.
>>>>>>>>
>>>>>>>> I propose the following:  We create a *filter* and a small
>>>>>>>> database (the latter we already have in the publisher).  We send a
>>>>>>>> *query* to the web server the web server *filter* intercepts that
>>>>>>>> *query* and resolves it, using the database to the actual resource
>>>>>>>> location and returns the resource you want.  Implementing this in
>>>>>>>> a filter divorces the query structure from the file system
>>>>>>>> structure.  The use of the database (that is generated by the
>>>>>>>> publisher when it scans) provides the resolution.
>>>>>>>> With this mechanism in place, WGET, as well as any other URL based
>>>>>>>> tool will be able to fetch the data as intended.
>>>>>>>>
>>>>>>>> BTW: The "query" is whatever we make it up to be... (not a
>>>>>>>> reference to SQL query).
>>>>>>>>
>>>>>>>> This gives the data-node admin the ability to put their files
>>>>>>>> wherever they want.  If they move files around and so on, they
>>>>>>>> just have to rescan with the publisher.  The issues around design
>>>>>>>> and efficiency can be address with varying degrees of cleverness.
>>>>>>>>
>>>>>>>> I welcome any thoughts on this issue... Please talk me down :-). I
>>>>>>>> think it is about time we put this DRS issue to bed.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Estanislao Gonzalez wrote:
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> Hi Bob,
>>>>>>>>>
>>>>>>>>> I guess you must be on vacations now. Anyway, here's the
>>>>>>>>> question, maybe someone else can answer it:
>>>>>>>>>
>>>>>>>>> The very first idea I had was almost what you proposed. Your
>>>>>>>>> proposal though leaves URLs of the form:
>>>>>>>>> http://****myserver/thredds/fileserver/CMIP5_replicas/output/...
>>>>>>>>>                                                             <---
>>>>>>>>> (almost) DRS Structure ----------->
>>>>>>>>>
>>>>>>>>> Which has no valid DRS structure (CMIP5_replicas nor CMIP5_core
>>>>>>>>> are in the DRS vocabulary).
>>>>>>>>>
>>>>>>>>> My proposal has a very similar flaw:
>>>>>>>>> http://****myserver/thredds/fileserver/replicated/CMIP5/output/...
>>>>>>>>>
>>>>>>>>> <--- full DRS Structure -----------> The DRS structure is
>>>>>>>>> preserved, but you cannot easily infer the correct URL from any
>>>>>>>>> dataset. I think the Idea is: if you know the prefix
>>>>>>>>> (http.../fileserver/) and the dataset DRS name you can always get
>>>>>>>>> the file without even browising the TDS:
>>>>>>>>> prefix + DRS = URL to file
>>>>>>>>>
>>>>>>>>> AFAIK the URL structure used by the TDS will never be 100% DRS
>>>>>>>>> conform (according to the DRS version 0.27) This one has the
>>>>>>>>> form:
>>>>>>>>> http://****<hostname>/<activity>/<product>/<institute>/<model>/
>>>>>>>>> <experiment>/<frequency>/<modeling
>>>>>>>>> realm>/<variable identifier>/<ensemble member>/<version>/
>>>>>>>>> [<endpoint>],
>>>>>>>>>
>>>>>>>>> where the TDS one has the endpoint moved to the front (the
>>>>>>>>> thredds/fileserver, thredds/dodsC, etc parts).
>>>>>>>>>
>>>>>>>>> To sum things up:
>>>>>>>>> Is it possible to publish files from different directory
>>>>>>>>> structures into an unified URL structure so that it is completely
>>>>>>>>> transparent to the user?
>>>>>>>>> Am I the only one addressing this problem? Are all other
>>>>>>>>> institutions planning  to publish all files from only one
>>>>>>>>> directory?
>>>>>>>>>
>>>>>>>>> The only viable solution I can think of is to rely on Stephen's
>>>>>>>>> versioning concept and maintaining a single true DRS structure
>>>>>>>>> with links to files kept in other more manageable directory
>>>>>>>>> structures (This will probably involve adapting Stephen's tool).
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Estani
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Bob Drach wrote:
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>>> Hi Estani,
>>>>>>>>>>
>>>>>>>>>> It should be possible to do what you want without running
>>>>>>>>>> multiple data nodes.
>>>>>>>>>>
>>>>>>>>>> The purpose of the THREDDS dataset roots is to hide the
>>>>>>>>>> directory structure from the end user, and to limit what the
>>>>>>>>>> TDS can access.
>>>>>>>>>> But
>>>>>>>>>> THREDDS can certainly have multiple dataset roots.
>>>>>>>>>>
>>>>>>>>>> In your example below, you should associate different paths with
>>>>>>>>>> the locations, for example:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>>>> <datasetRoot path="CMIP5_replicas"
>>>>>>>>>>> location="/replicated/CMIP5"/> <datasetRoot path="CMIP5_core"
>>>>>>>>>>> location="/core/CMIP5"/>
>>>>>>>>>>>
>>>>>>>>>>>                       
>>>>>>>>>> Also be aware that in the publisher configuration:
>>>>>>>>>>
>>>>>>>>>> - the directory_format can have multiple values, separated by
>>>>>>>>>> vertical bars (|). The publisher will use the first format that
>>>>>>>>>> matches the directory structure being scanned.
>>>>>>>>>>
>>>>>>>>>> - a useful strategy is to create different project sections for
>>>>>>>>>> various groups of directives. You could define a cmip5_replica
>>>>>>>>>> project, a cmip5_core project, etc.
>>>>>>>>>>
>>>>>>>>>> Bob
>>>>>>>>>>
>>>>>>>>>> On Jul 1, 2010, at 5:42 AM, Estanislao Gonzalez wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                     
>>>>>>>>>>> Hi Bryan,
>>>>>>>>>>>
>>>>>>>>>>> thanks for your answer!
>>>>>>>>>>> Running multiple ESG data nodes is always a possibility, but it
>>>>>>>>>>> seems an overkill to us as we may have several different "data
>>>>>>>>>>> repositories".
>>>>>>>>>>> We would like to separate: core-replicated,
>>>>>>>>>>> core-non-replicated, non-core, non-core-on-hpss, as well as
>>>>>>>>>>> other non-cmip5 data.
>>>>>>>>>>> Having 5+
>>>>>>>>>>> ESG data nodes is not viable in our scenario.
>>>>>>>>>>>
>>>>>>>>>>> The TDS allows the separation of access URL from the underlying
>>>>>>>>>>> file structure so that it might be possible. AFAIK the
>>>>>>>>>>> publisher does not provide a simple way of doing this.
>>>>>>>>>>>
>>>>>>>>>>> Setting thredds_dataset_roots to different values while
>>>>>>>>>>> publishing doesn't appear to work as those are mapped to a
>>>>>>>>>>> map-entry at the catalog root:
>>>>>>>>>>> <datasetRoot path="CMIP5" location="/replicated/CMIP5"/>
>>>>>>>>>>> <datasetRoot path="CMIP5" location="/core/CMIP5"/> ..
>>>>>>>>>>>
>>>>>>>>>>> which is clearly non bijective and can't therefore be reversed
>>>>>>>>>>> to locate the file from a given URL.
>>>>>>>>>>>
>>>>>>>>>>> While publishing all referred data will be held on a known
>>>>>>>>>>> location.
>>>>>>>>>>> Is it possible to use somehow this information to setup a
>>>>>>>>>>> proper catalog configuration so that the URL can be properly
>>>>>>>>>>> mapped? At least on a dataset level?
>>>>>>>>>>>
>>>>>>>>>>> The whole HPSS staging procedure should be completely
>>>>>>>>>>> transparent to the user, as well as the location of the files.
>>>>>>>>>>> I was just looking at other options in case we cannot publish
>>>>>>>>>>> them the way we want...
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Estani
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Bryan Lawrence wrote:
>>>>>>>>>>>
>>>>>>>>>>>                       
>>>>>>>>>>>> sorry.
>>>>>>>>>>>>
>>>>>>>>>>>> the first sentence should have read
>>>>>>>>>>>>
>>>>>>>>>>>> Just to note that *our* approach to the local versus
>>>>>>>>>>>> replication issue will be ...
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers
>>>>>>>>>>>> Bryan
>>>>>>>>>>>>
>>>>>>>>>>>> On Thursday 01 Jul 2010 11:25:37 Bryan Lawrence wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                         
>>>>>>>>>>>>> Hi Estani
>>>>>>>>>>>>>
>>>>>>>>>>>>> Just to note that your approach to the local versus
>>>>>>>>>>>>> replication will be to run two different ESG nodes ... which
>>>>>>>>>>>>> is in fact the desired outcome so as to get the right things
>>>>>>>>>>>>> in the catalogues at the right time (vis- a-viz qc etc).
>>>>>>>>>>>>>
>>>>>>>>>>>>> The issue with respect to cache, I'm not so sure about, in
>>>>>>>>>>>>> what way do you want to expose that into ESG?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Bryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wednesday 30 Jun 2010 17:05:57 Estanislao Gonzalez wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> Hi Stephen,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the page contains really helpful information, thanks a lot!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm also interested in some variables of the DEFAULT section
>>>>>>>>>>>>>> from the esg.ini configuration file. More specifically:
>>>>>>>>>>>>>> thredds_dataset_roots (and maybe
>>>>>>>>>>>>>> thredds_aggregation_services or any other which was changed
>>>>>>>>>>>>>> or you think it might be important)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The main question here is: how can different local directory
>>>>>>>>>>>>>> structures be published to the same DRS structure?
>>>>>>>>>>>>>> The example scenario in our case will be:
>>>>>>>>>>>>>> /replicated/<DRS structure> - for replicated data
>>>>>>>>>>>>>> /local/<DRS structure> - for non replicated data hold on
>>>>>>>>>>>>>> disk /cache/<DRS structure> - for data staged from a HPSS
>>>>>>>>>>>>>> system
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The only solution I can think of is to extend the URL before
>>>>>>>>>>>>>> the DRS structure starts (the URL won't be 100% DRS conform
>>>>>>>>>>>>>> anyway). So
>>>>>>>>>>>>>>   http://*****server/thredds/fileserver/<DRS structure> will
>>>>>>>>>>>>>> turn into
>>>>>>>>>>>>>>   http://*****server/thredds/fileserver/replicated/<DRS
>>>>>>>>>>>>>> structure>
>>>>>>>>>>>>>>   http://*****server/thredds/fileserver/local/<DRS structure>
>>>>>>>>>>>>>>   http://*****server/thredds/fileserver/cache/<DRS
>>>>>>>>>>>>>> structure>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is that viable? Are there any other options?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Estani
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> stephen.pascoe at stfc.ac.uk wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                             
>>>>>>>>>>>>>>> To illustrate how the ESG datanode can be configured to
>>>>>>>>>>>>>>> serve data for CMIP5 we have deployed a datanode containing
>>>>>>>>>>>>>>> a subset of
>>>>>>>>>>>>>>> CMIP3 in the Data Reference Syntax. Some key features of
>>>>>>>>>>>>>>> this deployment are:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   * The underlying directory structure is based on the Data
>>>>>>>>>>>>>>>     Reference Syntax.
>>>>>>>>>>>>>>>   * Datasets published at the realm level.
>>>>>>>>>>>>>>>   * The token-based security filter is replaced by the
>>>>>>>>>>>>>>>     OpenidRelyingParty security filter.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Further notes can be found at
>>>>>>>>>>>>>>> http://*****proj.badc.rl.ac.uk/go-essp/wiki/CMIP3_Datanode
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This test deployment should be of interest to anyone
>>>>>>>>>>>>>>> wanting to know how DRS identifiers could be exposed in
>>>>>>>>>>>>>>> THREDDS catalogues and the TDS HTML interface.  You can
>>>>>>>>>>>>>>> also try downloading files with OpenID authentication or
>>>>>>>>>>>>>>> via wget with SSL-client certificate authentication.  See
>>>>>>>>>>>>>>> the link above for details.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Stephen.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>> Stephen Pascoe  +44 (0)1235 445980 British Atmospheric Data
>>>>>>>>>>>>>>> Centre Rutherford Appleton Laboratory
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -----------------------------------------------------------
>>>>>>>>>>>>>>> ------
>>>>>>>>>>>>>>> -- -----
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>>>>>> http://*****mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                               
>>>>>>>>>>> -- 
>>>>>>>>>>> Estanislao Gonzalez
>>>>>>>>>>>
>>>>>>>>>>> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches
>>>>>>>>>>> Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>>>>>>>>>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>>>>>>>>
>>>>>>>>>>> Phone:   +49 (40) 46 00 94-126
>>>>>>>>>>> E-Mail:  estanislao.gonzalez at zmaw.de
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>> http://*****mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                       
>>>>>>>> -- 
>>>>>>>> Gavin M. Bell
>>>>>>>> Lawrence Livermore National Labs
>>>>>>>> -- 
>>>>>>>>
>>>>>>>> "Never mistake a clear view for a short distance."
>>>>>>>>                  -Paul Saffo
>>>>>>>>
>>>>>>>> (GPG Key - http://***rainbow.llnl.gov/dist/keys/gavin.asc)
>>>>>>>>
>>>>>>>> A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
>>>>>>>>
>>>>>>>>                 
>>>>>>> _______________________________________________
>>>>>>> GO-ESSP-TECH mailing list
>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>> http://***mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>         
>>> -- 
>>> Estanislao Gonzalez
>>>
>>> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches
>>> Klimarechenzentrum (DKRZ) - German Climate Computing Centre Room 108
>>> - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>
>>> Phone:   +49 (40) 46 00 94-126
>>> E-Mail:  estanislao.gonzalez at zmaw.de
>>>
>>> _______________________________________________
>>> is-enes-sa2-jra4 mailing list
>>> is-enes-sa2-jra4 at lists.enes.org
>>> https://lists.enes.org/mailman/listinfo/is-enes-sa2-jra4
>>>
>>>       
>
>
>   

-- 
Estanislao Gonzalez

Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

Phone:   +49 (40) 46 00 94-126
E-Mail:  estanislao.gonzalez at zmaw.de

-- 
Scanned by iCritical.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: CMIP5_archive_size_xls.zip
Type: application/zip
Size: 7136125 bytes
Desc: CMIP5_archive_size_xls.zip
Url : http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20100707/2e1ab197/attachment-0001.zip