[Go-essp-tech] Replication: requested and output DRS products.

Gavin M Bell gavin at llnl.gov
Thu Jul 8 16:25:07 MDT 2010


Hello Gentle-Folks,

May I suggest the following to potential remedy this issue of requested...

How about we generate a separate catalog for the requested files in a
dataset.  Have that separate catalog look, smell and taste like our
standard catalog, except it only has the selected files present.  Then
we can specify a DRS naming scheme for this "requested" dataset.  The
additional wrinkle would be associating the "requested" dataset
(catalog) with the superset dataset (catalog) it comes from / belongs
to.  This can perhaps be easily solved via DRS naming convention.
During replication, since the files in the requested catalog points the
same files in the 'regular' catalog, there would be no additional work
done to fetch/replicat the constituent files if you've already
replicated the superset 'regular' catalog since part of the replication
process is to compare what you have with what you want.  You would have
two catalogs pointing to the same files... perfectly legal.  Keeping the
requested catalog in step with the originating catalog would be rather
straightforward, especially in our single-author versioning universe.

I think this is pretty much the same thing that Bryan mentioned earlier
in this thread... I am just putting it in catalog-ese.

Again, if our linqua franca is the catalog (not files) then things boil
down to rather straightforward solutions

The replicated catalogs can be published to the search engine database
(like any other catalog) and show up as a first class search result
(subject to result filtering shenanigans at the end-user's discretion).

The powers that be that determine what is in "requested" would only have
to create "requested" catalogs and publish them.

P.S.
Yes, we need checksums (or id tantamount to them) everywhere!

On 7/8/10 12:28 PM, Bob Drach wrote:
> Hi Karl,
> 
> As I recall we discussed whether to build into the publisher (probably  
> within the CMIP5 / IPCC5 handler) the definition of 'requested  
> datasets'. The publisher could then associate a 'requested' value for  
> some property (product?) to aid the identification. This hasn't been  
> done yet - and it's not obvious how straightforward it would be - but  
> could be if the definition is sufficiently well defined at this point.
> 
> Bob
> 
> 
> On Jul 7, 2010, at 5:59 PM, Karl Taylor wrote:
> 
>> Dear all,
>>
>> I'm not sure scientists care much about the distinction between the  
>> "output", "requested", and "replicated" categories. Martin indicates  
>> their might be mild interest in being able to search only over the  
>> "requested" category, since outside that category, there may be  
>> little uniformity in what is available from the different models.  
>> There will be little interest in being able to distinguish between  
>> "requested" and "replicated" unless there is a difference in the  
>> quality control tests that have been applied to these two categories  
>> (and then only if a noticeable amount of data in the "requested"  
>> category wouldn't pass the "replicated" tests). Will this likely be  
>> the case?
>>
>> Clearly the ESG federation must be able to decide which files to  
>> replicate, so unlike the scientists there is real interest to some  
>> of you on this list that we be able to distinguish that subset. I'm  
>> not sure this information has to be part of the DRS though. Couldn't  
>> we just have some database that lists the criteria for selecting  
>> data to be replicated? The database coupled with coding to access  
>> that information could be used to decide whether each file in the  
>> "output" category needs to be replicated or not. This is why that  
>> although the current DRS document allows "product" to be either  
>> "output" or "requested", an all caps note appears stating: "[WILL  
>> POSSIBLY MODIFY THE ABOVE IF WE DON’T NEED TO KNOW ABOUT  
>> “REQUESTED”]."
>>
>> Bob Drach and I had some extended discussions about this some time  
>> ago, but I can't recall if he decided to include some capability  
>> along these lines in the publisher (i.e., enables the publisher to  
>> determine whether files are in the "requested" category or not), or  
>> if we've left that for completely independent external coding. Bob  
>> returns from vacation later this week, so I suggest we wait for some  
>> input from him.
>>
>> Best regards,
>> Karl
>>
>> On 7/7/10 2:48 AM, martin.juckes at stfc.ac.uk wrote:
>>> Hello Estani,
>>>
>>> The reference to CMIP5_archive_size.xls was not very useful,  
>>> apologies for referencing a file that isn't publicly available --  
>>> it is attached.
>>>
>>> According to the DRS document, everything should be found under the  
>>> "output" branch, and the "requested" branch will be a subset of the  
>>> "output".
>>>
>>> An end user may want a homogeneous dataset, and so may opt to  
>>> restrict attention to the "requested" data where he is likely to  
>>> find the same variables from a large range of models. He may, on  
>>> the other hand, want all available data for a given set of  
>>> experiments, in which case he should go to the "output" branch. He  
>>> will then find additional (low priority) variables and extended  
>>> time coverage from a small number of models.
>>>
>>> I'll see what can be done about a "DRS:requested" and  
>>> "ESGF:replicated" document (or wiki page),
>>>
>>> cheers,
>>> Martin
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Estanislao Gonzalez [mailto:estanislao.gonzalez at zmaw.de]
>>> Sent: Wed 07/07/2010 08:54
>>> To: Juckes, Martin (STFC,RAL,SSTD)
>>> Cc: Pascoe, Stephen (STFC,RAL,SSTD); gavin at llnl.gov;  
>>> drach1 at llnl.gov; go-essp-tech at ucar.edu; is-enes-sa2-jra4 at lists.enes.org 
>>> ; doutriaux1 at llnl.gov
>>> Subject: Re: Replication: requested and output DRS products.
>>>
>>> Hi Martin,
>>>
>>> I couldn't find the file you mentioned (CMIP5_archive_size.xls),  
>>> could
>>> you please provide a link to it?
>>>
>>> I'm aware now that output>  requested>  replicated. But the  
>>> distinction
>>> between the later ones is not clear to me. I totally agree that it  
>>> would
>>> be great if someone could sum that up.
>>>
>>> And one question from the "monster" thread that still remains is:
>>> It is clear that requested is a subset of output. Does this imply  
>>> that
>>> all data under .../requested/... should also be found under the
>>> .../output/... DRS sub-structure?
>>>
>>> I think not... but then again, why would the end user need to know  
>>> about
>>> this separation?
>>>
>>> Thanks,
>>> Estnai
>>>
>>>
>>> martin.juckes at stfc.ac.uk wrote:
>>>
>>>> Hello again,
>>>>
>>>> The decision as to what is to be replicated is, I think embedded  
>>>> in "CMIP5_archive_size.xls", and its implementation through the  
>>>> DRS is based on the separation between "requested" and "output"  
>>>> products. It would be useful to have a brief document outlining  
>>>> these decisions and some code to implement them. I'm not sure of  
>>>> the latest status on these two points, perhaps Stephen can add  
>>>> something.
>>>>
>>>> cheers,
>>>> Martin
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Estanislao Gonzalez [mailto:estanislao.gonzalez at zmaw.de]
>>>> Sent: Tue 06/07/2010 17:15
>>>> To: V. Balaji
>>>> Cc: Juckes, Martin (STFC,RAL,SSTD); Pascoe, Stephen  
>>>> (STFC,RAL,SSTD); gavin at llnl.gov; drach1 at llnl.gov; go-essp-tech at ucar.edu 
>>>> ; is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov; taylor13 at llnl.gov
>>>> Subject: Re: [Go-essp-tech] [is-enes-sa2-jra4] Example of      
>>>> configuringadatanode to serve CMIP3-DRS
>>>>
>>>> Hi Balaji,
>>>>
>>>> To put things in context once more: (I think there's no such thing  
>>>> as
>>>> over-clarification :-)
>>>>
>>>> DRS file and directory structure will be assured. The problem is  
>>>> if for
>>>> some reason we have two different directories, e.g. A and B, and  
>>>> we want
>>>> to publish data in DRS from both directories. So we have A/<DRS
>>>> structure>  and B/<DRS structure>.
>>>> We'd like both of them to be mapped to a central URL, e.g.
>>>> http://**www.**server.de/thredds/fileserver/<DRS structure>  so that  
>>>> the user
>>>> requires absolutely no knowledge about this separation.
>>>>
>>>> The remaining question is: why on earth would someone want to have  
>>>> A and
>>>> B?! :-)
>>>> Well some reasons are:
>>>> 1) simplified management. We don't have a mega-mix of millions of  
>>>> files
>>>> from which some have to me replicated, some are held only at our
>>>> institution, some are "temporarily" held as being cached from tape.
>>>> Telling these all apart might not be an easy task.
>>>> 2) Safety. In such a context a simple error might be disastrous  
>>>> (e.g.
>>>> someone tries to remove the replicated files to re-deploy them  
>>>> without
>>>> being aware that they share the directory with other files...)
>>>> 3) Backup. If we (ok, somebody else, we will have everything on  
>>>> tape, I
>>>> think...) want to backup a portion of the data, this won't be easily
>>>> achieved (the replicated data is already redundant, but the other  
>>>> isn't)
>>>> 4) Storage. We might get more disks, but we will certainly won't  
>>>> be able
>>>> to "merge" all of them into a single storage (well, that's because  
>>>> they
>>>> will arrive way after we start publishing things, so the first disks
>>>> will already have some data). In any case, for political (e.g.
>>>> institutional), technical (e.g. disk speed) or philosophical (e.g
>>>> ...uh....) reasons it might be desirable to keep different storages.
>>>>
>>>> And as I said we have to cope with that, somehow.
>>>> The starter question was: can this be achieved with the publisher?  
>>>> And
>>>> the answer was "no".
>>>>
>>>> And I totally agree with you regarding AR5. I must have a very good
>>>> reason for not attaining to a default, even defacto ones. But the
>>>> decision behind the storage in AR5 is a political one that, AFAIK,  
>>>> isn't
>>>> taken yet.
>>>>
>>>> Well, I hope this helped to clarify things a bit.
>>>>
>>>> Thanks,
>>>> Estani
>>>>
>>>> V. Balaji wrote:
>>>>
>>>>
>>>>> There are undoubtedly parts of this I'm not following too well,  
>>>>> so I
>>>>> apologize in advance for any misunderstandings. This is all from  
>>>>> the
>>>>> perspective of a modeling center.
>>>>>
>>>>> I do not understand the logic for _not_ wishing to lay data out in
>>>>> DRS-compliant fashion on the public data server. I know you can  
>>>>> do it,
>>>>> but I don't understand why you'd want to. One thing I'd like to  
>>>>> make
>>>>> sure is captured as a requirement is that 'wget -r' should deliver
>>>>> data laid out per DRS directory structure.
>>>>>
>>>>> The second issue is that, again from the modeling centre  
>>>>> perspective, I
>>>>> fervently hope that whatever's done for CMIP5 becomes a de-facto
>>>>> standard for other projects requiring coordinated model data  
>>>>> output. We
>>>>> (modeling centres) cannot build one-off solutions for each  
>>>>> project. We
>>>>> have with some success made CMOR1/AR4 a template which was forked  
>>>>> off
>>>>> for other projects (ENSEMBLES, CHFP, HTAP), because there's no  
>>>>> way we
>>>>> can repeatedly undertake the task of integrating multiple  
>>>>> inconsistent
>>>>> CMORs and DRSes into our data processing workflow. This is in ref  
>>>>> to
>>>>> Martin's question about "non-CMIP5 data".
>>>>>
>>>>> martin.juckes at stfc.ac.uk writes:
>>>>>
>>>>>
>>>>>
>>>>>> Hello Estanislao, Gavin,
>>>>>>
>>>>>> There is a key part of your problem I don't understand -- what  
>>>>>> do you
>>>>>> mean by "non CMIP5 data"?
>>>>>>
>>>>>> Before going into the ESGF CMIP5 archive, all files will be CMOR2
>>>>>> compliant. This means that they fit in the "requested" or  
>>>>>> "product"
>>>>>> categories of the DRS. The data to be replicated will be a  
>>>>>> subset of
>>>>>> the "ESG published units" (also known as realm level datasets)  
>>>>>> in the
>>>>>> "requested" category.
>>>>>>
>>>>>> There has been an agreement that the ESGF CMIP5 archive would be  
>>>>>> run
>>>>>> on disk, and so it is not surprising that the infrastructure  
>>>>>> does not
>>>>>> support tape storage. I can see that something along the lines  
>>>>>> Gavin
>>>>>> describes would resolve the problems with tape storage, but we  
>>>>>> need
>>>>>> to get the disk based system working as the first priority.
>>>>>>
>>>>>> Stephen raises the issue of replication and this is relevant,  
>>>>>> since
>>>>>> straight disk to disk copies (i.e. to an external hard drive which
>>>>>> can be posted) is a vital aspect of the replication plan. For the
>>>>>> time being, this requires people to stick to the DRS directory
>>>>>> structure.
>>>>>>
>>>>>> Within CMIP5 the data from different institutions is clearly
>>>>>> separated at the institution directory level, I can't see why  
>>>>>> there
>>>>>> should be any confusion here.
>>>>>>
>>>>>> For non-CMIP5 data -- why would you want to describe it with the
>>>>>> CMIP5 DRS?
>>>>>>
>>>>>> cheers,
>>>>>> Martin
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: is-enes-sa2-jra4-bounces at lists.enes.org on behalf of
>>>>>> stephen.pascoe at stfc.ac.uk
>>>>>> Sent: Tue 06/07/2010 12:13
>>>>>> To: estanislao.gonzalez at zmaw.de; gavin at llnl.gov
>>>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu;
>>>>>> is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
>>>>>> Subject: Re: [is-enes-sa2-jra4] [Go-essp-tech] Example of
>>>>>> configuringadatanode to serve CMIP3-DRS
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Estanislao,
>>>>>>
>>>>>>
>>>>>>
>>>>>>> * The only true problem is to differentiate between core and
>>>>>>> non-core data (which as far as I node is a file issue instead  
>>>>>>> of a
>>>>>>> dataset one,
>>>>>>> i.e. some datasets contain core and non core data)
>>>>>>>
>>>>>>>
>>>>>> I'm not sure you were involved then but we had lengthy discussions
>>>>>> last year on how we would deal with the separation of requested  
>>>>>> and
>>>>>> non-requested data (Karl discourages the term "core").  There is a
>>>>>> fundamental problem that the DRS vocabularies don't cleanly map  
>>>>>> onto
>>>>>> what is requested and not requested.  The outcome was to introduce
>>>>>> the DRS component "product" to divide the two.  If you are  
>>>>>> interested
>>>>>> take a look at the following threads:
>>>>>>
>>>>>> http://**mailman.ucar.edu/pipermail/go-essp-tech/2010-January/ 
>>>>>> 000335.html
>>>>>> http://**mailman.ucar.edu/pipermail/go-essp-tech/2009-December/ 
>>>>>> 000255.html
>>>>>>
>>>>>> There hasn't been much discussion of how we identify and manage
>>>>>> requested data since then and the nitty-gritty details still  
>>>>>> aren't
>>>>>> fixed.  This is going to be a challenge when we come to replicate.
>>>>>>
>>>>>> S.
>>>>>>
>>>>>> ---
>>>>>> Stephen Pascoe  +44 (0)1235 445980
>>>>>> British Atmospheric Data Centre
>>>>>> Rutherford Appleton Laboratory
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: is-enes-sa2-jra4-bounces at lists.enes.org
>>>>>> [mailto:is-enes-sa2-jra4-bounces at lists.enes.org] On Behalf Of
>>>>>> Estanislao Gonzalez
>>>>>> Sent: 06 July 2010 11:18
>>>>>> To: Gavin M Bell
>>>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu;
>>>>>> is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
>>>>>> Subject: Re: [is-enes-sa2-jra4] [Go-essp-tech] Example of  
>>>>>> configuring
>>>>>> adatanode to serve CMIP3-DRS
>>>>>>
>>>>>> Hi people,
>>>>>>
>>>>>> well I think we do require something like this (at least at the  
>>>>>> major
>>>>>> data nodes where data will get replicated). Managing all data  
>>>>>> mixed
>>>>>> up under one single directory is not a very neat solution for the
>>>>>> data administrator. In our particular case we will be publishing  
>>>>>> many
>>>>>> (much?
>>>>>> :-) data from different institutions and even types (not only  
>>>>>> CMIP5).
>>>>>> And we shouldn't forget about the replicated data (is that ===
>>>>>> core?), how can we tell which data requires being replicated? by
>>>>>> maintaining a second "catalog" in a DB? I think by maintaining a
>>>>>> separate filesystem a simple rsynch will do the job (after the  
>>>>>> very
>>>>>> first replication, of course).
>>>>>> In any case the fact that we at DKRZ cannot hold all CMIP5 data on
>>>>>> disk (yes, the core one we can :-) implies that we will have to
>>>>>> maintain a cache somewhere, and mixing this cache with the core  
>>>>>> data
>>>>>> is something we should probably avoid.
>>>>>>
>>>>>> Gavin's solution, if I got it right, has a major problem. The
>>>>>> catalogs will be created pointing to the real files (e.g.
>>>>>> .../core/CMIP5), so that the filter can alter the request from the
>>>>>> DRS query
>>>>>> (../CMIP5/<core_data>) to the real one, and thus allow the TDS to
>>>>>> work as usual. This leaves the catalogs unaltered and thereby the
>>>>>> harvest data which will have no reference to the mapped DRS  
>>>>>> structure
>>>>>> but to the real one. OR did I miss something here?
>>>>>>
>>>>>> I have already tried several possible solutions without any  
>>>>>> success
>>>>>> at all:
>>>>>> 1) Setting multiple datasetRoot entries is not allowed
>>>>>> 2) Altering the TDS to accept multiple datasetRoot entries and  
>>>>>> look
>>>>>> in all of them one after the other after something matches is  
>>>>>> almost
>>>>>> impossible (for the time we have ahead, the mere architecture of  
>>>>>> the
>>>>>> TDS is, in my opinion, a mess).
>>>>>> 3) In general altering the TDS is not a "nice" solution.
>>>>>> 4) Filtering the request breaks the coherence between the catalogs
>>>>>> and the DRS "virtual" structure (the catalogs have no information
>>>>>> whatsoever that a second link to the files exists.
>>>>>>
>>>>>> The only viable solution I can think of (and it is still to see if
>>>>>> it's really viable) is to maintain the files somewhere else and  
>>>>>> link
>>>>>> them to the "central" DRS filesystem before being published.
>>>>>>
>>>>>> After discussing this with Stephan we come up with something I'd  
>>>>>> like
>>>>>> to sum up here:
>>>>>> * All non CMIP5 data can be mapped to a DRS structure "not"  
>>>>>> starting
>>>>>> with CMIP5 so it can be easily mapped to somewhere else (TDS  
>>>>>> allows
>>>>>> that)
>>>>>> * The only true problem is to differentiate between core and non- 
>>>>>> core
>>>>>> data (which as far as I node is a file issue instead of a dataset
>>>>>> one, i.e. some datasets contain core and non core data)
>>>>>> * The replication can rely on external sources for differentiating
>>>>>> this, e.g. a DB.
>>>>>> * The cached non-core data can co-live, in the worst case  
>>>>>> scenario,
>>>>>> with the core data by removing the write permits of the later  
>>>>>> (beside
>>>>>> the security that it implies, this will be used as a flag in  
>>>>>> case the
>>>>>> server is restarted. All non-flagged (write enabled) files will be
>>>>>> treated as left overs from the stopped cache and will be further  
>>>>>> served)
>>>>>>
>>>>>> So we might get out with it without performing any major  
>>>>>> changes. But
>>>>>> this is something we should definitely discuss before next  
>>>>>> iteration :-)
>>>>>>
>>>>>> I hope this brings some light into the matter... sorry for the
>>>>>> lengthy mail...
>>>>>>
>>>>>> Regards,
>>>>>> Estani
>>>>>>
>>>>>> Gavin M Bell wrote:
>>>>>>
>>>>>>
>>>>>>> Martin,
>>>>>>>
>>>>>>> The savings is that the data provider / data-node admin doesn't  
>>>>>>> have
>>>>>>> to any additional work, whether it be provide any filesystem<- 
>>>>>>>>  drs
>>>>>>> mapping or (re)arranging their file system.  In the current  
>>>>>>> state of
>>>>>>> things all the salient information is already in the database  
>>>>>>> created
>>>>>>> as a result of the publisher [software] scan.  I think it would  
>>>>>>> be
>>>>>>> prudent to use that information to the benefit of our end users
>>>>>>> instead of imposing a DRS directory structure requirement for esg
>>>>>>> participation.
>>>>>>>
>>>>>>> You said:
>>>>>>> "Remember that not having to configure the file system is only  
>>>>>>> a real
>>>>>>> saving if the alternative (configuring the file system to URL  
>>>>>>> mapping)
>>>>>>> is actually easier than configuring the file system."
>>>>>>>
>>>>>>> I am saying:
>>>>>>> The 'alternative' you describe, does not exist.  Because there  
>>>>>>> is no
>>>>>>> "configuring the file system to URL mapping" necessary...  
>>>>>>> unless the
>>>>>>> end-user wants there to be. In which case we, as dutiful  
>>>>>>> programmers,
>>>>>>> provide that opportunity.  This is what my code sketch was
>>>>>>> illustrating with the property "drs.resolve.strategy", and the  
>>>>>>> use of
>>>>>>> a factory and strategy pattern - of which we will set a default  
>>>>>>> that
>>>>>>> requires them to do *no additional work*.  The data-node admin  
>>>>>>> won't
>>>>>>> have to do any actual setup outside of running an "esg-node -- 
>>>>>>> update".
>>>>>>> The upgrade/update process (determined by the esg-node install  
>>>>>>> script)
>>>>>>> will install the filter, without them having to do anything  
>>>>>>> additional.
>>>>>>>
>>>>>>> Indeed, the code I posted was a quick and dirty filter code  
>>>>>>> sketch
>>>>>>> demonstrating that putting a filter in place is easy. Yes, the
>>>>>>> resolution work would be done in the code that I only alluded  
>>>>>>> to, the
>>>>>>> "DRSResolver". Current duties preclude me from actually  
>>>>>>> implementing
>>>>>>> this issue outright, today, for this email conversation.  
>>>>>>> However, if
>>>>>>> we all conclude that it is worthwhile then I or someone else  
>>>>>>> could
>>>>>>> make it happen.
>>>>>>>
>>>>>>> I hope I have done a better job of making more clear my point;  
>>>>>>> that we
>>>>>>> can free our end-users of this DRS directory structure  
>>>>>>> requirement
>>>>>>> while allowing the DRS itself to be more flexible with it's
>>>>>>> representation.
>>>>>>> Also that the mechanism I described does not preclude anyone from
>>>>>>> setting up their filesystem to follow the DRS structure, we get  
>>>>>>> that
>>>>>>> for free! :-)
>>>>>>>
>>>>>>> I am glad that we do indeed agree that the effort to bring this  
>>>>>>> to
>>>>>>> fruition can and should be done in a way that does not impede or
>>>>>>> distract the current  deliverable path.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>> martin.juckes at stfc.ac.uk wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Er... the attachment you sent didn't actually do any mapping.  
>>>>>>>> But I'm
>>>>>>>> sure it could be done. The extra work I'm talking about is the  
>>>>>>>> same
>>>>>>>> as the extra work you talk about at the end of your mail, so I'm
>>>>>>>> going to ignore your suggestion at the start of your email  
>>>>>>>> that there
>>>>>>>> isn't any,
>>>>>>>>
>>>>>>>> cheers,
>>>>>>>> Martin
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Gavin M Bell [mailto:gavin at llnl.gov]
>>>>>>>> Sent: Mon 05/07/2010 21:37
>>>>>>>> To: Juckes, Martin (STFC,RAL,SSTD)
>>>>>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu;
>>>>>>>> is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
>>>>>>>> Subject: Re: [Go-essp-tech] [is-enes-sa2-jra4] Example of  
>>>>>>>> configuring
>>>>>>>> adatanode to serve CMIP3-DRS
>>>>>>>>
>>>>>>>> Hi Martin,
>>>>>>>>
>>>>>>>> With regards to the savings... One, perhaps default, setup is  
>>>>>>>> not
>>>>>>>> having the data provider do anything additional at all with  
>>>>>>>> respect
>>>>>>>> to configuration or setup.  They simply use the publisher to  
>>>>>>>> scan
>>>>>>>> their files into the system, something that must be done in all
>>>>>>>> cases... (so we can normalize that out). With that said, they  
>>>>>>>> would
>>>>>>>> not have to do
>>>>>>>> *any* additional work.  No work is easier than some work,  
>>>>>>>> regardless
>>>>>>>> of how easy ;-).
>>>>>>>>
>>>>>>>> I have attached the filter code that would almost do it.  The  
>>>>>>>> real
>>>>>>>> intelligence would be in the "DRSResolver" object to do the
>>>>>>>> resolution.
>>>>>>>>  I would have sketched out that class as well but that would be
>>>>>>>> tantamount to completing this task... and to finish it off I  
>>>>>>>> would
>>>>>>>> have to confer with Bob on the publisher database.  And have  
>>>>>>>> us all
>>>>>>>> settled on the DRS query syntax.
>>>>>>>> With a DRS URL query scheme we could wrap this up quite  
>>>>>>>> directly.
>>>>>>>>
>>>>>>>> The DRSResolver would:
>>>>>>>> -parse the request url (the query) and pull out the salient  
>>>>>>>> parts.
>>>>>>>> -fashion those parts into a SQL query against the publisher  
>>>>>>>> database
>>>>>>>> -Return the thredds' root based url to the rest of the  
>>>>>>>> processing
>>>>>>>> stream. If it is not able to be resolved, punt and return the  
>>>>>>>> same
>>>>>>>> input string as the output and let some other part of the  
>>>>>>>> process
>>>>>>>> stream regurgitate an error.
>>>>>>>>
>>>>>>>> Because all the metadata is pulled out in the publisher's  
>>>>>>>> scan, file
>>>>>>>> system placement of the scanned files is moot.
>>>>>>>>
>>>>>>>> In the code I attached, I leave room for the data-node user to  
>>>>>>>> select
>>>>>>>> their own implementation of the resolver following a factory/ 
>>>>>>>> strategy
>>>>>>>> pattern.  At that point indeed we allow end users to do 'work'  
>>>>>>>> with
>>>>>>>> doing their own mappings.  Perhaps we integrate a few canned  
>>>>>>>> mapping
>>>>>>>> schemes etc... We can be arbitrarily cleaver with these kinds of
>>>>>>>> things of course. :-)
>>>>>>>>
>>>>>>>> P.S.
>>>>>>>> The DRSResolver logic would/should be ported to all ingress  
>>>>>>>> request
>>>>>>>> streams.  Also the published catalogs would be published with  
>>>>>>>> the DRS
>>>>>>>> query syntax scheme as the canonical name of the resource -  
>>>>>>>> something
>>>>>>>> the search facility would use to identify the resource.
>>>>>>>>
>>>>>>>> done.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> martin.juckes at stfc.ac.uk wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi Gavin,
>>>>>>>>>
>>>>>>>>> I'm not convinced about the connection to Estanislao's email,  
>>>>>>>>> but
>>>>>>>>> the idea of thinking about the next step while implementing the
>>>>>>>>> current system is certainly a good one. Remember that not  
>>>>>>>>> having to
>>>>>>>>> configure the file system is only a real saving if the  
>>>>>>>>> alternative
>>>>>>>>> (configuring the file system to URL mapping) is actually  
>>>>>>>>> easier than
>>>>>>>>> configuring the file system. Setting up the DRS is not  
>>>>>>>>> difficult,
>>>>>>>>>
>>>>>>>>> cheers,
>>>>>>>>> Martin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Gavin M Bell [mailto:gavin at llnl.gov]
>>>>>>>>> Sent: Mon 05/07/2010 19:45
>>>>>>>>> To: Juckes, Martin (STFC,RAL,SSTD)
>>>>>>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu;
>>>>>>>>> is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
>>>>>>>>> Subject: Re: [Go-essp-tech] [is-enes-sa2-jra4] Example of
>>>>>>>>> configuring adatanode to serve CMIP3-DRS
>>>>>>>>>
>>>>>>>>> Martin and friends,
>>>>>>>>>
>>>>>>>>> This is false economy.  Two things.  First implementing this  
>>>>>>>>> is not
>>>>>>>>> hard.  Secondly implementing this will resolve the issues  
>>>>>>>>> r.w.t. the
>>>>>>>>> incongruence between DRS and the filesystem that Estanislao's  
>>>>>>>>> email
>>>>>>>>> illuminated.  So it seems to me that the alternative is keep  
>>>>>>>>> fitting
>>>>>>>>> this square DRS peg in to the round file system hole.  That  
>>>>>>>>> would
>>>>>>>>> mean having to do a whole other set of gymnastics to get the  
>>>>>>>>> DRS<->
>>>>>>>>> file system beast tamed.  There is work to be done either way
>>>>>>>>> because things are not ready to go as it stands. I suggest we  
>>>>>>>>> fix
>>>>>>>>> the problem at the root, now, not "later".  Essentially the  
>>>>>>>>> current
>>>>>>>>> course requires the data providers to jump through file system
>>>>>>>>> layout hoops.  I am of the opinion that we should "require" as
>>>>>>>>> little as possible from our users, especially something like
>>>>>>>>> this... it hurts adoption IMHO.
>>>>>>>>>
>>>>>>>>> Actually, let me frame this differently.  How about we fork  
>>>>>>>>> efforts,
>>>>>>>>> and have some folks think about what the *query* URL should  
>>>>>>>>> be for
>>>>>>>>> the functionality I suggested, while others continue the  
>>>>>>>>> current
>>>>>>>>> path.  When the former development is ripe I update the install
>>>>>>>>> script and have it installed upon the clients' next install
>>>>>>>>> automagically, no slowdown for anyone.  The null transform  
>>>>>>>>> would be
>>>>>>>>> equivalent to what we have now so we would be backward  
>>>>>>>>> compatible
>>>>>>>>> for folks whole have done the task of making their file systems
>>>>>>>>> congruent to DRS.  Fair enough?
>>>>>>>>>
>>>>>>>>> Sound good?
>>>>>>>>>
>>>>>>>>> martin.juckes at stfc.ac.uk wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Hello Gavin, Bob,
>>>>>>>>>>
>>>>>>>>>> I agree that this is a good idea in principle, but I think  
>>>>>>>>>> it is a
>>>>>>>>>> bad idea now. The thing about "now" is that we want to  
>>>>>>>>>> deploy and
>>>>>>>>>> test the system we have agreed on. We want to do it now  
>>>>>>>>>> because
>>>>>>>>>> modelling centres have supercomputers running and churning  
>>>>>>>>>> out vast
>>>>>>>>>> volumes of data, there are thousands of scientists waiting  
>>>>>>>>>> to get
>>>>>>>>>> at it and we have the job of installing a system to  
>>>>>>>>>> distribute it.
>>>>>>>>>> It is, I think, I bad time to start implementing changes in  
>>>>>>>>>> the
>>>>>>>>>> system design. Sorry if this sounds a bit harsh, but impending
>>>>>>>>>> deadlines make me nervous,
>>>>>>>>>>
>>>>>>>>>> cheers,
>>>>>>>>>> Martin
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: go-essp-tech-bounces at ucar.edu on behalf of Bob Drach
>>>>>>>>>> Sent: Mon 05/07/2010 19:18
>>>>>>>>>> To: Gavin M Bell
>>>>>>>>>> Cc: go-essp-tech at ucar.edu; is-enes-sa2-jra4 at lists.enes.org;  
>>>>>>>>>> Charles
>>>>>>>>>> Doutriaux
>>>>>>>>>> Subject: Re: [Go-essp-tech] [is-enes-sa2-jra4] Example of
>>>>>>>>>> configuring adatanode to serve CMIP3-DRS
>>>>>>>>>>
>>>>>>>>>> Hi Gavin,
>>>>>>>>>>
>>>>>>>>>> I agree completely. Having a regularized DRS syntax is a  
>>>>>>>>>> very good
>>>>>>>>>> idea, but to implement it we will need to introduce a level of
>>>>>>>>>> indirection between the DRS URL (your 'query') and the  
>>>>>>>>>> underlying
>>>>>>>>>> filesystem. Separating these two concerns will have a very
>>>>>>>>>> important
>>>>>>>>>> benefit: it will allow the data node managers to organize  
>>>>>>>>>> their
>>>>>>>>>> filesystems as they see fit.
>>>>>>>>>>
>>>>>>>>>> Bob
>>>>>>>>>>
>>>>>>>>>> On Jul 5, 2010, at 11:10 AM, Gavin M Bell wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Hello gentle-people,
>>>>>>>>>>>
>>>>>>>>>>> Here is my two cents on this whole DRS business.  I think  
>>>>>>>>>>> that the
>>>>>>>>>>> fundamental issue to all of this is the ability to do  
>>>>>>>>>>> resource
>>>>>>>>>>> resolution (lookup).  The issue of having urls match a DRS
>>>>>>>>>>> structure that matches the filesystem is a red herring  
>>>>>>>>>>> (IMHO).
>>>>>>>>>>> The basic issue is to be able to issue a query to the  
>>>>>>>>>>> system such
>>>>>>>>>>> that you find what you are looking for.  This query mechanism
>>>>>>>>>>> should be separate mechanism than filesystem  
>>>>>>>>>>> correspondence.  The
>>>>>>>>>>> driving issue behind the file system correspondence push is  
>>>>>>>>>>> so
>>>>>>>>>>> that people and/or applications can infer the location of
>>>>>>>>>>> resources in some regimented way.  The true heart of the  
>>>>>>>>>>> issue is
>>>>>>>>>>> not with the file system.  The heart of the issue is to  
>>>>>>>>>>> perform a
>>>>>>>>>>> query such that you provide resource resolution.  The file  
>>>>>>>>>>> system
>>>>>>>>>>> is a familiar mechanism but it isn't the only one.  The file
>>>>>>>>>>> system takes a query (the file system path) and returns the
>>>>>>>>>>> resource to us (the bits sitting at an inode location  
>>>>>>>>>>> somewhere
>>>>>>>>>>> that is memory mapped to some physical platter and spindle
>>>>>>>>>>> location, that is mapped to the file system path).  We are
>>>>>>>>>>> overloading the file system query mechanism when it is not
>>>>>>>>>>> necessary.
>>>>>>>>>>>
>>>>>>>>>>> I propose the following:  We create a *filter* and a small
>>>>>>>>>>> database (the latter we already have in the publisher).  We  
>>>>>>>>>>> send a
>>>>>>>>>>> *query* to the web server the web server *filter*  
>>>>>>>>>>> intercepts that
>>>>>>>>>>> *query* and resolves it, using the database to the actual  
>>>>>>>>>>> resource
>>>>>>>>>>> location and returns the resource you want.  Implementing  
>>>>>>>>>>> this in
>>>>>>>>>>> a filter divorces the query structure from the file system
>>>>>>>>>>> structure.  The use of the database (that is generated by the
>>>>>>>>>>> publisher when it scans) provides the resolution.
>>>>>>>>>>> With this mechanism in place, WGET, as well as any other  
>>>>>>>>>>> URL based
>>>>>>>>>>> tool will be able to fetch the data as intended.
>>>>>>>>>>>
>>>>>>>>>>> BTW: The "query" is whatever we make it up to be... (not a
>>>>>>>>>>> reference to SQL query).
>>>>>>>>>>>
>>>>>>>>>>> This gives the data-node admin the ability to put their files
>>>>>>>>>>> wherever they want.  If they move files around and so on,  
>>>>>>>>>>> they
>>>>>>>>>>> just have to rescan with the publisher.  The issues around  
>>>>>>>>>>> design
>>>>>>>>>>> and efficiency can be address with varying degrees of  
>>>>>>>>>>> cleverness.
>>>>>>>>>>>
>>>>>>>>>>> I welcome any thoughts on this issue... Please talk me  
>>>>>>>>>>> down :-). I
>>>>>>>>>>> think it is about time we put this DRS issue to bed.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Estanislao Gonzalez wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Hi Bob,
>>>>>>>>>>>>
>>>>>>>>>>>> I guess you must be on vacations now. Anyway, here's the
>>>>>>>>>>>> question, maybe someone else can answer it:
>>>>>>>>>>>>
>>>>>>>>>>>> The very first idea I had was almost what you proposed. Your
>>>>>>>>>>>> proposal though leaves URLs of the form:
>>>>>>>>>>>> http://******myserver/thredds/fileserver/CMIP5_replicas/ 
>>>>>>>>>>>> output/...
>>>>>>>>>>>>                                                             < 
>>>>>>>>>>>> ---
>>>>>>>>>>>> (almost) DRS Structure ----------->
>>>>>>>>>>>>
>>>>>>>>>>>> Which has no valid DRS structure (CMIP5_replicas nor  
>>>>>>>>>>>> CMIP5_core
>>>>>>>>>>>> are in the DRS vocabulary).
>>>>>>>>>>>>
>>>>>>>>>>>> My proposal has a very similar flaw:
>>>>>>>>>>>> http://******myserver/thredds/fileserver/replicated/CMIP5/ 
>>>>>>>>>>>> output/...
>>>>>>>>>>>>
>>>>>>>>>>>> <--- full DRS Structure ----------->  The DRS structure is
>>>>>>>>>>>> preserved, but you cannot easily infer the correct URL  
>>>>>>>>>>>> from any
>>>>>>>>>>>> dataset. I think the Idea is: if you know the prefix
>>>>>>>>>>>> (http.../fileserver/) and the dataset DRS name you can  
>>>>>>>>>>>> always get
>>>>>>>>>>>> the file without even browising the TDS:
>>>>>>>>>>>> prefix + DRS = URL to file
>>>>>>>>>>>>
>>>>>>>>>>>> AFAIK the URL structure used by the TDS will never be 100%  
>>>>>>>>>>>> DRS
>>>>>>>>>>>> conform (according to the DRS version 0.27) This one has the
>>>>>>>>>>>> form:
>>>>>>>>>>>> http://******<hostname>/<activity>/<product>/<institute>/ 
>>>>>>>>>>>> <model>/
>>>>>>>>>>>> <experiment>/<frequency>/<modeling
>>>>>>>>>>>> realm>/<variable identifier>/<ensemble member>/<version>/
>>>>>>>>>>>> [<endpoint>],
>>>>>>>>>>>>
>>>>>>>>>>>> where the TDS one has the endpoint moved to the front (the
>>>>>>>>>>>> thredds/fileserver, thredds/dodsC, etc parts).
>>>>>>>>>>>>
>>>>>>>>>>>> To sum things up:
>>>>>>>>>>>> Is it possible to publish files from different directory
>>>>>>>>>>>> structures into an unified URL structure so that it is  
>>>>>>>>>>>> completely
>>>>>>>>>>>> transparent to the user?
>>>>>>>>>>>> Am I the only one addressing this problem? Are all other
>>>>>>>>>>>> institutions planning  to publish all files from only one
>>>>>>>>>>>> directory?
>>>>>>>>>>>>
>>>>>>>>>>>> The only viable solution I can think of is to rely on  
>>>>>>>>>>>> Stephen's
>>>>>>>>>>>> versioning concept and maintaining a single true DRS  
>>>>>>>>>>>> structure
>>>>>>>>>>>> with links to files kept in other more manageable directory
>>>>>>>>>>>> structures (This will probably involve adapting Stephen's  
>>>>>>>>>>>> tool).
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Estani
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Bob Drach wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Estani,
>>>>>>>>>>>>>
>>>>>>>>>>>>> It should be possible to do what you want without running
>>>>>>>>>>>>> multiple data nodes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The purpose of the THREDDS dataset roots is to hide the
>>>>>>>>>>>>> directory structure from the end user, and to limit what  
>>>>>>>>>>>>> the
>>>>>>>>>>>>> TDS can access.
>>>>>>>>>>>>> But
>>>>>>>>>>>>> THREDDS can certainly have multiple dataset roots.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In your example below, you should associate different  
>>>>>>>>>>>>> paths with
>>>>>>>>>>>>> the locations, for example:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> <datasetRoot path="CMIP5_replicas"
>>>>>>>>>>>>>> location="/replicated/CMIP5"/>  <datasetRoot  
>>>>>>>>>>>>>> path="CMIP5_core"
>>>>>>>>>>>>>> location="/core/CMIP5"/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Also be aware that in the publisher configuration:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - the directory_format can have multiple values,  
>>>>>>>>>>>>> separated by
>>>>>>>>>>>>> vertical bars (|). The publisher will use the first  
>>>>>>>>>>>>> format that
>>>>>>>>>>>>> matches the directory structure being scanned.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - a useful strategy is to create different project  
>>>>>>>>>>>>> sections for
>>>>>>>>>>>>> various groups of directives. You could define a  
>>>>>>>>>>>>> cmip5_replica
>>>>>>>>>>>>> project, a cmip5_core project, etc.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Bob
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Jul 1, 2010, at 5:42 AM, Estanislao Gonzalez wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Bryan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> thanks for your answer!
>>>>>>>>>>>>>> Running multiple ESG data nodes is always a possibility,  
>>>>>>>>>>>>>> but it
>>>>>>>>>>>>>> seems an overkill to us as we may have several different  
>>>>>>>>>>>>>> "data
>>>>>>>>>>>>>> repositories".
>>>>>>>>>>>>>> We would like to separate: core-replicated,
>>>>>>>>>>>>>> core-non-replicated, non-core, non-core-on-hpss, as well  
>>>>>>>>>>>>>> as
>>>>>>>>>>>>>> other non-cmip5 data.
>>>>>>>>>>>>>> Having 5+
>>>>>>>>>>>>>> ESG data nodes is not viable in our scenario.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The TDS allows the separation of access URL from the  
>>>>>>>>>>>>>> underlying
>>>>>>>>>>>>>> file structure so that it might be possible. AFAIK the
>>>>>>>>>>>>>> publisher does not provide a simple way of doing this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Setting thredds_dataset_roots to different values while
>>>>>>>>>>>>>> publishing doesn't appear to work as those are mapped to a
>>>>>>>>>>>>>> map-entry at the catalog root:
>>>>>>>>>>>>>> <datasetRoot path="CMIP5" location="/replicated/CMIP5"/>
>>>>>>>>>>>>>> <datasetRoot path="CMIP5" location="/core/CMIP5"/>  ..
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> which is clearly non bijective and can't therefore be  
>>>>>>>>>>>>>> reversed
>>>>>>>>>>>>>> to locate the file from a given URL.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> While publishing all referred data will be held on a known
>>>>>>>>>>>>>> location.
>>>>>>>>>>>>>> Is it possible to use somehow this information to setup a
>>>>>>>>>>>>>> proper catalog configuration so that the URL can be  
>>>>>>>>>>>>>> properly
>>>>>>>>>>>>>> mapped? At least on a dataset level?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The whole HPSS staging procedure should be completely
>>>>>>>>>>>>>> transparent to the user, as well as the location of the  
>>>>>>>>>>>>>> files.
>>>>>>>>>>>>>> I was just looking at other options in case we cannot  
>>>>>>>>>>>>>> publish
>>>>>>>>>>>>>> them the way we want...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Estani
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Bryan Lawrence wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> sorry.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> the first sentence should have read
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Just to note that *our* approach to the local versus
>>>>>>>>>>>>>>> replication issue will be ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>>>> Bryan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thursday 01 Jul 2010 11:25:37 Bryan Lawrence wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Estani
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Just to note that your approach to the local versus
>>>>>>>>>>>>>>>> replication will be to run two different ESG nodes ...  
>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>> is in fact the desired outcome so as to get the right  
>>>>>>>>>>>>>>>> things
>>>>>>>>>>>>>>>> in the catalogues at the right time (vis- a-viz qc etc).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The issue with respect to cache, I'm not so sure  
>>>>>>>>>>>>>>>> about, in
>>>>>>>>>>>>>>>> what way do you want to expose that into ESG?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Bryan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wednesday 30 Jun 2010 17:05:57 Estanislao Gonzalez  
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Stephen,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> the page contains really helpful information, thanks  
>>>>>>>>>>>>>>>>> a lot!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm also interested in some variables of the DEFAULT  
>>>>>>>>>>>>>>>>> section
>>>>>>>>>>>>>>>>> from the esg.ini configuration file. More specifically:
>>>>>>>>>>>>>>>>> thredds_dataset_roots (and maybe
>>>>>>>>>>>>>>>>> thredds_aggregation_services or any other which was  
>>>>>>>>>>>>>>>>> changed
>>>>>>>>>>>>>>>>> or you think it might be important)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The main question here is: how can different local  
>>>>>>>>>>>>>>>>> directory
>>>>>>>>>>>>>>>>> structures be published to the same DRS structure?
>>>>>>>>>>>>>>>>> The example scenario in our case will be:
>>>>>>>>>>>>>>>>> /replicated/<DRS structure>  - for replicated data
>>>>>>>>>>>>>>>>> /local/<DRS structure>  - for non replicated data  
>>>>>>>>>>>>>>>>> hold on
>>>>>>>>>>>>>>>>> disk /cache/<DRS structure>  - for data staged from a  
>>>>>>>>>>>>>>>>> HPSS
>>>>>>>>>>>>>>>>> system
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The only solution I can think of is to extend the URL  
>>>>>>>>>>>>>>>>> before
>>>>>>>>>>>>>>>>> the DRS structure starts (the URL won't be 100% DRS  
>>>>>>>>>>>>>>>>> conform
>>>>>>>>>>>>>>>>> anyway). So
>>>>>>>>>>>>>>>>>   http://*******server/thredds/fileserver/<DRS  
>>>>>>>>>>>>>>>>> structure>  will
>>>>>>>>>>>>>>>>> turn into
>>>>>>>>>>>>>>>>>   http://*******server/thredds/fileserver/replicated/ 
>>>>>>>>>>>>>>>>> <DRS
>>>>>>>>>>>>>>>>> structure>
>>>>>>>>>>>>>>>>>   http://*******server/thredds/fileserver/local/<DRS  
>>>>>>>>>>>>>>>>> structure>
>>>>>>>>>>>>>>>>>   http://*******server/thredds/fileserver/cache/<DRS
>>>>>>>>>>>>>>>>> structure>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is that viable? Are there any other options?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Estani
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> stephen.pascoe at stfc.ac.uk wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> To illustrate how the ESG datanode can be configured  
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> serve data for CMIP5 we have deployed a datanode  
>>>>>>>>>>>>>>>>>> containing
>>>>>>>>>>>>>>>>>> a subset of
>>>>>>>>>>>>>>>>>> CMIP3 in the Data Reference Syntax. Some key  
>>>>>>>>>>>>>>>>>> features of
>>>>>>>>>>>>>>>>>> this deployment are:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>   * The underlying directory structure is based on  
>>>>>>>>>>>>>>>>>> the Data
>>>>>>>>>>>>>>>>>>     Reference Syntax.
>>>>>>>>>>>>>>>>>>   * Datasets published at the realm level.
>>>>>>>>>>>>>>>>>>   * The token-based security filter is replaced by the
>>>>>>>>>>>>>>>>>>     OpenidRelyingParty security filter.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Further notes can be found at
>>>>>>>>>>>>>>>>>> http://*******proj.badc.rl.ac.uk/go-essp/wiki/ 
>>>>>>>>>>>>>>>>>> CMIP3_Datanode
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This test deployment should be of interest to anyone
>>>>>>>>>>>>>>>>>> wanting to know how DRS identifiers could be exposed  
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> THREDDS catalogues and the TDS HTML interface.  You  
>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> also try downloading files with OpenID  
>>>>>>>>>>>>>>>>>> authentication or
>>>>>>>>>>>>>>>>>> via wget with SSL-client certificate  
>>>>>>>>>>>>>>>>>> authentication.  See
>>>>>>>>>>>>>>>>>> the link above for details.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>> Stephen.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>> Stephen Pascoe  +44 (0)1235 445980 British  
>>>>>>>>>>>>>>>>>> Atmospheric Data
>>>>>>>>>>>>>>>>>> Centre Rutherford Appleton Laboratory
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -----------------------------------------------------------
>>>>>>>>>>>>>>>>>> ------
>>>>>>>>>>>>>>>>>> -- -----
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>>>>>>>>> http://*******mailman.ucar.edu/mailman/listinfo/go- 
>>>>>>>>>>>>>>>>>> essp-tech
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Estanislao Gonzalez
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches
>>>>>>>>>>>>>> Klimarechenzentrum (DKRZ) - German Climate Computing  
>>>>>>>>>>>>>> Centre
>>>>>>>>>>>>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Phone:   +49 (40) 46 00 94-126
>>>>>>>>>>>>>> E-Mail:  estanislao.gonzalez at zmaw.de
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>>>>> http://*******mailman.ucar.edu/mailman/listinfo/go-essp- 
>>>>>>>>>>>>>> tech
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Gavin M. Bell
>>>>>>>>>>> Lawrence Livermore National Labs
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> "Never mistake a clear view for a short distance."
>>>>>>>>>>>                  -Paul Saffo
>>>>>>>>>>>
>>>>>>>>>>> (GPG Key - http://*****rainbow.llnl.gov/dist/keys/gavin.asc)
>>>>>>>>>>>
>>>>>>>>>>> A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>> http://*****mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> Estanislao Gonzalez
>>>>>>
>>>>>> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches
>>>>>> Klimarechenzentrum (DKRZ) - German Climate Computing Centre Room  
>>>>>> 108
>>>>>> - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>>>
>>>>>> Phone:   +49 (40) 46 00 94-126
>>>>>> E-Mail:  estanislao.gonzalez at zmaw.de
>>>>>>
>>>>>> _______________________________________________
>>>>>> is-enes-sa2-jra4 mailing list
>>>>>> is-enes-sa2-jra4 at lists.enes.org
>>>>>> https://**lists.enes.org/mailman/listinfo/is-enes-sa2-jra4
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Estanislao Gonzalez
>>>
>>> Max-Planck-Institut für Meteorologie (MPI-M)
>>> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>
>>> Phone:   +49 (40) 46 00 94-126
>>> E-Mail:  estanislao.gonzalez at zmaw.de
>>>
>>>
>>>
>>>
>>> --
>>> Scanned by iCritical.
>>>
>>>
> 
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech
> 
> 

-- 
Gavin M. Bell
Lawrence Livermore National Labs
--

 "Never mistake a clear view for a short distance."
       	       -Paul Saffo

(GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)

 A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E


More information about the GO-ESSP-TECH mailing list