[Go-essp-tech] [is-enes-sa2-jra4] Example of configuringadatanode to serve CMIP3-DRS

martin.juckes at stfc.ac.uk martin.juckes at stfc.ac.uk
Tue Jul 6 09:12:02 MDT 2010


Hello Estani,

I am sure that replication will involve no splitting of "ESG published units" -- this was a firm decision, taken in light of the fact that the distributed archive design has no way of coping with such a split. 

I've included Karl in the address list, because I am less sure of the next statement: I think that, in light of the above decision, the phrase you quote from "standard_output.xls" means that the relevant "low" priority variables will not be stored in the "requested" branch of the DRS -- they will be stored in "output" branch. Only ESG published units from the "requested" branch will be replicated. 

Thanks for the clarification about publication of tape data. 

cheers,
Martin


-----Original Message-----
From: Estanislao Gonzalez [mailto:estanislao.gonzalez at zmaw.de]
Sent: Tue 06/07/2010 15:57
To: Juckes, Martin (STFC,RAL,SSTD)
Cc: Pascoe, Stephen (STFC,RAL,SSTD); gavin at llnl.gov; drach1 at llnl.gov; go-essp-tech at ucar.edu; is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
Subject: Re: [is-enes-sa2-jra4] [Go-essp-tech] Example of configuringadatanode to serve CMIP3-DRS
 
Hello Martin,

It wasn't clear to me that the AR5 non CMIP5 data would be  somewhere
else stored, hence the confusion,  sorry for that.

Are you sure the replication involves no splitting of "ESG published
unit"? What does then this mean (from standard_output.xls):
""all*" indicates that although all years will be included in the
"replicated" subset, only the high and medium priority variables will be
included in the replicated subset."

Does this not imply that some variables in the ESG published unit will
not get replicated?

And the data on tape will reside on disk before getting published, then
moved to tape and through a filter retrieved on demand when someone
requires it (to sum the procedure up).

Regards,
Estani



martin.juckes at stfc.ac.uk wrote:
> Hello Estanislao,
>
> If, by "AR5 data" you mean data from the climate models contributing to CMIP5 which will be assigned a DOI and designated as part of the IPCC AR5 archive, then this will definitely be a subset of the CMIP5 archive. There may be additional AR5 data (observational datasets, for example), but these will clearly not be stored using the CMIP5 DRS.
>
> CMOR2 does not check the DRS structure -- my point was that any CMOR2 compliant file can easily be assigned a place in the DRS directory structure. So long as we restrict attention to CMOR2 complaint files on disk, I don't think there is any need to go beyond the DRS directory structure. Seeing your email below, this point is perhaps not relevant to the problems you are raising.
>
> By "ESG published unit" I mean a collection of files published as a single unit in the ESG data node. After lengthy discussion it was decided that, for CMIP5, the publication would be done at the "realm level", meaning that all files for given institute/model/experiment/output frequency/realm will be published as a single unit (I think you are right, that this corresponds to a single thredds id, but I'm not sure of the implementation details). The point relevant to our discussion is that the replicated portion of the archive will not involve splitting any of these units.  
>
> Finally, concerning the CMOR2 compliant data which you want to store on tape: my understanding is that, at present, the ESG data node is not able to scan such files in the way it scans files on disk, and so publication in a way which is integrated with the disk portion of the archive is not possible at present. I agree that it would be good to resolve this issue, but I think it needs to be done on a longer time frame -- as you suggest below. In terms of IS-ENES objectives, we should certainly identify a way of resolving this before the project is finished,
>
> cheers,
> Martin
>
> -----Original Message-----
> From: Estanislao Gonzalez [mailto:estanislao.gonzalez at zmaw.de]
> Sent: Tue 06/07/2010 14:11
> To: Juckes, Martin (STFC,RAL,SSTD)
> Cc: Pascoe, Stephen (STFC,RAL,SSTD); gavin at llnl.gov; drach1 at llnl.gov; go-essp-tech at ucar.edu; is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
> Subject: Re: [is-enes-sa2-jra4] [Go-essp-tech] Example of configuringadatanode to serve CMIP3-DRS
>  
> Hi Martin,
>
> I just got your email, so I'll try to answer shortly.
>
> martin.juckes at stfc.ac.uk wrote:
>   
>> Hello Estanislao, Gavin, 
>>
>> There is a key part of your problem I don't understand -- what do you mean by "non CMIP5 data"?
>>   
>>     
> Well I don't want to talk about things I'm not sure... but I think
> non-CMIP5 refers to AR5, for example. Data not required in CMIP5 but
> which will still get published (in the DOI sense). If someone realizes
> I'm wrong, please correct me.
>   
>> Before going into the ESGF CMIP5 archive, all files will be CMOR2 compliant. This means that they fit in the "requested" or "product" categories of the DRS. The data to be replicated will be a subset of the "ESG published units" (also known as realm level datasets) in the "requested" category. 
>>   
>>     
> CMOR2 checks the DRS structure? I think the CMOR2 creates it, but there
> is also a CMOR checker for data not going through the CMOR2 (maybe I got
> that wrong?). In any case, are you implying that product and requested
> are mutually exclusive and that replicated is a subset of requested?
> (just to be sure) Not sure though what an "ESG published unit" (aka
> realm level datasets) is. the TDS dataset_id? (DRS without variable_name
> and ensemble ?)
>   
>> There has been an agreement that the ESGF CMIP5 archive would be run on disk, and so it is not surprising that the infrastructure does not support tape storage. I can see that something along the lines Gavin describes would resolve the problems with tape storage, but we need to get the disk based system working as the first priority. 
>>   
>>     
> AFAIK the agreement was for what we call the "core" data (which I think
> is the requested one and includes, if it's not the same, the
> replicated). So there will be no problem with the replication.
>   
>> Stephen raises the issue of replication and this is relevant, since straight disk to disk copies (i.e. to an external hard drive which can be posted) is a vital aspect of the replication plan. For the time being, this requires people to stick to the DRS directory structure. 
>>
>> Within CMIP5 the data from different institutions is clearly separated at the institution directory level, I can't see why there should be any confusion here.
>>
>> For non-CMIP5 data -- why would you want to describe it with the CMIP5 DRS?
>>   
>>     
> Indeed we could do whatever we want, I still think attaining to some
> kind of standard will help. But as I said, there will be no problem as
> the data will have its own DRS structure starting with something other
> than CMIP5. This data presents then no problem.
> The only open issue is the differentiation between  replicated and
> non-replicated data. And that's probable only important to us, as the
> non replicated will be (at least partially) held on tape.
>
> So I don't think I see a requirement for changing anything on the
> current version. We will implement a staging filter (which is almost
> done) to transparently serve the files on tape.
>
> Still, I see a need to discuss this for future CMIP iterations, as I
> think at some point we won't be able to hold all files on disk. The
> costs of maintaining all of it promptly available exceeds in my opinion
> its benefit. But that's probably only me :-)
>
> Regards,
> Estani
>   
>> cheers,
>> Martin
>>
>>
>>
>> -----Original Message-----
>> From: is-enes-sa2-jra4-bounces at lists.enes.org on behalf of stephen.pascoe at stfc.ac.uk
>> Sent: Tue 06/07/2010 12:13
>> To: estanislao.gonzalez at zmaw.de; gavin at llnl.gov
>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu; is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
>> Subject: Re: [is-enes-sa2-jra4] [Go-essp-tech] Example of configuringadatanode to serve CMIP3-DRS
>>  
>>
>>  
>> Hi Estanislao,
>>
>>   
>>     
>>> * The only true problem is to differentiate between core and non-core data (which as far as I node is a file issue instead of a dataset one, 
>>> i.e. some datasets contain core and non core data)
>>>     
>>>       
>> I'm not sure you were involved then but we had lengthy discussions last year on how we would deal with the separation of requested and non-requested data (Karl discourages the term "core").  There is a fundamental problem that the DRS vocabularies don't cleanly map onto what is requested and not requested.  The outcome was to introduce the DRS component "product" to divide the two.  If you are interested take a look at the following threads:
>>
>> http://mailman.ucar.edu/pipermail/go-essp-tech/2010-January/000335.html
>> http://mailman.ucar.edu/pipermail/go-essp-tech/2009-December/000255.html
>>
>> There hasn't been much discussion of how we identify and manage requested data since then and the nitty-gritty details still aren't fixed.  This is going to be a challenge when we come to replicate.
>>
>> S.
>>
>> ---
>> Stephen Pascoe  +44 (0)1235 445980
>> British Atmospheric Data Centre
>> Rutherford Appleton Laboratory
>>
>> -----Original Message-----
>> From: is-enes-sa2-jra4-bounces at lists.enes.org [mailto:is-enes-sa2-jra4-bounces at lists.enes.org] On Behalf Of Estanislao Gonzalez
>> Sent: 06 July 2010 11:18
>> To: Gavin M Bell
>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu; is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
>> Subject: Re: [is-enes-sa2-jra4] [Go-essp-tech] Example of configuring adatanode to serve CMIP3-DRS
>>
>> Hi people,
>>
>> well I think we do require something like this (at least at the major data nodes where data will get replicated). Managing all data mixed up under one single directory is not a very neat solution for the data administrator. In our particular case we will be publishing many (much?
>> :-) data from different institutions and even types (not only CMIP5).
>> And we shouldn't forget about the replicated data (is that === core?), how can we tell which data requires being replicated? by maintaining a second "catalog" in a DB? I think by maintaining a separate filesystem a simple rsynch will do the job (after the very first replication, of course).
>> In any case the fact that we at DKRZ cannot hold all CMIP5 data on disk (yes, the core one we can :-) implies that we will have to maintain a cache somewhere, and mixing this cache with the core data is something we should probably avoid.
>>
>> Gavin's solution, if I got it right, has a major problem. The catalogs will be created pointing to the real files (e.g. .../core/CMIP5), so that the filter can alter the request from the DRS query
>> (../CMIP5/<core_data>) to the real one, and thus allow the TDS to work as usual. This leaves the catalogs unaltered and thereby the harvest data which will have no reference to the mapped DRS structure but to the real one. OR did I miss something here?
>>
>> I have already tried several possible solutions without any success at all:
>> 1) Setting multiple datasetRoot entries is not allowed
>> 2) Altering the TDS to accept multiple datasetRoot entries and look in all of them one after the other after something matches is almost impossible (for the time we have ahead, the mere architecture of the TDS is, in my opinion, a mess).
>> 3) In general altering the TDS is not a "nice" solution.
>> 4) Filtering the request breaks the coherence between the catalogs and the DRS "virtual" structure (the catalogs have no information whatsoever that a second link to the files exists.
>>
>> The only viable solution I can think of (and it is still to see if it's really viable) is to maintain the files somewhere else and link them to the "central" DRS filesystem before being published.
>>
>> After discussing this with Stephan we come up with something I'd like to sum up here:
>> * All non CMIP5 data can be mapped to a DRS structure "not" starting with CMIP5 so it can be easily mapped to somewhere else (TDS allows that)
>> * The only true problem is to differentiate between core and non-core data (which as far as I node is a file issue instead of a dataset one, i.e. some datasets contain core and non core data)
>> * The replication can rely on external sources for differentiating this, e.g. a DB.
>> * The cached non-core data can co-live, in the worst case scenario, with the core data by removing the write permits of the later (beside the security that it implies, this will be used as a flag in case the server is restarted. All non-flagged (write enabled) files will be treated as left overs from the stopped cache and will be further served)
>>
>> So we might get out with it without performing any major changes. But this is something we should definitely discuss before next iteration :-)
>>
>> I hope this brings some light into the matter... sorry for the lengthy mail...
>>
>> Regards,
>> Estani
>>
>> Gavin M Bell wrote:
>>   
>>     
>>> Martin,
>>>
>>> The savings is that the data provider / data-node admin doesn't have 
>>> to any additional work, whether it be provide any filesystem <-> drs 
>>> mapping or (re)arranging their file system.  In the current state of 
>>> things all the salient information is already in the database created 
>>> as a result of the publisher [software] scan.  I think it would be 
>>> prudent to use that information to the benefit of our end users 
>>> instead of imposing a DRS directory structure requirement for esg participation.
>>>
>>> You said:
>>> "Remember that not having to configure the file system is only a real 
>>> saving if the alternative (configuring the file system to URL mapping) 
>>> is actually easier than configuring the file system."
>>>
>>> I am saying:
>>> The 'alternative' you describe, does not exist.  Because there is no 
>>> "configuring the file system to URL mapping" necessary... unless the 
>>> end-user wants there to be. In which case we, as dutiful programmers, 
>>> provide that opportunity.  This is what my code sketch was 
>>> illustrating with the property "drs.resolve.strategy", and the use of 
>>> a factory and strategy pattern - of which we will set a default that 
>>> requires them to do *no additional work*.  The data-node admin won't 
>>> have to do any actual setup outside of running an "esg-node --update". 
>>> The upgrade/update process (determined by the esg-node install script) 
>>> will install the filter, without them having to do anything additional.
>>>
>>> Indeed, the code I posted was a quick and dirty filter code sketch 
>>> demonstrating that putting a filter in place is easy. Yes, the 
>>> resolution work would be done in the code that I only alluded to, the 
>>> "DRSResolver". Current duties preclude me from actually implementing 
>>> this issue outright, today, for this email conversation. However, if 
>>> we all conclude that it is worthwhile then I or someone else could 
>>> make it happen.
>>>
>>> I hope I have done a better job of making more clear my point; that we 
>>> can free our end-users of this DRS directory structure requirement 
>>> while allowing the DRS itself to be more flexible with it's representation.
>>> Also that the mechanism I described does not preclude anyone from 
>>> setting up their filesystem to follow the DRS structure, we get that 
>>> for free! :-)
>>>
>>> I am glad that we do indeed agree that the effort to bring this to 
>>> fruition can and should be done in a way that does not impede or 
>>> distract the current  deliverable path.
>>>
>>> Thanks.
>>>
>>>
>>> martin.juckes at stfc.ac.uk wrote:
>>>   
>>>     
>>>       
>>>> Er... the attachment you sent didn't actually do any mapping. But I'm 
>>>> sure it could be done. The extra work I'm talking about is the same 
>>>> as the extra work you talk about at the end of your mail, so I'm 
>>>> going to ignore your suggestion at the start of your email that there 
>>>> isn't any,
>>>>
>>>> cheers,
>>>> Martin
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Gavin M Bell [mailto:gavin at llnl.gov]
>>>> Sent: Mon 05/07/2010 21:37
>>>> To: Juckes, Martin (STFC,RAL,SSTD)
>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu; 
>>>> is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
>>>> Subject: Re: [Go-essp-tech] [is-enes-sa2-jra4] Example of configuring 
>>>> adatanode to serve CMIP3-DRS
>>>>  
>>>> Hi Martin,
>>>>
>>>> With regards to the savings... One, perhaps default, setup is not 
>>>> having the data provider do anything additional at all with respect 
>>>> to configuration or setup.  They simply use the publisher to scan 
>>>> their files into the system, something that must be done in all 
>>>> cases... (so we can normalize that out). With that said, they would 
>>>> not have to do
>>>> *any* additional work.  No work is easier than some work, regardless 
>>>> of how easy ;-).
>>>>
>>>> I have attached the filter code that would almost do it.  The real 
>>>> intelligence would be in the "DRSResolver" object to do the resolution.
>>>>  I would have sketched out that class as well but that would be 
>>>> tantamount to completing this task... and to finish it off I would 
>>>> have to confer with Bob on the publisher database.  And have us all 
>>>> settled on the DRS query syntax.
>>>> With a DRS URL query scheme we could wrap this up quite directly.
>>>>
>>>> The DRSResolver would:
>>>> -parse the request url (the query) and pull out the salient parts.
>>>> -fashion those parts into a SQL query against the publisher database 
>>>> -Return the thredds' root based url to the rest of the processing 
>>>> stream. If it is not able to be resolved, punt and return the same 
>>>> input string as the output and let some other part of the process 
>>>> stream regurgitate an error.
>>>>
>>>> Because all the metadata is pulled out in the publisher's scan, file 
>>>> system placement of the scanned files is moot.
>>>>
>>>> In the code I attached, I leave room for the data-node user to select 
>>>> their own implementation of the resolver following a factory/strategy 
>>>> pattern.  At that point indeed we allow end users to do 'work' with 
>>>> doing their own mappings.  Perhaps we integrate a few canned mapping 
>>>> schemes etc... We can be arbitrarily cleaver with these kinds of 
>>>> things of course. :-)
>>>>
>>>> P.S.
>>>> The DRSResolver logic would/should be ported to all ingress request 
>>>> streams.  Also the published catalogs would be published with the DRS 
>>>> query syntax scheme as the canonical name of the resource - something 
>>>> the search facility would use to identify the resource.
>>>>
>>>> done.
>>>>
>>>>
>>>>
>>>>
>>>> martin.juckes at stfc.ac.uk wrote:
>>>>     
>>>>       
>>>>         
>>>>> Hi Gavin,
>>>>>
>>>>> I'm not convinced about the connection to Estanislao's email, but 
>>>>> the idea of thinking about the next step while implementing the 
>>>>> current system is certainly a good one. Remember that not having to 
>>>>> configure the file system is only a real saving if the alternative 
>>>>> (configuring the file system to URL mapping) is actually easier than 
>>>>> configuring the file system. Setting up the DRS is not difficult,
>>>>>
>>>>> cheers,
>>>>> Martin
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Gavin M Bell [mailto:gavin at llnl.gov]
>>>>> Sent: Mon 05/07/2010 19:45
>>>>> To: Juckes, Martin (STFC,RAL,SSTD)
>>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu; 
>>>>> is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
>>>>> Subject: Re: [Go-essp-tech] [is-enes-sa2-jra4] Example of 
>>>>> configuring adatanode to serve CMIP3-DRS
>>>>>  
>>>>> Martin and friends,
>>>>>
>>>>> This is false economy.  Two things.  First implementing this is not 
>>>>> hard.  Secondly implementing this will resolve the issues r.w.t. the 
>>>>> incongruence between DRS and the filesystem that Estanislao's email 
>>>>> illuminated.  So it seems to me that the alternative is keep fitting 
>>>>> this square DRS peg in to the round file system hole.  That would 
>>>>> mean having to do a whole other set of gymnastics to get the DRS <-> 
>>>>> file system beast tamed.  There is work to be done either way 
>>>>> because things are not ready to go as it stands. I suggest we fix 
>>>>> the problem at the root, now, not "later".  Essentially the current 
>>>>> course requires the data providers to jump through file system 
>>>>> layout hoops.  I am of the opinion that we should "require" as 
>>>>> little as possible from our users, especially something like this... it hurts adoption IMHO.
>>>>>
>>>>> Actually, let me frame this differently.  How about we fork efforts, 
>>>>> and have some folks think about what the *query* URL should be for 
>>>>> the functionality I suggested, while others continue the current 
>>>>> path.  When the former development is ripe I update the install 
>>>>> script and have it installed upon the clients' next install 
>>>>> automagically, no slowdown for anyone.  The null transform would be 
>>>>> equivalent to what we have now so we would be backward compatible 
>>>>> for folks whole have done the task of making their file systems congruent to DRS.  Fair enough?
>>>>>
>>>>> Sound good?
>>>>>
>>>>> martin.juckes at stfc.ac.uk wrote:
>>>>>       
>>>>>         
>>>>>           
>>>>>> Hello Gavin, Bob,
>>>>>>
>>>>>> I agree that this is a good idea in principle, but I think it is a 
>>>>>> bad idea now. The thing about "now" is that we want to deploy and 
>>>>>> test the system we have agreed on. We want to do it now because 
>>>>>> modelling centres have supercomputers running and churning out vast 
>>>>>> volumes of data, there are thousands of scientists waiting to get 
>>>>>> at it and we have the job of installing a system to distribute it. 
>>>>>> It is, I think, I bad time to start implementing changes in the 
>>>>>> system design. Sorry if this sounds a bit harsh, but impending 
>>>>>> deadlines make me nervous,
>>>>>>
>>>>>> cheers,
>>>>>> Martin
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: go-essp-tech-bounces at ucar.edu on behalf of Bob Drach
>>>>>> Sent: Mon 05/07/2010 19:18
>>>>>> To: Gavin M Bell
>>>>>> Cc: go-essp-tech at ucar.edu; is-enes-sa2-jra4 at lists.enes.org; Charles 
>>>>>> Doutriaux
>>>>>> Subject: Re: [Go-essp-tech] [is-enes-sa2-jra4] Example of 
>>>>>> configuring adatanode to serve CMIP3-DRS
>>>>>>  
>>>>>> Hi Gavin,
>>>>>>
>>>>>> I agree completely. Having a regularized DRS syntax is a very good 
>>>>>> idea, but to implement it we will need to introduce a level of 
>>>>>> indirection between the DRS URL (your 'query') and the underlying 
>>>>>> filesystem. Separating these two concerns will have a very 
>>>>>> important
>>>>>> benefit: it will allow the data node managers to organize their 
>>>>>> filesystems as they see fit.
>>>>>>
>>>>>> Bob
>>>>>>
>>>>>> On Jul 5, 2010, at 11:10 AM, Gavin M Bell wrote:
>>>>>>
>>>>>>         
>>>>>>           
>>>>>>             
>>>>>>> Hello gentle-people,
>>>>>>>
>>>>>>> Here is my two cents on this whole DRS business.  I think that the 
>>>>>>> fundamental issue to all of this is the ability to do resource 
>>>>>>> resolution (lookup).  The issue of having urls match a DRS 
>>>>>>> structure that matches the filesystem is a red herring (IMHO).  
>>>>>>> The basic issue is to be able to issue a query to the system such 
>>>>>>> that you find what you are looking for.  This query mechanism 
>>>>>>> should be separate mechanism than filesystem correspondence.  The 
>>>>>>> driving issue behind the file system correspondence push is so 
>>>>>>> that people and/or applications can infer the location of 
>>>>>>> resources in some regimented way.  The true heart of the issue is 
>>>>>>> not with the file system.  The heart of the issue is to perform a 
>>>>>>> query such that you provide resource resolution.  The file system 
>>>>>>> is a familiar mechanism but it isn't the only one.  The file 
>>>>>>> system takes a query (the file system path) and returns the 
>>>>>>> resource to us (the bits sitting at an inode location somewhere 
>>>>>>> that is memory mapped to some physical platter and spindle 
>>>>>>> location, that is mapped to the file system path).  We are 
>>>>>>> overloading the file system query mechanism when it is not 
>>>>>>> necessary.
>>>>>>>
>>>>>>> I propose the following:  We create a *filter* and a small 
>>>>>>> database (the latter we already have in the publisher).  We send a 
>>>>>>> *query* to the web server the web server *filter* intercepts that 
>>>>>>> *query* and resolves it, using the database to the actual resource 
>>>>>>> location and returns the resource you want.  Implementing this in 
>>>>>>> a filter divorces the query structure from the file system 
>>>>>>> structure.  The use of the database (that is generated by the 
>>>>>>> publisher when it scans) provides the resolution.
>>>>>>> With this mechanism in place, WGET, as well as any other URL based 
>>>>>>> tool will be able to fetch the data as intended.
>>>>>>>
>>>>>>> BTW: The "query" is whatever we make it up to be... (not a 
>>>>>>> reference to SQL query).
>>>>>>>
>>>>>>> This gives the data-node admin the ability to put their files 
>>>>>>> wherever they want.  If they move files around and so on, they 
>>>>>>> just have to rescan with the publisher.  The issues around design 
>>>>>>> and efficiency can be address with varying degrees of cleverness.
>>>>>>>
>>>>>>> I welcome any thoughts on this issue... Please talk me down :-). I 
>>>>>>> think it is about time we put this DRS issue to bed.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Estanislao Gonzalez wrote:
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>>>>> Hi Bob,
>>>>>>>>
>>>>>>>> I guess you must be on vacations now. Anyway, here's the 
>>>>>>>> question, maybe someone else can answer it:
>>>>>>>>
>>>>>>>> The very first idea I had was almost what you proposed. Your 
>>>>>>>> proposal though leaves URLs of the form:
>>>>>>>> http://****myserver/thredds/fileserver/CMIP5_replicas/output/...
>>>>>>>>                                                             <---
>>>>>>>> (almost) DRS Structure ----------->
>>>>>>>>
>>>>>>>> Which has no valid DRS structure (CMIP5_replicas nor CMIP5_core 
>>>>>>>> are in the DRS vocabulary).
>>>>>>>>
>>>>>>>> My proposal has a very similar flaw:
>>>>>>>> http://****myserver/thredds/fileserver/replicated/CMIP5/output/...
>>>>>>>>
>>>>>>>> <--- full DRS Structure -----------> The DRS structure is 
>>>>>>>> preserved, but you cannot easily infer the correct URL from any 
>>>>>>>> dataset. I think the Idea is: if you know the prefix
>>>>>>>> (http.../fileserver/) and the dataset DRS name you can always get 
>>>>>>>> the file without even browising the TDS:
>>>>>>>> prefix + DRS = URL to file
>>>>>>>>
>>>>>>>> AFAIK the URL structure used by the TDS will never be 100% DRS 
>>>>>>>> conform (according to the DRS version 0.27) This one has the 
>>>>>>>> form:
>>>>>>>> http://****<hostname>/<activity>/<product>/<institute>/<model>/
>>>>>>>> <experiment>/<frequency>/<modeling
>>>>>>>> realm>/<variable identifier>/<ensemble member>/<version>/
>>>>>>>> [<endpoint>],
>>>>>>>>
>>>>>>>> where the TDS one has the endpoint moved to the front (the 
>>>>>>>> thredds/fileserver, thredds/dodsC, etc parts).
>>>>>>>>
>>>>>>>> To sum things up:
>>>>>>>> Is it possible to publish files from different directory 
>>>>>>>> structures into an unified URL structure so that it is completely 
>>>>>>>> transparent to the user?
>>>>>>>> Am I the only one addressing this problem? Are all other 
>>>>>>>> institutions planning  to publish all files from only one directory?
>>>>>>>>
>>>>>>>> The only viable solution I can think of is to rely on Stephen's 
>>>>>>>> versioning concept and maintaining a single true DRS structure 
>>>>>>>> with links to files kept in other more manageable directory 
>>>>>>>> structures (This will probably involve adapting Stephen's tool).
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Estani
>>>>>>>>
>>>>>>>>
>>>>>>>> Bob Drach wrote:
>>>>>>>>             
>>>>>>>>               
>>>>>>>>                 
>>>>>>>>> Hi Estani,
>>>>>>>>>
>>>>>>>>> It should be possible to do what you want without running 
>>>>>>>>> multiple data nodes.
>>>>>>>>>
>>>>>>>>> The purpose of the THREDDS dataset roots is to hide the 
>>>>>>>>> directory structure from the end user, and to limit what the TDS can access.
>>>>>>>>> But
>>>>>>>>> THREDDS can certainly have multiple dataset roots.
>>>>>>>>>
>>>>>>>>> In your example below, you should associate different paths with 
>>>>>>>>> the locations, for example:
>>>>>>>>>
>>>>>>>>>               
>>>>>>>>>                 
>>>>>>>>>                   
>>>>>>>>>> <datasetRoot path="CMIP5_replicas" 
>>>>>>>>>> location="/replicated/CMIP5"/> <datasetRoot path="CMIP5_core" 
>>>>>>>>>> location="/core/CMIP5"/>
>>>>>>>>>>                 
>>>>>>>>>>                   
>>>>>>>>>>                     
>>>>>>>>> Also be aware that in the publisher configuration:
>>>>>>>>>
>>>>>>>>> - the directory_format can have multiple values, separated by 
>>>>>>>>> vertical bars (|). The publisher will use the first format that 
>>>>>>>>> matches the directory structure being scanned.
>>>>>>>>>
>>>>>>>>> - a useful strategy is to create different project sections for 
>>>>>>>>> various groups of directives. You could define a cmip5_replica 
>>>>>>>>> project, a cmip5_core project, etc.
>>>>>>>>>
>>>>>>>>> Bob
>>>>>>>>>
>>>>>>>>> On Jul 1, 2010, at 5:42 AM, Estanislao Gonzalez wrote:
>>>>>>>>>
>>>>>>>>>               
>>>>>>>>>                 
>>>>>>>>>                   
>>>>>>>>>> Hi Bryan,
>>>>>>>>>>
>>>>>>>>>> thanks for your answer!
>>>>>>>>>> Running multiple ESG data nodes is always a possibility, but it 
>>>>>>>>>> seems an overkill to us as we may have several different "data 
>>>>>>>>>> repositories".
>>>>>>>>>> We would like to separate: core-replicated, 
>>>>>>>>>> core-non-replicated, non-core, non-core-on-hpss, as well as other non-cmip5 data.
>>>>>>>>>> Having 5+
>>>>>>>>>> ESG data nodes is not viable in our scenario.
>>>>>>>>>>
>>>>>>>>>> The TDS allows the separation of access URL from the underlying 
>>>>>>>>>> file structure so that it might be possible. AFAIK the 
>>>>>>>>>> publisher does not provide a simple way of doing this.
>>>>>>>>>>
>>>>>>>>>> Setting thredds_dataset_roots to different values while 
>>>>>>>>>> publishing doesn't appear to work as those are mapped to a 
>>>>>>>>>> map-entry at the catalog root:
>>>>>>>>>> <datasetRoot path="CMIP5" location="/replicated/CMIP5"/> 
>>>>>>>>>> <datasetRoot path="CMIP5" location="/core/CMIP5"/> ..
>>>>>>>>>>
>>>>>>>>>> which is clearly non bijective and can't therefore be reversed 
>>>>>>>>>> to locate the file from a given URL.
>>>>>>>>>>
>>>>>>>>>> While publishing all referred data will be held on a known 
>>>>>>>>>> location.
>>>>>>>>>> Is it possible to use somehow this information to setup a 
>>>>>>>>>> proper catalog configuration so that the URL can be properly 
>>>>>>>>>> mapped? At least on a dataset level?
>>>>>>>>>>
>>>>>>>>>> The whole HPSS staging procedure should be completely 
>>>>>>>>>> transparent to the user, as well as the location of the files. 
>>>>>>>>>> I was just looking at other options in case we cannot publish 
>>>>>>>>>> them the way we want...
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Estani
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Bryan Lawrence wrote:
>>>>>>>>>>                 
>>>>>>>>>>                   
>>>>>>>>>>                     
>>>>>>>>>>> sorry.
>>>>>>>>>>>
>>>>>>>>>>> the first sentence should have read
>>>>>>>>>>>
>>>>>>>>>>> Just to note that *our* approach to the local versus 
>>>>>>>>>>> replication issue will be ...
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>> Bryan
>>>>>>>>>>>
>>>>>>>>>>> On Thursday 01 Jul 2010 11:25:37 Bryan Lawrence wrote:
>>>>>>>>>>>
>>>>>>>>>>>                   
>>>>>>>>>>>                     
>>>>>>>>>>>                       
>>>>>>>>>>>> Hi Estani
>>>>>>>>>>>>
>>>>>>>>>>>> Just to note that your approach to the local versus 
>>>>>>>>>>>> replication will be to run two different ESG nodes ... which 
>>>>>>>>>>>> is in fact the desired outcome so as to get the right things 
>>>>>>>>>>>> in the catalogues at the right time (vis- a-viz qc etc).
>>>>>>>>>>>>
>>>>>>>>>>>> The issue with respect to cache, I'm not so sure about, in 
>>>>>>>>>>>> what way do you want to expose that into ESG?
>>>>>>>>>>>>
>>>>>>>>>>>> Bryan
>>>>>>>>>>>>
>>>>>>>>>>>> On Wednesday 30 Jun 2010 17:05:57 Estanislao Gonzalez wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>                     
>>>>>>>>>>>>                       
>>>>>>>>>>>>                         
>>>>>>>>>>>>> Hi Stephen,
>>>>>>>>>>>>>
>>>>>>>>>>>>> the page contains really helpful information, thanks a lot!
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm also interested in some variables of the DEFAULT section 
>>>>>>>>>>>>> from the esg.ini configuration file. More specifically:
>>>>>>>>>>>>> thredds_dataset_roots (and maybe 
>>>>>>>>>>>>> thredds_aggregation_services or any other which was changed 
>>>>>>>>>>>>> or you think it might be important)
>>>>>>>>>>>>>
>>>>>>>>>>>>> The main question here is: how can different local directory 
>>>>>>>>>>>>> structures be published to the same DRS structure?
>>>>>>>>>>>>> The example scenario in our case will be:
>>>>>>>>>>>>> /replicated/<DRS structure> - for replicated data 
>>>>>>>>>>>>> /local/<DRS structure> - for non replicated data hold on 
>>>>>>>>>>>>> disk /cache/<DRS structure> - for data staged from a HPSS 
>>>>>>>>>>>>> system
>>>>>>>>>>>>>
>>>>>>>>>>>>> The only solution I can think of is to extend the URL before 
>>>>>>>>>>>>> the DRS structure starts (the URL won't be 100% DRS conform 
>>>>>>>>>>>>> anyway). So
>>>>>>>>>>>>>   http://*****server/thredds/fileserver/<DRS structure> will 
>>>>>>>>>>>>> turn into
>>>>>>>>>>>>>   http://*****server/thredds/fileserver/replicated/<DRS structure>
>>>>>>>>>>>>>   http://*****server/thredds/fileserver/local/<DRS structure>
>>>>>>>>>>>>>   http://*****server/thredds/fileserver/cache/<DRS 
>>>>>>>>>>>>> structure>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is that viable? Are there any other options?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Estani
>>>>>>>>>>>>>
>>>>>>>>>>>>> stephen.pascoe at stfc.ac.uk wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>                       
>>>>>>>>>>>>>                         
>>>>>>>>>>>>>                           
>>>>>>>>>>>>>> To illustrate how the ESG datanode can be configured to 
>>>>>>>>>>>>>> serve data for CMIP5 we have deployed a datanode containing 
>>>>>>>>>>>>>> a subset of
>>>>>>>>>>>>>> CMIP3 in the Data Reference Syntax. Some key features of 
>>>>>>>>>>>>>> this deployment are:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   * The underlying directory structure is based on the Data
>>>>>>>>>>>>>>     Reference Syntax.
>>>>>>>>>>>>>>   * Datasets published at the realm level.
>>>>>>>>>>>>>>   * The token-based security filter is replaced by the
>>>>>>>>>>>>>>     OpenidRelyingParty security filter.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Further notes can be found at 
>>>>>>>>>>>>>> http://*****proj.badc.rl.ac.uk/go-essp/wiki/CMIP3_Datanode
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This test deployment should be of interest to anyone 
>>>>>>>>>>>>>> wanting to know how DRS identifiers could be exposed in 
>>>>>>>>>>>>>> THREDDS catalogues and the TDS HTML interface.  You can 
>>>>>>>>>>>>>> also try downloading files with OpenID authentication or 
>>>>>>>>>>>>>> via wget with SSL-client certificate authentication.  See the link above for details.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Stephen.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>> Stephen Pascoe  +44 (0)1235 445980 British Atmospheric Data 
>>>>>>>>>>>>>> Centre Rutherford Appleton Laboratory
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----------------------------------------------------------
>>>>>>>>>>>>>> ------
>>>>>>>>>>>>>> -- -----
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>>>>> http://*****mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                         
>>>>>>>>>>>>>>                           
>>>>>>>>>>>>>>                             
>>>>>>>>>> --
>>>>>>>>>> Estanislao Gonzalez
>>>>>>>>>>
>>>>>>>>>> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches 
>>>>>>>>>> Klimarechenzentrum (DKRZ) - German Climate Computing Centre 
>>>>>>>>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>>>>>>>
>>>>>>>>>> Phone:   +49 (40) 46 00 94-126
>>>>>>>>>> E-Mail:  estanislao.gonzalez at zmaw.de
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>> http://*****mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>
>>>>>>>>>>                 
>>>>>>>>>>                   
>>>>>>>>>>                     
>>>>>>> --
>>>>>>> Gavin M. Bell
>>>>>>> Lawrence Livermore National Labs
>>>>>>> --
>>>>>>>
>>>>>>> "Never mistake a clear view for a short distance."
>>>>>>>       	       -Paul Saffo
>>>>>>>
>>>>>>> (GPG Key - http://***rainbow.llnl.gov/dist/keys/gavin.asc)
>>>>>>>
>>>>>>> A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
>>>>>>>           
>>>>>>>             
>>>>>>>               
>>>>>> _______________________________________________
>>>>>> GO-ESSP-TECH mailing list
>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>> http://***mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>
>>>>>>         
>>>>>>           
>>>>>>             
>>>   
>>>     
>>>       
>> --
>> Estanislao Gonzalez
>>
>> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>
>> Phone:   +49 (40) 46 00 94-126
>> E-Mail:  estanislao.gonzalez at zmaw.de
>>
>> _______________________________________________
>> is-enes-sa2-jra4 mailing list
>> is-enes-sa2-jra4 at lists.enes.org
>> https://lists.enes.org/mailman/listinfo/is-enes-sa2-jra4
>>   
>>     
>
>
>   


-- 
Estanislao Gonzalez

Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

Phone:   +49 (40) 46 00 94-126
E-Mail:  estanislao.gonzalez at zmaw.de


-- 
Scanned by iCritical.


More information about the GO-ESSP-TECH mailing list