[Go-essp-tech] Replication: requested and output DRS products.

Wed Jul 7 18:59:53 MDT 2010

Dear all,

I'm not sure scientists care much about the distinction between the 
"output", "requested", and "replicated" categories. Martin indicates 
their might be mild interest in being able to search only over the 
"requested" category, since outside that category, there may be little 
uniformity in what is available from the different models. There will be 
little interest in being able to distinguish between "requested" and 
"replicated" unless there is a difference in the quality control tests 
that have been applied to these two categories (and then only if a 
noticeable amount of data in the "requested" category wouldn't pass the 
"replicated" tests). Will this likely be the case?

Clearly the ESG federation must be able to decide which files to 
replicate, so unlike the scientists there is real interest to some of 
you on this list that we be able to distinguish that subset. I'm not 
sure this information has to be part of the DRS though. Couldn't we just 
have some database that lists the criteria for selecting data to be 
replicated? The database coupled with coding to access that information 
could be used to decide whether each file in the "output" category needs 
to be replicated or not. This is why that although the current DRS 
document allows "product" to be either "output" or "requested", an all 
caps note appears stating: "[WILL POSSIBLY MODIFY THE ABOVE IF WE DON’T 
NEED TO KNOW ABOUT “REQUESTED”]."

Bob Drach and I had some extended discussions about this some time ago, 
but I can't recall if he decided to include some capability along these 
lines in the publisher (i.e., enables the publisher to determine whether 
files are in the "requested" category or not), or if we've left that for 
completely independent external coding. Bob returns from vacation later 
this week, so I suggest we wait for some input from him.

Best regards,
Karl

On 7/7/10 2:48 AM, martin.juckes at stfc.ac.uk wrote:
> Hello Estani,
>
> The reference to CMIP5_archive_size.xls was not very useful, apologies for referencing a file that isn't publicly available -- it is attached.
>
> According to the DRS document, everything should be found under the "output" branch, and the "requested" branch will be a subset of the "output".
>
> An end user may want a homogeneous dataset, and so may opt to restrict attention to the "requested" data where he is likely to find the same variables from a large range of models. He may, on the other hand, want all available data for a given set of experiments, in which case he should go to the "output" branch. He will then find additional (low priority) variables and extended time coverage from a small number of models.
>
> I'll see what can be done about a "DRS:requested" and "ESGF:replicated" document (or wiki page),
>
> cheers,
> Martin
>
>
>
>
> -----Original Message-----
> From: Estanislao Gonzalez [mailto:estanislao.gonzalez at zmaw.de]
> Sent: Wed 07/07/2010 08:54
> To: Juckes, Martin (STFC,RAL,SSTD)
> Cc: Pascoe, Stephen (STFC,RAL,SSTD); gavin at llnl.gov; drach1 at llnl.gov; go-essp-tech at ucar.edu; is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
> Subject: Re: Replication: requested and output DRS products.
>
> Hi Martin,
>
> I couldn't find the file you mentioned (CMIP5_archive_size.xls), could
> you please provide a link to it?
>
> I'm aware now that output>  requested>  replicated. But the distinction
> between the later ones is not clear to me. I totally agree that it would
> be great if someone could sum that up.
>
> And one question from the "monster" thread that still remains is:
> It is clear that requested is a subset of output. Does this imply that
> all data under .../requested/... should also be found under the
> .../output/... DRS sub-structure?
>
> I think not... but then again, why would the end user need to know about
> this separation?
>
> Thanks,
> Estnai
>
>
> martin.juckes at stfc.ac.uk wrote:
>    
>> Hello again,
>>
>> The decision as to what is to be replicated is, I think embedded in "CMIP5_archive_size.xls", and its implementation through the DRS is based on the separation between "requested" and "output" products. It would be useful to have a brief document outlining these decisions and some code to implement them. I'm not sure of the latest status on these two points, perhaps Stephen can add something.
>>
>> cheers,
>> Martin
>>
>>
>> -----Original Message-----
>> From: Estanislao Gonzalez [mailto:estanislao.gonzalez at zmaw.de]
>> Sent: Tue 06/07/2010 17:15
>> To: V. Balaji
>> Cc: Juckes, Martin (STFC,RAL,SSTD); Pascoe, Stephen (STFC,RAL,SSTD); gavin at llnl.gov; drach1 at llnl.gov; go-essp-tech at ucar.edu; is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov; taylor13 at llnl.gov
>> Subject: Re: [Go-essp-tech] [is-enes-sa2-jra4] Example of     configuringadatanode to serve CMIP3-DRS
>>
>> Hi Balaji,
>>
>> To put things in context once more: (I think there's no such thing as
>> over-clarification :-)
>>
>> DRS file and directory structure will be assured. The problem is if for
>> some reason we have two different directories, e.g. A and B, and we want
>> to publish data in DRS from both directories. So we have A/<DRS
>> structure>  and B/<DRS structure>.
>> We'd like both of them to be mapped to a central URL, e.g.
>> http://*www.*server.de/thredds/fileserver/<DRS structure>  so that the user
>> requires absolutely no knowledge about this separation.
>>
>> The remaining question is: why on earth would someone want to have A and
>> B?! :-)
>> Well some reasons are:
>> 1) simplified management. We don't have a mega-mix of millions of files
>> from which some have to me replicated, some are held only at our
>> institution, some are "temporarily" held as being cached from tape.
>> Telling these all apart might not be an easy task.
>> 2) Safety. In such a context a simple error might be disastrous (e.g.
>> someone tries to remove the replicated files to re-deploy them without
>> being aware that they share the directory with other files...)
>> 3) Backup. If we (ok, somebody else, we will have everything on tape, I
>> think...) want to backup a portion of the data, this won't be easily
>> achieved (the replicated data is already redundant, but the other isn't)
>> 4) Storage. We might get more disks, but we will certainly won't be able
>> to "merge" all of them into a single storage (well, that's because they
>> will arrive way after we start publishing things, so the first disks
>> will already have some data). In any case, for political (e.g.
>> institutional), technical (e.g. disk speed) or philosophical (e.g
>> ...uh....) reasons it might be desirable to keep different storages.
>>
>> And as I said we have to cope with that, somehow.
>> The starter question was: can this be achieved with the publisher? And
>> the answer was "no".
>>
>> And I totally agree with you regarding AR5. I must have a very good
>> reason for not attaining to a default, even defacto ones. But the
>> decision behind the storage in AR5 is a political one that, AFAIK, isn't
>> taken yet.
>>
>> Well, I hope this helped to clarify things a bit.
>>
>> Thanks,
>> Estani
>>
>> V. Balaji wrote:
>>
>>      
>>> There are undoubtedly parts of this I'm not following too well, so I
>>> apologize in advance for any misunderstandings. This is all from the
>>> perspective of a modeling center.
>>>
>>> I do not understand the logic for _not_ wishing to lay data out in
>>> DRS-compliant fashion on the public data server. I know you can do it,
>>> but I don't understand why you'd want to. One thing I'd like to make
>>> sure is captured as a requirement is that 'wget -r' should deliver
>>> data laid out per DRS directory structure.
>>>
>>> The second issue is that, again from the modeling centre perspective, I
>>> fervently hope that whatever's done for CMIP5 becomes a de-facto
>>> standard for other projects requiring coordinated model data output. We
>>> (modeling centres) cannot build one-off solutions for each project. We
>>> have with some success made CMOR1/AR4 a template which was forked off
>>> for other projects (ENSEMBLES, CHFP, HTAP), because there's no way we
>>> can repeatedly undertake the task of integrating multiple inconsistent
>>> CMORs and DRSes into our data processing workflow. This is in ref to
>>> Martin's question about "non-CMIP5 data".
>>>
>>> martin.juckes at stfc.ac.uk writes:
>>>
>>>
>>>        
>>>> Hello Estanislao, Gavin,
>>>>
>>>> There is a key part of your problem I don't understand -- what do you
>>>> mean by "non CMIP5 data"?
>>>>
>>>> Before going into the ESGF CMIP5 archive, all files will be CMOR2
>>>> compliant. This means that they fit in the "requested" or "product"
>>>> categories of the DRS. The data to be replicated will be a subset of
>>>> the "ESG published units" (also known as realm level datasets) in the
>>>> "requested" category.
>>>>
>>>> There has been an agreement that the ESGF CMIP5 archive would be run
>>>> on disk, and so it is not surprising that the infrastructure does not
>>>> support tape storage. I can see that something along the lines Gavin
>>>> describes would resolve the problems with tape storage, but we need
>>>> to get the disk based system working as the first priority.
>>>>
>>>> Stephen raises the issue of replication and this is relevant, since
>>>> straight disk to disk copies (i.e. to an external hard drive which
>>>> can be posted) is a vital aspect of the replication plan. For the
>>>> time being, this requires people to stick to the DRS directory
>>>> structure.
>>>>
>>>> Within CMIP5 the data from different institutions is clearly
>>>> separated at the institution directory level, I can't see why there
>>>> should be any confusion here.
>>>>
>>>> For non-CMIP5 data -- why would you want to describe it with the
>>>> CMIP5 DRS?
>>>>
>>>> cheers,
>>>> Martin
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: is-enes-sa2-jra4-bounces at lists.enes.org on behalf of
>>>> stephen.pascoe at stfc.ac.uk
>>>> Sent: Tue 06/07/2010 12:13
>>>> To: estanislao.gonzalez at zmaw.de; gavin at llnl.gov
>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu;
>>>> is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
>>>> Subject: Re: [is-enes-sa2-jra4] [Go-essp-tech] Example of
>>>> configuringadatanode to serve CMIP3-DRS
>>>>
>>>>
>>>>
>>>> Hi Estanislao,
>>>>
>>>>
>>>>          
>>>>> * The only true problem is to differentiate between core and
>>>>> non-core data (which as far as I node is a file issue instead of a
>>>>> dataset one,
>>>>> i.e. some datasets contain core and non core data)
>>>>>
>>>>>            
>>>> I'm not sure you were involved then but we had lengthy discussions
>>>> last year on how we would deal with the separation of requested and
>>>> non-requested data (Karl discourages the term "core").  There is a
>>>> fundamental problem that the DRS vocabularies don't cleanly map onto
>>>> what is requested and not requested.  The outcome was to introduce
>>>> the DRS component "product" to divide the two.  If you are interested
>>>> take a look at the following threads:
>>>>
>>>> http://*mailman.ucar.edu/pipermail/go-essp-tech/2010-January/000335.html
>>>> http://*mailman.ucar.edu/pipermail/go-essp-tech/2009-December/000255.html
>>>>
>>>> There hasn't been much discussion of how we identify and manage
>>>> requested data since then and the nitty-gritty details still aren't
>>>> fixed.  This is going to be a challenge when we come to replicate.
>>>>
>>>> S.
>>>>
>>>> ---
>>>> Stephen Pascoe  +44 (0)1235 445980
>>>> British Atmospheric Data Centre
>>>> Rutherford Appleton Laboratory
>>>>
>>>> -----Original Message-----
>>>> From: is-enes-sa2-jra4-bounces at lists.enes.org
>>>> [mailto:is-enes-sa2-jra4-bounces at lists.enes.org] On Behalf Of
>>>> Estanislao Gonzalez
>>>> Sent: 06 July 2010 11:18
>>>> To: Gavin M Bell
>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu;
>>>> is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
>>>> Subject: Re: [is-enes-sa2-jra4] [Go-essp-tech] Example of configuring
>>>> adatanode to serve CMIP3-DRS
>>>>
>>>> Hi people,
>>>>
>>>> well I think we do require something like this (at least at the major
>>>> data nodes where data will get replicated). Managing all data mixed
>>>> up under one single directory is not a very neat solution for the
>>>> data administrator. In our particular case we will be publishing many
>>>> (much?
>>>> :-) data from different institutions and even types (not only CMIP5).
>>>> And we shouldn't forget about the replicated data (is that ===
>>>> core?), how can we tell which data requires being replicated? by
>>>> maintaining a second "catalog" in a DB? I think by maintaining a
>>>> separate filesystem a simple rsynch will do the job (after the very
>>>> first replication, of course).
>>>> In any case the fact that we at DKRZ cannot hold all CMIP5 data on
>>>> disk (yes, the core one we can :-) implies that we will have to
>>>> maintain a cache somewhere, and mixing this cache with the core data
>>>> is something we should probably avoid.
>>>>
>>>> Gavin's solution, if I got it right, has a major problem. The
>>>> catalogs will be created pointing to the real files (e.g.
>>>> .../core/CMIP5), so that the filter can alter the request from the
>>>> DRS query
>>>> (../CMIP5/<core_data>) to the real one, and thus allow the TDS to
>>>> work as usual. This leaves the catalogs unaltered and thereby the
>>>> harvest data which will have no reference to the mapped DRS structure
>>>> but to the real one. OR did I miss something here?
>>>>
>>>> I have already tried several possible solutions without any success
>>>> at all:
>>>> 1) Setting multiple datasetRoot entries is not allowed
>>>> 2) Altering the TDS to accept multiple datasetRoot entries and look
>>>> in all of them one after the other after something matches is almost
>>>> impossible (for the time we have ahead, the mere architecture of the
>>>> TDS is, in my opinion, a mess).
>>>> 3) In general altering the TDS is not a "nice" solution.
>>>> 4) Filtering the request breaks the coherence between the catalogs
>>>> and the DRS "virtual" structure (the catalogs have no information
>>>> whatsoever that a second link to the files exists.
>>>>
>>>> The only viable solution I can think of (and it is still to see if
>>>> it's really viable) is to maintain the files somewhere else and link
>>>> them to the "central" DRS filesystem before being published.
>>>>
>>>> After discussing this with Stephan we come up with something I'd like
>>>> to sum up here:
>>>> * All non CMIP5 data can be mapped to a DRS structure "not" starting
>>>> with CMIP5 so it can be easily mapped to somewhere else (TDS allows
>>>> that)
>>>> * The only true problem is to differentiate between core and non-core
>>>> data (which as far as I node is a file issue instead of a dataset
>>>> one, i.e. some datasets contain core and non core data)
>>>> * The replication can rely on external sources for differentiating
>>>> this, e.g. a DB.
>>>> * The cached non-core data can co-live, in the worst case scenario,
>>>> with the core data by removing the write permits of the later (beside
>>>> the security that it implies, this will be used as a flag in case the
>>>> server is restarted. All non-flagged (write enabled) files will be
>>>> treated as left overs from the stopped cache and will be further served)
>>>>
>>>> So we might get out with it without performing any major changes. But
>>>> this is something we should definitely discuss before next iteration :-)
>>>>
>>>> I hope this brings some light into the matter... sorry for the
>>>> lengthy mail...
>>>>
>>>> Regards,
>>>> Estani
>>>>
>>>> Gavin M Bell wrote:
>>>>
>>>>          
>>>>> Martin,
>>>>>
>>>>> The savings is that the data provider / data-node admin doesn't have
>>>>> to any additional work, whether it be provide any filesystem<->  drs
>>>>> mapping or (re)arranging their file system.  In the current state of
>>>>> things all the salient information is already in the database created
>>>>> as a result of the publisher [software] scan.  I think it would be
>>>>> prudent to use that information to the benefit of our end users
>>>>> instead of imposing a DRS directory structure requirement for esg
>>>>> participation.
>>>>>
>>>>> You said:
>>>>> "Remember that not having to configure the file system is only a real
>>>>> saving if the alternative (configuring the file system to URL mapping)
>>>>> is actually easier than configuring the file system."
>>>>>
>>>>> I am saying:
>>>>> The 'alternative' you describe, does not exist.  Because there is no
>>>>> "configuring the file system to URL mapping" necessary... unless the
>>>>> end-user wants there to be. In which case we, as dutiful programmers,
>>>>> provide that opportunity.  This is what my code sketch was
>>>>> illustrating with the property "drs.resolve.strategy", and the use of
>>>>> a factory and strategy pattern - of which we will set a default that
>>>>> requires them to do *no additional work*.  The data-node admin won't
>>>>> have to do any actual setup outside of running an "esg-node --update".
>>>>> The upgrade/update process (determined by the esg-node install script)
>>>>> will install the filter, without them having to do anything additional.
>>>>>
>>>>> Indeed, the code I posted was a quick and dirty filter code sketch
>>>>> demonstrating that putting a filter in place is easy. Yes, the
>>>>> resolution work would be done in the code that I only alluded to, the
>>>>> "DRSResolver". Current duties preclude me from actually implementing
>>>>> this issue outright, today, for this email conversation. However, if
>>>>> we all conclude that it is worthwhile then I or someone else could
>>>>> make it happen.
>>>>>
>>>>> I hope I have done a better job of making more clear my point; that we
>>>>> can free our end-users of this DRS directory structure requirement
>>>>> while allowing the DRS itself to be more flexible with it's
>>>>> representation.
>>>>> Also that the mechanism I described does not preclude anyone from
>>>>> setting up their filesystem to follow the DRS structure, we get that
>>>>> for free! :-)
>>>>>
>>>>> I am glad that we do indeed agree that the effort to bring this to
>>>>> fruition can and should be done in a way that does not impede or
>>>>> distract the current  deliverable path.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> martin.juckes at stfc.ac.uk wrote:
>>>>>
>>>>>
>>>>>            
>>>>>> Er... the attachment you sent didn't actually do any mapping. But I'm
>>>>>> sure it could be done. The extra work I'm talking about is the same
>>>>>> as the extra work you talk about at the end of your mail, so I'm
>>>>>> going to ignore your suggestion at the start of your email that there
>>>>>> isn't any,
>>>>>>
>>>>>> cheers,
>>>>>> Martin
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Gavin M Bell [mailto:gavin at llnl.gov]
>>>>>> Sent: Mon 05/07/2010 21:37
>>>>>> To: Juckes, Martin (STFC,RAL,SSTD)
>>>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu;
>>>>>> is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
>>>>>> Subject: Re: [Go-essp-tech] [is-enes-sa2-jra4] Example of configuring
>>>>>> adatanode to serve CMIP3-DRS
>>>>>>
>>>>>> Hi Martin,
>>>>>>
>>>>>> With regards to the savings... One, perhaps default, setup is not
>>>>>> having the data provider do anything additional at all with respect
>>>>>> to configuration or setup.  They simply use the publisher to scan
>>>>>> their files into the system, something that must be done in all
>>>>>> cases... (so we can normalize that out). With that said, they would
>>>>>> not have to do
>>>>>> *any* additional work.  No work is easier than some work, regardless
>>>>>> of how easy ;-).
>>>>>>
>>>>>> I have attached the filter code that would almost do it.  The real
>>>>>> intelligence would be in the "DRSResolver" object to do the
>>>>>> resolution.
>>>>>>   I would have sketched out that class as well but that would be
>>>>>> tantamount to completing this task... and to finish it off I would
>>>>>> have to confer with Bob on the publisher database.  And have us all
>>>>>> settled on the DRS query syntax.
>>>>>> With a DRS URL query scheme we could wrap this up quite directly.
>>>>>>
>>>>>> The DRSResolver would:
>>>>>> -parse the request url (the query) and pull out the salient parts.
>>>>>> -fashion those parts into a SQL query against the publisher database
>>>>>> -Return the thredds' root based url to the rest of the processing
>>>>>> stream. If it is not able to be resolved, punt and return the same
>>>>>> input string as the output and let some other part of the process
>>>>>> stream regurgitate an error.
>>>>>>
>>>>>> Because all the metadata is pulled out in the publisher's scan, file
>>>>>> system placement of the scanned files is moot.
>>>>>>
>>>>>> In the code I attached, I leave room for the data-node user to select
>>>>>> their own implementation of the resolver following a factory/strategy
>>>>>> pattern.  At that point indeed we allow end users to do 'work' with
>>>>>> doing their own mappings.  Perhaps we integrate a few canned mapping
>>>>>> schemes etc... We can be arbitrarily cleaver with these kinds of
>>>>>> things of course. :-)
>>>>>>
>>>>>> P.S.
>>>>>> The DRSResolver logic would/should be ported to all ingress request
>>>>>> streams.  Also the published catalogs would be published with the DRS
>>>>>> query syntax scheme as the canonical name of the resource - something
>>>>>> the search facility would use to identify the resource.
>>>>>>
>>>>>> done.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> martin.juckes at stfc.ac.uk wrote:
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> Hi Gavin,
>>>>>>>
>>>>>>> I'm not convinced about the connection to Estanislao's email, but
>>>>>>> the idea of thinking about the next step while implementing the
>>>>>>> current system is certainly a good one. Remember that not having to
>>>>>>> configure the file system is only a real saving if the alternative
>>>>>>> (configuring the file system to URL mapping) is actually easier than
>>>>>>> configuring the file system. Setting up the DRS is not difficult,
>>>>>>>
>>>>>>> cheers,
>>>>>>> Martin
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Gavin M Bell [mailto:gavin at llnl.gov]
>>>>>>> Sent: Mon 05/07/2010 19:45
>>>>>>> To: Juckes, Martin (STFC,RAL,SSTD)
>>>>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu;
>>>>>>> is-enes-sa2-jra4 at lists.enes.org; doutriaux1 at llnl.gov
>>>>>>> Subject: Re: [Go-essp-tech] [is-enes-sa2-jra4] Example of
>>>>>>> configuring adatanode to serve CMIP3-DRS
>>>>>>>
>>>>>>> Martin and friends,
>>>>>>>
>>>>>>> This is false economy.  Two things.  First implementing this is not
>>>>>>> hard.  Secondly implementing this will resolve the issues r.w.t. the
>>>>>>> incongruence between DRS and the filesystem that Estanislao's email
>>>>>>> illuminated.  So it seems to me that the alternative is keep fitting
>>>>>>> this square DRS peg in to the round file system hole.  That would
>>>>>>> mean having to do a whole other set of gymnastics to get the DRS<->
>>>>>>> file system beast tamed.  There is work to be done either way
>>>>>>> because things are not ready to go as it stands. I suggest we fix
>>>>>>> the problem at the root, now, not "later".  Essentially the current
>>>>>>> course requires the data providers to jump through file system
>>>>>>> layout hoops.  I am of the opinion that we should "require" as
>>>>>>> little as possible from our users, especially something like
>>>>>>> this... it hurts adoption IMHO.
>>>>>>>
>>>>>>> Actually, let me frame this differently.  How about we fork efforts,
>>>>>>> and have some folks think about what the *query* URL should be for
>>>>>>> the functionality I suggested, while others continue the current
>>>>>>> path.  When the former development is ripe I update the install
>>>>>>> script and have it installed upon the clients' next install
>>>>>>> automagically, no slowdown for anyone.  The null transform would be
>>>>>>> equivalent to what we have now so we would be backward compatible
>>>>>>> for folks whole have done the task of making their file systems
>>>>>>> congruent to DRS.  Fair enough?
>>>>>>>
>>>>>>> Sound good?
>>>>>>>
>>>>>>> martin.juckes at stfc.ac.uk wrote:
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>>>> Hello Gavin, Bob,
>>>>>>>>
>>>>>>>> I agree that this is a good idea in principle, but I think it is a
>>>>>>>> bad idea now. The thing about "now" is that we want to deploy and
>>>>>>>> test the system we have agreed on. We want to do it now because
>>>>>>>> modelling centres have supercomputers running and churning out vast
>>>>>>>> volumes of data, there are thousands of scientists waiting to get
>>>>>>>> at it and we have the job of installing a system to distribute it.
>>>>>>>> It is, I think, I bad time to start implementing changes in the
>>>>>>>> system design. Sorry if this sounds a bit harsh, but impending
>>>>>>>> deadlines make me nervous,
>>>>>>>>
>>>>>>>> cheers,
>>>>>>>> Martin
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: go-essp-tech-bounces at ucar.edu on behalf of Bob Drach
>>>>>>>> Sent: Mon 05/07/2010 19:18
>>>>>>>> To: Gavin M Bell
>>>>>>>> Cc: go-essp-tech at ucar.edu; is-enes-sa2-jra4 at lists.enes.org; Charles
>>>>>>>> Doutriaux
>>>>>>>> Subject: Re: [Go-essp-tech] [is-enes-sa2-jra4] Example of
>>>>>>>> configuring adatanode to serve CMIP3-DRS
>>>>>>>>
>>>>>>>> Hi Gavin,
>>>>>>>>
>>>>>>>> I agree completely. Having a regularized DRS syntax is a very good
>>>>>>>> idea, but to implement it we will need to introduce a level of
>>>>>>>> indirection between the DRS URL (your 'query') and the underlying
>>>>>>>> filesystem. Separating these two concerns will have a very
>>>>>>>> important
>>>>>>>> benefit: it will allow the data node managers to organize their
>>>>>>>> filesystems as they see fit.
>>>>>>>>
>>>>>>>> Bob
>>>>>>>>
>>>>>>>> On Jul 5, 2010, at 11:10 AM, Gavin M Bell wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>>>> Hello gentle-people,
>>>>>>>>>
>>>>>>>>> Here is my two cents on this whole DRS business.  I think that the
>>>>>>>>> fundamental issue to all of this is the ability to do resource
>>>>>>>>> resolution (lookup).  The issue of having urls match a DRS
>>>>>>>>> structure that matches the filesystem is a red herring (IMHO).
>>>>>>>>> The basic issue is to be able to issue a query to the system such
>>>>>>>>> that you find what you are looking for.  This query mechanism
>>>>>>>>> should be separate mechanism than filesystem correspondence.  The
>>>>>>>>> driving issue behind the file system correspondence push is so
>>>>>>>>> that people and/or applications can infer the location of
>>>>>>>>> resources in some regimented way.  The true heart of the issue is
>>>>>>>>> not with the file system.  The heart of the issue is to perform a
>>>>>>>>> query such that you provide resource resolution.  The file system
>>>>>>>>> is a familiar mechanism but it isn't the only one.  The file
>>>>>>>>> system takes a query (the file system path) and returns the
>>>>>>>>> resource to us (the bits sitting at an inode location somewhere
>>>>>>>>> that is memory mapped to some physical platter and spindle
>>>>>>>>> location, that is mapped to the file system path).  We are
>>>>>>>>> overloading the file system query mechanism when it is not
>>>>>>>>> necessary.
>>>>>>>>>
>>>>>>>>> I propose the following:  We create a *filter* and a small
>>>>>>>>> database (the latter we already have in the publisher).  We send a
>>>>>>>>> *query* to the web server the web server *filter* intercepts that
>>>>>>>>> *query* and resolves it, using the database to the actual resource
>>>>>>>>> location and returns the resource you want.  Implementing this in
>>>>>>>>> a filter divorces the query structure from the file system
>>>>>>>>> structure.  The use of the database (that is generated by the
>>>>>>>>> publisher when it scans) provides the resolution.
>>>>>>>>> With this mechanism in place, WGET, as well as any other URL based
>>>>>>>>> tool will be able to fetch the data as intended.
>>>>>>>>>
>>>>>>>>> BTW: The "query" is whatever we make it up to be... (not a
>>>>>>>>> reference to SQL query).
>>>>>>>>>
>>>>>>>>> This gives the data-node admin the ability to put their files
>>>>>>>>> wherever they want.  If they move files around and so on, they
>>>>>>>>> just have to rescan with the publisher.  The issues around design
>>>>>>>>> and efficiency can be address with varying degrees of cleverness.
>>>>>>>>>
>>>>>>>>> I welcome any thoughts on this issue... Please talk me down :-). I
>>>>>>>>> think it is about time we put this DRS issue to bed.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Estanislao Gonzalez wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>>>> Hi Bob,
>>>>>>>>>>
>>>>>>>>>> I guess you must be on vacations now. Anyway, here's the
>>>>>>>>>> question, maybe someone else can answer it:
>>>>>>>>>>
>>>>>>>>>> The very first idea I had was almost what you proposed. Your
>>>>>>>>>> proposal though leaves URLs of the form:
>>>>>>>>>> http://*****myserver/thredds/fileserver/CMIP5_replicas/output/...
>>>>>>>>>>                                                              <---
>>>>>>>>>> (almost) DRS Structure ----------->
>>>>>>>>>>
>>>>>>>>>> Which has no valid DRS structure (CMIP5_replicas nor CMIP5_core
>>>>>>>>>> are in the DRS vocabulary).
>>>>>>>>>>
>>>>>>>>>> My proposal has a very similar flaw:
>>>>>>>>>> http://*****myserver/thredds/fileserver/replicated/CMIP5/output/...
>>>>>>>>>>
>>>>>>>>>> <--- full DRS Structure ----------->  The DRS structure is
>>>>>>>>>> preserved, but you cannot easily infer the correct URL from any
>>>>>>>>>> dataset. I think the Idea is: if you know the prefix
>>>>>>>>>> (http.../fileserver/) and the dataset DRS name you can always get
>>>>>>>>>> the file without even browising the TDS:
>>>>>>>>>> prefix + DRS = URL to file
>>>>>>>>>>
>>>>>>>>>> AFAIK the URL structure used by the TDS will never be 100% DRS
>>>>>>>>>> conform (according to the DRS version 0.27) This one has the
>>>>>>>>>> form:
>>>>>>>>>> http://*****<hostname>/<activity>/<product>/<institute>/<model>/
>>>>>>>>>> <experiment>/<frequency>/<modeling
>>>>>>>>>> realm>/<variable identifier>/<ensemble member>/<version>/
>>>>>>>>>> [<endpoint>],
>>>>>>>>>>
>>>>>>>>>> where the TDS one has the endpoint moved to the front (the
>>>>>>>>>> thredds/fileserver, thredds/dodsC, etc parts).
>>>>>>>>>>
>>>>>>>>>> To sum things up:
>>>>>>>>>> Is it possible to publish files from different directory
>>>>>>>>>> structures into an unified URL structure so that it is completely
>>>>>>>>>> transparent to the user?
>>>>>>>>>> Am I the only one addressing this problem? Are all other
>>>>>>>>>> institutions planning  to publish all files from only one
>>>>>>>>>> directory?
>>>>>>>>>>
>>>>>>>>>> The only viable solution I can think of is to rely on Stephen's
>>>>>>>>>> versioning concept and maintaining a single true DRS structure
>>>>>>>>>> with links to files kept in other more manageable directory
>>>>>>>>>> structures (This will probably involve adapting Stephen's tool).
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Estani
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Bob Drach wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>>>>>> Hi Estani,
>>>>>>>>>>>
>>>>>>>>>>> It should be possible to do what you want without running
>>>>>>>>>>> multiple data nodes.
>>>>>>>>>>>
>>>>>>>>>>> The purpose of the THREDDS dataset roots is to hide the
>>>>>>>>>>> directory structure from the end user, and to limit what the
>>>>>>>>>>> TDS can access.
>>>>>>>>>>> But
>>>>>>>>>>> THREDDS can certainly have multiple dataset roots.
>>>>>>>>>>>
>>>>>>>>>>> In your example below, you should associate different paths with
>>>>>>>>>>> the locations, for example:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                        
>>>>>>>>>>>> <datasetRoot path="CMIP5_replicas"
>>>>>>>>>>>> location="/replicated/CMIP5"/>  <datasetRoot path="CMIP5_core"
>>>>>>>>>>>> location="/core/CMIP5"/>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                          
>>>>>>>>>>> Also be aware that in the publisher configuration:
>>>>>>>>>>>
>>>>>>>>>>> - the directory_format can have multiple values, separated by
>>>>>>>>>>> vertical bars (|). The publisher will use the first format that
>>>>>>>>>>> matches the directory structure being scanned.
>>>>>>>>>>>
>>>>>>>>>>> - a useful strategy is to create different project sections for
>>>>>>>>>>> various groups of directives. You could define a cmip5_replica
>>>>>>>>>>> project, a cmip5_core project, etc.
>>>>>>>>>>>
>>>>>>>>>>> Bob
>>>>>>>>>>>
>>>>>>>>>>> On Jul 1, 2010, at 5:42 AM, Estanislao Gonzalez wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                        
>>>>>>>>>>>> Hi Bryan,
>>>>>>>>>>>>
>>>>>>>>>>>> thanks for your answer!
>>>>>>>>>>>> Running multiple ESG data nodes is always a possibility, but it
>>>>>>>>>>>> seems an overkill to us as we may have several different "data
>>>>>>>>>>>> repositories".
>>>>>>>>>>>> We would like to separate: core-replicated,
>>>>>>>>>>>> core-non-replicated, non-core, non-core-on-hpss, as well as
>>>>>>>>>>>> other non-cmip5 data.
>>>>>>>>>>>> Having 5+
>>>>>>>>>>>> ESG data nodes is not viable in our scenario.
>>>>>>>>>>>>
>>>>>>>>>>>> The TDS allows the separation of access URL from the underlying
>>>>>>>>>>>> file structure so that it might be possible. AFAIK the
>>>>>>>>>>>> publisher does not provide a simple way of doing this.
>>>>>>>>>>>>
>>>>>>>>>>>> Setting thredds_dataset_roots to different values while
>>>>>>>>>>>> publishing doesn't appear to work as those are mapped to a
>>>>>>>>>>>> map-entry at the catalog root:
>>>>>>>>>>>> <datasetRoot path="CMIP5" location="/replicated/CMIP5"/>
>>>>>>>>>>>> <datasetRoot path="CMIP5" location="/core/CMIP5"/>  ..
>>>>>>>>>>>>
>>>>>>>>>>>> which is clearly non bijective and can't therefore be reversed
>>>>>>>>>>>> to locate the file from a given URL.
>>>>>>>>>>>>
>>>>>>>>>>>> While publishing all referred data will be held on a known
>>>>>>>>>>>> location.
>>>>>>>>>>>> Is it possible to use somehow this information to setup a
>>>>>>>>>>>> proper catalog configuration so that the URL can be properly
>>>>>>>>>>>> mapped? At least on a dataset level?
>>>>>>>>>>>>
>>>>>>>>>>>> The whole HPSS staging procedure should be completely
>>>>>>>>>>>> transparent to the user, as well as the location of the files.
>>>>>>>>>>>> I was just looking at other options in case we cannot publish
>>>>>>>>>>>> them the way we want...
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Estani
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Bryan Lawrence wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                          
>>>>>>>>>>>>> sorry.
>>>>>>>>>>>>>
>>>>>>>>>>>>> the first sentence should have read
>>>>>>>>>>>>>
>>>>>>>>>>>>> Just to note that *our* approach to the local versus
>>>>>>>>>>>>> replication issue will be ...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>> Bryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thursday 01 Jul 2010 11:25:37 Bryan Lawrence wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                            
>>>>>>>>>>>>>> Hi Estani
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just to note that your approach to the local versus
>>>>>>>>>>>>>> replication will be to run two different ESG nodes ... which
>>>>>>>>>>>>>> is in fact the desired outcome so as to get the right things
>>>>>>>>>>>>>> in the catalogues at the right time (vis- a-viz qc etc).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The issue with respect to cache, I'm not so sure about, in
>>>>>>>>>>>>>> what way do you want to expose that into ESG?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Bryan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wednesday 30 Jun 2010 17:05:57 Estanislao Gonzalez wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                              
>>>>>>>>>>>>>>> Hi Stephen,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> the page contains really helpful information, thanks a lot!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm also interested in some variables of the DEFAULT section
>>>>>>>>>>>>>>> from the esg.ini configuration file. More specifically:
>>>>>>>>>>>>>>> thredds_dataset_roots (and maybe
>>>>>>>>>>>>>>> thredds_aggregation_services or any other which was changed
>>>>>>>>>>>>>>> or you think it might be important)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The main question here is: how can different local directory
>>>>>>>>>>>>>>> structures be published to the same DRS structure?
>>>>>>>>>>>>>>> The example scenario in our case will be:
>>>>>>>>>>>>>>> /replicated/<DRS structure>  - for replicated data
>>>>>>>>>>>>>>> /local/<DRS structure>  - for non replicated data hold on
>>>>>>>>>>>>>>> disk /cache/<DRS structure>  - for data staged from a HPSS
>>>>>>>>>>>>>>> system
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The only solution I can think of is to extend the URL before
>>>>>>>>>>>>>>> the DRS structure starts (the URL won't be 100% DRS conform
>>>>>>>>>>>>>>> anyway). So
>>>>>>>>>>>>>>>    http://******server/thredds/fileserver/<DRS structure>  will
>>>>>>>>>>>>>>> turn into
>>>>>>>>>>>>>>>    http://******server/thredds/fileserver/replicated/<DRS
>>>>>>>>>>>>>>> structure>
>>>>>>>>>>>>>>>    http://******server/thredds/fileserver/local/<DRS structure>
>>>>>>>>>>>>>>>    http://******server/thredds/fileserver/cache/<DRS
>>>>>>>>>>>>>>> structure>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is that viable? Are there any other options?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Estani
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> stephen.pascoe at stfc.ac.uk wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                                
>>>>>>>>>>>>>>>> To illustrate how the ESG datanode can be configured to
>>>>>>>>>>>>>>>> serve data for CMIP5 we have deployed a datanode containing
>>>>>>>>>>>>>>>> a subset of
>>>>>>>>>>>>>>>> CMIP3 in the Data Reference Syntax. Some key features of
>>>>>>>>>>>>>>>> this deployment are:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    * The underlying directory structure is based on the Data
>>>>>>>>>>>>>>>>      Reference Syntax.
>>>>>>>>>>>>>>>>    * Datasets published at the realm level.
>>>>>>>>>>>>>>>>    * The token-based security filter is replaced by the
>>>>>>>>>>>>>>>>      OpenidRelyingParty security filter.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Further notes can be found at
>>>>>>>>>>>>>>>> http://******proj.badc.rl.ac.uk/go-essp/wiki/CMIP3_Datanode
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This test deployment should be of interest to anyone
>>>>>>>>>>>>>>>> wanting to know how DRS identifiers could be exposed in
>>>>>>>>>>>>>>>> THREDDS catalogues and the TDS HTML interface.  You can
>>>>>>>>>>>>>>>> also try downloading files with OpenID authentication or
>>>>>>>>>>>>>>>> via wget with SSL-client certificate authentication.  See
>>>>>>>>>>>>>>>> the link above for details.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Stephen.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>> Stephen Pascoe  +44 (0)1235 445980 British Atmospheric Data
>>>>>>>>>>>>>>>> Centre Rutherford Appleton Laboratory
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -----------------------------------------------------------
>>>>>>>>>>>>>>>> ------
>>>>>>>>>>>>>>>> -- -----
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>>>>>>> http://******mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>                                  
>>>>>>>>>>>> --
>>>>>>>>>>>> Estanislao Gonzalez
>>>>>>>>>>>>
>>>>>>>>>>>> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches
>>>>>>>>>>>> Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>>>>>>>>>>>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>>>>>>>>>
>>>>>>>>>>>> Phone:   +49 (40) 46 00 94-126
>>>>>>>>>>>> E-Mail:  estanislao.gonzalez at zmaw.de
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>>> http://******mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                          
>>>>>>>>> --
>>>>>>>>> Gavin M. Bell
>>>>>>>>> Lawrence Livermore National Labs
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> "Never mistake a clear view for a short distance."
>>>>>>>>>                   -Paul Saffo
>>>>>>>>>
>>>>>>>>> (GPG Key - http://****rainbow.llnl.gov/dist/keys/gavin.asc)
>>>>>>>>>
>>>>>>>>> A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>> _______________________________________________
>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>> http://****mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>            
>>>> --
>>>> Estanislao Gonzalez
>>>>
>>>> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches
>>>> Klimarechenzentrum (DKRZ) - German Climate Computing Centre Room 108
>>>> - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>
>>>> Phone:   +49 (40) 46 00 94-126
>>>> E-Mail:  estanislao.gonzalez at zmaw.de
>>>>
>>>> _______________________________________________
>>>> is-enes-sa2-jra4 mailing list
>>>> is-enes-sa2-jra4 at lists.enes.org
>>>> https://*lists.enes.org/mailman/listinfo/is-enes-sa2-jra4
>>>>
>>>>
>>>>          
>>
>>
>>      
>
> --
> Estanislao Gonzalez
>
> Max-Planck-Institut für Meteorologie (MPI-M)
> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>
> Phone:   +49 (40) 46 00 94-126
> E-Mail:  estanislao.gonzalez at zmaw.de
>
>
>
>
> --
> Scanned by iCritical.
>
>