[Go-essp-tech] Non-DRS File structure at data nodes

Fri Sep 2 08:27:30 MDT 2011

Hello,

Thanks for all the replies on this.  Its all interesting/confusing.  Sorry, what follows is quite detailed and technical. I may have some of it wrong.  I hope someone has the patience to follow it through, as there are  requests for clarity.  Maybe I should be reading some documentation - in which case please point me towards it. I'll summaries first, then give a bit more context.

1. when the DRS talks about a publication level dataset version THREDDS ID - what xml element or attribute does this mean?

2. are the urlPaths in a thredds catalog accurate enough to
  a. download the data (by prepending a service path) - apart from any authentication issues, and
  b. interpreted the last part as a file name, where the filename conforms to the DRS?

Now for the context.........

I think its back to understanding what the DRS document means.  When I read the DRS I interpret the word 'will' to mean there is  mandatory requirement on someone/something to make sure this happens.  I think that's a reasonable interpretation (if not a reasonable expectation).

So, sorry to be a pedant (again) but DRS 1.2 March 9th says in section 3.4:

'Each publication-level dataset version **will** have the THREDDS id:
<activity>.<product>.<institute>.<model>.<experiment>.<frequency>.
<modeling realm>.<MIP table>.<ensemble member>.<version>'

(highlight on **will** is mine).

Of course this is a bit vague on what it means by a THREDDS id, but it implies there is something somewhere in a thredds catalogue that has this DRS defined identifier.  Can someone clarify what this thing is?  I have interpreted it as the xlink:title attribute in the catalogRef  elements of the esgcet catalogue.  But I think Stephen is interpreting it as the property drs_id in the dataset xml document.  Based on what I can see I don't think this second interpretation is quite right - the property drs_id looks more like what the DRS document calls a publication level dataset id (not versioned).  Picking an example from our own data...

<property name="drs_id" value="tamip.output1.MOHC.HadGEM2-A.tamip200904.3hr.atmos.3hrCurt.r10i1p1"/>

(I know its tamip - but I think tamip should be treated in the same way as cmip5).  This drs_id does not contain the version string.  Here's what the DRS document says about this kind of id:

'The CMIP5 best practices document3 defines a Publication-level dataset_id
as:
<activity>.<product>.<institute>.<model>.<experiment>.<frequency>.<modeling
realm>.<MIP table>.<ensemble member>'

(which ties in with Estani' saying somethings are only best practice).

Rather than the xlink:title the other candidate for the publication-level dataset version THREDDS id that I can see is <dataset ... ID=>.  (but this may change?)

Please, please can someone clarify.  If I can't use xlink:title what is it used for (its obviously more performant if I can just use xlink:title as I only have to get one document).

I also *think* that the filename is mandatory (again quoting the DRS): 

'For CMIP5 the filename **will** be constructed as
follows:
filename = <variable name>_<MIP table>_<model>_<experiment>_
<ensemble member>[_<temporal subset>].nc'

Which means I should always be able to get the variable name from the filename which I'm infering from the urlPath - which I'm assuming is accurate enough to do this, even if is not compliant with what the DRS says:

'URLs referencing the data files will have a site dependent prefix followed by the DRS directory
structure.'

(hmm does all this prove anything apart from the level of detail I'm prepared to read and interpret the DRS document to?)

I know using the data nodes is cumbersome, and means I may 'list' more data than I can 'get' because of the authorization issue, *but* 

  1. I can cope with failed attempts to get data better than I can cope with data I never knew about.  I *think* the gateways still don't have a full picture of all published data do they?  (e.g. I can't see IPSL from PCMDI, as far as I can tell).  

  2. I don't think its so cumbersome that its not worth the effort (though performance may be an issue depending one what things mean...).

(Is there any documentation on the REST interface to the gateway API. I've found https://wiki.ucar.edu/display/esgcet/Remote+Metadata+Query+API, but think its still based on the Hessian interface).

Thanks for bearing with all this,

Jamie

> -----Original Message-----
> From: stephen.pascoe at stfc.ac.uk [mailto:stephen.pascoe at stfc.ac.uk] 
> Sent: 02 September 2011 12:53
> To: Kettleborough, Jamie; martin.juckes at stfc.ac.uk; 
> gonzalez at dkrz.de; taylor13 at llnl.gov
> Cc: go-essp-tech at ucar.edu; Laura.E.Carriere at nasa.gov
> Subject: RE: [Go-essp-tech] Non-DRS File structure at data nodes
> 
> Jamie,
> 
> If you are parsing THREDDS catalogs the THREDDS properties 
> would be the indicator of the DRS structure.  
> 
> * The <property name="drs_id"> element states all DRS 
> components down to the publication level.  The dataset_id 
> property could also be used but may diverge from drs_id in the future.
> * The <variable> elements describe variable names.
> * The <property name="version"> element states the version.
> 
> Generally don't use the dataset at ID element as this could 
> change in future THREDDS versions.
> 
> Cheers,
> Stephen.
> 
> 
> ---
> Stephen Pascoe  +44 (0)1235 445980
> Centre of Environmental Data Archival
> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot 
> OX11 0QX, UK
> 
> 
> -----Original Message-----
> From: go-essp-tech-bounces at ucar.edu 
> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of 
> Kettleborough, Jamie
> Sent: 02 September 2011 12:32
> To: Juckes, Martin (STFC,RAL,RALSP); gonzalez at dkrz.de; 
> taylor13 at llnl.gov
> Cc: go-essp-tech at ucar.edu; Laura.E.Carriere at nasa.gov
> Subject: Re: [Go-essp-tech] Non-DRS File structure at data nodes
> 
> Hello,
> 
> Martin, like you, we want to store data locally in a DRS-like 
> way.  I think this means we will not preserve the actual 
> directory structure as implied by the URL, but will infer its 
> DRS location from the publication data set version id (I 
> think that's the right word), and the filename in the URL.  
> Do you know of any reason (or examples) where this won't work.
> 
> In practice this means we will take the publication data set 
> version id from the esgcet/thredds/catalog.xml <catalogRef 
> xlink:title> (hope you understand the short hand - I don't 
> know xpath...).  We will infer the variable from the pre '_' 
> part of the filename in the URL, and the filename will be 
> taken from the URL.  From what I understood of the DRS 
> document this seems to be the most reliable way of deriving 
> the implied DRS directory structure.   We'll get the URL from 
> the thredds data set catalogue (<catalogRef  xlink:href> in 
> esgcet/thredds/catalog.xml) using <dataset><dataset urlPath>. 
> 
> Any pitfalls?  (Apart from those pointed out a while ago by 
> Estani on the use of the thredds catalogue rather than the 
> gate way API - but my guess is the gateways get populated by 
> harvesting this exact same information - or is this guess wrong?).
> 
> Thanks,
> 
> Jamie
>  
> 
> > -----Original Message-----
> > From: go-essp-tech-bounces at ucar.edu
> > [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of 
> > martin.juckes at stfc.ac.uk
> > Sent: 02 September 2011 11:59
> > To: gonzalez at dkrz.de; taylor13 at llnl.gov
> > Cc: go-essp-tech at ucar.edu; Laura.E.Carriere at nasa.gov
> > Subject: Re: [Go-essp-tech] Non-DRS File structure at data nodes
> > 
> > Hello All,
> > 
> > Just a few comments. As Karl says, it is clear that users 
> can get at 
> > data which is not in the DRS directory structure, and in many cases 
> > will not be aware of the distinction. In addition to the 
> points Estani 
> > raises, some users may wish to preserve the directory structure in 
> > their local copies and will be faced with a range of different 
> > directory structures
> > -- so it is clear that lack of standardisation is going 
> cause problems 
> > for some users.
> > 
> > Another aspect is version control: as Karl points out, CMOR is 
> > generally going to be run before it is possible to determine the 
> > version of the dataset to which a file will be assigned. So the 
> > version needs to be assigned later. We talked a great deal 
> about the 
> > importance of having version control of data implemented at 
> the data 
> > nodes, and I was under the impression that it would be mandatory -- 
> > but perhaps we didn't get that far.
> > 
> > Data which is replicated to BADC will be available through 
> a range of 
> > interfaces, including direct file system access to users 
> logged onto 
> > local machines. We will convert data into the DRS directory 
> structure 
> > (having different structures for data from different groups 
> is far too 
> > complicated to be worth considering). This directory 
> structure is also 
> > required for quality control. We do have a requirement to 
> ensure that 
> > copies of data published at the archive centres (PCMDI, 
> DKRZ and BADC) 
> > are identical to those published at the providing centres. The plan 
> > was to exploit the DRS directory structure to meet this 
> requirement -- 
> > if directory structures vary between copies we may struggle here -- 
> > though it should be possible to find a solution using file 
> checksums.
> > 
> > cheers,
> > Martin
> > 
> > 
> > 
> > 
> > ________________________________
> > From: go-essp-tech-bounces at ucar.edu
> > [go-essp-tech-bounces at ucar.edu] on behalf of Estanislao Gonzalez 
> > [gonzalez at dkrz.de]
> > Sent: 02 September 2011 10:55
> > To: Karl Taylor
> > Cc: go-essp-tech at ucar.edu; Laura Carriere
> > Subject: Re: [Go-essp-tech] Non-DRS File structure at data nodes
> > 
> > Dear Laura, Karl
> > 
> > Regarding Karl's three points:
> > 
> > 1) Indeed what Karl said it's true. Our discussion around DRS is 
> > precisely because it's not mandated.
> > I think we made quite a few mistakes in this, if we had had 
> delivered 
> > proper tools in time, there should have been no need for 
> data centers 
> > to come up with different directory structures.
> > 
> > 2) the drslib is not intended for CMIP3, it will/might be used for 
> > that purpose though. It mainly produces a valid DRS 
> structure out of 
> > any files in other structure (including CMOR2). I think Stephen can 
> > comment more on this if required.
> > 
> > 3) In my opinion, the recommendation is useful for datacenters, but 
> > not on an archive level. We must cope with data centers not 
> complying 
> > to this, so it's the same as if there where no 
> recommendation at all.
> > 
> > I know the main idea is to create a middleware layer that 
> would make 
> > file structures obsolete. But then, we will have to write all tools 
> > again in order to interact with this intermediate level or at least 
> > patch them somehow. gridFTP, as well as ftp, are only useful as 
> > transmission protocols, you can't write your own script to 
> use them, 
> > you have to rely on either the gateway or the datanode to find what 
> > you are looking.
> > In my opinion, we will be relying too much in the ESG 
> infrastructure. 
> > What would happen if we loose the publisher database? How would we 
> > tell apart one version from another, if this is not 
> represented in the 
> > directory structure?
> > My fear is that if we keep separating the metadata from the data 
> > itself, we add a new weak link in the chain. Now if we loose the 
> > metadata the data will also be useless (this would be 
> indeed the worst 
> > case scenario). In 10 years we will have no idea what this 
> interfaces 
> > were like, probably both data node and gateways will be 
> superseded  by 
> > newer versions that can't translate our old requirements. But as I 
> > said, that's a problem for LTAs only. In any case, we need the 
> > middleware to provide some services and speed things up, 
> but I don't 
> > think we should rely blindly on it.
> > 
> > And regarding CMOR2, indeed it was designed to be flexible, 
> but drslib 
> > also relies on the same CMOR tables to separate what 
> output1 and 2 is. 
> > And there's no magic in drslib regarding versioning, it 
> must be input 
> > by hand. Why this functionality was kept away from CMOR2 is 
> not really 
> > clear to me. What ever it was, I'm not sure it work the 
> best for all 
> > configurations regarding who create, post-processes and publish the 
> > data.
> > 
> > I don't mean we should change any of these, it's too late and that 
> > wasn't the point anyway. I just thought that it is worth the 
> > discussion, especially for the future.
> > 
> > Thanks,
> > Estani
> > 
> > Am 02.09.2011 00:02, schrieb Karl Taylor:
> > Dear Laura,
> > 
> > Thank you for providing an important perspective on this.  I agree 
> > that misunderstanding and poor communication about this has caused 
> > considerable confusion.
> > 
> > Here's some short answers to your questions, followed by a more 
> > complete discussion that others may also want to read carefully:
> > 
> > 1.  It is *not* true that CMIP5 or ESG mandate a specific directory 
> > structure, although DRS document  recommends for
> > CMIP5 a specific directory structure.  Note that for 
> reanalysis data, 
> > which falls under the "obs4MIPs" project, the recommended 
> (again not 
> > required) directory structure differs from CMIP5.
> > 
> > 2.  The directory structure produced by CMOR2 is not 
> identical to the 
> > directory structure for CMIP*3* data stored at PCMDI.  It 
> also differs 
> > from the "final" form of the recommended (not required) directory 
> > structure for CMIP5. I'm not sure if drslib
> > (http://esgf.org/esgf-drslib-site/index.html) can convert 
> from CMIP3 
> > to final recommended CMIP5 directory structure, but I know it can 
> > convert from the default CMOR2-produced directory structure 
> to final 
> > CMIP5 structure (although I didn't see this mentioned in the drslib 
> > documentation).
> > 
> > 3.  The recommended procedure for treatment of CMIP5 data 
> is to write 
> > it using CMOR2 (without overriding the default directory 
> structure it 
> > produces)  and then use drslib (or
> > equivalent) to produce the final directory structure.
> > 
> > Now for some discussion....
> > 
> > For ESG, there is no directory structure imposed.  When 
> datasets are 
> > published, information is recorded that enables users (through 
> > gateways) to access the data they want (without any knowledge of 
> > directory structures).  The directory structures 
> recommended for CMIP5 
> > and for the "obs4MIPs" activity are different, but this does not 
> > hamper ESG from serving them and searching them, because it doesn't 
> > really care about directory structure.
> > 
> > For CMIP5 (which is only one of the projects served by ESG),
> > cmor2 creates a directory structure that is a reasonable way to 
> > organize the output, and CMOR2 can generate filenames 
> according to a 
> > template required by CMIP5, as described in the DRS document.
> > 
> > For CMIP5  the DRS document recommends (but does *not* 
> > require) a final directory structure.   Because this is only 
> > a recommendation, individual data nodes may choose to 
> organize their 
> > data to fit their own local requirements.
> > 
> > The DRS specifies a controlled vocabulary, and various 
> "descriptors" 
> > of CMIP5 datasets that are stored in catalogs
> > at the data nodes.   This information can be accessed in 
> > various ways, but by "reading" the catalogs (which are xml 
> files), a 
> > user can obtain the URL that can be used to get the data.  The 
> > uniformity in structure for all CMIP5 catalogs ensures that 
> software 
> > can be written to automatically translate between a set of DRS 
> > descriptors that uniquely identify the data being sought 
> and a list of 
> > (possibly
> > *non-uniformly* structured) directories/filenames  containing 
> > that data.    Thus the ESG gateway can generate wget scripts 
> > that can be run to download the data even when the directory 
> > structures differ from one node to another.  Presumably other tools 
> > could get the URL's similarly.
> > 
> > By the way, CMOR2 was designed to meet the needs of many different 
> > projects, not just CMIP5, so having it generate automatically 
> > directory structures consistent with the
> > requirements of these different projects is difficult.   For 
> > one thing, the "output" descriptor called for by the DRS requires a 
> > complicated algorithm unique to CMIP5 and thus this information is 
> > unknown by CMOR2.  Also the version number (which appears 
> in the final 
> > recommended DRS directory
> > structure) is based on the ("publication") date of the 
> dataset.  Since 
> > a dataset comprises many different variables, perhaps written on 
> > different days, it would be impossible for
> > CMOR2 to assign this date automatically, which is why the version 
> > number is assigned when the data are published.
> > Thus, the full, final directory structure *recommended* by
> > CMIP5 cannot be assigned by CMOR2.
> > 
> > So, those are the rules for CMIP5:  the directory structure is not 
> > mandated, but it is certainly recommended.  I think that 
> using drslib 
> > is a good way to put CMOR2 output in the recommended DRS directory 
> > structure, and I don't *think* other steps are required.
> > 
> > Please let me know if you have questions, and please feel free to 
> > respond.
> > 
> > Best regards,
> > Karl
> > 
> > 
> > 
> > On 9/1/11 12:55 PM, Laura Carriere wrote:
> > 
> > For what it's worth, I'm going to add my own perspective, one that 
> > comes from someone who is managing the team that is publishing the 
> > data at NASA/GSFC but is not involved in writing the code 
> or producing 
> > the data.  In other words, I'm sure there's lots I don't 
> understand, 
> > but here's what I have managed to decipher.
> > 
> > I'll start by saying that we don't have a strong opinion about what 
> > directory structure is used.  Our focus is on providing users quick 
> > access to data that is accurate and easily identified.  Our initial 
> > understanding was that CMOR2 would create the correct DRS file 
> > structure but we have since learned that this is not the case.  We 
> > were also under the impression that the DRS file structure was 
> > "recommended" not "required".  This, also, appears not to 
> be the case.
> > 
> > After learning that we weren't using the correct file structure, we 
> > re-read the documentation more carefully but we were still left not 
> > really knowing what the expectations were.
> > 
> > First I read the CMIP5 Data Reference Syntax (DRS) and Controlled 
> > Vocabulary documentation:
> > 
> > 
> http://cmip-pcmdi.llnl.gov/cmip5/docs/cmip5_data_reference_syntax.pdf
> > 
> > Section 3.1, shows the DRS structure we were creating by 
> using CMOR2, 
> > and 3.3 shows the DRS structure that we are supposed to be creating.
> > 
> > It also states that there is an expectation that we are responsible 
> > for "transforming" the CMOR2 structure to the recommended 
> structure.  
> > I found this surprising so I checked the CMOR2 release 
> notes and found 
> > that there's no reference to modifying CMOR2 to have an option to 
> > produce the new DRS structure so it became clear that we 
> needed to do 
> > this ourselves.
> > 
> > I then looked at the drslib page:
> > 
> > http://esgf.org/esgf-drslib-site/index.html
> > 
> > This is a utility to convert a CMIP3 directory structure to 
> > DRS-compliant form but since our team is quite new to the IPCC 
> > activity, we don't know if what CMOR2 creates is CMIP3 or not.
> > 
> > That left us not knowing if there were any tools to do what we had 
> > been asked to do.  The data provider was willing to 
> recreate the data 
> > with the missing directories so we republished all the data 
> we had had 
> > the time.  However, that doesn't really help us with the next data 
> > provider who is just now starting to give us data.
> > 
> > What I would like to be able to find is a simple way for the data 
> > providers (who are running CMOR2 but are not publishing the 
> data) to 
> > prepare the directory structure in a way that is compliant. 
>  I would 
> > rather not ask them to wade through all the above documentation and 
> > translate the directory structure themselves because they are busy 
> > enough as it is.
> > 
> > Ideally I would like to be able to tell them to use a particular 
> > option to CMOR2 to create the right structure but such an option 
> > doesn't exist.  The second best option would be some 
> clarification on 
> > the use of drslib.  Specifically, can it be run on the directory 
> > structure that
> > CMOR2 produces and will it then produce a compliant directory 
> > structure that we can publish?  And are there any additional steps 
> > required?
> > 
> > So, in the interests of improving communication, I suggest that 
> > someone remove the word "recommended" from sections 3.1 and 
> 3.3 in the 
> > DRS document, explain why it's "required" and the 
> repercussions of not 
> > complying and also add instructions on how to get to the "required"
> > structure.  In an ideal world, an option would be added to
> > CMOR2 to do this there.
> > 
> > As I said, this is just my perspective from the data 
> publication side.
> > Please feel free to enlighten me on what I've missed.  Thanks.
> > 
> >    Laura Carriere
> > 
> > 
> > On 9/1/2011 4:58 AM, Kettleborough, Jamie wrote:
> > 
> > 
> > Hello,
> > 
> > Isn't one issue that for some applications the *interface* with the 
> > data is at the *file system level* - not the catalogues? Version 
> > management, QC look like they are examples, and replication 
> may be too 
> > (and I think these are pretty much federation wide 
> > activities/applications).  So if you want to minimise the 
> complexity 
> > (~= minimise time to develop, cost of maintenance) in the way these 
> > applications interact with the data you want to ensure 
> consistency in 
> > the way data stored in the file system.
> > Bryan - I wasn't sure what interfaces you were talking 
> about... Sorry.
> > 
> > I'm going to be a bit pedantic here - but I don't think the DRS 
> > document says that data nodes must follow the DRS directory 
> structure, 
> > its only a recommendation.  Though there
> > *may* be a slight inconsistency in the way the DRS is written as it 
> > says the URLS *will* be a site dependant prefix followed by 
> the *DRS 
> > directory structure*.  At least that's my reading of the 
> 1.2 version 
> > dated 9th March. I don't think all nodes are following the DRS 
> > specification for the URLS because they don't have the same 
> underlying 
> > directory structure.  I don't know if the way the DRS is written or 
> > being interpreted is one of the sources of misunderstanding 
> over this 
> > issue of DRS directory structure?  (This is not a criticism, its an 
> > acceptance that communicating specification and plans is a hard 
> > problem to crack).
> > 
> > Another (possibly week) motivation for keeping all data in the DRS 
> > directory structure is it gives you a last ditch back up 
> strategy - if 
> > you loose the catalogues you can regenerate the version 
> info etc from 
> > the file system.
> > 
> > Jamie
> > 
> > 
> > 
> > -----Original Message-----
> > From: 
> > go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu>
> > [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Bryan Lawrence
> > Sent: 01 September 2011 08:55
> > To: go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
> > Cc: stockhause at dkrz.de<mailto:stockhause at dkrz.de>
> > Subject: Re: [Go-essp-tech] Non-DRS File structure at data nodes
> > 
> > Hi Folks
> > 
> > 
> > 
> > At least it's now clear to me, that we can't rely on the
> > 
> > 
> > DRS structure
> > 
> > 
> > so we should try to cope with this.
> > 
> > 
> > I'm just coming back to this, and I haven't read all of 
> this thread, 
> > but I don't agree with this statement!  If we can't rely on the DRS 
> > *at the interface level*, then ESGF is fundamentally doomed as a 
> > distributed activity, because we'll never have the resource 
> to support 
> > all the possible variants.
> > 
> > Behind those interfaces, more flexibility might be possible, but 
> > components would need to be pretty targetted in their functionality.
> > 
> > Bryan
> > 
> > 
> > 
> > 
> > Thanks,
> > Estani
> > 
> > Am 31.08.2011 12:55, schrieb
> > stephen.pascoe at stfc.ac.uk:<mailto:stephen.pascoe at stfc.ac.uk:>
> > 
> > 
> > Hi Estani,
> > 
> > I see you have some code in esgf-contrib.git for managing
> > 
> > 
> > a replica
> > 
> > 
> > database.  There's quite a lot of drs-parsing code there.
> > 
> > 
> >   Is there
> > 
> > 
> > any reason why this couldn't use drslib?
> > 
> > Cheers,
> > Stephen.
> > 
> > ---
> > Stephen Pascoe  +44 (0)1235 445980
> > Centre of Environmental Data Archival STFC Rutherford Appleton 
> > Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
> > 
> > 
> > -----Original Message-----
> > From: 
> > go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu>
> > [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Estanislao 
> > Gonzalez
> > Sent: 31 August 2011 10:23
> > To: Juckes, Martin (STFC,RAL,RALSP)
> > Cc: stockhause at dkrz.de<mailto:stockhause at dkrz.de>;
> > go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
> > Subject: Re: [Go-essp-tech] Non-DRS File structure at data nodes
> > 
> > Hi Martin,
> > 
> > Are you planning to publish that data as a new instance
> > 
> > 
> > or as a replica?
> > 
> > 
> > If I recall it right, Karl said he thought the replica
> > 
> > 
> > was attached
> > 
> > 
> > at a semantic level. But I have my doubts and haven't got
> > 
> > 
> > any feed
> > 
> > 
> > back on this. does anyone know if the gateway can handle
> > 
> > 
> > a replica
> > 
> > 
> > with a different url path? (dataset and version "should" be 
> the same, 
> > although keeping the same version will be
> > 
> > 
> > difficult, because
> > 
> > 
> > no tool can handle this AFAIK, i.e. replicating or 
> publishing multiple 
> > datasets with different versions)
> > 
> > And regarding replication (independently from the previous 
> question), 
> > how are you going to cope with new versions? Do you already 
> have tools 
> > for harvesting the TDS and building a list of which files 
> do need to 
> > be replicated, regarding from what
> > 
> > 
> > you already have?
> > 
> > 
> > The catalog will just publish a dataset and version along 
> with a bunch 
> > of files, you would need to keep a DB with the fies you've already 
> > downloaded, and compare with the catalog to realize what should be 
> > done next. This information is what drslib
> > 
> > 
> > should use to
> > 
> > 
> > create the next version. Is that what will happen? If
> > 
> > 
> > not, how will you be solving this?
> > 
> > 
> > Thanks,
> > Estani
> > 
> > Am 31.08.2011 10:54, schrieb
> > martin.juckes at stfc.ac.uk:<mailto:martin.juckes at stfc.ac.uk:>
> > 
> > 
> > Hello Martina,
> > 
> > For BADC, I don't think we are considering storing data
> > 
> > 
> > in anything
> > 
> > 
> > other than the DRS structure -- we just don't have the time 
> to build 
> > systems around multiple structures. This means
> > 
> > 
> > that data that
> > 
> > 
> > comes from a node with a different directory structure
> > 
> > 
> > will have to
> > 
> > 
> > be re-mapped. Verification of file identities will rely on 
> check-sums, 
> > as it always will when dealing with files
> > 
> > 
> > from archives
> > 
> > 
> > from which we have no curation guarantees,
> > 
> > cheers,
> > Martin
> > 
> > ________________________________
> > From: 
> > go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu>
> > 
> > 
> > 
> [go-essp-tech-bounces at ucar.edu<mailto:go-essp-tech-bounces at ucar.edu>]
> > 
> > 
> > on behalf of Martina Stockhause
> > [stockhause at dkrz.de<mailto:stockhause at dkrz.de>] Sent: 31 August 2011
> > 09:44
> > To: go-essp-tech at ucar.edu<mailto:go-essp-tech at ucar.edu>
> > Subject: [Go-essp-tech] Non-DRS File structure at data nodes
> > 
> > Hi everyone,
> > 
> > we promised to describe the problems regarding the non-DRS file 
> > structures at the data nodes. Estani has already started the 
> > discussion on the replication/user download problems
> > 
> > 
> > (see attached
> > 
> > 
> > email and document).
> > 
> > Implications for the QC:
> > - In the QCDB we need DRS syntax. The DOI process,
> > 
> > 
> > creation of CIM
> > 
> > 
> > documents, and identification of the data the QC results 
> are connected 
> > to rely on that. - The QC needs to know the version of the data 
> > checked. The DOI at the end of the QC process
> > 
> > 
> > is assigned
> > 
> > 
> > to a specific not-changable data version. At least at
> > 
> > 
> > DKRZ we have
> > 
> > 
> > to guarantee that the data is not changed after
> > 
> > 
> > assignment of the
> > 
> > 
> > DOI, therefore we store a data copy in our archive. - The 
> QC checker 
> > tool runs on files in a given directory structure and 
> creates results 
> > in a copy of this structure. The QC
> > 
> > 
> > wrapper can deal with recombinations of path parts.
> > 
> > 
> > So, if the directory structure includes all parts of the 
> DRS syntax, 
> > the wrapper can create the DRS syntax before
> > 
> > 
> > insert in the
> > 
> > 
> > QCDB. But we deal with structures at the data nodes, where some 
> > information is missing in the directory path, i.e.
> > 
> > 
> > version and MIP
> > 
> > 
> > table. Therefore an additional information would be
> > 
> > 
> > needed for that mapping.
> > 
> > 
> > Possible solutions to map the given file structure to the DRS 
> > directory structure before insert in the QCDB:
> > 
> > 1. The publication on the data nodes of the three gateways 
> who store 
> > replicas (PCMDI, BADC, DKRZ) publish data in the DRS directory 
> > structure. Then the QC run is possible without
> > 
> > 
> > mapping.
> > 
> > 
> > Replication problems?
> > 
> > 2. The directory structures of the data nodes are 
> replicated as they 
> > are. We store the data under a certain version.
> > 
> > 
> > How? Are there
> > 
> > 
> > implications for the replication from the data nodes? The 
> individual 
> > file structures down to the chunk level are stored together 
> with its 
> > DRS identification in a repository and
> > 
> > 
> > a service
> > 
> > 
> > is created to access the DRS id for the given file in the 
> given file 
> > structure. The QC and maybe other user data
> > 
> > 
> > services use this
> > 
> > 
> > service for mapping. That will slow down the QC insert process.
> > Before each insert of a chunk name, a qc result for a specific 
> > variable, and the qc result on the experiment level that 
> service has 
> > to be called. And who can set-up and maintain such a 
> repository? DKRZ 
> > has not the man power to do that in the
> > 
> > 
> > next months.
> > 
> > 
> > Cheers,
> > Martina
> > 
> > 
> > 
> > -------- Original-Nachricht --------
> > Betreff:        RE: ESG discussion
> > Datum:  Wed, 10 Aug 2011 15:35:04 +0100
> > Von:    Kettleborough,
> > 
> > 
> > 
> > Jamie<jamie.kettleborough at metoffice.gov.uk><mailto:jamie.kettl
> > eborough at metoffice.gov.uk><mailto:jamie.kettl
> > eborough@
> > 
> > 
> > metoffice.gov.uk>  An:     Karl
> > Taylor<taylor13 at llnl.gov><mailto:taylor13 at llnl.gov><mailto:tay
> > lor13 at llnl.gov><mailto:taylor13 at llnl.gov>, Wood,
> > 
> > 
> > 
> > Richard<richard.wood at metoffice.gov.uk><mailto:richard.wood at met
> > office.gov.uk><mailto:richard.wood at met
> > office.go
> > 
> > 
> > v.uk>  CC:     Carter,
> > 
> > 
> > 
> > Mick<mick.carter at metoffice.gov.uk><mailto:mick.carter at metoffic
> > e.gov.uk><mailto:mick.carter at metoffice.gov
> > 
> > 
> > .uk>
> > , Elkington,
> > 
> > 
> > 
> > Mark<mark.elkington at metoffice.gov.uk><mailto:mark.elkington at me
> > toffice.gov.uk><mailto:mark.elkington at metoffi
> > 
> > 
> > ce.g
> > ov.uk>, Bentley,
> > 
> > 
> > 
> > Philip<philip.bentley at metoffice.gov.uk><mailto:philip.bentley@
> > metoffice.gov.uk><mailto:philip.bentley at metof
> > 
> > 
> > fice
> > .gov.uk>, Senior,
> > 
> > 
> > 
> > Cath<cath.senior at metoffice.gov.uk><mailto:cath.senior at metoffic
> > e.gov.uk><mailto:cath.senior at metoffice.gov
> > 
> > 
> > .uk>
> > , Hines,
> > 
> > 
> > 
> > Adrian<adrian.hines at metoffice.gov.uk><mailto:adrian.hines at meto
> > ffice.gov.uk><mailto:adrian.hines at metoffice
> > 
> > 
> > .gov .uk>, Dean N.
> > Williams<williams13 at llnl.gov><mailto:williams13 at llnl.gov><mail
> > to:williams13 at llnl.gov><mailto:williams13 at llnl.gov>,
> > Estanislao
> > 
> > 
> > 
> > Gonzalez<gonzalez at dkrz.de><mailto:gonzalez at dkrz.de><mailto:gon
> > zalez at dkrz.de><mailto:gonzalez at dkrz.de>,<martin.juckes@
> > 
> > 
> > stfc 
> > .ac.uk><mailto:martin.juckes at stfc.ac.uk><mailto:martin.juckes@
> > stfc.ac.uk>, Kettleborough,
> > 
> > 
> > 
> > Jamie<jamie.kettleborough at metoffice.gov.uk><mailto:jamie.kettl
> > eborough at metoffice.gov.uk><mailto:jamie.kettleboro
> > 
> > 
> > ugh@
> > metoffice.gov.uk>
> > 
> > 
> > Hello Karl, Dean,
> > 
> > Thanks for you reply on this, and the fact you are taking our 
> > concerns seriously. You are right to challenge us for
> > 
> > 
> > the specific
> > 
> > 
> > issues, rather than us just highlighting the things that
> > 
> > 
> > don't meet
> > 
> > 
> > our (possibly idealised) expectations of how the system 
> > should look.  As a result, we have had a thorough review of 
> > our key issues. I think some of them are issues that make if
> > 
> > 
> > harder for us
> > 
> > 
> > to do things now; other issues are maybe more concerns
> > 
> > 
> > of problems
> > 
> > 
> > being stored up. This document has been prepared with the 
> > help Estani Gonzalez.  We would like to have Martin Juckes
> > 
> > 
> > input on this
> > 
> > 
> > too - but he is currently away on holiday.  I hope he can add 
> > to this when he returns - he has spent a lot of time thinking 
> > about the implications of data node directory structure on
> > 
> > 
> > versioning. I
> > 
> > 
> > hope this helps clarify issues, if not please let use
> > 
> > 
> > know, Thanks,
> > 
> > 
> > Jamie
> > 
> > ________________________________
> > From: Karl Taylor [mailto:taylor13 at llnl.gov]
> > Sent: 09 August 2011 01:48
> > To: Wood, Richard
> > Cc: Carter, Mick; Kettleborough, Jamie; Elkington, Mark;
> > 
> > 
> > Bentley,
> > 
> > 
> > Philip; Senior, Cath; Hines, Adrian; Dean N. Williams
> > 
> > 
> > Subject: Re:
> > 
> > 
> > ESG discussion
> > 
> > Dear all,
> > 
> > Thanks for taking the time to bring to my attention the
> > 
> > 
> > ESG issues
> > 
> > 
> > that I hope can be addressed reasonably soon.  I think we're 
> > in general agreement that the user's experience should be improved.
> > 
> > I've discussed this briefly with Dean.  I plan to meet
> > 
> > 
> > with him and
> > 
> > 
> > others here, and, drawing on your suggestions, we'll attempt 
> > to find solutions and methods of communication that might
> > 
> > 
> > improve matters.
> > 
> > 
> > Before doing this, it would help if you could briefly answer 
> > the following questions:
> > 
> > 1.  Is the main issue that it is currently difficult to 
> > script downloads from all the nodes because only some support
> > 
> > 
> > PKI?  What
> > 
> > 
> > other uniformity among nodes is required for you to be
> > 
> > 
> > able to do
> > 
> > 
> > what you want to do (i.e., what do you specifically want
> > 
> > 
> > to do that
> > 
> > 
> > is difficult to do now)?  [nb. all data nodes are
> > 
> > 
> > scheduled to be
> > 
> > 
> > operating with PKI authentication by September 1.]
> > 
> > 2.  Is there anything from the perspective of a data 
> > *provider* that needs to be done (other than make things easier for
> > 
> > 
> > data users)?
> > 
> > 
> > 3.  Currently ESG and CMIP5 do not dictate the directory
> > 
> > 
> > structure
> > 
> > 
> > found at each data node (although most nodes are adhering to the
> > recommendations of the DRS).   The gateway software and
> > 
> > 
> > catalog make it
> > 
> > 
> > possible to get to the data regardless of directory
> > 
> > 
> > structure.  It
> > 
> > 
> > is possible that "versioning" might impose additional
> > 
> > 
> > constraints
> > 
> > 
> > on the directory structure, but I'm not sure about this.
> > 
> > 
> >   (By the
> > 
> > 
> > way, I'm not sure what the "versioning" issue is since
> > 
> > 
> > currently I
> > 
> > 
> > think it's impossible for users to know about more than one 
> > version; is that the
> > issue?)  From a user's or provider's perspective, is there 
> > any essential reason that the directory structure should be
> > 
> > 
> > the same at
> > 
> > 
> > each node?
> > 
> > 4.  ESG allows considerable flexibility in publishing data, and
> > CMIP5 has suggested "best practices" to reduce
> > 
> > 
> > differences.  Only
> > 
> > 
> > some of the "best practices" are currently requirements.
> > 
> > 
> >   A certain
> > 
> > 
> > amount of flexibility is essential since different data
> > 
> > 
> > providers
> > 
> > 
> > have resources to support the potential capabilities of
> > 
> > 
> > ESG (e.g.,
> > 
> > 
> > not all can support server-side calculations, which will
> > 
> > 
> > be put in place at some nodes).
> > 
> > 
> > Likewise a provider can currently turn off the
> > 
> > 
> > "checksum", if this
> > 
> > 
> > is deemed to slow publication too much (although we could 
> > insist that checksums be stored in the thredds catalogue).
> > 
> > 
> > Nevertheless,
> > 
> > 
> > it is unlikely that every data node will be identically
> > 
> > 
> > configured for all
> > 
> > 
> > options.    What are the *essential* ways that the data
> > 
> > 
> > nodes should
> > 
> > 
> > respond identically (we may not be able to insist on 
> > uniformity that isn't essential for serving our users)?
> > 
> > Thanks again for your input, and I look forward to your 
> > further help with this.
> > 
> > Best regards,
> > Karl
> > 
> > 
> > On 8/5/11 10:43 AM, Wood, Richard wrote:
> > 
> > Dear Karl,
> > 
> >      Following on from our phone call I had a discussion with 
> > technical
> > 
> > colleagues here (Mick Carter, Jamie Kettleborough, Mark
> > 
> > 
> > Elkington,
> > 
> > 
> > also earlier with Phil Bentley), and with Adrian Hines who is 
> > coordinating our CMIP5 analysis work, about ideas for
> > 
> > 
> > future development of the ESG.
> > 
> > 
> > Our observations are from the user perspective, and
> > 
> > 
> > based on what
> > 
> > 
> > we can gather from mailing lists and our own experience.
> > 
> > 
> > Coming out
> > 
> > 
> > of our discussion we have a couple of suggestions that
> > 
> > 
> > could help
> > 
> > 
> > with visibility for data providers and users:
> > 
> > - Some areas need agreement among the data nodes as to the 
> > technical solution, and then implementation across all
> > 
> > 
> > the nodes,
> > 
> > 
> > while others need a specific solution to be developed in
> > 
> > 
> > one place and rolled out.
> > 
> > 
> > The group teleconferences that Dean organises appear to
> > 
> > 
> > be a good
> > 
> > 
> > forum for airing specific technical ideas and solutions.
> > 
> > 
> > However,
> > 
> > 
> > in our experience it can be  difficult in that kind of forum 
> > to discuss planning and prioritisation questions. From our
> > 
> > 
> > perspective
> > 
> > 
> > we don't have visibility of the more project-related
> > 
> > 
> > issues such as
> > 
> > 
> > key technical decisions, prioritisation and timelines, or of 
> > whether issues that have arisen in the mailing list
> > 
> > 
> > discussions are
> > 
> > 
> > being followed up. We guess these may be discussed in 
> > separate project teleconferences involving the technical leads
> > 
> > 
> > from the data
> > 
> > 
> > nodes. As users we would not necessarily expect to be
> > 
> > 
> > involved in
> > 
> > 
> > those discussions, but as data providers and dowloaders
> > 
> > 
> > it would be
> > 
> > 
> > very helpful for our planning to see the outcomes of the 
> > discussions. The sort of thing we had in mind would be a
> > 
> > 
> > simple web
> > 
> > 
> > page showing the priority development areas, agreed
> > 
> > 
> > solutions and
> > 
> > 
> > estimated dates for completion/release. Some solutions
> > 
> > 
> > will need to
> > 
> > 
> > be implemented separately across all the participating
> > 
> > 
> > data nodes,
> > 
> > 
> > and in these cases it would be useful to see the
> > 
> > 
> > estimated timeframe for implementation at each node.
> > 
> > 
> > This would not be intended as a 'big stick' to the partners, 
> > but simply as a planning aid so that everyone can see what's
> > 
> > 
> > available
> > 
> > 
> > when and the project can identify any potential
> > 
> > 
> > bottlenecks or issues in advance.
> > 
> > 
> > Also the intention is not to generate a lot of extra work.
> > Hopefully providing this information would be pretty
> > 
> > 
> > light on people's time.
> > 
> > 
> > - From where we sit it appears that some nodes are quite
> > 
> > 
> > successful
> > 
> > 
> > in following best practice and implementing the
> > 
> > 
> > federation policies
> > 
> > 
> > as far as they are aware of them. Could what these nodes
> > 
> > 
> > do be made
> > 
> > 
> > helpful to all the data nodes (e.g. by using identical
> > 
> > 
> > software)?
> > 
> > 
> > We realise there may be real differences between some
> > 
> > 
> > data nodes -
> > 
> > 
> > but where possible we think that what is similar could
> > 
> > 
> > be enforced
> > 
> > 
> > or made explicitly the same through sharing the software
> > 
> > 
> > components and tools.
> > 
> > 
> > To set the discussion on priorities rolling, Jamie has
> > 
> > 
> > prepared, in
> > 
> > 
> > consultation with others here, a short document showing the 
> > Met Office view of current priority issues (attached). If you 
> > could update us on the status of work on these issues, that
> > 
> > 
> > would be very
> > 
> > 
> > useful (ideally via the web pages proposed above, which we 
> > think would be of interest to many users, or via email in the
> > 
> > 
> > interim).
> > 
> > 
> > Many thanks for the update on tokenless authentication,
> > 
> > 
> > which is very good news.
> > 
> > 
> >      Once again, our thanks to you, Dean and the team for
> > 
> > 
> > all the hard
> > 
> > 
> >      work
> > 
> > we know is going into this. Please let us know what you think 
> > of the above ideas and the attachment, and if there is
> > 
> > 
> > anything we can
> > 
> > 
> > do to help.
> > 
> >          Best wishes,
> > 
> >           Richard
> > 
> > --------------
> > Richard Wood
> > Met Office Fellow and Head (Oceans, Cryosphere and Dangerous Climate
> > Change)
> > Met Office Hadley Centre
> > FitzRoy Road, Exeter EX1 3PB, UK
> > Phone +44 (0)1392 886641  Fax +44 (0)1392 885681 Email
> > 
> > 
> > 
> > richard.wood at metoffice.gov.uk<mailto:richard.wood at metoffice.go
> > v.uk><mailto:richard.wood at metoffice.gov.uk><mailto:richard.woo
> > d at metoffice.gov.uk>
> > 
> > 
> > http://www.metoffice.gov.uk Personal web page
> > 
> > 
> > 
> > http://www.metoffice.gov.uk/research/scientists/cryosphere-oceans/r
> > 
> > 
> > ichar
> > d-wood
> > 
> > *** Please note I also work as Theme Leader (Climate System) 
> > for the Natural Environment Research Council ***
> > *** Where possible please send emails on NERC matters to 
> > rwtl at nerc.ac.uk<mailto:rwtl at nerc.ac.uk><mailto:rwtl at nerc.ac.uk
> > ><mailto:rwtl at nerc.ac.uk>  ***
> > 
> > 
> > --
> > Bryan Lawrence
> > University of Reading:  Professor of Weather and Climate 
> > Computing National Centre for Atmospheric Science: Director 
> > of Models and Data
> > STFC: Director of the Centre of Environmental Data Archival 
> > Phone +44 1235 445012; Web: home.badc.rl.ac.uk/lawrence 
> > _______________________________________________
> > GO-ESSP-TECH mailing list
> > GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> > 
> > 
> > 
> > _______________________________________________
> > GO-ESSP-TECH mailing list
> > GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> > 
> > 
> > --
> > 
> >    Laura Carriere                        
> > laura.carriere at nasa.gov<mailto:laura.carriere at nasa.gov>
> >    SAIC                                 301 614-5064
> > 
> > _______________________________________________
> > GO-ESSP-TECH mailing list
> > GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> > 
> > 
> > 
> > 
> > _______________________________________________
> > GO-ESSP-TECH mailing list
> > GO-ESSP-TECH at ucar.edu<mailto:GO-ESSP-TECH at ucar.edu>
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> > 
> > 
> > 
> > 
> > --
> > Estanislao Gonzalez
> > 
> > Max-Planck-Institut für Meteorologie (MPI-M) Deutsches 
> > Klimarechenzentrum (DKRZ) - German Climate Computing Centre 
> > Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
> > 
> > Phone:   +49 (40) 46 00 94-126
> > E-Mail:  gonzalez at dkrz.de<mailto:gonzalez at dkrz.de>
> > --
> > Scanned by iCritical.
> > _______________________________________________
> > GO-ESSP-TECH mailing list
> > GO-ESSP-TECH at ucar.edu
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> > 
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> -- 
> Scanned by iCritical.
>