[Go-essp-tech] Non-DRS File structure at data nodes

Thu Sep 1 01:55:11 MDT 2011

Hi Folks

> At least it's now clear to me, that we can't rely on the DRS structure
> so we should try to cope with this.

I'm just coming back to this, and I haven't read all of this thread, but I don't 
agree with this statement!  If we can't rely on the DRS *at the interface 
level*, then ESGF is fundamentally doomed as a distributed activity, because 
we'll never have the resource to support all the possible variants.

Behind those interfaces, more flexibility might be possible, but components would 
need to be pretty targetted in their functionality.

Bryan

> Thanks,
> Estani
> 
> Am 31.08.2011 12:55, schrieb stephen.pascoe at stfc.ac.uk:
> > Hi Estani,
> > 
> > I see you have some code in esgf-contrib.git for managing a replica
> > database.  There's quite a lot of drs-parsing code there.  Is there any
> > reason why this couldn't use drslib?
> > 
> > Cheers,
> > Stephen.
> > 
> > ---
> > Stephen Pascoe  +44 (0)1235 445980
> > Centre of Environmental Data Archival
> > STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
> > 
> > 
> > -----Original Message-----
> > From: go-essp-tech-bounces at ucar.edu
> > [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Estanislao Gonzalez
> > Sent: 31 August 2011 10:23
> > To: Juckes, Martin (STFC,RAL,RALSP)
> > Cc: stockhause at dkrz.de; go-essp-tech at ucar.edu
> > Subject: Re: [Go-essp-tech] Non-DRS File structure at data nodes
> > 
> > Hi Martin,
> > 
> > Are you planning to publish that data as a new instance or as a replica?
> > If I recall it right, Karl said he thought the replica was attached at a
> > semantic level. But I have my doubts and haven't got any feed back on
> > this. does anyone know if the gateway can handle a replica with a
> > different url path? (dataset and version "should" be the same, although
> > keeping the same version will be difficult, because no tool can handle
> > this AFAIK, i.e. replicating or publishing multiple datasets with
> > different versions)
> > 
> > And regarding replication (independently from the previous question),
> > how are you going to cope with new versions? Do you already have tools
> > for harvesting the TDS and building a list of which files do need to be
> > replicated, regarding from what you already have?
> > 
> > The catalog will just publish a dataset and version along with a bunch
> > of files, you would need to keep a DB with the fies you've already
> > downloaded, and compare with the catalog to realize what should be done
> > next. This information is what drslib should use to create the next
> > version. Is that what will happen? If not, how will you be solving this?
> > 
> > Thanks,
> > Estani
> > 
> > Am 31.08.2011 10:54, schrieb martin.juckes at stfc.ac.uk:
> >> Hello Martina,
> >> 
> >> For BADC, I don't think we are considering storing data in anything
> >> other than the DRS structure -- we just don't have the time to build
> >> systems around multiple structures. This means that data that comes
> >> from a node with a different directory structure will have to be
> >> re-mapped. Verification of file identities will rely on check-sums, as
> >> it always will when dealing with files from archives from which we have
> >> no curation guarantees,
> >> 
> >> cheers,
> >> Martin
> >> 
> >> ________________________________
> >> From: go-essp-tech-bounces at ucar.edu [go-essp-tech-bounces at ucar.edu] on
> >> behalf of Martina Stockhause [stockhause at dkrz.de] Sent: 31 August 2011
> >> 09:44
> >> To: go-essp-tech at ucar.edu
> >> Subject: [Go-essp-tech] Non-DRS File structure at data nodes
> >> 
> >> Hi everyone,
> >> 
> >> we promised to describe the problems regarding the non-DRS file
> >> structures at the data nodes. Estani has already started the discussion
> >> on the replication/user download problems (see attached email and
> >> document).
> >> 
> >> Implications for the QC:
> >> - In the QCDB we need DRS syntax. The DOI process, creation of CIM
> >> documents, and identification of the data the QC results are connected
> >> to rely on that. - The QC needs to know the version of the data
> >> checked. The DOI at the end of the QC process is assigned to a specific
> >> not-changable data version. At least at DKRZ we have to guarantee that
> >> the data is not changed after assignment of the DOI, therefore we store
> >> a data copy in our archive. - The QC checker tool runs on files in a
> >> given directory structure and creates results in a copy of this
> >> structure. The QC wrapper can deal with recombinations of path parts.
> >> So, if the directory structure includes all parts of the DRS syntax,
> >> the wrapper can create the DRS syntax before insert in the QCDB. But we
> >> deal with structures at the data nodes, where some information is
> >> missing in the directory path, i.e. version and MIP table. Therefore an
> >> additional information would be needed for that mapping.
> >> 
> >> Possible solutions to map the given file structure to the DRS directory
> >> structure before insert in the QCDB:
> >> 
> >> 1. The publication on the data nodes of the three gateways who store
> >> replicas (PCMDI, BADC, DKRZ) publish data in the DRS directory
> >> structure. Then the QC run is possible without mapping. Replication
> >> problems?
> >> 
> >> 2. The directory structures of the data nodes are replicated as they
> >> are. We store the data under a certain version. How? Are there
> >> implications for the replication from the data nodes? The individual
> >> file structures down to the chunk level are stored together with its
> >> DRS identification in a repository and a service is created to access
> >> the DRS id for the given file in the given file structure. The QC and
> >> maybe other user data services use this service for mapping. That will
> >> slow down the QC insert process. Before each insert of a chunk name, a
> >> qc result for a specific variable, and the qc result on the experiment
> >> level that service has to be called. And who can set-up and maintain
> >> such a repository? DKRZ has not the man power to do that in the next
> >> months.
> >> 
> >> Cheers,
> >> Martina
> >> 
> >> 
> >> 
> >> -------- Original-Nachricht --------
> >> Betreff:        RE: ESG discussion
> >> Datum:  Wed, 10 Aug 2011 15:35:04 +0100
> >> Von:    Kettleborough,
> >> Jamie<jamie.kettleborough at metoffice.gov.uk><mailto:jamie.kettleborough@
> >> metoffice.gov.uk> An:     Karl
> >> Taylor<taylor13 at llnl.gov><mailto:taylor13 at llnl.gov>, Wood,
> >> Richard<richard.wood at metoffice.gov.uk><mailto:richard.wood at metoffice.go
> >> v.uk> CC:     Carter,
> >> Mick<mick.carter at metoffice.gov.uk><mailto:mick.carter at metoffice.gov.uk>
> >> , Elkington,
> >> Mark<mark.elkington at metoffice.gov.uk><mailto:mark.elkington at metoffice.g
> >> ov.uk>, Bentley,
> >> Philip<philip.bentley at metoffice.gov.uk><mailto:philip.bentley at metoffice
> >> .gov.uk>, Senior,
> >> Cath<cath.senior at metoffice.gov.uk><mailto:cath.senior at metoffice.gov.uk>
> >> , Hines,
> >> Adrian<adrian.hines at metoffice.gov.uk><mailto:adrian.hines at metoffice.gov
> >> .uk>, Dean N. Williams<williams13 at llnl.gov><mailto:williams13 at llnl.gov>,
> >> Estanislao
> >> Gonzalez<gonzalez at dkrz.de><mailto:gonzalez at dkrz.de>,<martin.juckes at stfc
> >> .ac.uk><mailto:martin.juckes at stfc.ac.uk>, Kettleborough,
> >> Jamie<jamie.kettleborough at metoffice.gov.uk><mailto:jamie.kettleborough@
> >> metoffice.gov.uk>
> >> 
> >> 
> >> Hello Karl, Dean,
> >> 
> >> Thanks for you reply on this, and the fact you are taking our concerns
> >> seriously. You are right to challenge us for the specific issues,
> >> rather than us just highlighting the things that don't meet our
> >> (possibly idealised) expectations of how the system should look.  As a
> >> result, we have had a thorough review of our key issues. I think some
> >> of them are issues that make if harder for us to do things now; other
> >> issues are maybe more concerns of problems being stored up. This
> >> document has been prepared with the help Estani Gonzalez.  We would
> >> like to have Martin Juckes input on this too - but he is currently away
> >> on holiday.  I hope he can add to this when he returns - he has spent a
> >> lot of time thinking about the implications of data node directory
> >> structure on versioning. I hope this helps clarify issues, if not
> >> please let use know,
> >> Thanks,
> >> Jamie
> >> 
> >> ________________________________
> >> From: Karl Taylor [mailto:taylor13 at llnl.gov]
> >> Sent: 09 August 2011 01:48
> >> To: Wood, Richard
> >> Cc: Carter, Mick; Kettleborough, Jamie; Elkington, Mark; Bentley,
> >> Philip; Senior, Cath; Hines, Adrian; Dean N. Williams Subject: Re: ESG
> >> discussion
> >> 
> >> Dear all,
> >> 
> >> Thanks for taking the time to bring to my attention the ESG issues that
> >> I hope can be addressed reasonably soon.  I think we're in general
> >> agreement that the user's experience should be improved.
> >> 
> >> I've discussed this briefly with Dean.  I plan to meet with him and
> >> others here, and, drawing on your suggestions, we'll attempt to find
> >> solutions and methods of communication that might improve matters. 
> >> Before doing this, it would help if you could briefly answer the
> >> following questions:
> >> 
> >> 1.  Is the main issue that it is currently difficult to script downloads
> >> from all the nodes because only some support PKI?  What other
> >> uniformity among nodes is required for you to be able to do what you
> >> want to do (i.e., what do you specifically want to do that is difficult
> >> to do now)?  [nb. all data nodes are scheduled to be operating with PKI
> >> authentication by September 1.]
> >> 
> >> 2.  Is there anything from the perspective of a data *provider*  that
> >> needs to be done (other than make things easier for data users)?
> >> 
> >> 3.  Currently ESG and CMIP5 do not dictate the directory structure found
> >> at each data node (although most nodes are adhering to the
> >> recommendations of the DRS).   The gateway software and catalog make it
> >> possible to get to the data regardless of directory structure.  It is
> >> possible that "versioning" might impose additional constraints on the
> >> directory structure, but I'm not sure about this.  (By the way, I'm not
> >> sure what the "versioning" issue is since currently I think it's
> >> impossible for users to know about more than one version; is that the
> >> issue?)  From a user's or provider's perspective, is there any
> >> essential reason that the directory structure should be the same at
> >> each node?
> >> 
> >> 4.  ESG allows considerable flexibility in publishing data, and CMIP5
> >> has suggested "best practices" to reduce differences.  Only some of the
> >> "best practices" are currently requirements.  A certain amount of
> >> flexibility is essential since different data providers have resources
> >> to support the potential capabilities of ESG (e.g., not all can support
> >> server-side calculations, which will be put in place at some nodes).  
> >> Likewise a provider can currently turn off the "checksum", if this is
> >> deemed to slow publication too much (although we could insist that
> >> checksums be stored in the thredds catalogue).  Nevertheless, it is
> >> unlikely that every data node will be identically configured for all
> >> options.    What are the *essential* ways that the data nodes should
> >> respond identically (we may not be able to insist on uniformity that
> >> isn't essential for serving our users)?
> >> 
> >> Thanks again for your input, and I look forward to your further help
> >> with this.
> >> 
> >> Best regards,
> >> Karl
> >> 
> >> 
> >> On 8/5/11 10:43 AM, Wood, Richard wrote:
> >> 
> >> Dear Karl,
> >> 
> >>     Following on from our phone call I had a discussion with technical
> >> 
> >> colleagues here (Mick Carter, Jamie Kettleborough, Mark Elkington, also
> >> earlier with Phil Bentley), and with Adrian Hines who is coordinating
> >> our CMIP5 analysis work, about ideas for future development of the ESG.
> >> Our observations are from the user perspective, and based on what we can
> >> gather from mailing lists and our own experience. Coming out of our
> >> discussion we have a couple of suggestions that could help with
> >> visibility for data providers and users:
> >> 
> >> - Some areas need agreement among the data nodes as to the technical
> >> solution, and then implementation across all the nodes, while others
> >> need a specific solution to be developed in one place and rolled out.
> >> The group teleconferences that Dean organises appear to be a good forum
> >> for airing specific technical ideas and solutions. However, in our
> >> experience it can be  difficult in that kind of forum to discuss
> >> planning and prioritisation questions. From our perspective we don't
> >> have visibility of the more project-related issues such as key technical
> >> decisions, prioritisation and timelines, or of whether issues that have
> >> arisen in the mailing list discussions are being followed up. We guess
> >> these may be discussed in separate project teleconferences involving the
> >> technical leads from the data nodes. As users we would not necessarily
> >> expect to be involved in those discussions, but as data providers and
> >> dowloaders it would be very helpful for our planning to see the outcomes
> >> of the discussions. The sort of thing we had in mind would be a simple
> >> web page showing the priority development areas, agreed solutions and
> >> estimated dates for completion/release. Some solutions will need to be
> >> implemented separately across all the participating data nodes, and in
> >> these cases it would be useful to see the estimated timeframe for
> >> implementation at each node.
> >> This would not be intended as a 'big stick' to the partners, but simply
> >> as a planning aid so that everyone can see what's available when and the
> >> project can identify any potential bottlenecks or issues in advance.
> >> Also the intention is not to generate a lot of extra work. Hopefully
> >> providing this information would be pretty light on people's time.
> >> 
> >> - From where we sit it appears that some nodes are quite successful in
> >> following best practice and implementing the federation policies as far
> >> as they are aware of them. Could what these nodes do be made helpful to
> >> all the data nodes (e.g. by using identical software)?  We realise there
> >> may be real differences between some data nodes - but where possible we
> >> think that what is similar could be enforced or made explicitly the same
> >> through sharing the software components and tools.
> >> 
> >> To set the discussion on priorities rolling, Jamie has prepared, in
> >> consultation with others here, a short document showing the Met Office
> >> view of current priority issues (attached). If you could update us on
> >> the status of work on these issues, that would be very useful (ideally
> >> via the web pages proposed above, which we think would be of interest to
> >> many users, or via email in the interim). Many thanks for the update on
> >> tokenless authentication, which is very good news.
> >> 
> >>     Once again, our thanks to you, Dean and the team for all the hard
> >>     work
> >> 
> >> we know is going into this. Please let us know what you think of the
> >> above ideas and the attachment, and if there is anything we can do to
> >> help.
> >> 
> >>         Best wishes,
> >>         
> >>          Richard
> >> 
> >> --------------
> >> Richard Wood
> >> Met Office Fellow and Head (Oceans, Cryosphere and Dangerous Climate
> >> Change)
> >> Met Office Hadley Centre
> >> FitzRoy Road, Exeter EX1 3PB, UK
> >> Phone +44 (0)1392 886641  Fax +44 (0)1392 885681
> >> Email
> >> richard.wood at metoffice.gov.uk<mailto:richard.wood at metoffice.gov.uk>   
> >> http://www.metoffice.gov.uk Personal web page
> >> http://www.metoffice.gov.uk/research/scientists/cryosphere-oceans/richar
> >> d-wood
> >> 
> >> *** Please note I also work as Theme Leader (Climate System) for the
> >> Natural Environment Research Council ***
> >> *** Where possible please send emails on NERC matters to
> >> rwtl at nerc.ac.uk<mailto:rwtl at nerc.ac.uk> ***

--
Bryan Lawrence
University of Reading:  Professor of Weather and Climate Computing
National Centre for Atmospheric Science: Director of Models and Data 
STFC: Director of the Centre of Environmental Data Archival 
Phone +44 1235 445012; Web: home.badc.rl.ac.uk/lawrence