[Go-essp-tech] Non-DRS File structure at data nodes

Thu Sep 1 03:19:17 MDT 2011

Hi Jamie

I agree, it's an inteface to the filesystem in many cases, so if folks want to 
hide it, then they have to provide a file system interface ... or something that 
can behave like it (I suspect we could do it through an http interface if we 
really have to, but it'll feel quite unnatural to write s/w to run on a data 
node that communicates with it's own data via the http interface).

I also agree that we are (and probably will be) paying a heavy price (in time 
and effort) for poor communication both within and about the DRS ...

Cheers
Bryan

> Hello,
> 
> Isn't one issue that for some applications the *interface* with the data
> is at the *file system level* - not the catalogues? Version management,
> QC look like they are examples, and replication may be too (and I think
> these are pretty much federation wide activities/applications).  So if
> you want to minimise the complexity (~= minimise time to develop, cost
> of maintenance) in the way these applications interact with the data you
> want to ensure consistency in the way data stored in the file system.
> Bryan - I wasn't sure what interfaces you were talking about... Sorry.
> 
> I'm going to be a bit pedantic here - but I don't think the DRS document
> says that data nodes must follow the DRS directory structure, its only a
> recommendation.  Though there *may* be a slight inconsistency in the way
> the DRS is written as it says the URLS *will* be a site dependant prefix
> followed by the
> *DRS directory structure*.  At least that's my reading of the 1.2
> version dated 9th March. I don't think all nodes are following the DRS
> specification for the URLS because they don't have the same underlying
> directory structure.  I don't know if the way the DRS is written or
> being interpreted is one of the sources of misunderstanding over this
> issue of DRS directory structure?  (This is not a criticism, its an
> acceptance that communicating specification and plans is a hard problem
> to crack).
> 
> Another (possibly week) motivation for keeping all data in the DRS
> directory structure is it gives you a last ditch back up strategy - if
> you loose the catalogues you can regenerate the version info etc from
> the file system.
> 
> Jamie
> 
> > -----Original Message-----
> > From: go-essp-tech-bounces at ucar.edu
> > [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Bryan Lawrence
> > Sent: 01 September 2011 08:55
> > To: go-essp-tech at ucar.edu
> > Cc: stockhause at dkrz.de
> > Subject: Re: [Go-essp-tech] Non-DRS File structure at data nodes
> > 
> > Hi Folks
> > 
> > > At least it's now clear to me, that we can't rely on the
> > 
> > DRS structure
> > 
> > > so we should try to cope with this.
> > 
> > I'm just coming back to this, and I haven't read all of this
> > thread, but I don't agree with this statement!  If we can't
> > rely on the DRS *at the interface level*, then ESGF is
> > fundamentally doomed as a distributed activity, because we'll
> > never have the resource to support all the possible variants.
> > 
> > Behind those interfaces, more flexibility might be possible,
> > but components would need to be pretty targetted in their
> > functionality.
> > 
> > Bryan
> > 
> > > Thanks,
> > > Estani
> > > 
> > > Am 31.08.2011 12:55, schrieb stephen.pascoe at stfc.ac.uk:
> > > > Hi Estani,
> > > > 
> > > > I see you have some code in esgf-contrib.git for managing
> > 
> > a replica
> > 
> > > > database.  There's quite a lot of drs-parsing code there.
> >  
> >  Is there
> >  
> > > > any reason why this couldn't use drslib?
> > > > 
> > > > Cheers,
> > > > Stephen.
> > > > 
> > > > ---
> > > > Stephen Pascoe  +44 (0)1235 445980
> > > > Centre of Environmental Data Archival STFC Rutherford Appleton
> > > > Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: go-essp-tech-bounces at ucar.edu
> > > > [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Estanislao
> > > > Gonzalez
> > > > Sent: 31 August 2011 10:23
> > > > To: Juckes, Martin (STFC,RAL,RALSP)
> > > > Cc: stockhause at dkrz.de; go-essp-tech at ucar.edu
> > > > Subject: Re: [Go-essp-tech] Non-DRS File structure at data nodes
> > > > 
> > > > Hi Martin,
> > > > 
> > > > Are you planning to publish that data as a new instance
> > 
> > or as a replica?
> > 
> > > > If I recall it right, Karl said he thought the replica
> > 
> > was attached
> > 
> > > > at a semantic level. But I have my doubts and haven't got
> > 
> > any feed
> > 
> > > > back on this. does anyone know if the gateway can handle
> > 
> > a replica
> > 
> > > > with a different url path? (dataset and version "should" be the
> > > > same, although keeping the same version will be
> > 
> > difficult, because
> > 
> > > > no tool can handle this AFAIK, i.e. replicating or publishing
> > > > multiple datasets with different versions)
> > > > 
> > > > And regarding replication (independently from the previous
> > > > question), how are you going to cope with new versions? Do you
> > > > already have tools for harvesting the TDS and building a list of
> > > > which files do need to be replicated, regarding from what
> > 
> > you already have?
> > 
> > > > The catalog will just publish a dataset and version along with a
> > > > bunch of files, you would need to keep a DB with the fies you've
> > > > already downloaded, and compare with the catalog to realize what
> > > > should be done next. This information is what drslib
> > 
> > should use to
> > 
> > > > create the next version. Is that what will happen? If
> > 
> > not, how will you be solving this?
> > 
> > > > Thanks,
> > > > Estani
> > > > 
> > > > Am 31.08.2011 10:54, schrieb martin.juckes at stfc.ac.uk:
> > > >> Hello Martina,
> > > >> 
> > > >> For BADC, I don't think we are considering storing data
> > 
> > in anything
> > 
> > > >> other than the DRS structure -- we just don't have the time to
> > > >> build systems around multiple structures. This means
> > 
> > that data that
> > 
> > > >> comes from a node with a different directory structure
> > 
> > will have to
> > 
> > > >> be re-mapped. Verification of file identities will rely on
> > > >> check-sums, as it always will when dealing with files
> > 
> > from archives
> > 
> > > >> from which we have no curation guarantees,
> > > >> 
> > > >> cheers,
> > > >> Martin
> > > >> 
> > > >> ________________________________
> > > >> From: go-essp-tech-bounces at ucar.edu
> > 
> > [go-essp-tech-bounces at ucar.edu]
> > 
> > > >> on behalf of Martina Stockhause [stockhause at dkrz.de] Sent: 31
> > > >> August 2011
> > > >> 09:44
> > > >> To: go-essp-tech at ucar.edu
> > > >> Subject: [Go-essp-tech] Non-DRS File structure at data nodes
> > > >> 
> > > >> Hi everyone,
> > > >> 
> > > >> we promised to describe the problems regarding the non-DRS file
> > > >> structures at the data nodes. Estani has already started the
> > > >> discussion on the replication/user download problems
> > 
> > (see attached
> > 
> > > >> email and document).
> > > >> 
> > > >> Implications for the QC:
> > > >> - In the QCDB we need DRS syntax. The DOI process,
> > 
> > creation of CIM
> > 
> > > >> documents, and identification of the data the QC results are
> > > >> connected to rely on that. - The QC needs to know the version of
> > > >> the data checked. The DOI at the end of the QC process
> > 
> > is assigned
> > 
> > > >> to a specific not-changable data version. At least at
> > 
> > DKRZ we have
> > 
> > > >> to guarantee that the data is not changed after
> > 
> > assignment of the
> > 
> > > >> DOI, therefore we store a data copy in our archive. - The QC
> > > >> checker tool runs on files in a given directory structure and
> > > >> creates results in a copy of this structure. The QC
> > 
> > wrapper can deal with recombinations of path parts.
> > 
> > > >> So, if the directory structure includes all parts of the DRS
> > > >> syntax, the wrapper can create the DRS syntax before
> > 
> > insert in the
> > 
> > > >> QCDB. But we deal with structures at the data nodes, where some
> > > >> information is missing in the directory path, i.e.
> > 
> > version and MIP
> > 
> > > >> table. Therefore an additional information would be
> > 
> > needed for that mapping.
> > 
> > > >> Possible solutions to map the given file structure to the DRS
> > > >> directory structure before insert in the QCDB:
> > > >> 
> > > >> 1. The publication on the data nodes of the three gateways who
> > > >> store replicas (PCMDI, BADC, DKRZ) publish data in the DRS
> > > >> directory structure. Then the QC run is possible without
> > 
> > mapping.
> > 
> > > >> Replication problems?
> > > >> 
> > > >> 2. The directory structures of the data nodes are replicated as
> > > >> they are. We store the data under a certain version.
> > 
> > How? Are there
> > 
> > > >> implications for the replication from the data nodes? The
> > > >> individual file structures down to the chunk level are stored
> > > >> together with its DRS identification in a repository and
> > 
> > a service
> > 
> > > >> is created to access the DRS id for the given file in the given
> > > >> file structure. The QC and maybe other user data
> > 
> > services use this
> > 
> > > >> service for mapping. That will slow down the QC insert process.
> > > >> Before each insert of a chunk name, a qc result for a specific
> > > >> variable, and the qc result on the experiment level that service
> > > >> has to be called. And who can set-up and maintain such a
> > > >> repository? DKRZ has not the man power to do that in the
> > 
> > next months.
> > 
> > > >> Cheers,
> > > >> Martina
> > > >> 
> > > >> 
> > > >> 
> > > >> -------- Original-Nachricht --------
> > > >> Betreff:        RE: ESG discussion
> > > >> Datum:  Wed, 10 Aug 2011 15:35:04 +0100
> > > >> Von:    Kettleborough,
> > 
> > Jamie<jamie.kettleborough at metoffice.gov.uk><mailto:jamie.kettl
> > eborough@
> > 
> > > >> metoffice.gov.uk> An:     Karl
> > > >> Taylor<taylor13 at llnl.gov><mailto:taylor13 at llnl.gov>, Wood,
> > 
> > Richard<richard.wood at metoffice.gov.uk><mailto:richard.wood at met
> > office.go
> > 
> > > >> v.uk> CC:     Carter,
> > 
> > Mick<mick.carter at metoffice.gov.uk><mailto:mick.carter at metoffice.gov
> > 
> > > >> .uk>
> > > >> , Elkington,
> > 
> > Mark<mark.elkington at metoffice.gov.uk><mailto:mark.elkington at metoffi
> > 
> > > >> ce.g
> > > >> ov.uk>, Bentley,
> > 
> > Philip<philip.bentley at metoffice.gov.uk><mailto:philip.bentley at metof
> > 
> > > >> fice
> > > >> .gov.uk>, Senior,
> > 
> > Cath<cath.senior at metoffice.gov.uk><mailto:cath.senior at metoffice.gov
> > 
> > > >> .uk>
> > > >> , Hines,
> > 
> > Adrian<adrian.hines at metoffice.gov.uk><mailto:adrian.hines at metoffice
> > 
> > > >> .gov .uk>, Dean N.
> > > >> Williams<williams13 at llnl.gov><mailto:williams13 at llnl.gov>,
> > > >> Estanislao
> > 
> > Gonzalez<gonzalez at dkrz.de><mailto:gonzalez at dkrz.de>,<martin.juckes@
> > 
> > > >> stfc .ac.uk><mailto:martin.juckes at stfc.ac.uk>, Kettleborough,
> > 
> > Jamie<jamie.kettleborough at metoffice.gov.uk><mailto:jamie.kettleboro
> > 
> > > >> ugh@
> > > >> metoffice.gov.uk>
> > > >> 
> > > >> 
> > > >> Hello Karl, Dean,
> > > >> 
> > > >> Thanks for you reply on this, and the fact you are taking our
> > > >> concerns seriously. You are right to challenge us for
> > 
> > the specific
> > 
> > > >> issues, rather than us just highlighting the things that
> > 
> > don't meet
> > 
> > > >> our (possibly idealised) expectations of how the system should
> > > >> look.  As a result, we have had a thorough review of our key
> > > >> issues. I think some of them are issues that make if
> > 
> > harder for us
> > 
> > > >> to do things now; other issues are maybe more concerns
> > 
> > of problems
> > 
> > > >> being stored up. This document has been prepared with the help
> > > >> Estani Gonzalez.  We would like to have Martin Juckes
> > 
> > input on this
> > 
> > > >> too - but he is currently away on holiday.  I hope he can add to
> > > >> this when he returns - he has spent a lot of time thinking about
> > > >> the implications of data node directory structure on
> > 
> > versioning. I
> > 
> > > >> hope this helps clarify issues, if not please let use
> > 
> > know, Thanks,
> > 
> > > >> Jamie
> > > >> 
> > > >> ________________________________
> > > >> From: Karl Taylor [mailto:taylor13 at llnl.gov]
> > > >> Sent: 09 August 2011 01:48
> > > >> To: Wood, Richard
> > > >> Cc: Carter, Mick; Kettleborough, Jamie; Elkington, Mark;
> > 
> > Bentley,
> > 
> > > >> Philip; Senior, Cath; Hines, Adrian; Dean N. Williams
> > 
> > Subject: Re:
> > > >> ESG discussion
> > > >> 
> > > >> Dear all,
> > > >> 
> > > >> Thanks for taking the time to bring to my attention the
> > 
> > ESG issues
> > 
> > > >> that I hope can be addressed reasonably soon.  I think we're in
> > > >> general agreement that the user's experience should be improved.
> > > >> 
> > > >> I've discussed this briefly with Dean.  I plan to meet
> > 
> > with him and
> > 
> > > >> others here, and, drawing on your suggestions, we'll attempt to
> > > >> find solutions and methods of communication that might
> > 
> > improve matters.
> > 
> > > >> Before doing this, it would help if you could briefly answer the
> > > >> following questions:
> > > >> 
> > > >> 1.  Is the main issue that it is currently difficult to script
> > > >> downloads from all the nodes because only some support
> > 
> > PKI?  What
> > 
> > > >> other uniformity among nodes is required for you to be
> > 
> > able to do
> > 
> > > >> what you want to do (i.e., what do you specifically want
> > 
> > to do that
> > 
> > > >> is difficult to do now)?  [nb. all data nodes are
> > 
> > scheduled to be
> > 
> > > >> operating with PKI authentication by September 1.]
> > > >> 
> > > >> 2.  Is there anything from the perspective of a data *provider*
> > > >> that needs to be done (other than make things easier for
> > 
> > data users)?
> > 
> > > >> 3.  Currently ESG and CMIP5 do not dictate the directory
> > 
> > structure
> > 
> > > >> found at each data node (although most nodes are adhering to the
> > > >> recommendations of the DRS).   The gateway software and
> > 
> > catalog make it
> > 
> > > >> possible to get to the data regardless of directory
> > 
> > structure.  It
> > 
> > > >> is possible that "versioning" might impose additional
> > 
> > constraints
> > 
> > > >> on the directory structure, but I'm not sure about this.
> >  
> >  (By the
> >  
> > > >> way, I'm not sure what the "versioning" issue is since
> > 
> > currently I
> > 
> > > >> think it's impossible for users to know about more than one
> > > >> version; is that the
> > > >> issue?)  From a user's or provider's perspective, is there any
> > > >> essential reason that the directory structure should be
> > 
> > the same at
> > 
> > > >> each node?
> > > >> 
> > > >> 4.  ESG allows considerable flexibility in publishing data, and
> > > >> CMIP5 has suggested "best practices" to reduce
> > 
> > differences.  Only
> > 
> > > >> some of the "best practices" are currently requirements.
> >  
> >  A certain
> >  
> > > >> amount of flexibility is essential since different data
> > 
> > providers
> > 
> > > >> have resources to support the potential capabilities of
> > 
> > ESG (e.g.,
> > 
> > > >> not all can support server-side calculations, which will
> > 
> > be put in place at some nodes).
> > 
> > > >> Likewise a provider can currently turn off the
> > 
> > "checksum", if this
> > 
> > > >> is deemed to slow publication too much (although we could insist
> > > >> that checksums be stored in the thredds catalogue).
> > 
> > Nevertheless,
> > 
> > > >> it is unlikely that every data node will be identically
> > 
> > configured for all
> > 
> > > >> options.    What are the *essential* ways that the data
> > 
> > nodes should
> > 
> > > >> respond identically (we may not be able to insist on uniformity
> > > >> that isn't essential for serving our users)?
> > > >> 
> > > >> Thanks again for your input, and I look forward to your further
> > > >> help with this.
> > > >> 
> > > >> Best regards,
> > > >> Karl
> > > >> 
> > > >> 
> > > >> On 8/5/11 10:43 AM, Wood, Richard wrote:
> > > >> 
> > > >> Dear Karl,
> > > >> 
> > > >>     Following on from our phone call I had a discussion with
> > > >> 
> > > >> technical
> > > >> 
> > > >> colleagues here (Mick Carter, Jamie Kettleborough, Mark
> > 
> > Elkington,
> > 
> > > >> also earlier with Phil Bentley), and with Adrian Hines who is
> > > >> coordinating our CMIP5 analysis work, about ideas for
> > 
> > future development of the ESG.
> > 
> > > >> Our observations are from the user perspective, and
> > 
> > based on what
> > 
> > > >> we can gather from mailing lists and our own experience.
> > 
> > Coming out
> > 
> > > >> of our discussion we have a couple of suggestions that
> > 
> > could help
> > 
> > > >> with visibility for data providers and users:
> > > >> 
> > > >> - Some areas need agreement among the data nodes as to the
> > > >> technical solution, and then implementation across all
> > 
> > the nodes,
> > 
> > > >> while others need a specific solution to be developed in
> > 
> > one place and rolled out.
> > 
> > > >> The group teleconferences that Dean organises appear to
> > 
> > be a good
> > 
> > > >> forum for airing specific technical ideas and solutions.
> > 
> > However,
> > 
> > > >> in our experience it can be  difficult in that kind of forum to
> > > >> discuss planning and prioritisation questions. From our
> > 
> > perspective
> > 
> > > >> we don't have visibility of the more project-related
> > 
> > issues such as
> > 
> > > >> key technical decisions, prioritisation and timelines, or of
> > > >> whether issues that have arisen in the mailing list
> > 
> > discussions are
> > 
> > > >> being followed up. We guess these may be discussed in separate
> > > >> project teleconferences involving the technical leads
> > 
> > from the data
> > 
> > > >> nodes. As users we would not necessarily expect to be
> > 
> > involved in
> > 
> > > >> those discussions, but as data providers and dowloaders
> > 
> > it would be
> > 
> > > >> very helpful for our planning to see the outcomes of the
> > > >> discussions. The sort of thing we had in mind would be a
> > 
> > simple web
> > 
> > > >> page showing the priority development areas, agreed
> > 
> > solutions and
> > 
> > > >> estimated dates for completion/release. Some solutions
> > 
> > will need to
> > 
> > > >> be implemented separately across all the participating
> > 
> > data nodes,
> > 
> > > >> and in these cases it would be useful to see the
> > 
> > estimated timeframe for implementation at each node.
> > 
> > > >> This would not be intended as a 'big stick' to the partners, but
> > > >> simply as a planning aid so that everyone can see what's
> > 
> > available
> > 
> > > >> when and the project can identify any potential
> > 
> > bottlenecks or issues in advance.
> > 
> > > >> Also the intention is not to generate a lot of extra work.
> > > >> Hopefully providing this information would be pretty
> > 
> > light on people's time.
> > 
> > > >> - From where we sit it appears that some nodes are quite
> > 
> > successful
> > 
> > > >> in following best practice and implementing the
> > 
> > federation policies
> > 
> > > >> as far as they are aware of them. Could what these nodes
> > 
> > do be made
> > 
> > > >> helpful to all the data nodes (e.g. by using identical
> > 
> > software)?
> > 
> > > >> We realise there may be real differences between some
> > 
> > data nodes -
> > 
> > > >> but where possible we think that what is similar could
> > 
> > be enforced
> > 
> > > >> or made explicitly the same through sharing the software
> > 
> > components and tools.
> > 
> > > >> To set the discussion on priorities rolling, Jamie has
> > 
> > prepared, in
> > 
> > > >> consultation with others here, a short document showing the Met
> > > >> Office view of current priority issues (attached). If you could
> > > >> update us on the status of work on these issues, that
> > 
> > would be very
> > 
> > > >> useful (ideally via the web pages proposed above, which we think
> > > >> would be of interest to many users, or via email in the
> > 
> > interim).
> > 
> > > >> Many thanks for the update on tokenless authentication,
> > 
> > which is very good news.
> > 
> > > >>     Once again, our thanks to you, Dean and the team for
> > 
> > all the hard
> > 
> > > >>     work
> > > >> 
> > > >> we know is going into this. Please let us know what you think of
> > > >> the above ideas and the attachment, and if there is
> > 
> > anything we can
> > 
> > > >> do to help.
> > > >> 
> > > >>         Best wishes,
> > > >>         
> > > >>          Richard
> > > >> 
> > > >> --------------
> > > >> Richard Wood
> > > >> Met Office Fellow and Head (Oceans, Cryosphere and Dangerous
> > > >> Climate
> > > >> Change)
> > > >> Met Office Hadley Centre
> > > >> FitzRoy Road, Exeter EX1 3PB, UK
> > > >> Phone +44 (0)1392 886641  Fax +44 (0)1392 885681 Email
> > 
> > richard.wood at metoffice.gov.uk<mailto:richard.wood at metoffice.gov.uk>
> > 
> > > >> http://www.metoffice.gov.uk Personal web page
> > 
> > http://www.metoffice.gov.uk/research/scientists/cryosphere-oceans/r
> > 
> > > >> ichar
> > > >> d-wood
> > > >> 
> > > >> *** Please note I also work as Theme Leader (Climate System) for
> > > >> the Natural Environment Research Council ***
> > > >> *** Where possible please send emails on NERC matters to
> > > >> rwtl at nerc.ac.uk<mailto:rwtl at nerc.ac.uk> ***
> > 
> > --
> > Bryan Lawrence
> > University of Reading:  Professor of Weather and Climate
> > Computing National Centre for Atmospheric Science: Director
> > of Models and Data
> > STFC: Director of the Centre of Environmental Data Archival
> > Phone +44 1235 445012; Web: home.badc.rl.ac.uk/lawrence
> > _______________________________________________
> > GO-ESSP-TECH mailing list
> > GO-ESSP-TECH at ucar.edu
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

--
Bryan Lawrence
University of Reading:  Professor of Weather and Climate Computing
National Centre for Atmospheric Science: Director of Models and Data 
STFC: Director of the Centre of Environmental Data Archival 
Phone +44 1235 445012; Web: home.badc.rl.ac.uk/lawrence