[Go-essp-tech] Non-DRS File structure at data nodes

Thu Sep 1 03:28:01 MDT 2011

Hi Estani and Stephen

Thanks both.

I agree that the situation on the ground is not what we would want. However, we 
can try and avoid any more "non-DRS" instantiations, and for the core archives, 
of course we will all be DRS compliant ...

So, my take on the non-complant sites, is we hack methods of getting the data to 
a core site *in* DRS format, and eventually deprecate support for nodes what are 
non-compliant to the point that users are encourage to avoid downloading from 
them. Some services might not work at non-DRS sites. Tough!

Republishing is hard, but letting a thousand flowers bloom will be harder. 

Bryan

> Hi Bryan,
> 
> Neither do I agree with what I said. But continuing what Stephen
> explained, the issue it's not really what we want or where we are aiming
> to. It's about what we have.
> There are just too many institutions that are publishing using non DRS
> directories nor URLs (Ids and Filenames are ok, AFAIK)
> What I now see, which I haven't previously, is that this won't change.
> Not for cmip5 as there's no commitment to it. Republishing seem to be a
> major effort (especially now that people already are downloading data);
> and I can't say I disagree.
> 
> So, we have to keep up with this somehow. As Martin mentioned, and is
> also our intention, we will comply to it and write tools that require
> such structures (hopefully making it flexible enough to be adapted in
> the mid term to "other" semantic-rich structures). It's not really that
> we want it, it's just that having no coherent data structure won't work
> for the amount of data we are going to store, so there's no other
> option. A single institution might be able to pull it off, as they can
> think a new structure for themselves. An archive can't do that.
> 
> What's missing is to adapt this "free" structures to the DRS one. At the
> moment it's not clear to me what does this exactly mean, as it depends
> on how far away this structure is from the DRS one and at what point
> should we be linking them (QC? replication? publication? replica
> publication? etc) But that's what I should be aiming at from now on,
> mostly because it's the only part of cmip5 I have access to.
> 
> Well that's only my opinion anyways. I just thought it might be worth
> clarifying it.
> 
> Thanks,
> Estani
> 
> Am 01.09.2011 10:13, schrieb stephen.pascoe at stfc.ac.uk:
> > Hi Bryan,
> > 
> > I think what Estani is getting at is that *at this moment* we can't rely
> > on conformance to the DRS directory structure on all datanodes.  There
> > hasn't been sufficient commitment to it throughout the federation. This
> > doesn't mean the interfaces won't develop to comply with DRS in the
> > future.  Bob/Gavin have expressed interest in adding a software layer to
> > TDS to do this.  However, with everything else that's going on, it
> > probably won't happen unless those of us using the DRS demonstrate why
> > we need it by building DRS tools that make our life easier.
> > 
> > To this end I'm putting the future of DRS on the agenda of the IS-ENES
> > coding sprint Sept 21-23rd.  I hope we can come up with some concrete
> > suggestions for what DRS can do for us and how to move towards using it
> > more effectively.
> > 
> > Cheers,
> > Stephen.
> > 
> > 
> > ---
> > Stephen Pascoe  +44 (0)1235 445980
> > Centre of Environmental Data Archival
> > STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, UK
> > 
> > -----Original Message-----
> > From: Bryan Lawrence [mailto:bryan.lawrence at stfc.ac.uk]
> > Sent: 01 September 2011 08:55
> > To: go-essp-tech at ucar.edu
> > Cc: Estanislao Gonzalez; Pascoe, Stephen (STFC,RAL,RALSP);
> > stockhause at dkrz.de Subject: Re: [Go-essp-tech] Non-DRS File structure at
> > data nodes
> > 
> > Hi Folks
> > 
> >> At least it's now clear to me, that we can't rely on the DRS structure
> >> so we should try to cope with this.
> > 
> > I'm just coming back to this, and I haven't read all of this thread, but
> > I don't agree with this statement!  If we can't rely on the DRS *at the
> > interface level*, then ESGF is fundamentally doomed as a distributed
> > activity, because we'll never have the resource to support all the
> > possible variants.
> > 
> > Behind those interfaces, more flexibility might be possible, but
> > components would need to be pretty targetted in their functionality.
> > 
> > Bryan
> > 
> >> Thanks,
> >> Estani
> >> 
> >> Am 31.08.2011 12:55, schrieb stephen.pascoe at stfc.ac.uk:
> >>> Hi Estani,
> >>> 
> >>> I see you have some code in esgf-contrib.git for managing a replica
> >>> database.  There's quite a lot of drs-parsing code there.  Is there any
> >>> reason why this couldn't use drslib?
> >>> 
> >>> Cheers,
> >>> Stephen.
> >>> 
> >>> ---
> >>> Stephen Pascoe  +44 (0)1235 445980
> >>> Centre of Environmental Data Archival
> >>> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX,
> >>> UK
> >>> 
> >>> 
> >>> -----Original Message-----
> >>> From: go-essp-tech-bounces at ucar.edu
> >>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Estanislao Gonzalez
> >>> Sent: 31 August 2011 10:23
> >>> To: Juckes, Martin (STFC,RAL,RALSP)
> >>> Cc: stockhause at dkrz.de; go-essp-tech at ucar.edu
> >>> Subject: Re: [Go-essp-tech] Non-DRS File structure at data nodes
> >>> 
> >>> Hi Martin,
> >>> 
> >>> Are you planning to publish that data as a new instance or as a
> >>> replica? If I recall it right, Karl said he thought the replica was
> >>> attached at a semantic level. But I have my doubts and haven't got any
> >>> feed back on this. does anyone know if the gateway can handle a
> >>> replica with a different url path? (dataset and version "should" be
> >>> the same, although keeping the same version will be difficult, because
> >>> no tool can handle this AFAIK, i.e. replicating or publishing multiple
> >>> datasets with different versions)
> >>> 
> >>> And regarding replication (independently from the previous question),
> >>> how are you going to cope with new versions? Do you already have tools
> >>> for harvesting the TDS and building a list of which files do need to be
> >>> replicated, regarding from what you already have?
> >>> 
> >>> The catalog will just publish a dataset and version along with a bunch
> >>> of files, you would need to keep a DB with the fies you've already
> >>> downloaded, and compare with the catalog to realize what should be done
> >>> next. This information is what drslib should use to create the next
> >>> version. Is that what will happen? If not, how will you be solving
> >>> this?
> >>> 
> >>> Thanks,
> >>> Estani
> >>> 
> >>> Am 31.08.2011 10:54, schrieb martin.juckes at stfc.ac.uk:
> >>>> Hello Martina,
> >>>> 
> >>>> For BADC, I don't think we are considering storing data in anything
> >>>> other than the DRS structure -- we just don't have the time to build
> >>>> systems around multiple structures. This means that data that comes
> >>>> from a node with a different directory structure will have to be
> >>>> re-mapped. Verification of file identities will rely on check-sums, as
> >>>> it always will when dealing with files from archives from which we
> >>>> have no curation guarantees,
> >>>> 
> >>>> cheers,
> >>>> Martin
> >>>> 
> >>>> ________________________________
> >>>> From: go-essp-tech-bounces at ucar.edu [go-essp-tech-bounces at ucar.edu] on
> >>>> behalf of Martina Stockhause [stockhause at dkrz.de] Sent: 31 August 2011
> >>>> 09:44
> >>>> To: go-essp-tech at ucar.edu
> >>>> Subject: [Go-essp-tech] Non-DRS File structure at data nodes
> >>>> 
> >>>> Hi everyone,
> >>>> 
> >>>> we promised to describe the problems regarding the non-DRS file
> >>>> structures at the data nodes. Estani has already started the
> >>>> discussion on the replication/user download problems (see attached
> >>>> email and document).
> >>>> 
> >>>> Implications for the QC:
> >>>> - In the QCDB we need DRS syntax. The DOI process, creation of CIM
> >>>> documents, and identification of the data the QC results are connected
> >>>> to rely on that. - The QC needs to know the version of the data
> >>>> checked. The DOI at the end of the QC process is assigned to a
> >>>> specific not-changable data version. At least at DKRZ we have to
> >>>> guarantee that the data is not changed after assignment of the DOI,
> >>>> therefore we store a data copy in our archive. - The QC checker tool
> >>>> runs on files in a given directory structure and creates results in a
> >>>> copy of this structure. The QC wrapper can deal with recombinations
> >>>> of path parts. So, if the directory structure includes all parts of
> >>>> the DRS syntax, the wrapper can create the DRS syntax before insert
> >>>> in the QCDB. But we deal with structures at the data nodes, where
> >>>> some information is missing in the directory path, i.e. version and
> >>>> MIP table. Therefore an additional information would be needed for
> >>>> that mapping.
> >>>> 
> >>>> Possible solutions to map the given file structure to the DRS
> >>>> directory structure before insert in the QCDB:
> >>>> 
> >>>> 1. The publication on the data nodes of the three gateways who store
> >>>> replicas (PCMDI, BADC, DKRZ) publish data in the DRS directory
> >>>> structure. Then the QC run is possible without mapping. Replication
> >>>> problems?
> >>>> 
> >>>> 2. The directory structures of the data nodes are replicated as they
> >>>> are. We store the data under a certain version. How? Are there
> >>>> implications for the replication from the data nodes? The individual
> >>>> file structures down to the chunk level are stored together with its
> >>>> DRS identification in a repository and a service is created to access
> >>>> the DRS id for the given file in the given file structure. The QC and
> >>>> maybe other user data services use this service for mapping. That will
> >>>> slow down the QC insert process. Before each insert of a chunk name, a
> >>>> qc result for a specific variable, and the qc result on the experiment
> >>>> level that service has to be called. And who can set-up and maintain
> >>>> such a repository? DKRZ has not the man power to do that in the next
> >>>> months.
> >>>> 
> >>>> Cheers,
> >>>> Martina
> >>>> 
> >>>> 
> >>>> 
> >>>> -------- Original-Nachricht --------
> >>>> Betreff:        RE: ESG discussion
> >>>> Datum:  Wed, 10 Aug 2011 15:35:04 +0100
> >>>> Von:    Kettleborough,
> >>>> Jamie<jamie.kettleborough at metoffice.gov.uk><mailto:jamie.kettleborough
> >>>> @ metoffice.gov.uk>  An:     Karl
> >>>> Taylor<taylor13 at llnl.gov><mailto:taylor13 at llnl.gov>, Wood,
> >>>> Richard<richard.wood at metoffice.gov.uk><mailto:richard.wood at metoffice.g
> >>>> o v.uk>  CC:     Carter,
> >>>> Mick<mick.carter at metoffice.gov.uk><mailto:mick.carter at metoffice.gov.uk
> >>>> > , Elkington,
> >>>> Mark<mark.elkington at metoffice.gov.uk><mailto:mark.elkington at metoffice.
> >>>> g ov.uk>, Bentley,
> >>>> Philip<philip.bentley at metoffice.gov.uk><mailto:philip.bentley at metoffic
> >>>> e .gov.uk>, Senior,
> >>>> Cath<cath.senior at metoffice.gov.uk><mailto:cath.senior at metoffice.gov.uk
> >>>> > , Hines,
> >>>> Adrian<adrian.hines at metoffice.gov.uk><mailto:adrian.hines at metoffice.go
> >>>> v .uk>, Dean N.
> >>>> Williams<williams13 at llnl.gov><mailto:williams13 at llnl.gov>, Estanislao
> >>>> Gonzalez<gonzalez at dkrz.de><mailto:gonzalez at dkrz.de>,<martin.juckes at stf
> >>>> c .ac.uk><mailto:martin.juckes at stfc.ac.uk>, Kettleborough,
> >>>> Jamie<jamie.kettleborough at metoffice.gov.uk><mailto:jamie.kettleborough
> >>>> @ metoffice.gov.uk>
> >>>> 
> >>>> 
> >>>> Hello Karl, Dean,
> >>>> 
> >>>> Thanks for you reply on this, and the fact you are taking our concerns
> >>>> seriously. You are right to challenge us for the specific issues,
> >>>> rather than us just highlighting the things that don't meet our
> >>>> (possibly idealised) expectations of how the system should look.  As a
> >>>> result, we have had a thorough review of our key issues. I think some
> >>>> of them are issues that make if harder for us to do things now; other
> >>>> issues are maybe more concerns of problems being stored up. This
> >>>> document has been prepared with the help Estani Gonzalez.  We would
> >>>> like to have Martin Juckes input on this too - but he is currently
> >>>> away on holiday.  I hope he can add to this when he returns - he has
> >>>> spent a lot of time thinking about the implications of data node
> >>>> directory structure on versioning. I hope this helps clarify issues,
> >>>> if not please let use know,
> >>>> Thanks,
> >>>> Jamie
> >>>> 
> >>>> ________________________________
> >>>> From: Karl Taylor [mailto:taylor13 at llnl.gov]
> >>>> Sent: 09 August 2011 01:48
> >>>> To: Wood, Richard
> >>>> Cc: Carter, Mick; Kettleborough, Jamie; Elkington, Mark; Bentley,
> >>>> Philip; Senior, Cath; Hines, Adrian; Dean N. Williams Subject: Re: ESG
> >>>> discussion
> >>>> 
> >>>> Dear all,
> >>>> 
> >>>> Thanks for taking the time to bring to my attention the ESG issues
> >>>> that I hope can be addressed reasonably soon.  I think we're in
> >>>> general agreement that the user's experience should be improved.
> >>>> 
> >>>> I've discussed this briefly with Dean.  I plan to meet with him and
> >>>> others here, and, drawing on your suggestions, we'll attempt to find
> >>>> solutions and methods of communication that might improve matters.
> >>>> Before doing this, it would help if you could briefly answer the
> >>>> following questions:
> >>>> 
> >>>> 1.  Is the main issue that it is currently difficult to script
> >>>> downloads from all the nodes because only some support PKI?  What
> >>>> other uniformity among nodes is required for you to be able to do
> >>>> what you want to do (i.e., what do you specifically want to do that
> >>>> is difficult to do now)?  [nb. all data nodes are scheduled to be
> >>>> operating with PKI authentication by September 1.]
> >>>> 
> >>>> 2.  Is there anything from the perspective of a data *provider*  that
> >>>> needs to be done (other than make things easier for data users)?
> >>>> 
> >>>> 3.  Currently ESG and CMIP5 do not dictate the directory structure
> >>>> found at each data node (although most nodes are adhering to the
> >>>> recommendations of the DRS).   The gateway software and catalog make
> >>>> it possible to get to the data regardless of directory structure.  It
> >>>> is possible that "versioning" might impose additional constraints on
> >>>> the directory structure, but I'm not sure about this.  (By the way,
> >>>> I'm not sure what the "versioning" issue is since currently I think
> >>>> it's impossible for users to know about more than one version; is
> >>>> that the issue?)  From a user's or provider's perspective, is there
> >>>> any essential reason that the directory structure should be the same
> >>>> at each node?
> >>>> 
> >>>> 4.  ESG allows considerable flexibility in publishing data, and CMIP5
> >>>> has suggested "best practices" to reduce differences.  Only some of
> >>>> the "best practices" are currently requirements.  A certain amount of
> >>>> flexibility is essential since different data providers have
> >>>> resources to support the potential capabilities of ESG (e.g., not all
> >>>> can support server-side calculations, which will be put in place at
> >>>> some nodes). Likewise a provider can currently turn off the
> >>>> "checksum", if this is deemed to slow publication too much (although
> >>>> we could insist that checksums be stored in the thredds catalogue). 
> >>>> Nevertheless, it is unlikely that every data node will be identically
> >>>> configured for all options.    What are the *essential* ways that the
> >>>> data nodes should respond identically (we may not be able to insist
> >>>> on uniformity that isn't essential for serving our users)?
> >>>> 
> >>>> Thanks again for your input, and I look forward to your further help
> >>>> with this.
> >>>> 
> >>>> Best regards,
> >>>> Karl
> >>>> 
> >>>> 
> >>>> On 8/5/11 10:43 AM, Wood, Richard wrote:
> >>>> 
> >>>> Dear Karl,
> >>>> 
> >>>>      Following on from our phone call I had a discussion with
> >>>>      technical
> >>>> 
> >>>> colleagues here (Mick Carter, Jamie Kettleborough, Mark Elkington,
> >>>> also earlier with Phil Bentley), and with Adrian Hines who is
> >>>> coordinating our CMIP5 analysis work, about ideas for future
> >>>> development of the ESG. Our observations are from the user
> >>>> perspective, and based on what we can gather from mailing lists and
> >>>> our own experience. Coming out of our discussion we have a couple of
> >>>> suggestions that could help with visibility for data providers and
> >>>> users:
> >>>> 
> >>>> - Some areas need agreement among the data nodes as to the technical
> >>>> solution, and then implementation across all the nodes, while others
> >>>> need a specific solution to be developed in one place and rolled out.
> >>>> The group teleconferences that Dean organises appear to be a good
> >>>> forum for airing specific technical ideas and solutions. However, in
> >>>> our experience it can be  difficult in that kind of forum to discuss
> >>>> planning and prioritisation questions. From our perspective we don't
> >>>> have visibility of the more project-related issues such as key
> >>>> technical decisions, prioritisation and timelines, or of whether
> >>>> issues that have arisen in the mailing list discussions are being
> >>>> followed up. We guess these may be discussed in separate project
> >>>> teleconferences involving the technical leads from the data nodes. As
> >>>> users we would not necessarily expect to be involved in those
> >>>> discussions, but as data providers and dowloaders it would be very
> >>>> helpful for our planning to see the outcomes of the discussions. The
> >>>> sort of thing we had in mind would be a simple web page showing the
> >>>> priority development areas, agreed solutions and estimated dates for
> >>>> completion/release. Some solutions will need to be implemented
> >>>> separately across all the participating data nodes, and in these
> >>>> cases it would be useful to see the estimated timeframe for
> >>>> implementation at each node.
> >>>> This would not be intended as a 'big stick' to the partners, but
> >>>> simply as a planning aid so that everyone can see what's available
> >>>> when and the project can identify any potential bottlenecks or issues
> >>>> in advance. Also the intention is not to generate a lot of extra
> >>>> work. Hopefully providing this information would be pretty light on
> >>>> people's time.
> >>>> 
> >>>> - From where we sit it appears that some nodes are quite successful in
> >>>> following best practice and implementing the federation policies as
> >>>> far as they are aware of them. Could what these nodes do be made
> >>>> helpful to all the data nodes (e.g. by using identical software)?  We
> >>>> realise there may be real differences between some data nodes - but
> >>>> where possible we think that what is similar could be enforced or
> >>>> made explicitly the same through sharing the software components and
> >>>> tools.
> >>>> 
> >>>> To set the discussion on priorities rolling, Jamie has prepared, in
> >>>> consultation with others here, a short document showing the Met Office
> >>>> view of current priority issues (attached). If you could update us on
> >>>> the status of work on these issues, that would be very useful (ideally
> >>>> via the web pages proposed above, which we think would be of interest
> >>>> to many users, or via email in the interim). Many thanks for the
> >>>> update on tokenless authentication, which is very good news.
> >>>> 
> >>>>      Once again, our thanks to you, Dean and the team for all the hard
> >>>>      work
> >>>> 
> >>>> we know is going into this. Please let us know what you think of the
> >>>> above ideas and the attachment, and if there is anything we can do to
> >>>> help.
> >>>> 
> >>>>          Best wishes,
> >>>>          
> >>>>           Richard
> >>>> 
> >>>> --------------
> >>>> Richard Wood
> >>>> Met Office Fellow and Head (Oceans, Cryosphere and Dangerous Climate
> >>>> Change)
> >>>> Met Office Hadley Centre
> >>>> FitzRoy Road, Exeter EX1 3PB, UK
> >>>> Phone +44 (0)1392 886641  Fax +44 (0)1392 885681
> >>>> Email
> >>>> richard.wood at metoffice.gov.uk<mailto:richard.wood at metoffice.gov.uk>
> >>>> http://www.metoffice.gov.uk Personal web page
> >>>> http://www.metoffice.gov.uk/research/scientists/cryosphere-oceans/rich
> >>>> ar d-wood
> >>>> 
> >>>> *** Please note I also work as Theme Leader (Climate System) for the
> >>>> Natural Environment Research Council ***
> >>>> *** Where possible please send emails on NERC matters to
> >>>> rwtl at nerc.ac.uk<mailto:rwtl at nerc.ac.uk>  ***
> > 
> > --
> > Bryan Lawrence
> > University of Reading:  Professor of Weather and Climate Computing
> > National Centre for Atmospheric Science: Director of Models and Data
> > STFC: Director of the Centre of Environmental Data Archival
> > Phone +44 1235 445012; Web: home.badc.rl.ac.uk/lawrence

--
Bryan Lawrence
University of Reading:  Professor of Weather and Climate Computing
National Centre for Atmospheric Science: Director of Models and Data 
STFC: Director of the Centre of Environmental Data Archival 
Phone +44 1235 445012; Web: home.badc.rl.ac.uk/lawrence