[Go-essp-tech] Replica support in Gateway 2.0

Kettleborough, Jamie jamie.kettleborough at metoffice.gov.uk
Mon Nov 14 09:36:42 MST 2011


Hello Martin,

I'm not sure how effective this suggestion would be.  I don't know that analysts will contact data suppliers, nor that the data suppliers will always be in a position to respond.

I do wonder if there isn't some place for feedback from data analysts to other analysts and data suppliers (may be Ag's version comment app? Or a community wiki, or a place to upload ncatted scripts?).  My guess is that some issues are hard to detect in automatic QC e.g. how do you detect the the forcing attribute is wrong?  Yet some analysis (detection attribution for instance) is crucial that this is right.  These things are only discovered after data has gone out: by analysts.  It would be great if they could let others know (just letting the data supplier know may not be enough as their response timescale may be slow- I know ours is at the moment).

What is the plan when a DOI'd data set is found to be wrong?

Jamie

> -----Original Message-----
> From: martin.juckes at stfc.ac.uk [mailto:martin.juckes at stfc.ac.uk] 
> Sent: 11 November 2011 09:46
> To: Kettleborough, Jamie; gonzalez at dkrz.de; 
> go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
> Subject: RE: [Go-essp-tech] Replica support in Gateway 2.0
> 
> Hello Jamie,
> 
> Some of you points might have been answered elsewhere, but I 
> just wanted point out some issues raised by your comment 
> about people publishing now.
> 
> The archive is required to distribute provisional data, and 
> this has some consequences. DOIs are clearly important 
> because they make a clean break between the provisional and 
> the reference archive. There was a help desk response a 
> couple of weeks ago suggesting that a user who wanted 
> information about possible changes to data contact the 
> individual data node managers. This is relevant to people 
> publishing now, and perhaps we should be explicitly stating 
> that users should contact the data suppliers before 
> publishing. There are two reasons for doing this:
> (1) The data suppliers might be on the point of retracting 
> and/or replacing data;
> (2) Technical problems in the distributed archive mean that 
> people need to check exactly which data they are using;
> 
> We should, as you say, deal with item (2) as fast as 
> possible. Item (1) will remain until we get assurances from 
> data suppliers about the stability of the data, which is 
> essentially the DOI step. So I think we should be advising 
> users who are publishing work based on non-DOI'ed data to 
> contact data suppliers before submitting their work for 
> publication. What do you think of this last suggestion?
> 
> cheers,
> Martin
> ________________________________________
> From: Kettleborough, Jamie [jamie.kettleborough at metoffice.gov.uk]
> Sent: 10 November 2011 17:14
> To: Juckes, Martin (STFC,RAL,RALSP); gonzalez at dkrz.de; 
> go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
> Cc: Kettleborough, Jamie
> Subject: RE: [Go-essp-tech] Replica support in Gateway 2.0
> 
> Hello,
> 
> I know this thread has moved on, but can we rewind just a 
> bit.  I think Nathan asked for use cases around this issue.
> 
> As far as I'm aware there are two main user use cases here 
> (there may be others obviously).
> 
> 1. User wants to get data *now* (even if its only 'preprint' 
> in Bryan's language) in a way that suites their particular 
> needs (functional and non-functional)
> 
> 2. In a few years (not sure how long - could be months) time 
> user wants to get a copy of the data to verify or extend some 
> previous analysis.
> 
> I'm sure someone will correct me if I have this wrong, but I 
> *think* most of the discussion so far has centered around the 
> second of these.  I think the first one needs some discusion 
> too though doesn't it - it feels more urgent to me?
> 
> I think it needs reviewing in the light of replication as 
> when there is replication a user *may* choose to go to a 
> particular data-node as it suites them better - they may see 
> faster downloads because of the network route betweent them 
> and the servers, or it might support some data service they 
> want.  Something in the system has to have the responsibility 
> of choosing which replica to download.  To be honest I think 
> this is best left to the user.  If this is the case then the 
> user has to have a good view of *where* the data is through 
> their interface.  Nate - I think your proposed implementation 
> would expose the information no?
> 
> The other issue for replication I can think of coming out of 
> use case 1 is versioning.  Data will be revised by data 
> providers (we have examples of this), so I think that the 
> replication system has to keep up and the interfaces have to 
> be able to communicate this to the user.   A case I'm worried 
> about is if a replica goes stale (sorry Estani, I *think* we 
> have examples of these e.g. 
> cmip5.output1.MOHC.HadGEM2-ES.historicalGHG.day.atmos.day.r1i1
> p1.v2011032 at DKRZ is stale - we've resubmitted as 
> v20111102.  I only discovered this yesterday, and understood 
> what *might* be happening today, I can send more examples if 
> you need them).  I think a user needs to be able to tell 
> (without too much hard work) what really is a replica, and 
> what is a replica of a previous version.  Nate - can you cope 
> with this situation in your proposed implementation?
> 
> (Are there any issues around authorisation when it comes to 
> replicas - would a new published version mean all replicas of 
> previous versions are no longer 'authorisable' against, or 
> would stale replicas be available?)
> 
> I know that replication has started to happen - but is this 
> the right thing now?  Is everything in place to do this in a 
> way that is *safe* and not going to confuse users?
> 
> Jamie
> 
> (I guess if you have a client that only uses the data nodes 
> you know what node you were talking to, and see the full 
> version information from the outset, so these aren't such big issues).
> 
> ps yes I know I still have to answer some questions from 
> Sebastien about our client.
> 
> > -----Original Message-----
> > From: go-essp-tech-bounces at ucar.edu
> > [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of 
> > martin.juckes at stfc.ac.uk
> > Sent: 10 November 2011 16:26
> > To: gonzalez at dkrz.de; go-essp-tech at ucar.edu; 
> > esg-gateway-dev at earthsystemgrid.org
> > Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
> >
> > Hi Estani,
> >
> > You missed the start -- the bit which is not achievable is 
> publishing 
> > a replica to the same gateway used for the original publication of 
> > that data. E.g. IPSL data published to BADC,
> >
> > Cheers,
> > Martin
> >
> > > >-----Original Message-----
> > > >From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech- 
> > > >bounces at ucar.edu] On Behalf Of Estanislao Gonzalez
> > > >Sent: 10 November 2011 16:20
> > > >To: go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
> > > >Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
> > > >
> > > >Hi,
> > > >
> > > >this analogy seams perfect. Now regarding to the options
> > we have at
> > > >the
> > > >moment:
> > > >1) How "unique" is the dataset id in a Gateway? Federation wide, 
> > > >local gAteway or project unique?
> > > >2) Depending on one, the procedure could involve "moving" the 
> > > >published data to some other project, Gateway, Federation :-)
> > > >
> > > >I think this could be achievable:
> > > >1) Data gets replicated to some Gateway (redundancy enforced)
> > > >2) The originated Gateway, if it's also replicating,
> > should replicate
> > > >(just data, no publication yet) from the QC checked replica.
> > > >3) The "pre-print" gets removed (which either mean move to a 
> > > >different project, Gateway, etc or really completely
> > delete it from
> > > >the Gateway)
> > > >4) The replica gets published.
> > > >
> > > >I might be omitting something, but it seams achievable right now.
> > > >
> > > >My 2c,
> > > >Estani
> > > >
> > > >Am 10.11.2011 07:15, schrieb Bryan Lawrence:
> > > >> Martin has been quite vociferous (quite rightly) in
> > personal email
> > > >to me that as far as QC goes, the dataset which gets
> > through QC2 will
> > > >*not* be the original dataset - we have no control over
> > the original
> > > >dataset's permanence and/or immutability.
> > > >>
> > > >> This raises some interesting issues about the role of
> > ESGF ... and
> > > >it's interaction with the data owner and the publication process 
> > > >which is governed by DKRZ as the Publisher (and in the future 
> > > >probably multiple publication processes and multiple
> > Publishers). The
> > > >correct analogy here, as I said on an earlier email today, is to 
> > > >consider the original dataset as a preprint, of a
> > Published dataset
> > > >(at QC level 3).
> > > >>
> > > >> Incidentally, this disctinction might offer us a possible
> > > >> (distinct)
> > > >future for two different types of gateways into ESGF: the
> > Published
> > > >datasets view (which makes pre-eminent the QC'd copy) and the 
> > > >published view (which makes pre-eminenent whatever someone
> > sticks on
> > > >a data node).
> > > >>
> > > >> But meanwhile, I think we can live with what you
> > proposed, as long
> > > >as the QC status of the replicas is clearly visible - 
> and the DOI 
> > > >points to a landing page that somehow prioritises those 
> versions, 
> > > >which would be trivial if your page was organised in the 
> same way 
> > > >(prioritising the replicants of QC level 3, then 
> replicants of QC 
> > > >level 2, and then originals).
> > > >>
> > > >> Cheers
> > > >> Bryan
> > > >>
> > > >>
> > > >>> Hi Stephen,
> > > >>>
> > > >>> On 11/10/2011 05:23 AM, stephen.pascoe at stfc.ac.uk wrote:
> > > >>>> Hi Eric,
> > > >>>>
> > > >>>> Replicas are beginning to show up in CMIP5 and this 
> is exposing
> > > >some
> > > >>>> gaps in what Gateway 1.x can do. I know you are 
> reimplementing
> > > >replica
> > > >>>> support in Gateway 2.0 so I'd like to raise these issues now.
> > > >>>>
> > > >>>> We need to be able to publish a replica to the same
> > Gateway that
> > > >hosts
> > > >>>> the original. I can't imagine this being possible 
> with Gateway 
> > > >>>> 1.x
> > > >since
> > > >>>> the URL http://<GATEWAY>/dataset/<dataset-id>.html
> > only points to
> > > >one
> > > >>>> dataset on that Gateway. Either that page needs to 
> link to the
> > > >original
> > > >>>> and all replicas for that dataset or we need 
> separate URLs for
> > > >each
> > > >>>> replica/original, or both.
> > > >>> The current direction for the implementation would be
> > to have a 1
> > > >page
> > > >>> for the original dataset and have that page list 
> where replicas 
> > > >>> are located.
> > > >>>
> > > >>> If there are use cases for the other options we 
> should get those
> > > >identified.
> > > >>>
> > > >>> Thanks!
> > > >>> -Nate
> > > >>>
> > > >>>
> > > >>>> Is this part of your design for Gateway 2.0's 
> replica support?
> > > >>>>
> > > >>>> Thanks,
> > > >>>>
> > > >>>> Stephen.
> > > >>>>
> > > >>>> ---
> > > >>>>
> > > >>>> Stephen Pascoe +44 (0)1235 445980
> > > >>>>
> > > >>>> Centre of Environmental Data Archival
> > > >>>>
> > > >>>> STFC Rutherford Appleton Laboratory, Harwell Oxford,
> > Didcot OX11
> > > >0QX, UK
> > > >>>>
> > > >>>>
> > > >>>> --
> > > >>>> Scanned by iCritical.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> _______________________________________________
> > > >>>> GO-ESSP-TECH mailing list
> > > >>>> GO-ESSP-TECH at ucar.edu
> > > >>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> > > >>> _______________________________________________
> > > >>> GO-ESSP-TECH mailing list
> > > >>> GO-ESSP-TECH at ucar.edu
> > > >>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> > > >>>
> > > >> --
> > > >> Bryan Lawrence
> > > >> University of Reading:  Professor of Weather and Climate
> > Computing.
> > > >> National Centre for Atmospheric Science: Director of Models and
> > > >Data.
> > > >> STFC: Director of the Centre for Environmental Data Archival.
> > > >> Ph: +44 118 3786507 or 1235 445012;
> > Web:home.badc.rl.ac.uk/lawrence
> > > >> _______________________________________________
> > > >> GO-ESSP-TECH mailing list
> > > >> GO-ESSP-TECH at ucar.edu
> > > >> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> > > >
> > > >
> > > >--
> > > >Estanislao Gonzalez
> > > >
> > > >Max-Planck-Institut für Meteorologie (MPI-M) Deutsches 
> > > >Klimarechenzentrum (DKRZ) - German Climate Computing
> > Centre Room 108
> > > >- Bundesstrasse 45a, D-20146 Hamburg, Germany
> > > >
> > > >Phone:   +49 (40) 46 00 94-126
> > > >E-Mail:  gonzalez at dkrz.de
> > > >
> > > >_______________________________________________
> > > >GO-ESSP-TECH mailing list
> > > >GO-ESSP-TECH at ucar.edu
> > > >http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> > --
> > Scanned by iCritical.
> > _______________________________________________
> > GO-ESSP-TECH mailing list
> > GO-ESSP-TECH at ucar.edu
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >
> --
> Scanned by iCritical.
> 


More information about the GO-ESSP-TECH mailing list