[Go-essp-tech] Replica support in Gateway 2.0

Kettleborough, Jamie jamie.kettleborough at metoffice.gov.uk
Tue Nov 15 06:50:59 MST 2011


Hello Michael, Bryan, Martin,

Thanks for the replies on this,  that's clarified things as much as necessary for now.

Jamie 

> -----Original Message-----
> From: go-essp-tech-bounces at ucar.edu 
> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Michael 
> Lautenschlager
> Sent: 15 November 2011 08:10
> To: Bryan Lawrence
> Cc: go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
> Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
> 
> That is exactly why we should be somewhat cautious in 
> finalising QC-L3 and assigning  DataCite DOIs. This is by 
> definition a final data product and we cannot remove this 
> entity. We have to keep it and in case of error add a 
> corrigendum or publish a new version with a new DataCite DOI 
> which then has also be kept forever. As Bryan pointed out 
> DataCite DOI publications follow comparable rules than  
> scientific publications.
> 
> For the replica we have to ensure that they are identical to 
> the DataCite published entities.
> Best, Michael
> 
> ---------------
> Dr. Michael Lautenschlager
> Head of DKRZ Department Data Management
> Director World Data Center Climate
> 
> German Climate Computing Centre (DKRZ)
> ADDRESS: Bundesstrasse 45a, D-20146 Hamburg, Germany
> PHONE:   +4940-460094-118
> E-Mail:  lautenschlager at dkrz.de
> 
> URL:    http://www.dkrz.de/
>          http://www.wdc-climate.de/
> 
> 
> Geschäftsführer: Prof. Dr. Thomas Ludwig Sitz der 
> Gesellschaft: Hamburg Amtsgericht Hamburg HRB 39784
> 
> 
> Am 14.11.2011 20:59, schrieb Bryan Lawrence:
> > the bottom line is that a DOI'd dataset is just like a 
> journal paper. If you subsequently find fault, you publish a 
> corrigendum, or a new paper ... but the original still hangs 
> around, because it's part of the evidential world ...
> > Cheers
> > Bryan
> >
> >> Hi Jamie,
> >>
> >> when a DOI'ed data set is found to be wrong the role of 
> the data centre is to post a note on the DOI landing page and 
> email those who have applied for access to it if applicable. 
> The data provider may want to do more -- e.g. notify journals 
> if they have already published results based on the data.
> >>
> >> For the non-DOI'ed data -- it is true that not everyone 
> will follow 
> >> advice,
> >>
> >> cheers,
> >> Martin
> >>
> >> ________________________________________
> >> From: Kettleborough, Jamie [jamie.kettleborough at metoffice.gov.uk]
> >> Sent: 14 November 2011 16:36
> >> To: Juckes, Martin (STFC,RAL,RALSP); gonzalez at dkrz.de; 
> >> go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
> >> Cc: Kettleborough, Jamie
> >> Subject: RE: [Go-essp-tech] Replica support in Gateway 2.0
> >>
> >> Hello Martin,
> >>
> >> I'm not sure how effective this suggestion would be.  I 
> don't know that analysts will contact data suppliers, nor 
> that the data suppliers will always be in a position to respond.
> >>
> >> I do wonder if there isn't some place for feedback from 
> data analysts to other analysts and data suppliers (may be 
> Ag's version comment app? Or a community wiki, or a place to 
> upload ncatted scripts?).  My guess is that some issues are 
> hard to detect in automatic QC e.g. how do you detect the the 
> forcing attribute is wrong?  Yet some analysis (detection 
> attribution for instance) is crucial that this is right.  
> These things are only discovered after data has gone out: by 
> analysts.  It would be great if they could let others know 
> (just letting the data supplier know may not be enough as 
> their response timescale may be slow- I know ours is at the moment).
> >>
> >> What is the plan when a DOI'd data set is found to be wrong?
> >>
> >> Jamie
> >>
> >>> -----Original Message-----
> >>> From: martin.juckes at stfc.ac.uk [mailto:martin.juckes at stfc.ac.uk]
> >>> Sent: 11 November 2011 09:46
> >>> To: Kettleborough, Jamie; gonzalez at dkrz.de; 
> go-essp-tech at ucar.edu; 
> >>> esg-gateway-dev at earthsystemgrid.org
> >>> Subject: RE: [Go-essp-tech] Replica support in Gateway 2.0
> >>>
> >>> Hello Jamie,
> >>>
> >>> Some of you points might have been answered elsewhere, but I just 
> >>> wanted point out some issues raised by your comment about people 
> >>> publishing now.
> >>>
> >>> The archive is required to distribute provisional data, 
> and this has 
> >>> some consequences. DOIs are clearly important because they make a 
> >>> clean break between the provisional and the reference 
> archive. There 
> >>> was a help desk response a couple of weeks ago suggesting that a 
> >>> user who wanted information about possible changes to 
> data contact 
> >>> the individual data node managers. This is relevant to people 
> >>> publishing now, and perhaps we should be explicitly stating that 
> >>> users should contact the data suppliers before 
> publishing. There are 
> >>> two reasons for doing this:
> >>> (1) The data suppliers might be on the point of retracting and/or 
> >>> replacing data;
> >>> (2) Technical problems in the distributed archive mean 
> that people 
> >>> need to check exactly which data they are using;
> >>>
> >>> We should, as you say, deal with item (2) as fast as 
> possible. Item 
> >>> (1) will remain until we get assurances from data suppliers about 
> >>> the stability of the data, which is essentially the DOI 
> step. So I 
> >>> think we should be advising users who are publishing work 
> based on 
> >>> non-DOI'ed data to contact data suppliers before submitting their 
> >>> work for publication. What do you think of this last suggestion?
> >>>
> >>> cheers,
> >>> Martin
> >>> ________________________________________
> >>> From: Kettleborough, Jamie [jamie.kettleborough at metoffice.gov.uk]
> >>> Sent: 10 November 2011 17:14
> >>> To: Juckes, Martin (STFC,RAL,RALSP); gonzalez at dkrz.de; 
> >>> go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
> >>> Cc: Kettleborough, Jamie
> >>> Subject: RE: [Go-essp-tech] Replica support in Gateway 2.0
> >>>
> >>> Hello,
> >>>
> >>> I know this thread has moved on, but can we rewind just a bit.  I 
> >>> think Nathan asked for use cases around this issue.
> >>>
> >>> As far as I'm aware there are two main user use cases here (there 
> >>> may be others obviously).
> >>>
> >>> 1. User wants to get data *now* (even if its only 'preprint'
> >>> in Bryan's language) in a way that suites their particular needs 
> >>> (functional and non-functional)
> >>>
> >>> 2. In a few years (not sure how long - could be months) time user 
> >>> wants to get a copy of the data to verify or extend some previous 
> >>> analysis.
> >>>
> >>> I'm sure someone will correct me if I have this wrong, but I
> >>> *think* most of the discussion so far has centered around 
> the second 
> >>> of these.  I think the first one needs some discusion too though 
> >>> doesn't it - it feels more urgent to me?
> >>>
> >>> I think it needs reviewing in the light of replication as 
> when there 
> >>> is replication a user *may* choose to go to a particular 
> data-node 
> >>> as it suites them better - they may see faster downloads 
> because of 
> >>> the network route betweent them and the servers, or it 
> might support 
> >>> some data service they want.  Something in the system has to have 
> >>> the responsibility of choosing which replica to download.  To be 
> >>> honest I think this is best left to the user.  If this is 
> the case 
> >>> then the user has to have a good view of *where* the data 
> is through 
> >>> their interface.  Nate - I think your proposed 
> implementation would 
> >>> expose the information no?
> >>>
> >>> The other issue for replication I can think of coming out of use 
> >>> case 1 is versioning.  Data will be revised by data providers (we 
> >>> have examples of this), so I think that the replication 
> system has 
> >>> to keep up and the interfaces have to
> >>> be able to communicate this to the user.   A case I'm worried
> >>> about is if a replica goes stale (sorry Estani, I *think* we have 
> >>> examples of these e.g.
> >>> cmip5.output1.MOHC.HadGEM2-ES.historicalGHG.day.atmos.day.r1i1
> >>> p1.v2011032 at DKRZ is stale - we've resubmitted as v20111102.  I 
> >>> only discovered this yesterday, and understood what *might* be 
> >>> happening today, I can send more examples if you need them).  I 
> >>> think a user needs to be able to tell (without too much 
> hard work) 
> >>> what really is a replica, and what is a replica of a previous 
> >>> version.  Nate - can you cope with this situation in your 
> proposed 
> >>> implementation?
> >>>
> >>> (Are there any issues around authorisation when it comes 
> to replicas 
> >>> - would a new published version mean all replicas of previous 
> >>> versions are no longer 'authorisable' against, or would stale 
> >>> replicas be available?)
> >>>
> >>> I know that replication has started to happen - but is this the 
> >>> right thing now?  Is everything in place to do this in a 
> way that is 
> >>> *safe* and not going to confuse users?
> >>>
> >>> Jamie
> >>>
> >>> (I guess if you have a client that only uses the data 
> nodes you know 
> >>> what node you were talking to, and see the full version 
> information 
> >>> from the outset, so these aren't such big issues).
> >>>
> >>> ps yes I know I still have to answer some questions from 
> Sebastien 
> >>> about our client.
> >>>
> >>>> -----Original Message-----
> >>>> From: go-essp-tech-bounces at ucar.edu 
> >>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of 
> >>>> martin.juckes at stfc.ac.uk
> >>>> Sent: 10 November 2011 16:26
> >>>> To: gonzalez at dkrz.de; go-essp-tech at ucar.edu; 
> >>>> esg-gateway-dev at earthsystemgrid.org
> >>>> Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
> >>>>
> >>>> Hi Estani,
> >>>>
> >>>> You missed the start -- the bit which is not achievable is
> >>> publishing
> >>>> a replica to the same gateway used for the original 
> publication of 
> >>>> that data. E.g. IPSL data published to BADC,
> >>>>
> >>>> Cheers,
> >>>> Martin
> >>>>
> >>>>>> -----Original Message-----
> >>>>>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech- 
> >>>>>> bounces at ucar.edu] On Behalf Of Estanislao Gonzalez
> >>>>>> Sent: 10 November 2011 16:20
> >>>>>> To: go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
> >>>>>> Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> this analogy seams perfect. Now regarding to the options
> >>>> we have at
> >>>>>> the
> >>>>>> moment:
> >>>>>> 1) How "unique" is the dataset id in a Gateway? 
> Federation wide, 
> >>>>>> local gAteway or project unique?
> >>>>>> 2) Depending on one, the procedure could involve "moving" the 
> >>>>>> published data to some other project, Gateway, Federation :-)
> >>>>>>
> >>>>>> I think this could be achievable:
> >>>>>> 1) Data gets replicated to some Gateway (redundancy enforced)
> >>>>>> 2) The originated Gateway, if it's also replicating,
> >>>> should replicate
> >>>>>> (just data, no publication yet) from the QC checked replica.
> >>>>>> 3) The "pre-print" gets removed (which either mean move to a 
> >>>>>> different project, Gateway, etc or really completely
> >>>> delete it from
> >>>>>> the Gateway)
> >>>>>> 4) The replica gets published.
> >>>>>>
> >>>>>> I might be omitting something, but it seams achievable 
> right now.
> >>>>>>
> >>>>>> My 2c,
> >>>>>> Estani
> >>>>>>
> >>>>>> Am 10.11.2011 07:15, schrieb Bryan Lawrence:
> >>>>>>> Martin has been quite vociferous (quite rightly) in
> >>>> personal email
> >>>>>> to me that as far as QC goes, the dataset which gets
> >>>> through QC2 will
> >>>>>> *not* be the original dataset - we have no control over
> >>>> the original
> >>>>>> dataset's permanence and/or immutability.
> >>>>>>> This raises some interesting issues about the role of
> >>>> ESGF ... and
> >>>>>> it's interaction with the data owner and the 
> publication process 
> >>>>>> which is governed by DKRZ as the Publisher (and in the future 
> >>>>>> probably multiple publication processes and multiple
> >>>> Publishers). The
> >>>>>> correct analogy here, as I said on an earlier email 
> today, is to 
> >>>>>> consider the original dataset as a preprint, of a
> >>>> Published dataset
> >>>>>> (at QC level 3).
> >>>>>>> Incidentally, this disctinction might offer us a possible
> >>>>>>> (distinct)
> >>>>>> future for two different types of gateways into ESGF: the
> >>>> Published
> >>>>>> datasets view (which makes pre-eminent the QC'd copy) and the 
> >>>>>> published view (which makes pre-eminenent whatever someone
> >>>> sticks on
> >>>>>> a data node).
> >>>>>>> But meanwhile, I think we can live with what you
> >>>> proposed, as long
> >>>>>> as the QC status of the replicas is clearly visible -
> >>> and the DOI
> >>>>>> points to a landing page that somehow prioritises those
> >>> versions,
> >>>>>> which would be trivial if your page was organised in the
> >>> same way
> >>>>>> (prioritising the replicants of QC level 3, then
> >>> replicants of QC
> >>>>>> level 2, and then originals).
> >>>>>>> Cheers
> >>>>>>> Bryan
> >>>>>>>
> >>>>>>>
> >>>>>>>> Hi Stephen,
> >>>>>>>>
> >>>>>>>> On 11/10/2011 05:23 AM, stephen.pascoe at stfc.ac.uk wrote:
> >>>>>>>>> Hi Eric,
> >>>>>>>>>
> >>>>>>>>> Replicas are beginning to show up in CMIP5 and this
> >>> is exposing
> >>>>>> some
> >>>>>>>>> gaps in what Gateway 1.x can do. I know you are
> >>> reimplementing
> >>>>>> replica
> >>>>>>>>> support in Gateway 2.0 so I'd like to raise these 
> issues now.
> >>>>>>>>>
> >>>>>>>>> We need to be able to publish a replica to the same
> >>>> Gateway that
> >>>>>> hosts
> >>>>>>>>> the original. I can't imagine this being possible
> >>> with Gateway
> >>>>>>>>> 1.x
> >>>>>> since
> >>>>>>>>> the URL http://<GATEWAY>/dataset/<dataset-id>.html
> >>>> only points to
> >>>>>> one
> >>>>>>>>> dataset on that Gateway. Either that page needs to
> >>> link to the
> >>>>>> original
> >>>>>>>>> and all replicas for that dataset or we need
> >>> separate URLs for
> >>>>>> each
> >>>>>>>>> replica/original, or both.
> >>>>>>>> The current direction for the implementation would be
> >>>> to have a 1
> >>>>>> page
> >>>>>>>> for the original dataset and have that page list
> >>> where replicas
> >>>>>>>> are located.
> >>>>>>>>
> >>>>>>>> If there are use cases for the other options we
> >>> should get those
> >>>>>> identified.
> >>>>>>>> Thanks!
> >>>>>>>> -Nate
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Is this part of your design for Gateway 2.0's
> >>> replica support?
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Stephen.
> >>>>>>>>>
> >>>>>>>>> ---
> >>>>>>>>>
> >>>>>>>>> Stephen Pascoe +44 (0)1235 445980
> >>>>>>>>>
> >>>>>>>>> Centre of Environmental Data Archival
> >>>>>>>>>
> >>>>>>>>> STFC Rutherford Appleton Laboratory, Harwell Oxford,
> >>>> Didcot OX11
> >>>>>> 0QX, UK
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Scanned by iCritical.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> GO-ESSP-TECH mailing list
> >>>>>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>>>> _______________________________________________
> >>>>>>>> GO-ESSP-TECH mailing list
> >>>>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>>>>
> >>>>>>> --
> >>>>>>> Bryan Lawrence
> >>>>>>> University of Reading:  Professor of Weather and Climate
> >>>> Computing.
> >>>>>>> National Centre for Atmospheric Science: Director of 
> Models and
> >>>>>> Data.
> >>>>>>> STFC: Director of the Centre for Environmental Data Archival.
> >>>>>>> Ph: +44 118 3786507 or 1235 445012;
> >>>> Web:home.badc.rl.ac.uk/lawrence
> >>>>>>> _______________________________________________
> >>>>>>> GO-ESSP-TECH mailing list
> >>>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>>
> >>>>>> --
> >>>>>> Estanislao Gonzalez
> >>>>>>
> >>>>>> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches 
> >>>>>> Klimarechenzentrum (DKRZ) - German Climate Computing
> >>>> Centre Room 108
> >>>>>> - Bundesstrasse 45a, D-20146 Hamburg, Germany
> >>>>>>
> >>>>>> Phone:   +49 (40) 46 00 94-126
> >>>>>> E-Mail:  gonzalez at dkrz.de
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> GO-ESSP-TECH mailing list
> >>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>> --
> >>>> Scanned by iCritical.
> >>>> _______________________________________________
> >>>> GO-ESSP-TECH mailing list
> >>>> GO-ESSP-TECH at ucar.edu
> >>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>
> >>> --
> >>> Scanned by iCritical.
> >>>
> > --
> > Bryan Lawrence
> > University of Reading:  Professor of Weather and Climate Computing.
> > National Centre for Atmospheric Science: Director of Models 
> and Data.
> > STFC: Director of the Centre for Environmental Data Archival.
> > Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence 
> > _______________________________________________
> > GO-ESSP-TECH mailing list
> > GO-ESSP-TECH at ucar.edu
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> 


More information about the GO-ESSP-TECH mailing list