[Go-essp-tech] Replica support in Gateway 2.0
Kettleborough, Jamie
jamie.kettleborough at metoffice.gov.uk
Tue Nov 15 06:50:59 MST 2011
Hello Michael, Bryan, Martin,
Thanks for the replies on this, that's clarified things as much as necessary for now.
Jamie
> -----Original Message-----
> From: go-essp-tech-bounces at ucar.edu
> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Michael
> Lautenschlager
> Sent: 15 November 2011 08:10
> To: Bryan Lawrence
> Cc: go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
> Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
>
> That is exactly why we should be somewhat cautious in
> finalising QC-L3 and assigning DataCite DOIs. This is by
> definition a final data product and we cannot remove this
> entity. We have to keep it and in case of error add a
> corrigendum or publish a new version with a new DataCite DOI
> which then has also be kept forever. As Bryan pointed out
> DataCite DOI publications follow comparable rules than
> scientific publications.
>
> For the replica we have to ensure that they are identical to
> the DataCite published entities.
> Best, Michael
>
> ---------------
> Dr. Michael Lautenschlager
> Head of DKRZ Department Data Management
> Director World Data Center Climate
>
> German Climate Computing Centre (DKRZ)
> ADDRESS: Bundesstrasse 45a, D-20146 Hamburg, Germany
> PHONE: +4940-460094-118
> E-Mail: lautenschlager at dkrz.de
>
> URL: http://www.dkrz.de/
> http://www.wdc-climate.de/
>
>
> Geschäftsführer: Prof. Dr. Thomas Ludwig Sitz der
> Gesellschaft: Hamburg Amtsgericht Hamburg HRB 39784
>
>
> Am 14.11.2011 20:59, schrieb Bryan Lawrence:
> > the bottom line is that a DOI'd dataset is just like a
> journal paper. If you subsequently find fault, you publish a
> corrigendum, or a new paper ... but the original still hangs
> around, because it's part of the evidential world ...
> > Cheers
> > Bryan
> >
> >> Hi Jamie,
> >>
> >> when a DOI'ed data set is found to be wrong the role of
> the data centre is to post a note on the DOI landing page and
> email those who have applied for access to it if applicable.
> The data provider may want to do more -- e.g. notify journals
> if they have already published results based on the data.
> >>
> >> For the non-DOI'ed data -- it is true that not everyone
> will follow
> >> advice,
> >>
> >> cheers,
> >> Martin
> >>
> >> ________________________________________
> >> From: Kettleborough, Jamie [jamie.kettleborough at metoffice.gov.uk]
> >> Sent: 14 November 2011 16:36
> >> To: Juckes, Martin (STFC,RAL,RALSP); gonzalez at dkrz.de;
> >> go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
> >> Cc: Kettleborough, Jamie
> >> Subject: RE: [Go-essp-tech] Replica support in Gateway 2.0
> >>
> >> Hello Martin,
> >>
> >> I'm not sure how effective this suggestion would be. I
> don't know that analysts will contact data suppliers, nor
> that the data suppliers will always be in a position to respond.
> >>
> >> I do wonder if there isn't some place for feedback from
> data analysts to other analysts and data suppliers (may be
> Ag's version comment app? Or a community wiki, or a place to
> upload ncatted scripts?). My guess is that some issues are
> hard to detect in automatic QC e.g. how do you detect the the
> forcing attribute is wrong? Yet some analysis (detection
> attribution for instance) is crucial that this is right.
> These things are only discovered after data has gone out: by
> analysts. It would be great if they could let others know
> (just letting the data supplier know may not be enough as
> their response timescale may be slow- I know ours is at the moment).
> >>
> >> What is the plan when a DOI'd data set is found to be wrong?
> >>
> >> Jamie
> >>
> >>> -----Original Message-----
> >>> From: martin.juckes at stfc.ac.uk [mailto:martin.juckes at stfc.ac.uk]
> >>> Sent: 11 November 2011 09:46
> >>> To: Kettleborough, Jamie; gonzalez at dkrz.de;
> go-essp-tech at ucar.edu;
> >>> esg-gateway-dev at earthsystemgrid.org
> >>> Subject: RE: [Go-essp-tech] Replica support in Gateway 2.0
> >>>
> >>> Hello Jamie,
> >>>
> >>> Some of you points might have been answered elsewhere, but I just
> >>> wanted point out some issues raised by your comment about people
> >>> publishing now.
> >>>
> >>> The archive is required to distribute provisional data,
> and this has
> >>> some consequences. DOIs are clearly important because they make a
> >>> clean break between the provisional and the reference
> archive. There
> >>> was a help desk response a couple of weeks ago suggesting that a
> >>> user who wanted information about possible changes to
> data contact
> >>> the individual data node managers. This is relevant to people
> >>> publishing now, and perhaps we should be explicitly stating that
> >>> users should contact the data suppliers before
> publishing. There are
> >>> two reasons for doing this:
> >>> (1) The data suppliers might be on the point of retracting and/or
> >>> replacing data;
> >>> (2) Technical problems in the distributed archive mean
> that people
> >>> need to check exactly which data they are using;
> >>>
> >>> We should, as you say, deal with item (2) as fast as
> possible. Item
> >>> (1) will remain until we get assurances from data suppliers about
> >>> the stability of the data, which is essentially the DOI
> step. So I
> >>> think we should be advising users who are publishing work
> based on
> >>> non-DOI'ed data to contact data suppliers before submitting their
> >>> work for publication. What do you think of this last suggestion?
> >>>
> >>> cheers,
> >>> Martin
> >>> ________________________________________
> >>> From: Kettleborough, Jamie [jamie.kettleborough at metoffice.gov.uk]
> >>> Sent: 10 November 2011 17:14
> >>> To: Juckes, Martin (STFC,RAL,RALSP); gonzalez at dkrz.de;
> >>> go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
> >>> Cc: Kettleborough, Jamie
> >>> Subject: RE: [Go-essp-tech] Replica support in Gateway 2.0
> >>>
> >>> Hello,
> >>>
> >>> I know this thread has moved on, but can we rewind just a bit. I
> >>> think Nathan asked for use cases around this issue.
> >>>
> >>> As far as I'm aware there are two main user use cases here (there
> >>> may be others obviously).
> >>>
> >>> 1. User wants to get data *now* (even if its only 'preprint'
> >>> in Bryan's language) in a way that suites their particular needs
> >>> (functional and non-functional)
> >>>
> >>> 2. In a few years (not sure how long - could be months) time user
> >>> wants to get a copy of the data to verify or extend some previous
> >>> analysis.
> >>>
> >>> I'm sure someone will correct me if I have this wrong, but I
> >>> *think* most of the discussion so far has centered around
> the second
> >>> of these. I think the first one needs some discusion too though
> >>> doesn't it - it feels more urgent to me?
> >>>
> >>> I think it needs reviewing in the light of replication as
> when there
> >>> is replication a user *may* choose to go to a particular
> data-node
> >>> as it suites them better - they may see faster downloads
> because of
> >>> the network route betweent them and the servers, or it
> might support
> >>> some data service they want. Something in the system has to have
> >>> the responsibility of choosing which replica to download. To be
> >>> honest I think this is best left to the user. If this is
> the case
> >>> then the user has to have a good view of *where* the data
> is through
> >>> their interface. Nate - I think your proposed
> implementation would
> >>> expose the information no?
> >>>
> >>> The other issue for replication I can think of coming out of use
> >>> case 1 is versioning. Data will be revised by data providers (we
> >>> have examples of this), so I think that the replication
> system has
> >>> to keep up and the interfaces have to
> >>> be able to communicate this to the user. A case I'm worried
> >>> about is if a replica goes stale (sorry Estani, I *think* we have
> >>> examples of these e.g.
> >>> cmip5.output1.MOHC.HadGEM2-ES.historicalGHG.day.atmos.day.r1i1
> >>> p1.v2011032 at DKRZ is stale - we've resubmitted as v20111102. I
> >>> only discovered this yesterday, and understood what *might* be
> >>> happening today, I can send more examples if you need them). I
> >>> think a user needs to be able to tell (without too much
> hard work)
> >>> what really is a replica, and what is a replica of a previous
> >>> version. Nate - can you cope with this situation in your
> proposed
> >>> implementation?
> >>>
> >>> (Are there any issues around authorisation when it comes
> to replicas
> >>> - would a new published version mean all replicas of previous
> >>> versions are no longer 'authorisable' against, or would stale
> >>> replicas be available?)
> >>>
> >>> I know that replication has started to happen - but is this the
> >>> right thing now? Is everything in place to do this in a
> way that is
> >>> *safe* and not going to confuse users?
> >>>
> >>> Jamie
> >>>
> >>> (I guess if you have a client that only uses the data
> nodes you know
> >>> what node you were talking to, and see the full version
> information
> >>> from the outset, so these aren't such big issues).
> >>>
> >>> ps yes I know I still have to answer some questions from
> Sebastien
> >>> about our client.
> >>>
> >>>> -----Original Message-----
> >>>> From: go-essp-tech-bounces at ucar.edu
> >>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of
> >>>> martin.juckes at stfc.ac.uk
> >>>> Sent: 10 November 2011 16:26
> >>>> To: gonzalez at dkrz.de; go-essp-tech at ucar.edu;
> >>>> esg-gateway-dev at earthsystemgrid.org
> >>>> Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
> >>>>
> >>>> Hi Estani,
> >>>>
> >>>> You missed the start -- the bit which is not achievable is
> >>> publishing
> >>>> a replica to the same gateway used for the original
> publication of
> >>>> that data. E.g. IPSL data published to BADC,
> >>>>
> >>>> Cheers,
> >>>> Martin
> >>>>
> >>>>>> -----Original Message-----
> >>>>>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
> >>>>>> bounces at ucar.edu] On Behalf Of Estanislao Gonzalez
> >>>>>> Sent: 10 November 2011 16:20
> >>>>>> To: go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
> >>>>>> Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> this analogy seams perfect. Now regarding to the options
> >>>> we have at
> >>>>>> the
> >>>>>> moment:
> >>>>>> 1) How "unique" is the dataset id in a Gateway?
> Federation wide,
> >>>>>> local gAteway or project unique?
> >>>>>> 2) Depending on one, the procedure could involve "moving" the
> >>>>>> published data to some other project, Gateway, Federation :-)
> >>>>>>
> >>>>>> I think this could be achievable:
> >>>>>> 1) Data gets replicated to some Gateway (redundancy enforced)
> >>>>>> 2) The originated Gateway, if it's also replicating,
> >>>> should replicate
> >>>>>> (just data, no publication yet) from the QC checked replica.
> >>>>>> 3) The "pre-print" gets removed (which either mean move to a
> >>>>>> different project, Gateway, etc or really completely
> >>>> delete it from
> >>>>>> the Gateway)
> >>>>>> 4) The replica gets published.
> >>>>>>
> >>>>>> I might be omitting something, but it seams achievable
> right now.
> >>>>>>
> >>>>>> My 2c,
> >>>>>> Estani
> >>>>>>
> >>>>>> Am 10.11.2011 07:15, schrieb Bryan Lawrence:
> >>>>>>> Martin has been quite vociferous (quite rightly) in
> >>>> personal email
> >>>>>> to me that as far as QC goes, the dataset which gets
> >>>> through QC2 will
> >>>>>> *not* be the original dataset - we have no control over
> >>>> the original
> >>>>>> dataset's permanence and/or immutability.
> >>>>>>> This raises some interesting issues about the role of
> >>>> ESGF ... and
> >>>>>> it's interaction with the data owner and the
> publication process
> >>>>>> which is governed by DKRZ as the Publisher (and in the future
> >>>>>> probably multiple publication processes and multiple
> >>>> Publishers). The
> >>>>>> correct analogy here, as I said on an earlier email
> today, is to
> >>>>>> consider the original dataset as a preprint, of a
> >>>> Published dataset
> >>>>>> (at QC level 3).
> >>>>>>> Incidentally, this disctinction might offer us a possible
> >>>>>>> (distinct)
> >>>>>> future for two different types of gateways into ESGF: the
> >>>> Published
> >>>>>> datasets view (which makes pre-eminent the QC'd copy) and the
> >>>>>> published view (which makes pre-eminenent whatever someone
> >>>> sticks on
> >>>>>> a data node).
> >>>>>>> But meanwhile, I think we can live with what you
> >>>> proposed, as long
> >>>>>> as the QC status of the replicas is clearly visible -
> >>> and the DOI
> >>>>>> points to a landing page that somehow prioritises those
> >>> versions,
> >>>>>> which would be trivial if your page was organised in the
> >>> same way
> >>>>>> (prioritising the replicants of QC level 3, then
> >>> replicants of QC
> >>>>>> level 2, and then originals).
> >>>>>>> Cheers
> >>>>>>> Bryan
> >>>>>>>
> >>>>>>>
> >>>>>>>> Hi Stephen,
> >>>>>>>>
> >>>>>>>> On 11/10/2011 05:23 AM, stephen.pascoe at stfc.ac.uk wrote:
> >>>>>>>>> Hi Eric,
> >>>>>>>>>
> >>>>>>>>> Replicas are beginning to show up in CMIP5 and this
> >>> is exposing
> >>>>>> some
> >>>>>>>>> gaps in what Gateway 1.x can do. I know you are
> >>> reimplementing
> >>>>>> replica
> >>>>>>>>> support in Gateway 2.0 so I'd like to raise these
> issues now.
> >>>>>>>>>
> >>>>>>>>> We need to be able to publish a replica to the same
> >>>> Gateway that
> >>>>>> hosts
> >>>>>>>>> the original. I can't imagine this being possible
> >>> with Gateway
> >>>>>>>>> 1.x
> >>>>>> since
> >>>>>>>>> the URL http://<GATEWAY>/dataset/<dataset-id>.html
> >>>> only points to
> >>>>>> one
> >>>>>>>>> dataset on that Gateway. Either that page needs to
> >>> link to the
> >>>>>> original
> >>>>>>>>> and all replicas for that dataset or we need
> >>> separate URLs for
> >>>>>> each
> >>>>>>>>> replica/original, or both.
> >>>>>>>> The current direction for the implementation would be
> >>>> to have a 1
> >>>>>> page
> >>>>>>>> for the original dataset and have that page list
> >>> where replicas
> >>>>>>>> are located.
> >>>>>>>>
> >>>>>>>> If there are use cases for the other options we
> >>> should get those
> >>>>>> identified.
> >>>>>>>> Thanks!
> >>>>>>>> -Nate
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Is this part of your design for Gateway 2.0's
> >>> replica support?
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Stephen.
> >>>>>>>>>
> >>>>>>>>> ---
> >>>>>>>>>
> >>>>>>>>> Stephen Pascoe +44 (0)1235 445980
> >>>>>>>>>
> >>>>>>>>> Centre of Environmental Data Archival
> >>>>>>>>>
> >>>>>>>>> STFC Rutherford Appleton Laboratory, Harwell Oxford,
> >>>> Didcot OX11
> >>>>>> 0QX, UK
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Scanned by iCritical.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> GO-ESSP-TECH mailing list
> >>>>>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>>>> _______________________________________________
> >>>>>>>> GO-ESSP-TECH mailing list
> >>>>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>>>>
> >>>>>>> --
> >>>>>>> Bryan Lawrence
> >>>>>>> University of Reading: Professor of Weather and Climate
> >>>> Computing.
> >>>>>>> National Centre for Atmospheric Science: Director of
> Models and
> >>>>>> Data.
> >>>>>>> STFC: Director of the Centre for Environmental Data Archival.
> >>>>>>> Ph: +44 118 3786507 or 1235 445012;
> >>>> Web:home.badc.rl.ac.uk/lawrence
> >>>>>>> _______________________________________________
> >>>>>>> GO-ESSP-TECH mailing list
> >>>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>>
> >>>>>> --
> >>>>>> Estanislao Gonzalez
> >>>>>>
> >>>>>> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches
> >>>>>> Klimarechenzentrum (DKRZ) - German Climate Computing
> >>>> Centre Room 108
> >>>>>> - Bundesstrasse 45a, D-20146 Hamburg, Germany
> >>>>>>
> >>>>>> Phone: +49 (40) 46 00 94-126
> >>>>>> E-Mail: gonzalez at dkrz.de
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> GO-ESSP-TECH mailing list
> >>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>> --
> >>>> Scanned by iCritical.
> >>>> _______________________________________________
> >>>> GO-ESSP-TECH mailing list
> >>>> GO-ESSP-TECH at ucar.edu
> >>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>
> >>> --
> >>> Scanned by iCritical.
> >>>
> > --
> > Bryan Lawrence
> > University of Reading: Professor of Weather and Climate Computing.
> > National Centre for Atmospheric Science: Director of Models
> and Data.
> > STFC: Director of the Centre for Environmental Data Archival.
> > Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence
> > _______________________________________________
> > GO-ESSP-TECH mailing list
> > GO-ESSP-TECH at ucar.edu
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
More information about the GO-ESSP-TECH
mailing list