[Go-essp-tech] Replica support in Gateway 2.0

Bryan Lawrence bryan.lawrence at ncas.ac.uk
Fri Nov 11 01:38:45 MST 2011


... and having replied to the first part of Jamie's message, and ignored the second ... clearly that was a mistake (on my behalf and the gateway) ... I agree that this is a major issue ... thanks Jamie!

Bryan

> Hi Jamie,
> 
> I wasn't aware of the "stale" datasets. Well, that is a *major* Gateway 
> Bug, because it should know already about a newer version, so as it is 
> designed right now "should" point just to the original because there's 
> no replica for this.
> [What now happens is that if you select BADC as the Gateway to download 
> to, you end up with the new version, although at DKRZ said the version 
> was the older one. - It looks like a major bug to me]
> 
> The fact that the new version does not get immediately replicated it's 
> not a problem but a particularity of any distributed system. It's 
> impossible to have a replica of the all archive synchronized to the 
> minute. What we can do, is "hide" things that are outdated.
> Basically:
> 1) If the user searches for an id, that resolves to the latest version. 
> (If no version is given, we might safely presume the latest is 
> requested)
> 2) Only replicas of that version should be displayed if available (this 
> should be already possible)
> 3) If the user searches for a particular version, she/he should be 
> getting exactly this (feature not implemented)
> 4) Same as 2, display only known replicas of that version.
> 
> I don't think this is complicated. And regarding the "few years [from 
> now]" I don't think that's the case, and not only because of bandwidth 
> (which IS an issue already, as well as server load). I'm pretty sure 
> papers are getting written "as we speak", even before a DOI gets out 
> there. So these people "need" to cite something, right? The only thing 
> they can hold to is a bunch of URLs AND checksums. It's the only thing 
> you can "cite" at the moment, without this no papers could be written 
> and not only in the CMIP5 context (and thus a large AR5 share).
> That's the reason the archives are there. You can cite what you find at 
> DKRZ, or BADC, or PCMDI. What's in there is guaranteed to remain exactly 
> where it was found for the next 10+ years, other institutions do not 
> have this commitment (nor can).
> 
> Again 2c. Thanks,
> Estani
> 
> On 10.11.2011 09:14, Kettleborough, Jamie wrote:
> > Hello,
> >
> > I know this thread has moved on, but can we rewind just a bit.  I
> > think Nathan asked for use cases around this issue.
> >
> > As far as I'm aware there are two main user use cases here (there may
> > be others obviously).
> >
> > 1. User wants to get data *now* (even if its only 'preprint' in
> > Bryan's language) in a way that suites their particular needs
> > (functional and non-functional)
> >
> > 2. In a few years (not sure how long - could be months) time user
> > wants to get a copy of the data to verify or extend some previous
> > analysis.
> >
> > I'm sure someone will correct me if I have this wrong, but I *think*
> > most of the discussion so far has centered around the second of 
> > these.
> > I think the first one needs some discusion too though doesn't it - it
> > feels more urgent to me?
> >
> > I think it needs reviewing in the light of replication as when there
> > is replication a user *may* choose to go to a particular data-node as
> > it suites them better - they may see faster downloads because of the
> > network route betweent them and the servers, or it might support some
> > data service they want.  Something in the system has to have the
> > responsibility of choosing which replica to download.  To be honest I
> > think this is best left to the user.  If this is the case then the
> > user has to have a good view of *where* the data is through their
> > interface.  Nate - I think your proposed implementation would expose
> > the information no?
> >
> > The other issue for replication I can think of coming out of use case
> > 1 is versioning.  Data will be revised by data providers (we have
> > examples of this), so I think that the replication system has to keep
> > up and the interfaces have to be able to communicate this to the 
> > user.
> >  A case I'm worried about is if a replica goes stale (sorry Estani, I
> > *think* we have examples of these e.g.
> > 
> > cmip5.output1.MOHC.HadGEM2-ES.historicalGHG.day.atmos.day.r1i1p1.v2011032
> > at DKRZ is stale - we've resubmitted as v20111102.  I only discovered
> > this yesterday, and understood what *might* be happening today, I can
> > send more examples if you need them).  I think a user needs to be 
> > able
> > to tell (without too much hard work) what really is a replica, and
> > what is a replica of a previous version.  Nate - can you cope with
> > this situation in your proposed implementation?
> >
> > (Are there any issues around authorisation when it comes to replicas
> > - would a new published version mean all replicas of previous 
> > versions
> > are no longer 'authorisable' against, or would stale replicas be
> > available?)
> >
> > I know that replication has started to happen - but is this the right
> > thing now?  Is everything in place to do this in a way that is *safe*
> > and not going to confuse users?
> >
> > Jamie
> >
> > (I guess if you have a client that only uses the data nodes you know
> > what node you were talking to, and see the full version information
> > from the outset, so these aren't such big issues).
> >
> > ps yes I know I still have to answer some questions from Sebastien
> > about our client.
> >
> >> -----Original Message-----
> >> From: go-essp-tech-bounces at ucar.edu
> >> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of
> >> martin.juckes at stfc.ac.uk
> >> Sent: 10 November 2011 16:26
> >> To: gonzalez at dkrz.de; go-essp-tech at ucar.edu;
> >> esg-gateway-dev at earthsystemgrid.org
> >> Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
> >>
> >> Hi Estani,
> >>
> >> You missed the start -- the bit which is not achievable is
> >> publishing a replica to the same gateway used for the
> >> original publication of that data. E.g. IPSL data published to BADC,
> >>
> >> Cheers,
> >> Martin
> >>
> >> > >-----Original Message-----
> >> > >From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
> >> > >bounces at ucar.edu] On Behalf Of Estanislao Gonzalez
> >> > >Sent: 10 November 2011 16:20
> >> > >To: go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
> >> > >Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
> >> > >
> >> > >Hi,
> >> > >
> >> > >this analogy seams perfect. Now regarding to the options
> >> we have at
> >> > >the
> >> > >moment:
> >> > >1) How "unique" is the dataset id in a Gateway? Federation wide,
> >> > >local gAteway or project unique?
> >> > >2) Depending on one, the procedure could involve "moving" the
> >> > >published data to some other project, Gateway, Federation :-)
> >> > >
> >> > >I think this could be achievable:
> >> > >1) Data gets replicated to some Gateway (redundancy enforced)
> >> > >2) The originated Gateway, if it's also replicating,
> >> should replicate
> >> > >(just data, no publication yet) from the QC checked replica.
> >> > >3) The "pre-print" gets removed (which either mean move to a
> >> > >different project, Gateway, etc or really completely
> >> delete it from
> >> > >the Gateway)
> >> > >4) The replica gets published.
> >> > >
> >> > >I might be omitting something, but it seams achievable right now.
> >> > >
> >> > >My 2c,
> >> > >Estani
> >> > >
> >> > >Am 10.11.2011 07:15, schrieb Bryan Lawrence:
> >> > >> Martin has been quite vociferous (quite rightly) in
> >> personal email
> >> > >to me that as far as QC goes, the dataset which gets
> >> through QC2 will
> >> > >*not* be the original dataset - we have no control over
> >> the original
> >> > >dataset's permanence and/or immutability.
> >> > >>
> >> > >> This raises some interesting issues about the role of
> >> ESGF ... and
> >> > >it's interaction with the data owner and the publication process
> >> > >which is governed by DKRZ as the Publisher (and in the future
> >> > >probably multiple publication processes and multiple
> >> Publishers). The
> >> > >correct analogy here, as I said on an earlier email today, is to
> >> > >consider the original dataset as a preprint, of a
> >> Published dataset
> >> > >(at QC level 3).
> >> > >>
> >> > >> Incidentally, this disctinction might offer us a possible
> >> > >> (distinct)
> >> > >future for two different types of gateways into ESGF: the
> >> Published
> >> > >datasets view (which makes pre-eminent the QC'd copy) and the
> >> > >published view (which makes pre-eminenent whatever someone
> >> sticks on
> >> > >a data node).
> >> > >>
> >> > >> But meanwhile, I think we can live with what you
> >> proposed, as long
> >> > >as the QC status of the replicas is clearly visible - and the DOI
> >> > >points to a landing page that somehow prioritises those versions,
> >> > >which would be trivial if your page was organised in the same way
> >> > >(prioritising the replicants of QC level 3, then replicants of QC
> >> > >level 2, and then originals).
> >> > >>
> >> > >> Cheers
> >> > >> Bryan
> >> > >>
> >> > >>
> >> > >>> Hi Stephen,
> >> > >>>
> >> > >>> On 11/10/2011 05:23 AM, stephen.pascoe at stfc.ac.uk wrote:
> >> > >>>> Hi Eric,
> >> > >>>>
> >> > >>>> Replicas are beginning to show up in CMIP5 and this is 
> >> exposing
> >> > >some
> >> > >>>> gaps in what Gateway 1.x can do. I know you are 
> >> reimplementing
> >> > >replica
> >> > >>>> support in Gateway 2.0 so I'd like to raise these issues now.
> >> > >>>>
> >> > >>>> We need to be able to publish a replica to the same
> >> Gateway that
> >> > >hosts
> >> > >>>> the original. I can't imagine this being possible with 
> >> Gateway
> >> > >>>> 1.x
> >> > >since
> >> > >>>> the URL http://<GATEWAY>/dataset/<dataset-id>.html
> >> only points to
> >> > >one
> >> > >>>> dataset on that Gateway. Either that page needs to link to 
> >> the
> >> > >original
> >> > >>>> and all replicas for that dataset or we need separate URLs 
> >> for
> >> > >each
> >> > >>>> replica/original, or both.
> >> > >>> The current direction for the implementation would be
> >> to have a 1
> >> > >page
> >> > >>> for the original dataset and have that page list where 
> >> replicas
> >> > >>> are located.
> >> > >>>
> >> > >>> If there are use cases for the other options we should get 
> >> those
> >> > >identified.
> >> > >>>
> >> > >>> Thanks!
> >> > >>> -Nate
> >> > >>>
> >> > >>>
> >> > >>>> Is this part of your design for Gateway 2.0's replica 
> >> support?
> >> > >>>>
> >> > >>>> Thanks,
> >> > >>>>
> >> > >>>> Stephen.
> >> > >>>>
> >> > >>>> ---
> >> > >>>>
> >> > >>>> Stephen Pascoe +44 (0)1235 445980
> >> > >>>>
> >> > >>>> Centre of Environmental Data Archival
> >> > >>>>
> >> > >>>> STFC Rutherford Appleton Laboratory, Harwell Oxford,
> >> Didcot OX11
> >> > >0QX, UK
> >> > >>>>
> >> > >>>>
> >> > >>>> --
> >> > >>>> Scanned by iCritical.
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>> _______________________________________________
> >> > >>>> GO-ESSP-TECH mailing list
> >> > >>>> GO-ESSP-TECH at ucar.edu
> >> > >>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >> > >>> _______________________________________________
> >> > >>> GO-ESSP-TECH mailing list
> >> > >>> GO-ESSP-TECH at ucar.edu
> >> > >>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >> > >>>
> >> > >> --
> >> > >> Bryan Lawrence
> >> > >> University of Reading:  Professor of Weather and Climate
> >> Computing.
> >> > >> National Centre for Atmospheric Science: Director of Models and
> >> > >Data.
> >> > >> STFC: Director of the Centre for Environmental Data Archival.
> >> > >> Ph: +44 118 3786507 or 1235 445012;
> >> Web:home.badc.rl.ac.uk/lawrence
> >> > >> _______________________________________________
> >> > >> GO-ESSP-TECH mailing list
> >> > >> GO-ESSP-TECH at ucar.edu
> >> > >> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >> > >
> >> > >
> >> > >--
> >> > >Estanislao Gonzalez
> >> > >
> >> > >Max-Planck-Institut für Meteorologie (MPI-M) Deutsches
> >> > >Klimarechenzentrum (DKRZ) - German Climate Computing
> >> Centre Room 108
> >> > >- Bundesstrasse 45a, D-20146 Hamburg, Germany
> >> > >
> >> > >Phone:   +49 (40) 46 00 94-126
> >> > >E-Mail:  gonzalez at dkrz.de
> >> > >
> >> > >_______________________________________________
> >> > >GO-ESSP-TECH mailing list
> >> > >GO-ESSP-TECH at ucar.edu
> >> > >http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >> --
> >> Scanned by iCritical.
> >> _______________________________________________
> >> GO-ESSP-TECH mailing list
> >> GO-ESSP-TECH at ucar.edu
> >> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>
> 
> 

--
Bryan Lawrence
University of Reading:  Professor of Weather and Climate Computing.
National Centre for Atmospheric Science: Director of Models and Data. 
STFC: Director of the Centre for Environmental Data Archival.
Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence


More information about the GO-ESSP-TECH mailing list