[Go-essp-tech] [esg-gateway-dev] Replica support in Gateway 2.0
Kettleborough, Jamie
jamie.kettleborough at metoffice.gov.uk
Mon Nov 14 09:53:05 MST 2011
Hello Eric,
Thanks for this. I think there is third bullet point to add to your 'notable things'. It is implied by much of the discussion, but maybe needs making explicit:
* The version a user is downloading needs to be made clear to the users (I'm not sure it is at the moment - it certainly is something that has caused users here a lot of issues).
I think there are different ways of exposing this, and I don't know what is optimal... How hard would it be for the gateway to expose the version number? (are there reasons why this is a bad idea?)
Jamie
> -----Original Message-----
> From: esg-gateway-dev-bounces at mailman.earthsystemgrid.org
> [mailto:esg-gateway-dev-bounces at mailman.earthsystemgrid.org]
> On Behalf Of Eric Nienhouse
> Sent: 11 November 2011 14:41
> To: Bryan Lawrence
> Cc: go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
> Subject: Re: [esg-gateway-dev] [Go-essp-tech] Replica support
> in Gateway 2.0
>
> Hi All,
>
> There is much to discuss here and I'd like to identify two
> notable things have been raised in the thread:
>
> * Stale replica datasets may appear in search results UI
> confusing users (as Martin describes this, a "file share" issue.)
>
> * QC/DOI applied "reference" datasets must be labeled as such
> and *may* be fundamentally different than the original
> dataset supplied by the data producers. This is represented
> as a "reference archive" issue.
>
> We've often been using the term "replication" to describe
> many aspects of this data management problem.
>
> We're working on design to address the first issue above in
> the Gateway UI. Namely directing users to the latest version
> of a dataset and hiding/making difficult access to
> older/superseded replicated versions.
> The feedback and suggestions offered in this thread are very
> useful here. There is more to shine light on in this area,
> including end user notifications regarding downloads of stale
> datasets as well as administrative notification of new
> dataset versions requiring replication. These are likely
> lower priority to making end user data access clear, though
> they will help in scaling the system in the future.
>
> There may be more at play regarding the QC/DOI assigned
> "reference archive" datasets. I don't want to introduce
> undue complexity at this point and feel we should strive to
> keep things as simple as we can.
> However, a case can be made to identify a QC'ed dataset as a
> separate "first class" entity with specific QC and DOI
> related attributes that has been "derived from" the original,
> and perhaps less controlled, dataset. The QC'ed dataset,
> though related to the original, lives a live of its own and
> is replicated as such, forming the reference archive version
> of this particular dataset. This would make "pre-print" data
> distinctly different than a "published QC3" dataset.
>
> I believe we need to pursue the latter issue in further
> detail. Bryan and Martin have advocated this model which I
> believe we need in some form. I'd like to get this more
> formalized quickly. It is notable that we all agree that
> undue dataset copies and transfers are not desirable.
>
> What do you think about this approach?
>
> Thanks and regards,
>
> -Eric
>
>
>
>
> Bryan Lawrence wrote:
> > ... and having replied to the first part of Jamie's
> message, and ignored the second ... clearly that was a
> mistake (on my behalf and the gateway) ... I agree that this
> is a major issue ... thanks Jamie!
> >
> > Bryan
> >
> >
> >> Hi Jamie,
> >>
> >> I wasn't aware of the "stale" datasets. Well, that is a *major*
> >> Gateway Bug, because it should know already about a newer
> version, so
> >> as it is designed right now "should" point just to the original
> >> because there's no replica for this.
> >> [What now happens is that if you select BADC as the Gateway to
> >> download to, you end up with the new version, although at
> DKRZ said
> >> the version was the older one. - It looks like a major bug to me]
> >>
> >> The fact that the new version does not get immediately replicated
> >> it's not a problem but a particularity of any distributed system.
> >> It's impossible to have a replica of the all archive
> synchronized to
> >> the minute. What we can do, is "hide" things that are outdated.
> >> Basically:
> >> 1) If the user searches for an id, that resolves to the
> latest version.
> >> (If no version is given, we might safely presume the latest is
> >> requested)
> >> 2) Only replicas of that version should be displayed if available
> >> (this should be already possible)
> >> 3) If the user searches for a particular version, she/he should be
> >> getting exactly this (feature not implemented)
> >> 4) Same as 2, display only known replicas of that version.
> >>
> >> I don't think this is complicated. And regarding the "few
> years [from
> >> now]" I don't think that's the case, and not only because of
> >> bandwidth (which IS an issue already, as well as server load). I'm
> >> pretty sure papers are getting written "as we speak", even
> before a
> >> DOI gets out there. So these people "need" to cite
> something, right?
> >> The only thing they can hold to is a bunch of URLs AND checksums.
> >> It's the only thing you can "cite" at the moment, without this no
> >> papers could be written and not only in the CMIP5 context
> (and thus a large AR5 share).
> >> That's the reason the archives are there. You can cite
> what you find
> >> at DKRZ, or BADC, or PCMDI. What's in there is guaranteed
> to remain
> >> exactly where it was found for the next 10+ years, other
> institutions
> >> do not have this commitment (nor can).
> >>
> >> Again 2c. Thanks,
> >> Estani
> >>
> >> On 10.11.2011 09:14, Kettleborough, Jamie wrote:
> >>
> >>> Hello,
> >>>
> >>> I know this thread has moved on, but can we rewind just a bit. I
> >>> think Nathan asked for use cases around this issue.
> >>>
> >>> As far as I'm aware there are two main user use cases here (there
> >>> may be others obviously).
> >>>
> >>> 1. User wants to get data *now* (even if its only 'preprint' in
> >>> Bryan's language) in a way that suites their particular needs
> >>> (functional and non-functional)
> >>>
> >>> 2. In a few years (not sure how long - could be months) time user
> >>> wants to get a copy of the data to verify or extend some previous
> >>> analysis.
> >>>
> >>> I'm sure someone will correct me if I have this wrong,
> but I *think*
> >>> most of the discussion so far has centered around the second of
> >>> these.
> >>> I think the first one needs some discusion too though
> doesn't it -
> >>> it feels more urgent to me?
> >>>
> >>> I think it needs reviewing in the light of replication as
> when there
> >>> is replication a user *may* choose to go to a particular
> data-node
> >>> as it suites them better - they may see faster downloads
> because of
> >>> the network route betweent them and the servers, or it
> might support
> >>> some data service they want. Something in the system has to have
> >>> the responsibility of choosing which replica to download. To be
> >>> honest I think this is best left to the user. If this is
> the case
> >>> then the user has to have a good view of *where* the data
> is through
> >>> their interface. Nate - I think your proposed
> implementation would
> >>> expose the information no?
> >>>
> >>> The other issue for replication I can think of coming out of use
> >>> case
> >>> 1 is versioning. Data will be revised by data providers (we have
> >>> examples of this), so I think that the replication system has to
> >>> keep up and the interfaces have to be able to communicate this to
> >>> the user.
> >>> A case I'm worried about is if a replica goes stale
> (sorry Estani,
> >>> I
> >>> *think* we have examples of these e.g.
> >>>
> >>>
> cmip5.output1.MOHC.HadGEM2-ES.historicalGHG.day.atmos.day.r1i1p1.v20
> >>> 11032 at DKRZ is stale - we've resubmitted as v20111102. I only
> >>> discovered this yesterday, and understood what *might* be
> happening
> >>> today, I can send more examples if you need them). I
> think a user
> >>> needs to be able to tell (without too much hard work)
> what really is
> >>> a replica, and what is a replica of a previous version.
> Nate - can
> >>> you cope with this situation in your proposed implementation?
> >>>
> >>> (Are there any issues around authorisation when it comes
> to replicas
> >>> - would a new published version mean all replicas of previous
> >>> versions are no longer 'authorisable' against, or would stale
> >>> replicas be
> >>> available?)
> >>>
> >>> I know that replication has started to happen - but is this the
> >>> right thing now? Is everything in place to do this in a
> way that is
> >>> *safe* and not going to confuse users?
> >>>
> >>> Jamie
> >>>
> >>> (I guess if you have a client that only uses the data
> nodes you know
> >>> what node you were talking to, and see the full version
> information
> >>> from the outset, so these aren't such big issues).
> >>>
> >>> ps yes I know I still have to answer some questions from
> Sebastien
> >>> about our client.
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: go-essp-tech-bounces at ucar.edu
> >>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of
> >>>> martin.juckes at stfc.ac.uk
> >>>> Sent: 10 November 2011 16:26
> >>>> To: gonzalez at dkrz.de; go-essp-tech at ucar.edu;
> >>>> esg-gateway-dev at earthsystemgrid.org
> >>>> Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
> >>>>
> >>>> Hi Estani,
> >>>>
> >>>> You missed the start -- the bit which is not achievable is
> >>>> publishing a replica to the same gateway used for the original
> >>>> publication of that data. E.g. IPSL data published to BADC,
> >>>>
> >>>> Cheers,
> >>>> Martin
> >>>>
> >>>>
> >>>>>> -----Original Message-----
> >>>>>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
> >>>>>> bounces at ucar.edu] On Behalf Of Estanislao Gonzalez
> >>>>>> Sent: 10 November 2011 16:20
> >>>>>> To: go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
> >>>>>> Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> this analogy seams perfect. Now regarding to the options
> >>>>>>
> >>>> we have at
> >>>>
> >>>>>> the
> >>>>>> moment:
> >>>>>> 1) How "unique" is the dataset id in a Gateway?
> Federation wide,
> >>>>>> local gAteway or project unique?
> >>>>>> 2) Depending on one, the procedure could involve "moving" the
> >>>>>> published data to some other project, Gateway, Federation :-)
> >>>>>>
> >>>>>> I think this could be achievable:
> >>>>>> 1) Data gets replicated to some Gateway (redundancy enforced)
> >>>>>> 2) The originated Gateway, if it's also replicating,
> >>>>>>
> >>>> should replicate
> >>>>
> >>>>>> (just data, no publication yet) from the QC checked replica.
> >>>>>> 3) The "pre-print" gets removed (which either mean move to a
> >>>>>> different project, Gateway, etc or really completely
> >>>>>>
> >>>> delete it from
> >>>>
> >>>>>> the Gateway)
> >>>>>> 4) The replica gets published.
> >>>>>>
> >>>>>> I might be omitting something, but it seams achievable
> right now.
> >>>>>>
> >>>>>> My 2c,
> >>>>>> Estani
> >>>>>>
> >>>>>> Am 10.11.2011 07:15, schrieb Bryan Lawrence:
> >>>>>>
> >>>>>>> Martin has been quite vociferous (quite rightly) in
> >>>>>>>
> >>>> personal email
> >>>>
> >>>>>> to me that as far as QC goes, the dataset which gets
> >>>>>>
> >>>> through QC2 will
> >>>>
> >>>>>> *not* be the original dataset - we have no control over
> >>>>>>
> >>>> the original
> >>>>
> >>>>>> dataset's permanence and/or immutability.
> >>>>>>
> >>>>>>> This raises some interesting issues about the role of
> >>>>>>>
> >>>> ESGF ... and
> >>>>
> >>>>>> it's interaction with the data owner and the
> publication process
> >>>>>> which is governed by DKRZ as the Publisher (and in the future
> >>>>>> probably multiple publication processes and multiple
> >>>>>>
> >>>> Publishers). The
> >>>>
> >>>>>> correct analogy here, as I said on an earlier email
> today, is to
> >>>>>> consider the original dataset as a preprint, of a
> >>>>>>
> >>>> Published dataset
> >>>>
> >>>>>> (at QC level 3).
> >>>>>>
> >>>>>>> Incidentally, this disctinction might offer us a possible
> >>>>>>> (distinct)
> >>>>>>>
> >>>>>> future for two different types of gateways into ESGF: the
> >>>>>>
> >>>> Published
> >>>>
> >>>>>> datasets view (which makes pre-eminent the QC'd copy) and the
> >>>>>> published view (which makes pre-eminenent whatever someone
> >>>>>>
> >>>> sticks on
> >>>>
> >>>>>> a data node).
> >>>>>>
> >>>>>>> But meanwhile, I think we can live with what you
> >>>>>>>
> >>>> proposed, as long
> >>>>
> >>>>>> as the QC status of the replicas is clearly visible -
> and the DOI
> >>>>>> points to a landing page that somehow prioritises
> those versions,
> >>>>>> which would be trivial if your page was organised in
> the same way
> >>>>>> (prioritising the replicants of QC level 3, then
> replicants of QC
> >>>>>> level 2, and then originals).
> >>>>>>
> >>>>>>> Cheers
> >>>>>>> Bryan
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> Hi Stephen,
> >>>>>>>>
> >>>>>>>> On 11/10/2011 05:23 AM, stephen.pascoe at stfc.ac.uk wrote:
> >>>>>>>>
> >>>>>>>>> Hi Eric,
> >>>>>>>>>
> >>>>>>>>> Replicas are beginning to show up in CMIP5 and this is
> >>>>>>>>>
> >>>> exposing
> >>>>
> >>>>>> some
> >>>>>>
> >>>>>>>>> gaps in what Gateway 1.x can do. I know you are
> >>>>>>>>>
> >>>> reimplementing
> >>>>
> >>>>>> replica
> >>>>>>
> >>>>>>>>> support in Gateway 2.0 so I'd like to raise these
> issues now.
> >>>>>>>>>
> >>>>>>>>> We need to be able to publish a replica to the same
> >>>>>>>>>
> >>>> Gateway that
> >>>>
> >>>>>> hosts
> >>>>>>
> >>>>>>>>> the original. I can't imagine this being possible with
> >>>>>>>>>
> >>>> Gateway
> >>>>
> >>>>>>>>> 1.x
> >>>>>>>>>
> >>>>>> since
> >>>>>>
> >>>>>>>>> the URL http://<GATEWAY>/dataset/<dataset-id>.html
> >>>>>>>>>
> >>>> only points to
> >>>>
> >>>>>> one
> >>>>>>
> >>>>>>>>> dataset on that Gateway. Either that page needs to link to
> >>>>>>>>>
> >>>> the
> >>>>
> >>>>>> original
> >>>>>>
> >>>>>>>>> and all replicas for that dataset or we need separate URLs
> >>>>>>>>>
> >>>> for
> >>>>
> >>>>>> each
> >>>>>>
> >>>>>>>>> replica/original, or both.
> >>>>>>>>>
> >>>>>>>> The current direction for the implementation would be
> >>>>>>>>
> >>>> to have a 1
> >>>>
> >>>>>> page
> >>>>>>
> >>>>>>>> for the original dataset and have that page list where
> >>>>>>>>
> >>>> replicas
> >>>>
> >>>>>>>> are located.
> >>>>>>>>
> >>>>>>>> If there are use cases for the other options we should get
> >>>>>>>>
> >>>> those
> >>>>
> >>>>>> identified.
> >>>>>>
> >>>>>>>> Thanks!
> >>>>>>>> -Nate
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Is this part of your design for Gateway 2.0's replica
> >>>>>>>>>
> >>>> support?
> >>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Stephen.
> >>>>>>>>>
> >>>>>>>>> ---
> >>>>>>>>>
> >>>>>>>>> Stephen Pascoe +44 (0)1235 445980
> >>>>>>>>>
> >>>>>>>>> Centre of Environmental Data Archival
> >>>>>>>>>
> >>>>>>>>> STFC Rutherford Appleton Laboratory, Harwell Oxford,
> >>>>>>>>>
> >>>> Didcot OX11
> >>>>
> >>>>>> 0QX, UK
> >>>>>>
> >>>>>>>>> --
> >>>>>>>>> Scanned by iCritical.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> GO-ESSP-TECH mailing list
> >>>>>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> GO-ESSP-TECH mailing list
> >>>>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>>>>
> >>>>>>>>
> >>>>>>> --
> >>>>>>> Bryan Lawrence
> >>>>>>> University of Reading: Professor of Weather and Climate
> >>>>>>>
> >>>> Computing.
> >>>>
> >>>>>>> National Centre for Atmospheric Science: Director of
> Models and
> >>>>>>>
> >>>>>> Data.
> >>>>>>
> >>>>>>> STFC: Director of the Centre for Environmental Data Archival.
> >>>>>>> Ph: +44 118 3786507 or 1235 445012;
> >>>>>>>
> >>>> Web:home.badc.rl.ac.uk/lawrence
> >>>>
> >>>>>>> _______________________________________________
> >>>>>>> GO-ESSP-TECH mailing list
> >>>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>>>
> >>>>>> --
> >>>>>> Estanislao Gonzalez
> >>>>>>
> >>>>>> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches
> >>>>>> Klimarechenzentrum (DKRZ) - German Climate Computing
> >>>>>>
> >>>> Centre Room 108
> >>>>
> >>>>>> - Bundesstrasse 45a, D-20146 Hamburg, Germany
> >>>>>>
> >>>>>> Phone: +49 (40) 46 00 94-126
> >>>>>> E-Mail: gonzalez at dkrz.de
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> GO-ESSP-TECH mailing list
> >>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>>
> >>>> --
> >>>> Scanned by iCritical.
> >>>> _______________________________________________
> >>>> GO-ESSP-TECH mailing list
> >>>> GO-ESSP-TECH at ucar.edu
> >>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>
> >>>>
> >>
> >
> > --
> > Bryan Lawrence
> > University of Reading: Professor of Weather and Climate Computing.
> > National Centre for Atmospheric Science: Director of Models
> and Data.
> > STFC: Director of the Centre for Environmental Data Archival.
> > Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence
> > _______________________________________________
> > GO-ESSP-TECH mailing list
> > GO-ESSP-TECH at ucar.edu
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >
>
> _______________________________________________
> esg-gateway-dev mailing list
> esg-gateway-dev at mailman.earthsystemgrid.org
> http://mailman.earthsystemgrid.org/mailman/listinfo/esg-gateway-dev
>
More information about the GO-ESSP-TECH
mailing list