[Go-essp-tech] [esg-gateway-dev] Replica support in Gateway 2.0

Mon Nov 14 09:53:05 MST 2011

Hello Eric,

Thanks for this.  I think there is third bullet point to add to your 'notable things'. It is implied by much of the discussion, but maybe needs making explicit:

* The version a user is downloading needs to be made clear to the users (I'm not sure it is at the moment - it certainly is something that has caused users here a lot of issues).

I think there are different ways of exposing this, and I don't know what is optimal... How hard would it be for the gateway to expose the version number?  (are there reasons why this is a bad idea?)

Jamie

> -----Original Message-----
> From: esg-gateway-dev-bounces at mailman.earthsystemgrid.org 
> [mailto:esg-gateway-dev-bounces at mailman.earthsystemgrid.org] 
> On Behalf Of Eric Nienhouse
> Sent: 11 November 2011 14:41
> To: Bryan Lawrence
> Cc: go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
> Subject: Re: [esg-gateway-dev] [Go-essp-tech] Replica support 
> in Gateway 2.0
> 
> Hi All,
> 
> There is much to discuss here and I'd like to identify two 
> notable things have been raised in the thread:
> 
> * Stale replica datasets may appear in search results UI 
> confusing users (as Martin describes this, a  "file share" issue.)
> 
> * QC/DOI applied "reference" datasets must be labeled as such 
> and *may* be fundamentally different than the original 
> dataset supplied by the data producers.  This is represented 
> as a "reference archive" issue.
> 
> We've often been using the term "replication" to describe 
> many aspects of this data management problem.
> 
> We're working on design to address the first issue above in 
> the Gateway UI.  Namely directing users to the latest version 
> of a dataset and hiding/making difficult access to 
> older/superseded replicated versions.  
> The feedback and suggestions offered in this thread are very 
> useful here.  There is more to shine light on in this area, 
> including end user notifications regarding downloads of stale 
> datasets as well as administrative notification of new 
> dataset versions requiring replication.  These are likely 
> lower priority to making end user data access clear, though 
> they will help in scaling the system in the future.
> 
> There may be more at play regarding the QC/DOI assigned 
> "reference archive" datasets.  I don't want to introduce 
> undue complexity at this point and feel we should strive to 
> keep things as simple as we can.  
> However, a case can be made to identify a QC'ed dataset as a 
> separate "first class" entity with specific QC and DOI 
> related attributes that has been "derived from" the original, 
> and perhaps less controlled, dataset.  The QC'ed dataset, 
> though related to the original, lives a live of its own and 
> is replicated as such, forming the reference archive version 
> of this particular dataset.  This would make "pre-print" data 
> distinctly different than a "published QC3" dataset.
> 
> I believe we need to pursue the latter issue in further 
> detail.  Bryan and Martin have advocated this model which I 
> believe we need in some form.  I'd like to get this more 
> formalized quickly.  It is notable that we all agree that 
> undue dataset copies and transfers are not desirable.
> 
> What do you think about this approach?
> 
> Thanks and regards,
> 
> -Eric
> 
> 
> 
> 
> Bryan Lawrence wrote:
> > ... and having replied to the first part of Jamie's 
> message, and ignored the second ... clearly that was a 
> mistake (on my behalf and the gateway) ... I agree that this 
> is a major issue ... thanks Jamie!
> >
> > Bryan
> >
> >   
> >> Hi Jamie,
> >>
> >> I wasn't aware of the "stale" datasets. Well, that is a *major* 
> >> Gateway Bug, because it should know already about a newer 
> version, so 
> >> as it is designed right now "should" point just to the original 
> >> because there's no replica for this.
> >> [What now happens is that if you select BADC as the Gateway to 
> >> download to, you end up with the new version, although at 
> DKRZ said 
> >> the version was the older one. - It looks like a major bug to me]
> >>
> >> The fact that the new version does not get immediately replicated 
> >> it's not a problem but a particularity of any distributed system. 
> >> It's impossible to have a replica of the all archive 
> synchronized to 
> >> the minute. What we can do, is "hide" things that are outdated.
> >> Basically:
> >> 1) If the user searches for an id, that resolves to the 
> latest version. 
> >> (If no version is given, we might safely presume the latest is
> >> requested)
> >> 2) Only replicas of that version should be displayed if available 
> >> (this should be already possible)
> >> 3) If the user searches for a particular version, she/he should be 
> >> getting exactly this (feature not implemented)
> >> 4) Same as 2, display only known replicas of that version.
> >>
> >> I don't think this is complicated. And regarding the "few 
> years [from 
> >> now]" I don't think that's the case, and not only because of 
> >> bandwidth (which IS an issue already, as well as server load). I'm 
> >> pretty sure papers are getting written "as we speak", even 
> before a 
> >> DOI gets out there. So these people "need" to cite 
> something, right? 
> >> The only thing they can hold to is a bunch of URLs AND checksums. 
> >> It's the only thing you can "cite" at the moment, without this no 
> >> papers could be written and not only in the CMIP5 context 
> (and thus a large AR5 share).
> >> That's the reason the archives are there. You can cite 
> what you find 
> >> at DKRZ, or BADC, or PCMDI. What's in there is guaranteed 
> to remain 
> >> exactly where it was found for the next 10+ years, other 
> institutions 
> >> do not have this commitment (nor can).
> >>
> >> Again 2c. Thanks,
> >> Estani
> >>
> >> On 10.11.2011 09:14, Kettleborough, Jamie wrote:
> >>     
> >>> Hello,
> >>>
> >>> I know this thread has moved on, but can we rewind just a bit.  I 
> >>> think Nathan asked for use cases around this issue.
> >>>
> >>> As far as I'm aware there are two main user use cases here (there 
> >>> may be others obviously).
> >>>
> >>> 1. User wants to get data *now* (even if its only 'preprint' in 
> >>> Bryan's language) in a way that suites their particular needs 
> >>> (functional and non-functional)
> >>>
> >>> 2. In a few years (not sure how long - could be months) time user 
> >>> wants to get a copy of the data to verify or extend some previous 
> >>> analysis.
> >>>
> >>> I'm sure someone will correct me if I have this wrong, 
> but I *think* 
> >>> most of the discussion so far has centered around the second of 
> >>> these.
> >>> I think the first one needs some discusion too though 
> doesn't it - 
> >>> it feels more urgent to me?
> >>>
> >>> I think it needs reviewing in the light of replication as 
> when there 
> >>> is replication a user *may* choose to go to a particular 
> data-node 
> >>> as it suites them better - they may see faster downloads 
> because of 
> >>> the network route betweent them and the servers, or it 
> might support 
> >>> some data service they want.  Something in the system has to have 
> >>> the responsibility of choosing which replica to download.  To be 
> >>> honest I think this is best left to the user.  If this is 
> the case 
> >>> then the user has to have a good view of *where* the data 
> is through 
> >>> their interface.  Nate - I think your proposed 
> implementation would 
> >>> expose the information no?
> >>>
> >>> The other issue for replication I can think of coming out of use 
> >>> case
> >>> 1 is versioning.  Data will be revised by data providers (we have 
> >>> examples of this), so I think that the replication system has to 
> >>> keep up and the interfaces have to be able to communicate this to 
> >>> the user.
> >>>  A case I'm worried about is if a replica goes stale 
> (sorry Estani, 
> >>> I
> >>> *think* we have examples of these e.g.
> >>>
> >>> 
> cmip5.output1.MOHC.HadGEM2-ES.historicalGHG.day.atmos.day.r1i1p1.v20
> >>> 11032 at DKRZ is stale - we've resubmitted as v20111102.  I only 
> >>> discovered this yesterday, and understood what *might* be 
> happening 
> >>> today, I can send more examples if you need them).  I 
> think a user 
> >>> needs to be able to tell (without too much hard work) 
> what really is 
> >>> a replica, and what is a replica of a previous version.  
> Nate - can 
> >>> you cope with this situation in your proposed implementation?
> >>>
> >>> (Are there any issues around authorisation when it comes 
> to replicas
> >>> - would a new published version mean all replicas of previous 
> >>> versions are no longer 'authorisable' against, or would stale 
> >>> replicas be
> >>> available?)
> >>>
> >>> I know that replication has started to happen - but is this the 
> >>> right thing now?  Is everything in place to do this in a 
> way that is 
> >>> *safe* and not going to confuse users?
> >>>
> >>> Jamie
> >>>
> >>> (I guess if you have a client that only uses the data 
> nodes you know 
> >>> what node you were talking to, and see the full version 
> information 
> >>> from the outset, so these aren't such big issues).
> >>>
> >>> ps yes I know I still have to answer some questions from 
> Sebastien 
> >>> about our client.
> >>>
> >>>       
> >>>> -----Original Message-----
> >>>> From: go-essp-tech-bounces at ucar.edu 
> >>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of 
> >>>> martin.juckes at stfc.ac.uk
> >>>> Sent: 10 November 2011 16:26
> >>>> To: gonzalez at dkrz.de; go-essp-tech at ucar.edu; 
> >>>> esg-gateway-dev at earthsystemgrid.org
> >>>> Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
> >>>>
> >>>> Hi Estani,
> >>>>
> >>>> You missed the start -- the bit which is not achievable is 
> >>>> publishing a replica to the same gateway used for the original 
> >>>> publication of that data. E.g. IPSL data published to BADC,
> >>>>
> >>>> Cheers,
> >>>> Martin
> >>>>
> >>>>         
> >>>>>> -----Original Message-----
> >>>>>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech- 
> >>>>>> bounces at ucar.edu] On Behalf Of Estanislao Gonzalez
> >>>>>> Sent: 10 November 2011 16:20
> >>>>>> To: go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
> >>>>>> Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> this analogy seams perfect. Now regarding to the options
> >>>>>>             
> >>>> we have at
> >>>>         
> >>>>>> the
> >>>>>> moment:
> >>>>>> 1) How "unique" is the dataset id in a Gateway? 
> Federation wide, 
> >>>>>> local gAteway or project unique?
> >>>>>> 2) Depending on one, the procedure could involve "moving" the 
> >>>>>> published data to some other project, Gateway, Federation :-)
> >>>>>>
> >>>>>> I think this could be achievable:
> >>>>>> 1) Data gets replicated to some Gateway (redundancy enforced)
> >>>>>> 2) The originated Gateway, if it's also replicating,
> >>>>>>             
> >>>> should replicate
> >>>>         
> >>>>>> (just data, no publication yet) from the QC checked replica.
> >>>>>> 3) The "pre-print" gets removed (which either mean move to a 
> >>>>>> different project, Gateway, etc or really completely
> >>>>>>             
> >>>> delete it from
> >>>>         
> >>>>>> the Gateway)
> >>>>>> 4) The replica gets published.
> >>>>>>
> >>>>>> I might be omitting something, but it seams achievable 
> right now.
> >>>>>>
> >>>>>> My 2c,
> >>>>>> Estani
> >>>>>>
> >>>>>> Am 10.11.2011 07:15, schrieb Bryan Lawrence:
> >>>>>>             
> >>>>>>> Martin has been quite vociferous (quite rightly) in
> >>>>>>>               
> >>>> personal email
> >>>>         
> >>>>>> to me that as far as QC goes, the dataset which gets
> >>>>>>             
> >>>> through QC2 will
> >>>>         
> >>>>>> *not* be the original dataset - we have no control over
> >>>>>>             
> >>>> the original
> >>>>         
> >>>>>> dataset's permanence and/or immutability.
> >>>>>>             
> >>>>>>> This raises some interesting issues about the role of
> >>>>>>>               
> >>>> ESGF ... and
> >>>>         
> >>>>>> it's interaction with the data owner and the 
> publication process 
> >>>>>> which is governed by DKRZ as the Publisher (and in the future 
> >>>>>> probably multiple publication processes and multiple
> >>>>>>             
> >>>> Publishers). The
> >>>>         
> >>>>>> correct analogy here, as I said on an earlier email 
> today, is to 
> >>>>>> consider the original dataset as a preprint, of a
> >>>>>>             
> >>>> Published dataset
> >>>>         
> >>>>>> (at QC level 3).
> >>>>>>             
> >>>>>>> Incidentally, this disctinction might offer us a possible
> >>>>>>> (distinct)
> >>>>>>>               
> >>>>>> future for two different types of gateways into ESGF: the
> >>>>>>             
> >>>> Published
> >>>>         
> >>>>>> datasets view (which makes pre-eminent the QC'd copy) and the 
> >>>>>> published view (which makes pre-eminenent whatever someone
> >>>>>>             
> >>>> sticks on
> >>>>         
> >>>>>> a data node).
> >>>>>>             
> >>>>>>> But meanwhile, I think we can live with what you
> >>>>>>>               
> >>>> proposed, as long
> >>>>         
> >>>>>> as the QC status of the replicas is clearly visible - 
> and the DOI 
> >>>>>> points to a landing page that somehow prioritises 
> those versions, 
> >>>>>> which would be trivial if your page was organised in 
> the same way 
> >>>>>> (prioritising the replicants of QC level 3, then 
> replicants of QC 
> >>>>>> level 2, and then originals).
> >>>>>>             
> >>>>>>> Cheers
> >>>>>>> Bryan
> >>>>>>>
> >>>>>>>
> >>>>>>>               
> >>>>>>>> Hi Stephen,
> >>>>>>>>
> >>>>>>>> On 11/10/2011 05:23 AM, stephen.pascoe at stfc.ac.uk wrote:
> >>>>>>>>                 
> >>>>>>>>> Hi Eric,
> >>>>>>>>>
> >>>>>>>>> Replicas are beginning to show up in CMIP5 and this is
> >>>>>>>>>                   
> >>>> exposing
> >>>>         
> >>>>>> some
> >>>>>>             
> >>>>>>>>> gaps in what Gateway 1.x can do. I know you are
> >>>>>>>>>                   
> >>>> reimplementing
> >>>>         
> >>>>>> replica
> >>>>>>             
> >>>>>>>>> support in Gateway 2.0 so I'd like to raise these 
> issues now.
> >>>>>>>>>
> >>>>>>>>> We need to be able to publish a replica to the same
> >>>>>>>>>                   
> >>>> Gateway that
> >>>>         
> >>>>>> hosts
> >>>>>>             
> >>>>>>>>> the original. I can't imagine this being possible with
> >>>>>>>>>                   
> >>>> Gateway
> >>>>         
> >>>>>>>>> 1.x
> >>>>>>>>>                   
> >>>>>> since
> >>>>>>             
> >>>>>>>>> the URL http://<GATEWAY>/dataset/<dataset-id>.html
> >>>>>>>>>                   
> >>>> only points to
> >>>>         
> >>>>>> one
> >>>>>>             
> >>>>>>>>> dataset on that Gateway. Either that page needs to link to
> >>>>>>>>>                   
> >>>> the
> >>>>         
> >>>>>> original
> >>>>>>             
> >>>>>>>>> and all replicas for that dataset or we need separate URLs
> >>>>>>>>>                   
> >>>> for
> >>>>         
> >>>>>> each
> >>>>>>             
> >>>>>>>>> replica/original, or both.
> >>>>>>>>>                   
> >>>>>>>> The current direction for the implementation would be
> >>>>>>>>                 
> >>>> to have a 1
> >>>>         
> >>>>>> page
> >>>>>>             
> >>>>>>>> for the original dataset and have that page list where
> >>>>>>>>                 
> >>>> replicas
> >>>>         
> >>>>>>>> are located.
> >>>>>>>>
> >>>>>>>> If there are use cases for the other options we should get
> >>>>>>>>                 
> >>>> those
> >>>>         
> >>>>>> identified.
> >>>>>>             
> >>>>>>>> Thanks!
> >>>>>>>> -Nate
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>                 
> >>>>>>>>> Is this part of your design for Gateway 2.0's replica
> >>>>>>>>>                   
> >>>> support?
> >>>>         
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Stephen.
> >>>>>>>>>
> >>>>>>>>> ---
> >>>>>>>>>
> >>>>>>>>> Stephen Pascoe +44 (0)1235 445980
> >>>>>>>>>
> >>>>>>>>> Centre of Environmental Data Archival
> >>>>>>>>>
> >>>>>>>>> STFC Rutherford Appleton Laboratory, Harwell Oxford,
> >>>>>>>>>                   
> >>>> Didcot OX11
> >>>>         
> >>>>>> 0QX, UK
> >>>>>>             
> >>>>>>>>> --
> >>>>>>>>> Scanned by iCritical.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> GO-ESSP-TECH mailing list
> >>>>>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>>>>>                   
> >>>>>>>> _______________________________________________
> >>>>>>>> GO-ESSP-TECH mailing list
> >>>>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>>>>
> >>>>>>>>                 
> >>>>>>> --
> >>>>>>> Bryan Lawrence
> >>>>>>> University of Reading:  Professor of Weather and Climate
> >>>>>>>               
> >>>> Computing.
> >>>>         
> >>>>>>> National Centre for Atmospheric Science: Director of 
> Models and
> >>>>>>>               
> >>>>>> Data.
> >>>>>>             
> >>>>>>> STFC: Director of the Centre for Environmental Data Archival.
> >>>>>>> Ph: +44 118 3786507 or 1235 445012;
> >>>>>>>               
> >>>> Web:home.badc.rl.ac.uk/lawrence
> >>>>         
> >>>>>>> _______________________________________________
> >>>>>>> GO-ESSP-TECH mailing list
> >>>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>>>               
> >>>>>> --
> >>>>>> Estanislao Gonzalez
> >>>>>>
> >>>>>> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches 
> >>>>>> Klimarechenzentrum (DKRZ) - German Climate Computing
> >>>>>>             
> >>>> Centre Room 108
> >>>>         
> >>>>>> - Bundesstrasse 45a, D-20146 Hamburg, Germany
> >>>>>>
> >>>>>> Phone:   +49 (40) 46 00 94-126
> >>>>>> E-Mail:  gonzalez at dkrz.de
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> GO-ESSP-TECH mailing list
> >>>>>> GO-ESSP-TECH at ucar.edu
> >>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>>>             
> >>>> --
> >>>> Scanned by iCritical.
> >>>> _______________________________________________
> >>>> GO-ESSP-TECH mailing list
> >>>> GO-ESSP-TECH at ucar.edu
> >>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>>>
> >>>>         
> >>     
> >
> > --
> > Bryan Lawrence
> > University of Reading:  Professor of Weather and Climate Computing.
> > National Centre for Atmospheric Science: Director of Models 
> and Data. 
> > STFC: Director of the Centre for Environmental Data Archival.
> > Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence 
> > _______________________________________________
> > GO-ESSP-TECH mailing list
> > GO-ESSP-TECH at ucar.edu
> > http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >   
> 
> _______________________________________________
> esg-gateway-dev mailing list
> esg-gateway-dev at mailman.earthsystemgrid.org
> http://mailman.earthsystemgrid.org/mailman/listinfo/esg-gateway-dev
>