[Go-essp-tech] [esg-gateway-dev] Replica support in Gateway 2.0
Eric Nienhouse
ejn at ucar.edu
Mon Nov 14 15:53:13 MST 2011
Hi All,
Thanks Jamie Another notable bullet from this discussion:
* Clear indication of the version of the dataset under download.
Tomorrow is an "on" week for our go-essp tech call. I suggest we
convene and discuss this area further and work to condense this thread
into actions. In particular clarification and prioritizing of the
following:
* QC L2 datasets as derived from the "pre-press" original.
* Clear indication of dataset version. (It would not be hard to expose
the dataset version in the Gateway.)
* Stale replicated datasets.
A related topic from our previous call agenda is:
* DRS conformance. Can Karl's scriptable download scenario be achieved.
Other input welcome!
Thanks,
-Eric
Kettleborough, Jamie wrote:
> Hello Eric,
>
> Thanks for this. I think there is third bullet point to add to your 'notable things'. It is implied by much of the discussion, but maybe needs making explicit:
>
> * The version a user is downloading needs to be made clear to the users (I'm not sure it is at the moment - it certainly is something that has caused users here a lot of issues).
>
> I think there are different ways of exposing this, and I don't know what is optimal... How hard would it be for the gateway to expose the version number? (are there reasons why this is a bad idea?)
>
> Jamie
>
>
>> -----Original Message-----
>> From: esg-gateway-dev-bounces at mailman.earthsystemgrid.org
>> [mailto:esg-gateway-dev-bounces at mailman.earthsystemgrid.org]
>> On Behalf Of Eric Nienhouse
>> Sent: 11 November 2011 14:41
>> To: Bryan Lawrence
>> Cc: go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
>> Subject: Re: [esg-gateway-dev] [Go-essp-tech] Replica support
>> in Gateway 2.0
>>
>> Hi All,
>>
>> There is much to discuss here and I'd like to identify two
>> notable things have been raised in the thread:
>>
>> * Stale replica datasets may appear in search results UI
>> confusing users (as Martin describes this, a "file share" issue.)
>>
>> * QC/DOI applied "reference" datasets must be labeled as such
>> and *may* be fundamentally different than the original
>> dataset supplied by the data producers. This is represented
>> as a "reference archive" issue.
>>
>> We've often been using the term "replication" to describe
>> many aspects of this data management problem.
>>
>> We're working on design to address the first issue above in
>> the Gateway UI. Namely directing users to the latest version
>> of a dataset and hiding/making difficult access to
>> older/superseded replicated versions.
>> The feedback and suggestions offered in this thread are very
>> useful here. There is more to shine light on in this area,
>> including end user notifications regarding downloads of stale
>> datasets as well as administrative notification of new
>> dataset versions requiring replication. These are likely
>> lower priority to making end user data access clear, though
>> they will help in scaling the system in the future.
>>
>> There may be more at play regarding the QC/DOI assigned
>> "reference archive" datasets. I don't want to introduce
>> undue complexity at this point and feel we should strive to
>> keep things as simple as we can.
>> However, a case can be made to identify a QC'ed dataset as a
>> separate "first class" entity with specific QC and DOI
>> related attributes that has been "derived from" the original,
>> and perhaps less controlled, dataset. The QC'ed dataset,
>> though related to the original, lives a live of its own and
>> is replicated as such, forming the reference archive version
>> of this particular dataset. This would make "pre-print" data
>> distinctly different than a "published QC3" dataset.
>>
>> I believe we need to pursue the latter issue in further
>> detail. Bryan and Martin have advocated this model which I
>> believe we need in some form. I'd like to get this more
>> formalized quickly. It is notable that we all agree that
>> undue dataset copies and transfers are not desirable.
>>
>> What do you think about this approach?
>>
>> Thanks and regards,
>>
>> -Eric
>>
>>
>>
>>
>> Bryan Lawrence wrote:
>>
>>> ... and having replied to the first part of Jamie's
>>>
>> message, and ignored the second ... clearly that was a
>> mistake (on my behalf and the gateway) ... I agree that this
>> is a major issue ... thanks Jamie!
>>
>>> Bryan
>>>
>>>
>>>
>>>> Hi Jamie,
>>>>
>>>> I wasn't aware of the "stale" datasets. Well, that is a *major*
>>>> Gateway Bug, because it should know already about a newer
>>>>
>> version, so
>>
>>>> as it is designed right now "should" point just to the original
>>>> because there's no replica for this.
>>>> [What now happens is that if you select BADC as the Gateway to
>>>> download to, you end up with the new version, although at
>>>>
>> DKRZ said
>>
>>>> the version was the older one. - It looks like a major bug to me]
>>>>
>>>> The fact that the new version does not get immediately replicated
>>>> it's not a problem but a particularity of any distributed system.
>>>> It's impossible to have a replica of the all archive
>>>>
>> synchronized to
>>
>>>> the minute. What we can do, is "hide" things that are outdated.
>>>> Basically:
>>>> 1) If the user searches for an id, that resolves to the
>>>>
>> latest version.
>>
>>>> (If no version is given, we might safely presume the latest is
>>>> requested)
>>>> 2) Only replicas of that version should be displayed if available
>>>> (this should be already possible)
>>>> 3) If the user searches for a particular version, she/he should be
>>>> getting exactly this (feature not implemented)
>>>> 4) Same as 2, display only known replicas of that version.
>>>>
>>>> I don't think this is complicated. And regarding the "few
>>>>
>> years [from
>>
>>>> now]" I don't think that's the case, and not only because of
>>>> bandwidth (which IS an issue already, as well as server load). I'm
>>>> pretty sure papers are getting written "as we speak", even
>>>>
>> before a
>>
>>>> DOI gets out there. So these people "need" to cite
>>>>
>> something, right?
>>
>>>> The only thing they can hold to is a bunch of URLs AND checksums.
>>>> It's the only thing you can "cite" at the moment, without this no
>>>> papers could be written and not only in the CMIP5 context
>>>>
>> (and thus a large AR5 share).
>>
>>>> That's the reason the archives are there. You can cite
>>>>
>> what you find
>>
>>>> at DKRZ, or BADC, or PCMDI. What's in there is guaranteed
>>>>
>> to remain
>>
>>>> exactly where it was found for the next 10+ years, other
>>>>
>> institutions
>>
>>>> do not have this commitment (nor can).
>>>>
>>>> Again 2c. Thanks,
>>>> Estani
>>>>
>>>> On 10.11.2011 09:14, Kettleborough, Jamie wrote:
>>>>
>>>>
>>>>> Hello,
>>>>>
>>>>> I know this thread has moved on, but can we rewind just a bit. I
>>>>> think Nathan asked for use cases around this issue.
>>>>>
>>>>> As far as I'm aware there are two main user use cases here (there
>>>>> may be others obviously).
>>>>>
>>>>> 1. User wants to get data *now* (even if its only 'preprint' in
>>>>> Bryan's language) in a way that suites their particular needs
>>>>> (functional and non-functional)
>>>>>
>>>>> 2. In a few years (not sure how long - could be months) time user
>>>>> wants to get a copy of the data to verify or extend some previous
>>>>> analysis.
>>>>>
>>>>> I'm sure someone will correct me if I have this wrong,
>>>>>
>> but I *think*
>>
>>>>> most of the discussion so far has centered around the second of
>>>>> these.
>>>>> I think the first one needs some discusion too though
>>>>>
>> doesn't it -
>>
>>>>> it feels more urgent to me?
>>>>>
>>>>> I think it needs reviewing in the light of replication as
>>>>>
>> when there
>>
>>>>> is replication a user *may* choose to go to a particular
>>>>>
>> data-node
>>
>>>>> as it suites them better - they may see faster downloads
>>>>>
>> because of
>>
>>>>> the network route betweent them and the servers, or it
>>>>>
>> might support
>>
>>>>> some data service they want. Something in the system has to have
>>>>> the responsibility of choosing which replica to download. To be
>>>>> honest I think this is best left to the user. If this is
>>>>>
>> the case
>>
>>>>> then the user has to have a good view of *where* the data
>>>>>
>> is through
>>
>>>>> their interface. Nate - I think your proposed
>>>>>
>> implementation would
>>
>>>>> expose the information no?
>>>>>
>>>>> The other issue for replication I can think of coming out of use
>>>>> case
>>>>> 1 is versioning. Data will be revised by data providers (we have
>>>>> examples of this), so I think that the replication system has to
>>>>> keep up and the interfaces have to be able to communicate this to
>>>>> the user.
>>>>> A case I'm worried about is if a replica goes stale
>>>>>
>> (sorry Estani,
>>
>>>>> I
>>>>> *think* we have examples of these e.g.
>>>>>
>>>>>
>>>>>
>> cmip5.output1.MOHC.HadGEM2-ES.historicalGHG.day.atmos.day.r1i1p1.v20
>>
>>>>> 11032 at DKRZ is stale - we've resubmitted as v20111102. I only
>>>>> discovered this yesterday, and understood what *might* be
>>>>>
>> happening
>>
>>>>> today, I can send more examples if you need them). I
>>>>>
>> think a user
>>
>>>>> needs to be able to tell (without too much hard work)
>>>>>
>> what really is
>>
>>>>> a replica, and what is a replica of a previous version.
>>>>>
>> Nate - can
>>
>>>>> you cope with this situation in your proposed implementation?
>>>>>
>>>>> (Are there any issues around authorisation when it comes
>>>>>
>> to replicas
>>
>>>>> - would a new published version mean all replicas of previous
>>>>> versions are no longer 'authorisable' against, or would stale
>>>>> replicas be
>>>>> available?)
>>>>>
>>>>> I know that replication has started to happen - but is this the
>>>>> right thing now? Is everything in place to do this in a
>>>>>
>> way that is
>>
>>>>> *safe* and not going to confuse users?
>>>>>
>>>>> Jamie
>>>>>
>>>>> (I guess if you have a client that only uses the data
>>>>>
>> nodes you know
>>
>>>>> what node you were talking to, and see the full version
>>>>>
>> information
>>
>>>>> from the outset, so these aren't such big issues).
>>>>>
>>>>> ps yes I know I still have to answer some questions from
>>>>>
>> Sebastien
>>
>>>>> about our client.
>>>>>
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: go-essp-tech-bounces at ucar.edu
>>>>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of
>>>>>> martin.juckes at stfc.ac.uk
>>>>>> Sent: 10 November 2011 16:26
>>>>>> To: gonzalez at dkrz.de; go-essp-tech at ucar.edu;
>>>>>> esg-gateway-dev at earthsystemgrid.org
>>>>>> Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
>>>>>>
>>>>>> Hi Estani,
>>>>>>
>>>>>> You missed the start -- the bit which is not achievable is
>>>>>> publishing a replica to the same gateway used for the original
>>>>>> publication of that data. E.g. IPSL data published to BADC,
>>>>>>
>>>>>> Cheers,
>>>>>> Martin
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
>>>>>>>> bounces at ucar.edu] On Behalf Of Estanislao Gonzalez
>>>>>>>> Sent: 10 November 2011 16:20
>>>>>>>> To: go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
>>>>>>>> Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> this analogy seams perfect. Now regarding to the options
>>>>>>>>
>>>>>>>>
>>>>>> we have at
>>>>>>
>>>>>>
>>>>>>>> the
>>>>>>>> moment:
>>>>>>>> 1) How "unique" is the dataset id in a Gateway?
>>>>>>>>
>> Federation wide,
>>
>>>>>>>> local gAteway or project unique?
>>>>>>>> 2) Depending on one, the procedure could involve "moving" the
>>>>>>>> published data to some other project, Gateway, Federation :-)
>>>>>>>>
>>>>>>>> I think this could be achievable:
>>>>>>>> 1) Data gets replicated to some Gateway (redundancy enforced)
>>>>>>>> 2) The originated Gateway, if it's also replicating,
>>>>>>>>
>>>>>>>>
>>>>>> should replicate
>>>>>>
>>>>>>
>>>>>>>> (just data, no publication yet) from the QC checked replica.
>>>>>>>> 3) The "pre-print" gets removed (which either mean move to a
>>>>>>>> different project, Gateway, etc or really completely
>>>>>>>>
>>>>>>>>
>>>>>> delete it from
>>>>>>
>>>>>>
>>>>>>>> the Gateway)
>>>>>>>> 4) The replica gets published.
>>>>>>>>
>>>>>>>> I might be omitting something, but it seams achievable
>>>>>>>>
>> right now.
>>
>>>>>>>> My 2c,
>>>>>>>> Estani
>>>>>>>>
>>>>>>>> Am 10.11.2011 07:15, schrieb Bryan Lawrence:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Martin has been quite vociferous (quite rightly) in
>>>>>>>>>
>>>>>>>>>
>>>>>> personal email
>>>>>>
>>>>>>
>>>>>>>> to me that as far as QC goes, the dataset which gets
>>>>>>>>
>>>>>>>>
>>>>>> through QC2 will
>>>>>>
>>>>>>
>>>>>>>> *not* be the original dataset - we have no control over
>>>>>>>>
>>>>>>>>
>>>>>> the original
>>>>>>
>>>>>>
>>>>>>>> dataset's permanence and/or immutability.
>>>>>>>>
>>>>>>>>
>>>>>>>>> This raises some interesting issues about the role of
>>>>>>>>>
>>>>>>>>>
>>>>>> ESGF ... and
>>>>>>
>>>>>>
>>>>>>>> it's interaction with the data owner and the
>>>>>>>>
>> publication process
>>
>>>>>>>> which is governed by DKRZ as the Publisher (and in the future
>>>>>>>> probably multiple publication processes and multiple
>>>>>>>>
>>>>>>>>
>>>>>> Publishers). The
>>>>>>
>>>>>>
>>>>>>>> correct analogy here, as I said on an earlier email
>>>>>>>>
>> today, is to
>>
>>>>>>>> consider the original dataset as a preprint, of a
>>>>>>>>
>>>>>>>>
>>>>>> Published dataset
>>>>>>
>>>>>>
>>>>>>>> (at QC level 3).
>>>>>>>>
>>>>>>>>
>>>>>>>>> Incidentally, this disctinction might offer us a possible
>>>>>>>>> (distinct)
>>>>>>>>>
>>>>>>>>>
>>>>>>>> future for two different types of gateways into ESGF: the
>>>>>>>>
>>>>>>>>
>>>>>> Published
>>>>>>
>>>>>>
>>>>>>>> datasets view (which makes pre-eminent the QC'd copy) and the
>>>>>>>> published view (which makes pre-eminenent whatever someone
>>>>>>>>
>>>>>>>>
>>>>>> sticks on
>>>>>>
>>>>>>
>>>>>>>> a data node).
>>>>>>>>
>>>>>>>>
>>>>>>>>> But meanwhile, I think we can live with what you
>>>>>>>>>
>>>>>>>>>
>>>>>> proposed, as long
>>>>>>
>>>>>>
>>>>>>>> as the QC status of the replicas is clearly visible -
>>>>>>>>
>> and the DOI
>>
>>>>>>>> points to a landing page that somehow prioritises
>>>>>>>>
>> those versions,
>>
>>>>>>>> which would be trivial if your page was organised in
>>>>>>>>
>> the same way
>>
>>>>>>>> (prioritising the replicants of QC level 3, then
>>>>>>>>
>> replicants of QC
>>
>>>>>>>> level 2, and then originals).
>>>>>>>>
>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Bryan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Hi Stephen,
>>>>>>>>>>
>>>>>>>>>> On 11/10/2011 05:23 AM, stephen.pascoe at stfc.ac.uk wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Hi Eric,
>>>>>>>>>>>
>>>>>>>>>>> Replicas are beginning to show up in CMIP5 and this is
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>> exposing
>>>>>>
>>>>>>
>>>>>>>> some
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> gaps in what Gateway 1.x can do. I know you are
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>> reimplementing
>>>>>>
>>>>>>
>>>>>>>> replica
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> support in Gateway 2.0 so I'd like to raise these
>>>>>>>>>>>
>> issues now.
>>
>>>>>>>>>>> We need to be able to publish a replica to the same
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>> Gateway that
>>>>>>
>>>>>>
>>>>>>>> hosts
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> the original. I can't imagine this being possible with
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>> Gateway
>>>>>>
>>>>>>
>>>>>>>>>>> 1.x
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>> since
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> the URL http://<GATEWAY>/dataset/<dataset-id>.html
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>> only points to
>>>>>>
>>>>>>
>>>>>>>> one
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> dataset on that Gateway. Either that page needs to link to
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>> the
>>>>>>
>>>>>>
>>>>>>>> original
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> and all replicas for that dataset or we need separate URLs
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>> for
>>>>>>
>>>>>>
>>>>>>>> each
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> replica/original, or both.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> The current direction for the implementation would be
>>>>>>>>>>
>>>>>>>>>>
>>>>>> to have a 1
>>>>>>
>>>>>>
>>>>>>>> page
>>>>>>>>
>>>>>>>>
>>>>>>>>>> for the original dataset and have that page list where
>>>>>>>>>>
>>>>>>>>>>
>>>>>> replicas
>>>>>>
>>>>>>
>>>>>>>>>> are located.
>>>>>>>>>>
>>>>>>>>>> If there are use cases for the other options we should get
>>>>>>>>>>
>>>>>>>>>>
>>>>>> those
>>>>>>
>>>>>>
>>>>>>>> identified.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>> -Nate
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Is this part of your design for Gateway 2.0's replica
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>> support?
>>>>>>
>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Stephen.
>>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>>
>>>>>>>>>>> Stephen Pascoe +44 (0)1235 445980
>>>>>>>>>>>
>>>>>>>>>>> Centre of Environmental Data Archival
>>>>>>>>>>>
>>>>>>>>>>> STFC Rutherford Appleton Laboratory, Harwell Oxford,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>> Didcot OX11
>>>>>>
>>>>>>
>>>>>>>> 0QX, UK
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Scanned by iCritical.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Bryan Lawrence
>>>>>>>>> University of Reading: Professor of Weather and Climate
>>>>>>>>>
>>>>>>>>>
>>>>>> Computing.
>>>>>>
>>>>>>
>>>>>>>>> National Centre for Atmospheric Science: Director of
>>>>>>>>>
>> Models and
>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Data.
>>>>>>>>
>>>>>>>>
>>>>>>>>> STFC: Director of the Centre for Environmental Data Archival.
>>>>>>>>> Ph: +44 118 3786507 or 1235 445012;
>>>>>>>>>
>>>>>>>>>
>>>>>> Web:home.badc.rl.ac.uk/lawrence
>>>>>>
>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> Estanislao Gonzalez
>>>>>>>>
>>>>>>>> Max-Planck-Institut für Meteorologie (MPI-M) Deutsches
>>>>>>>> Klimarechenzentrum (DKRZ) - German Climate Computing
>>>>>>>>
>>>>>>>>
>>>>>> Centre Room 108
>>>>>>
>>>>>>
>>>>>>>> - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>>>>>>>
>>>>>>>> Phone: +49 (40) 46 00 94-126
>>>>>>>> E-Mail: gonzalez at dkrz.de
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> GO-ESSP-TECH mailing list
>>>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>>>
>>>>>>>>
>>>>>> --
>>>>>> Scanned by iCritical.
>>>>>> _______________________________________________
>>>>>> GO-ESSP-TECH mailing list
>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>> --
>>> Bryan Lawrence
>>> University of Reading: Professor of Weather and Climate Computing.
>>> National Centre for Atmospheric Science: Director of Models
>>>
>> and Data.
>>
>>> STFC: Director of the Centre for Environmental Data Archival.
>>> Ph: +44 118 3786507 or 1235 445012; Web:home.badc.rl.ac.uk/lawrence
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>
>>>
>> _______________________________________________
>> esg-gateway-dev mailing list
>> esg-gateway-dev at mailman.earthsystemgrid.org
>> http://mailman.earthsystemgrid.org/mailman/listinfo/esg-gateway-dev
>>
>>
More information about the GO-ESSP-TECH
mailing list