[Go-essp-tech] [esg-gateway-dev] Replica support in Gateway 2.0

Fri Nov 11 12:57:06 MST 2011

Hi Stephen,

that's right, but that's just the "view" from BADC (which I missed to 
refer to). So again from both sides:
- From BADC Gateway the dataset 
cmip5.output1.MOHC.HadGEM2-ES.historicalGHG.day.atmos.day.r1i1p1 shows 
version 20111112 and a replica at WDCC (which doesn't exist)
- If clicked on the replica, the user is sent to DKRZ latest version 
which is 20110327

- From DKRZ Gatewaz the dataset is shown in version 20110302 and that 
the original is at BADC (that's correct, but is not what we want to 
display)
- if clicked on BAD link, the user is sent to BADC landing on version 
20111112

That's it.

BADC - 
http://cmip-gw.badc.rl.ac.uk/dataset/cmip5.output1.MOHC.HadGEM2-ES.historicalGHG.day.atmos.day.r1i1p1.html
DKRZ - 
http://ipcc-ar5.dkrz.de/dataset/cmip5.output1.MOHC.HadGEM2-ES.historicalGHG.day.atmos.day.r1i1p1.html

 From those URL you'll realized that the problem resides on the Gateway 
dropping the version from the datasets. This should have been:
BADC - 
http://cmip-gw.badc.rl.ac.uk/dataset/cmip5.output1.MOHC.HadGEM2-ES.historicalGHG.day.atmos.day.r1i1p1.v20111112html
BADC - 
http://cmip-gw.badc.rl.ac.uk/dataset/cmip5.output1.MOHC.HadGEM2-ES.historicalGHG.day.atmos.day.r1i1p1.v20110327html
DKRZ - 
http://ipcc-ar5.dkrz.de/dataset/cmip5.output1.MOHC.HadGEM2-ES.historicalGHG.day.atmos.day.r1i1p1.v20110327html

And the user forward to the latest version if not specified. AFAIK this 
is work in progress fro Gateway 2.0 (besides the fact that the URL look 
different)

thanks,
Estani

On 11.11.2011 07:36, stephen.pascoe at stfc.ac.uk wrote:
> Hi Estani,
>
>> I wasn't aware of the "stale" datasets. Well, that is a *major* 
>> Gateway
>> Bug, because it should know already about a newer version, so as it 
>> is
>> designed right now "should" point just to the original because 
>> there's
>> no replica for this.
>> [What now happens is that if you select BADC as the Gateway to 
>> download
>> to, you end up with the new version, although at DKRZ said the 
>> version
>> was the older one. - It looks like a major bug to me]
>
> I'd like to restate  this bug for clarity.
>
> The dataset
> cmip5.output1.MOHC.HadGEM2-ES.historicalGHG.day.atmos.day.r1i1p1 is
> visible on our gateway at the URL [1].  You can see this is version
> 20111112 and the "History" tab displays the two versions.  This
> dataset is named according to the old version number on the DKRZ
> datanode (not easy to provide a URL for this).
>
> So the problems are:
>
>  * The version is baked into the dataset name but a mirror may have a
> different version to the one held locally
>  * Harvesting metadata doesn't update the version of mirrors if you
> have a copy locally
>  * There is no easy way for replicating sites to discover new 
> versions
>
> Does that sound right to you?
>
> [1]
> 
> http://cmip-gw.badc.rl.ac.uk/dataset/cmip5.output1.MOHC.HadGEM2-ES.historicalGHG.day.atmos.day.r1i1p1.html
>
>
> ---
> Stephen Pascoe  +44 (0)1235 445980
> Centre of Environmental Data Archival
> STFC Rutherford Appleton Laboratory, Harwell Oxford, Didcot OX11 0QX, 
> UK
>
>
> -----Original Message-----
> From: esg-gateway-dev-bounces at mailman.earthsystemgrid.org
> [mailto:esg-gateway-dev-bounces at mailman.earthsystemgrid.org] On 
> Behalf
> Of Estanislao Gonzalez
> Sent: 10 November 2011 23:03
> To: Kettleborough, Jamie
> Cc: esg-gateway-dev at earthsystemgrid.org; go-essp-tech at ucar.edu;
> Juckes, Martin (STFC,RAL,RALSP)
> Subject: Re: [esg-gateway-dev] [Go-essp-tech] Replica support in 
> Gateway 2.0
>
> Hi Jamie,
>
> I wasn't aware of the "stale" datasets. Well, that is a *major* 
> Gateway
> Bug, because it should know already about a newer version, so as it 
> is
> designed right now "should" point just to the original because 
> there's
> no replica for this.
> [What now happens is that if you select BADC as the Gateway to 
> download
> to, you end up with the new version, although at DKRZ said the 
> version
> was the older one. - It looks like a major bug to me]
>
> The fact that the new version does not get immediately replicated 
> it's
> not a problem but a particularity of any distributed system. It's
> impossible to have a replica of the all archive synchronized to the
> minute. What we can do, is "hide" things that are outdated.
> Basically:
> 1) If the user searches for an id, that resolves to the latest 
> version.
> (If no version is given, we might safely presume the latest is
> requested)
> 2) Only replicas of that version should be displayed if available 
> (this
> should be already possible)
> 3) If the user searches for a particular version, she/he should be
> getting exactly this (feature not implemented)
> 4) Same as 2, display only known replicas of that version.
>
> I don't think this is complicated. And regarding the "few years [from
> now]" I don't think that's the case, and not only because of 
> bandwidth
> (which IS an issue already, as well as server load). I'm pretty sure
> papers are getting written "as we speak", even before a DOI gets out
> there. So these people "need" to cite something, right? The only 
> thing
> they can hold to is a bunch of URLs AND checksums. It's the only 
> thing
> you can "cite" at the moment, without this no papers could be written
> and not only in the CMIP5 context (and thus a large AR5 share).
> That's the reason the archives are there. You can cite what you find 
> at
> DKRZ, or BADC, or PCMDI. What's in there is guaranteed to remain 
> exactly
> where it was found for the next 10+ years, other institutions do not
> have this commitment (nor can).
>
> Again 2c. Thanks,
> Estani
>
> On 10.11.2011 09:14, Kettleborough, Jamie wrote:
>> Hello,
>>
>> I know this thread has moved on, but can we rewind just a bit.  I
>> think Nathan asked for use cases around this issue.
>>
>> As far as I'm aware there are two main user use cases here (there 
>> may
>> be others obviously).
>>
>> 1. User wants to get data *now* (even if its only 'preprint' in
>> Bryan's language) in a way that suites their particular needs
>> (functional and non-functional)
>>
>> 2. In a few years (not sure how long - could be months) time user
>> wants to get a copy of the data to verify or extend some previous
>> analysis.
>>
>> I'm sure someone will correct me if I have this wrong, but I *think*
>> most of the discussion so far has centered around the second of
>> these.
>> I think the first one needs some discusion too though doesn't it - 
>> it
>> feels more urgent to me?
>>
>> I think it needs reviewing in the light of replication as when there
>> is replication a user *may* choose to go to a particular data-node 
>> as
>> it suites them better - they may see faster downloads because of the
>> network route betweent them and the servers, or it might support 
>> some
>> data service they want.  Something in the system has to have the
>> responsibility of choosing which replica to download.  To be honest 
>> I
>> think this is best left to the user.  If this is the case then the
>> user has to have a good view of *where* the data is through their
>> interface.  Nate - I think your proposed implementation would expose
>> the information no?
>>
>> The other issue for replication I can think of coming out of use 
>> case
>> 1 is versioning.  Data will be revised by data providers (we have
>> examples of this), so I think that the replication system has to 
>> keep
>> up and the interfaces have to be able to communicate this to the
>> user.
>>  A case I'm worried about is if a replica goes stale (sorry Estani, 
>> I
>> *think* we have examples of these e.g.
>>
>> 
>> cmip5.output1.MOHC.HadGEM2-ES.historicalGHG.day.atmos.day.r1i1p1.v2011032
>> at DKRZ is stale - we've resubmitted as v20111102.  I only 
>> discovered
>> this yesterday, and understood what *might* be happening today, I 
>> can
>> send more examples if you need them).  I think a user needs to be
>> able
>> to tell (without too much hard work) what really is a replica, and
>> what is a replica of a previous version.  Nate - can you cope with
>> this situation in your proposed implementation?
>>
>> (Are there any issues around authorisation when it comes to replicas
>> - would a new published version mean all replicas of previous
>> versions
>> are no longer 'authorisable' against, or would stale replicas be
>> available?)
>>
>> I know that replication has started to happen - but is this the 
>> right
>> thing now?  Is everything in place to do this in a way that is 
>> *safe*
>> and not going to confuse users?
>>
>> Jamie
>>
>> (I guess if you have a client that only uses the data nodes you know
>> what node you were talking to, and see the full version information
>> from the outset, so these aren't such big issues).
>>
>> ps yes I know I still have to answer some questions from Sebastien
>> about our client.
>>
>>> -----Original Message-----
>>> From: go-essp-tech-bounces at ucar.edu
>>> [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of
>>> martin.juckes at stfc.ac.uk
>>> Sent: 10 November 2011 16:26
>>> To: gonzalez at dkrz.de; go-essp-tech at ucar.edu;
>>> esg-gateway-dev at earthsystemgrid.org
>>> Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
>>>
>>> Hi Estani,
>>>
>>> You missed the start -- the bit which is not achievable is
>>> publishing a replica to the same gateway used for the
>>> original publication of that data. E.g. IPSL data published to 
>>> BADC,
>>>
>>> Cheers,
>>> Martin
>>>
>>> > >-----Original Message-----
>>> > >From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
>>> > >bounces at ucar.edu] On Behalf Of Estanislao Gonzalez
>>> > >Sent: 10 November 2011 16:20
>>> > >To: go-essp-tech at ucar.edu; esg-gateway-dev at earthsystemgrid.org
>>> > >Subject: Re: [Go-essp-tech] Replica support in Gateway 2.0
>>> > >
>>> > >Hi,
>>> > >
>>> > >this analogy seams perfect. Now regarding to the options
>>> we have at
>>> > >the
>>> > >moment:
>>> > >1) How "unique" is the dataset id in a Gateway? Federation wide,
>>> > >local gAteway or project unique?
>>> > >2) Depending on one, the procedure could involve "moving" the
>>> > >published data to some other project, Gateway, Federation :-)
>>> > >
>>> > >I think this could be achievable:
>>> > >1) Data gets replicated to some Gateway (redundancy enforced)
>>> > >2) The originated Gateway, if it's also replicating,
>>> should replicate
>>> > >(just data, no publication yet) from the QC checked replica.
>>> > >3) The "pre-print" gets removed (which either mean move to a
>>> > >different project, Gateway, etc or really completely
>>> delete it from
>>> > >the Gateway)
>>> > >4) The replica gets published.
>>> > >
>>> > >I might be omitting something, but it seams achievable right 
>>> now.
>>> > >
>>> > >My 2c,
>>> > >Estani
>>> > >
>>> > >Am 10.11.2011 07:15, schrieb Bryan Lawrence:
>>> > >> Martin has been quite vociferous (quite rightly) in
>>> personal email
>>> > >to me that as far as QC goes, the dataset which gets
>>> through QC2 will
>>> > >*not* be the original dataset - we have no control over
>>> the original
>>> > >dataset's permanence and/or immutability.
>>> > >>
>>> > >> This raises some interesting issues about the role of
>>> ESGF ... and
>>> > >it's interaction with the data owner and the publication process
>>> > >which is governed by DKRZ as the Publisher (and in the future
>>> > >probably multiple publication processes and multiple
>>> Publishers). The
>>> > >correct analogy here, as I said on an earlier email today, is to
>>> > >consider the original dataset as a preprint, of a
>>> Published dataset
>>> > >(at QC level 3).
>>> > >>
>>> > >> Incidentally, this disctinction might offer us a possible
>>> > >> (distinct)
>>> > >future for two different types of gateways into ESGF: the
>>> Published
>>> > >datasets view (which makes pre-eminent the QC'd copy) and the
>>> > >published view (which makes pre-eminenent whatever someone
>>> sticks on
>>> > >a data node).
>>> > >>
>>> > >> But meanwhile, I think we can live with what you
>>> proposed, as long
>>> > >as the QC status of the replicas is clearly visible - and the 
>>> DOI
>>> > >points to a landing page that somehow prioritises those 
>>> versions,
>>> > >which would be trivial if your page was organised in the same 
>>> way
>>> > >(prioritising the replicants of QC level 3, then replicants of 
>>> QC
>>> > >level 2, and then originals).
>>> > >>
>>> > >> Cheers
>>> > >> Bryan
>>> > >>
>>> > >>
>>> > >>> Hi Stephen,
>>> > >>>
>>> > >>> On 11/10/2011 05:23 AM, stephen.pascoe at stfc.ac.uk wrote:
>>> > >>>> Hi Eric,
>>> > >>>>
>>> > >>>> Replicas are beginning to show up in CMIP5 and this is
>>> exposing
>>> > >some
>>> > >>>> gaps in what Gateway 1.x can do. I know you are
>>> reimplementing
>>> > >replica
>>> > >>>> support in Gateway 2.0 so I'd like to raise these issues 
>>> now.
>>> > >>>>
>>> > >>>> We need to be able to publish a replica to the same
>>> Gateway that
>>> > >hosts
>>> > >>>> the original. I can't imagine this being possible with
>>> Gateway
>>> > >>>> 1.x
>>> > >since
>>> > >>>> the URL http://<GATEWAY>/dataset/<dataset-id>.html
>>> only points to
>>> > >one
>>> > >>>> dataset on that Gateway. Either that page needs to link to
>>> the
>>> > >original
>>> > >>>> and all replicas for that dataset or we need separate URLs
>>> for
>>> > >each
>>> > >>>> replica/original, or both.
>>> > >>> The current direction for the implementation would be
>>> to have a 1
>>> > >page
>>> > >>> for the original dataset and have that page list where
>>> replicas
>>> > >>> are located.
>>> > >>>
>>> > >>> If there are use cases for the other options we should get
>>> those
>>> > >identified.
>>> > >>>
>>> > >>> Thanks!
>>> > >>> -Nate
>>> > >>>
>>> > >>>
>>> > >>>> Is this part of your design for Gateway 2.0's replica
>>> support?
>>> > >>>>
>>> > >>>> Thanks,
>>> > >>>>
>>> > >>>> Stephen.
>>> > >>>>
>>> > >>>> ---
>>> > >>>>
>>> > >>>> Stephen Pascoe +44 (0)1235 445980
>>> > >>>>
>>> > >>>> Centre of Environmental Data Archival
>>> > >>>>
>>> > >>>> STFC Rutherford Appleton Laboratory, Harwell Oxford,
>>> Didcot OX11
>>> > >0QX, UK
>>> > >>>>
>>> > >>>>
>>> > >>>> --
>>> > >>>> Scanned by iCritical.
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>> _______________________________________________
>>> > >>>> GO-ESSP-TECH mailing list
>>> > >>>> GO-ESSP-TECH at ucar.edu
>>> > >>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>> > >>> _______________________________________________
>>> > >>> GO-ESSP-TECH mailing list
>>> > >>> GO-ESSP-TECH at ucar.edu
>>> > >>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>> > >>>
>>> > >> --
>>> > >> Bryan Lawrence
>>> > >> University of Reading:  Professor of Weather and Climate
>>> Computing.
>>> > >> National Centre for Atmospheric Science: Director of Models 
>>> and
>>> > >Data.
>>> > >> STFC: Director of the Centre for Environmental Data Archival.
>>> > >> Ph: +44 118 3786507 or 1235 445012;
>>> Web:home.badc.rl.ac.uk/lawrence
>>> > >> _______________________________________________
>>> > >> GO-ESSP-TECH mailing list
>>> > >> GO-ESSP-TECH at ucar.edu
>>> > >> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>> > >
>>> > >
>>> > >--
>>> > >Estanislao Gonzalez
>>> > >
>>> > >Max-Planck-Institut für Meteorologie (MPI-M) Deutsches
>>> > >Klimarechenzentrum (DKRZ) - German Climate Computing
>>> Centre Room 108
>>> > >- Bundesstrasse 45a, D-20146 Hamburg, Germany
>>> > >
>>> > >Phone:   +49 (40) 46 00 94-126
>>> > >E-Mail:  gonzalez at dkrz.de
>>> > >
>>> > >_______________________________________________
>>> > >GO-ESSP-TECH mailing list
>>> > >GO-ESSP-TECH at ucar.edu
>>> > >http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>> --
>>> Scanned by iCritical.
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>

-- 
Estanislao Gonzalez

Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

Phone:   +49 (40) 46 00 94-126
E-Mail:  gonzalez at dkrz.de