[Go-essp-tech] [IPSL-CMIP5] Re: Dataset versions across CMIP5 Gateways

Estanislao Gonzalez gonzalez at dkrz.de
Mon Jan 23 08:43:37 MST 2012


Hi Sébastien,

[<dropping the help-desk link, as this is more a GO-ESSP subject>]

Thanks for the feedback. The replication problem is a beast that was 
left to grow alone... so pretty much every institution has a different 
procedure that's not being communicated properly. The problem sadly is 
that there are many other issues that don't let us get to the core of 
this problematic. Nevertheless, and as I said, this is an ongoing effort 
and everyone is really doing their best with the spare time they get 
after fixing other more urgent problems.

In our case replication is done "intelligently" if I may say so (i.e. 
only deltas are moved around, everything is being kept track of, etc); 
but  not as automated as we'd like it to be... too many exceptions, too 
little time. For instance, and AFAIK, gridFTP is almost not used 
anywhere and there are very little institutions providing fast access to 
their data for data replicators (i.e. when replicating we go the same 
channel as all other users, which slows thing considerably). We have 2 
gridFTP servers and one dedicated for replication, BADC has pretty much 
the same thing, although they have to keep both servers in synch, and 
PCMDI has some GridFTP, but last time I checked (a while ago) only a few 
datasets were available). AFAIK no other institutions have those 
resources (and we three provide it because of our "commitment" as archives).

But basically the impediment of publishing replicas is what's holding us 
back... what's the point of having replicas if they break functionality? 
They were meant to help users (not to mention archives rely on them), 
but as of this time, the procedure hinders them by taking up precious 
resources (bandwidth - especially in institutions not providing a 
separate means to access data for replication) and not offering anything 
in exchange as we just can publish replicas at this time.

And regarding the write permit, I'd say nothing should be a matter of 
trust. We just don't have a proper paradigm in place.
There are trully two different type of "replicas": the archive one, 
meant for persistence and LTA (long time archiving), and the redundant, 
meant for speeding up bandwidth and used as a back-up.
The problem is that the first one, can be used for redundancy as well... 
though they have a very different nature: while the redundant is a truly 
subordinated copy (i.e. it must follow the "original" one), the archive 
copy is not, it's a complete new entity. In programming terms, the 
redundant is a pointer with cached date, while the archive is a deep copy.

Sorry for the long mail, but I think the community should know about the 
current status of replication.
As usual, feedback is more than welcome.

Thanks,
Estani

Am 23.01.2012 15:40, schrieb Sébastien Denvil:
> Hi Estani,
>
> Le 23/01/2012 12:40, Estanislao Gonzalez a écrit :
>> Hi Sébastien,
>>
>> This is a known problem about replicas. I'm removing all replicas 
>> from our system (just from the Gateway) hoping this will get solved. 
>> This shoudn't inhibit replication via the BDM, but it will forbid 
>> discovery... Anyway, I was waiting/hoping for a solution to this, but 
>> I see no other option as to retreat them. Should be ready soon...
>>
>
> Ok, thanks for letting me know. What about PCMDI replicated/published 
> datasets?
>
>> Regarding your last point:
>> > Can you confirm that *not* all users authorised to publish at DKRZ 
>> are able to modify this dataset?
>>
>> Well indeed they can, since there's only one person authorized to 
>> published to DKRZ (me) and only one person able to change that 
>> authorization (myself) I don't think this is a problem... 
>
> Ok, that was my supposition. If it's only you then no problem. In the 
> future it could be that other users can publish to the DKRZ gateway. 
> By that time it would be good to change permissions (just to avoid 
> mistakes).
>
>> I guess your question goes more on why am I able to "write" IPSL 
>> dataset. Well, please remember that we are talking about replicas, so 
>> I can't alter IPSL dataset, but  I can publish a replica. 
>> Furthermore, I'm even able to publish wrong information, i.e. another 
>> datasets or a corrupt one, and mark it as a replica of IPSLs. This is 
>> something we don't really want to happen.
>
> Again it's a matter of trust. I'm sure you perform all the necessary 
> checks to avoid publication of corrupted replicas.
>
>> On the other hand, IPSL may remove, alter or do whatever it likes 
>> with the "original", and that's again something archives don't want, 
>> at least not if "our" copy is treated as the "replica".
>>
>
> Up to know we preserve datasets version and we follow the CMIP5 
> procedures precisely. The benefit for you is that you have time to 
> define the best replicas/publication strategy.
>
>> We do have a lot to define regarding replicas. This is an ongoing 
>> conversation, so I'll kindly ask or stakeholders to speak their mind.
>
> I believe gateways should expose all dataset version (especially the 
> last one).
>
> Because it's taking time to replicate/publish it would then mean that 
> the latest version may not have been replicated     but should appear 
> as such in gateways.
>
> It would also mean that the best thing the replication software must 
> achieve is to be able to download only what has changed (based on 
> checksums when available) and following the drslib strategy to build 
> link when nothing has changed.
>
> Thanks.
> Sébastien
>
>>
>> Thanks,
>> Estani
>>
>> Am 23.01.2012 10:33, schrieb Sébastien Denvil:
>>> Dear all,
>>>
>>> browsing gateways from PCMDI, BADC and DKRZ using the underlying 
>>> facets of this dataset I observed a strange behaviour. 
>>> cmip5.output1.IPSL.IPSL-CM5A-LR.piControl.mon.ocean.Omon.r1i1p1
>>>
>>> This dataset have 2 versions, v20110324 and v20111010.
>>>
>>> Only the BADC gateway display the latest version. The other two 
>>> gateways display the old one and never mentioned the existence of a 
>>> new version. I believe this is a major issue due to replication side 
>>> effects.
>>>
>>> Because there isn't any "version" facet it would be important to 
>>> make visible every version of a dataset in an homogeneous ways 
>>> across gateways?
>>>
>>> http://cmip-gw.badc.rl.ac.uk/dataset/cmip5.output1.IPSL.IPSL-CM5A-LR.piControl.mon.ocean.Omon.r1i1p1.html 
>>>
>>> http://pcmdi3.llnl.gov/esgcet/dataset/cmip5.output1.IPSL.IPSL-CM5A-LR.piControl.mon.ocean.Omon.r1i1p1.html 
>>>
>>> http://ipcc-ar5.dkrz.de/dataset/cmip5.output1.IPSL.IPSL-CM5A-LR.piControl.mon.ocean.Omon.r1i1p1.html 
>>>
>>>
>>> Also selecting the administration tab from the DKRZ gateway I can 
>>> read the following:
>>> Groups authorized for Writing: Users authorized to publish at DKRZ
>>> Gateway Administrators
>>>
>>> Can you confirm that *not* all users authorised to publish at DKRZ 
>>> are able to modify this dataset?
>>>
>>> Regards.
>>> Sébastien
>>>
>>>
>>>
>>> _______________________________________________
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu
>>> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>
>>
>> -- 
>> Estanislao Gonzalez
>>
>> Max-Planck-Institut für Meteorologie (MPI-M)
>> Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
>> Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
>>
>> Phone:   +49 (40) 46 00 94-126
>> E-Mail:gonzalez at dkrz.de  
>
>
> -- 
> Sébastien Denvil
> IPSL, Pôle de modélisation du climat
> UPMC, Case 101, 4 place Jussieu,
> 75252 Paris Cedex 5
>
> Tour 45-55 2ème étage Bureau 209
> Tel: 33 1 44 27 21 10
> Fax: 33 1 44 27 39 02


-- 
Estanislao Gonzalez

Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

Phone:   +49 (40) 46 00 94-126
E-Mail:  gonzalez at dkrz.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20120123/ae9f27c1/attachment.html 


More information about the GO-ESSP-TECH mailing list