[Go-essp-tech] Replication plan

Thu Mar 3 10:14:13 MST 2011

Hi Ann,

There's nothing there yet. Until now replication was not really an issue 
as, to be honest, there's not much data to be replicated.

I just hope we can use "better" the resources we have at our 
disposition. Making an optimal solution would involve much time, and 
there's a trade off between the time we invest on this, and when we 
start replicating.

I would think in something more pragmatic like perhaps dividing the 
datasets by models or building a simple hash function from the dataset 
id to know at least what to start downloading first (at least by doing 
that we could immediately know where the dataset *should* have been 
replicated first). This will be enough I think and pretty simple to 
accomplish.

The missing part is the replication script, which should gather the info 
of the replicas as well (where the files truly are), and the BDM or the 
script we use will have to rely on one queue per replica-gateway 
consuming this list of files, so that it won't matter which gateway 
connection is faster.

In my opinion, we should make this replication solution as pragmatic as 
it gets. We have enough problems elsewhere, we probably won't reuse this 
solution and the improvement is not that high as to invest too much time 
in this and by that, delay the replication start. But still I would 
leverage the fact that replicas will at least be in more gateways after 
the first replication.

I agree we will need a better replication procedure for the future as 
this will be more and more common. I'm thinking more like p2p, torrent 
and the like, but I don't see that happening soon enough.

My 2c anyway,
Estani

Am 03.03.2011 17:49, schrieb Ann Chervenak:
>
> Hi, Estani,
>
> I agree that such a replication plan is a smart idea. Otherwise, we 
> are likely to create hot spots where a newly published data set gets a 
> lot of simultaneous access, slowing down everyone using that site.
>
> Your suggestion of two sites each downloading half the data set and 
> then acting as alternate sources for replication operations makes sense.
>
> Such a replication plan can get fairly sophisticated--e.g., using a 
> tree configuration to disseminate data to multiple mirror sites, with 
> each newly created replica acting as a source site for subsequent 
> replication operations.
>
> We could also think about scheduling downloads based on when system 
> and network loads are likely to be lighter.
>
> Are your able to schedule replication operations (i.e., can you expect 
> BADC to publish certain data sets at certain times), or are 
> replication operations more reactive (initiated when you see that a 
> new data set has been published)?
>
> Ann
>
>
> On 3/3/11 12:09 AM, Estanislao Gonzalez wrote:
>> Hi,
>>
>> I decided to split my last email as this is something a little
>> different, but got originated from what I said there.
>>
>> Shouldn't we have a replication plan?
>> For example, if PCMDI and DKRZ replicates the same datasets from BADC at
>> the same time it will be a waste of time. We (DKRZ) could replicate one
>> half and PCMDI the other, and for the second half we will be able to
>> download simultaneously from two other gateways (actually datanodes, but
>> you get the idea).
>>
>> Any thoughts on how this could be achievable? Or if it even makes sense?
>>
>> Thanks,
>> Estani
>>

-- 
Estanislao Gonzalez

Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany

Phone:   +49 (40) 46 00 94-126
E-Mail:  estanislao.gonzalez at zmaw.de