[Go-essp-tech] Replication plan
Estanislao Gonzalez
estanislao.gonzalez at zmaw.de
Thu Mar 3 10:14:13 MST 2011
Hi Ann,
There's nothing there yet. Until now replication was not really an issue
as, to be honest, there's not much data to be replicated.
I just hope we can use "better" the resources we have at our
disposition. Making an optimal solution would involve much time, and
there's a trade off between the time we invest on this, and when we
start replicating.
I would think in something more pragmatic like perhaps dividing the
datasets by models or building a simple hash function from the dataset
id to know at least what to start downloading first (at least by doing
that we could immediately know where the dataset *should* have been
replicated first). This will be enough I think and pretty simple to
accomplish.
The missing part is the replication script, which should gather the info
of the replicas as well (where the files truly are), and the BDM or the
script we use will have to rely on one queue per replica-gateway
consuming this list of files, so that it won't matter which gateway
connection is faster.
In my opinion, we should make this replication solution as pragmatic as
it gets. We have enough problems elsewhere, we probably won't reuse this
solution and the improvement is not that high as to invest too much time
in this and by that, delay the replication start. But still I would
leverage the fact that replicas will at least be in more gateways after
the first replication.
I agree we will need a better replication procedure for the future as
this will be more and more common. I'm thinking more like p2p, torrent
and the like, but I don't see that happening soon enough.
My 2c anyway,
Estani
Am 03.03.2011 17:49, schrieb Ann Chervenak:
>
> Hi, Estani,
>
> I agree that such a replication plan is a smart idea. Otherwise, we
> are likely to create hot spots where a newly published data set gets a
> lot of simultaneous access, slowing down everyone using that site.
>
> Your suggestion of two sites each downloading half the data set and
> then acting as alternate sources for replication operations makes sense.
>
> Such a replication plan can get fairly sophisticated--e.g., using a
> tree configuration to disseminate data to multiple mirror sites, with
> each newly created replica acting as a source site for subsequent
> replication operations.
>
> We could also think about scheduling downloads based on when system
> and network loads are likely to be lighter.
>
> Are your able to schedule replication operations (i.e., can you expect
> BADC to publish certain data sets at certain times), or are
> replication operations more reactive (initiated when you see that a
> new data set has been published)?
>
> Ann
>
>
> On 3/3/11 12:09 AM, Estanislao Gonzalez wrote:
>> Hi,
>>
>> I decided to split my last email as this is something a little
>> different, but got originated from what I said there.
>>
>> Shouldn't we have a replication plan?
>> For example, if PCMDI and DKRZ replicates the same datasets from BADC at
>> the same time it will be a waste of time. We (DKRZ) could replicate one
>> half and PCMDI the other, and for the second half we will be able to
>> download simultaneously from two other gateways (actually datanodes, but
>> you get the idea).
>>
>> Any thoughts on how this could be achievable? Or if it even makes sense?
>>
>> Thanks,
>> Estani
>>
--
Estanislao Gonzalez
Max-Planck-Institut für Meteorologie (MPI-M)
Deutsches Klimarechenzentrum (DKRZ) - German Climate Computing Centre
Room 108 - Bundesstrasse 45a, D-20146 Hamburg, Germany
Phone: +49 (40) 46 00 94-126
E-Mail: estanislao.gonzalez at zmaw.de
More information about the GO-ESSP-TECH
mailing list