[Go-essp-tech] Replication plan

Thu Mar 3 12:47:06 MST 2011

Hi, Estani,

Yes, I agree with you that we should do something simple and practical 
in the short term.

The more sophisticated replication planning would be a longer-term 
solution.

Ann

On 3/3/11 9:14 AM, Estanislao Gonzalez wrote:
> Hi Ann,
>
> There's nothing there yet. Until now replication was not really an 
> issue as, to be honest, there's not much data to be replicated.
>
> I just hope we can use "better" the resources we have at our 
> disposition. Making an optimal solution would involve much time, and 
> there's a trade off between the time we invest on this, and when we 
> start replicating.
>
> I would think in something more pragmatic like perhaps dividing the 
> datasets by models or building a simple hash function from the dataset 
> id to know at least what to start downloading first (at least by doing 
> that we could immediately know where the dataset *should* have been 
> replicated first). This will be enough I think and pretty simple to 
> accomplish.
>
> The missing part is the replication script, which should gather the 
> info of the replicas as well (where the files truly are), and the BDM 
> or the script we use will have to rely on one queue per 
> replica-gateway consuming this list of files, so that it won't matter 
> which gateway connection is faster.
>
> In my opinion, we should make this replication solution as pragmatic 
> as it gets. We have enough problems elsewhere, we probably won't reuse 
> this solution and the improvement is not that high as to invest too 
> much time in this and by that, delay the replication start. But still 
> I would leverage the fact that replicas will at least be in more 
> gateways after the first replication.
>
> I agree we will need a better replication procedure for the future as 
> this will be more and more common. I'm thinking more like p2p, torrent 
> and the like, but I don't see that happening soon enough.
>
> My 2c anyway,
> Estani
>
> Am 03.03.2011 17:49, schrieb Ann Chervenak:
>>
>> Hi, Estani,
>>
>> I agree that such a replication plan is a smart idea. Otherwise, we 
>> are likely to create hot spots where a newly published data set gets 
>> a lot of simultaneous access, slowing down everyone using that site.
>>
>> Your suggestion of two sites each downloading half the data set and 
>> then acting as alternate sources for replication operations makes sense.
>>
>> Such a replication plan can get fairly sophisticated--e.g., using a 
>> tree configuration to disseminate data to multiple mirror sites, with 
>> each newly created replica acting as a source site for subsequent 
>> replication operations.
>>
>> We could also think about scheduling downloads based on when system 
>> and network loads are likely to be lighter.
>>
>> Are your able to schedule replication operations (i.e., can you 
>> expect BADC to publish certain data sets at certain times), or are 
>> replication operations more reactive (initiated when you see that a 
>> new data set has been published)?
>>
>> Ann
>>
>>
>> On 3/3/11 12:09 AM, Estanislao Gonzalez wrote:
>>> Hi,
>>>
>>> I decided to split my last email as this is something a little
>>> different, but got originated from what I said there.
>>>
>>> Shouldn't we have a replication plan?
>>> For example, if PCMDI and DKRZ replicates the same datasets from 
>>> BADC at
>>> the same time it will be a waste of time. We (DKRZ) could replicate one
>>> half and PCMDI the other, and for the second half we will be able to
>>> download simultaneously from two other gateways (actually datanodes, 
>>> but
>>> you get the idea).
>>>
>>> Any thoughts on how this could be achievable? Or if it even makes 
>>> sense?
>>>
>>> Thanks,
>>> Estani
>>>
>
>