[Go-essp-tech] Replication call: agenda and document

Thu Nov 5 23:07:46 MST 2009

Hi Gavin

I think we agree more than we differ, until we get to implementation :-)

If we strip out the detail, the replication problem is related to the following:

1) modellers produce data and introduce it to ESG publisher (i.e. put it in the right place for the publisher to see it)

2) ESG publisher scans the data (and now following conversations with Karl, may need to move data or raise an error if the data isn't in the right version directories) ... and loads it's internal DB and produces Thredds catalogs (which are then harvested by gateways).

3) We need to interrogate the system (or the system needs to notify) that there is new data to be replicated. 

4) We need to construct an inventory, in whatever format, that contains a list of tuples which have the following semantics: 
 [(directory path at source, filename, checksum),...]

5) the data at those tuples needs to be moved

6) then we essentially  repeat steps 1) and 2) with the modificatoin that this time, ESG publisher marks the data as replicates, and we don't expect anyone else to replicate it.

and eventually

7) All gateways show all replicates.

In the context of replication, there are I believe, no more sophisticated semantics necessary, so even a Thredds catalog carries more semantics than is necessary for replication (and yes, I'm familiar with Thredds, albeit not as familiar as I would like, one day someone will show me some UML, meanwhile I've had to be content with instances and XML schema).

Now, what's the easiest way to get this info? I would argue a direct interrogation of the ESG publisher database (since what we need is a difference and that's likely to be most easily obtained from that). Then, given the only purpose of this difference is to enable the movement of these files, why wouldn't we simply serialise the "inventory" with only the semantics above? The Thredds semantics are reproduced at the target end at step 6) and we can then, before declaring replication victory, we compare thredds catalogs as you have suggested.

To get started, we could write the code to do steps 3,4 and 5 in python in an afternoon and exploit os.command to run BDM ...  and with that it would take no time at all for Bob (or us) to produce a pylons controller to expose the inventory given two dates (assuming the date of acquisition was stored in the db).

Now we do have to handle notification and/or interrogation but it's a very simple use case ... congruent to the one you want to support for users, but with one significant difference, the replicatoin sites are going to be trusted to a higher level ...

In which case I could (for say 20 ESG data nodes) run my own scripts to poll that pylons controller if we had nothing more sophisticated on the table, as could all the other replication sites.

So, with roughly two days work, and exploiting a command line interface to BDM,we could have replication working semi-operationally.

Now obviously we can do more sophisticated stuff, by why wouldn't we get something working asap?

Cheers
Bryan

-- 
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848; 
Web: home.badc.rl.ac.uk/lawrence