[Go-essp-tech] Replication call: agenda and document

Wed Nov 4 03:36:14 MST 2009

Hi Gavin

Thanks for your email, it's good to get this stuff out in the open .... I've picked over it in some detail below.  I've chosen to use a socratic (confrontational) manner, but dont' take it personally, I find that a direct arguement tends to get the discussion focussed quickly on the issues (where I'm as likely to be wrong as right :-).

> The main idea I want to get across is for us to have a *catalog-centric*
> view of they system.  It is the catalog that is the primary currency of
> the system.

I think where we are going to disagree is about the definition of the system, and on the use cases of interaction with the system.  

But before I disagree, let me agree vociferously. From a *user* perspective, you are absoutely right. If you dig into any of my history, you'll see I've bagged on ad nauseum about the issue of getting metadata (aka catalogue for some cases) views right ... but I've also introduced a taxonomy of metadata  (e.g. doi:10.1098/rsta.2008.0237 ) to deal with the fact that there can *never* be "one ring to rule them all" because the way we interact with systems is different.

For example, even in the ESG perspective you have the TDS catalogs, and you have the gateway views ... (integrating rather more metadata etc). Clearly these are both useful catalogs, but they're not the same catalog.

I think we would both agree that the best way of building a complex system is to link together simple systems ... and  as you rightly note a catalog is an incredibly useful view on a system, and that you can build a more complex system out of the catalogs and interactions with them.

The question not addressed however, mainly because it can't be by anyone but us, is how the ESG system will interact with my *ALREADY EXISTING SYSTEM*. Not to put to fine a point on it, but we already have a catalog, with hundreds of TB of data, ten thousand plus users etc ... and established operational procedures for managing data (and in particular, ingesting data), and the ESG system is simply not mature enough to replace it, so it's going to have to work with it.

Even on the smaller scale of the ESG data node, folks will be producing data in their production environments and moving it into the data node environment ... so if you like, the introduction of data *into* the ESG system is outside of scope of the ESG system ... (by definition).

So, a priori, the construction of a new version of data is out of scope for ESG ... the question on the table then is how best to deal with replicating a new (or first) version of data. (Maybe I didn't need to say all of the above, but it's important to put things in context ...)

I used the phrase "inventory" on the telco, deliberately, because it could be a catalog, or it could be a more simple file list, along with checksums ... (does the catalog have all the file checksums in it? Which catalog?)

> - The catalog gets generated and published from the data via the
> data-node/publisher to the gateway.

Which catalog. The TDS catalog?

> - The gateway is simply, in the context of this model, a searchable
> index over a collection of catalogs.

We're not interested in search for this use case. So, as you say, we have a collection of TDS catalogs.

> - Changes to catalogs are what is versioned.

I don't agree with the perspective. I agree with the result :-) Atomic datasets are versioned, and that versioning is reflected in the catalog.

> - Changes to catalogs are what trigger notifications

I have no problem with this, as long as we remember notifications are that the "replicating sites" are not in the same category as "users" ... not least because you don't want your metrics of downloads to include replication downloads ... also one is "desirable" the other is "essential".

> - Replication should be about replicating catalogs, where files
> transfers are the necessary side-effect of proper catalog replication.

Sorry, again I think this perspective can lead to muddled thinking.  One could forget what the catalog is for. It is to enable the users to access data. Catalogs are a necessary mechanism for making it possible for users to access data. They might also be a necessary component of a replication system. (Catalogs such as the ESG catalog are NOT data management systems. I've learnt that the hard way with my team where I used to think they could b).

> It is the catalog that is the central 'document' that we are interested
> in.  It is the single entity that contains the necessary information
> used in all levels of this system.

I think this way leads complex systems. I'm minded to think of a unix system being made up of lots of things that all work well on their own. Or a network layered model ... TCP doesn't know about ethernet ... the central thing of interest is the data!

So, frankly, what I'm ultimately interested in is the data. I'm interested in the ESG catalog IFF it adds value to what I can already do (it does), but it wont supplant it, because it wont drive my operational data management system.   So clearly, the (ESG) catalog doesn't control everything in the ESG *federation* .... (if it were to, we would have had rather a lot of influence over it's functionality).

> The very good point that was brought up on the call was, what is the
> interface between parts of the system?  It has become clear to me that
> if each part of the system understood the catalog then they could
> operate quite well, gleaning the information out of catalogs.

I don't doubt this point. As I said, my "inventory" *could* be a catalog document. But I'm picking the bones out of your email, to expose assumptions :-) :-)

> The topic today was replication:
> So... In a catalog centric model, the question of replication becomes
> simply, what datasets have changed?  

Yes.

> This is equivalent to asking, what catalogs have changed? 

Sort of. More particularly, what entries in the catalog have changed. Which is also equivalent to saying "What files have changed?" I can do that with rather less lines of code than I can parsing the TDS catalog. (I'm not saying I should do the former, but one has to ask what is the value proposition for doing the latter, given we can *and must* rebuild the catalog on ingestion at the remote site).

> The replication agent is interested in
> these notifications thus should be defacto subscribed to getting such
> notification messages. 

This is possible.

> When the replication agent is notified it would 
> look on it's system and see if the notification is something in it's
> list to have a replica of, it's "replication list".  If so it can pull
> down the catalog or some subset (diff) of that catalog, or simply the
> necessary tuple to find the location(s) of holders of the newest
> catalog.  The catalog will always have in it its authoritative source
> (dataset name and gateway).  This can be resolved to the actual data
> node that has the new version of that catalog (and any other replicas
> that are up-to-date).  Then it is the job of the replication agent that
> wants to be updated to contact the authoritative data-node or any
> up-to-date replica holder and basically sync catalogs.  Syncing catalogs
> means grabbing the latest catalog, from the authoritative source or an
> updated data-node replica, and diffing it with the stale catalog it
> currently has... the result of the diff is the set files and such that
> need to be transfered in order to make the state of the stale node
> equivalent to the state of the latest catalog.  It is the catalog that
> contains the 'inventory' and all other necessary information.  

This is possible. But it's not the simplest mechanism. Simple is good.

The simplest mechanism is to avoid the gateway, recognise that either in the db or the TDS view, ESG publisher has (or could have) the information to directly produce a "diff update" with respect to some previous time. That diff aka inventory could be a list of files and checksums. And the collection of such difs is what a replication agent is interested in exploiting.

In the abstract view both approaches are the same.  The question we need to answer is which is easist to implement, and is less likely to fail.

> Once 
> files are transfered integrity checking can be accomplished at a few
> levels.  First is to have the stale, node generate it's own catalog and
> then check it against the reference (up-to-date) catalog it got from the
> source. If replication has been done successfully they should be
> identical!  The catalog should have a 'header' portion that contains the
> checksum of the immutable portion 'body' of the catalog.  The first
> level integrity check would be to see if what is generated and the
> reference are the same, if not a second level check that required
> walking the catalog's (xml) tree and compare the two trees.  it is in
> the latter check where individual files entries are checked to detect
> what files may need to be fetched again.  Also if the connection goes
> down or fails in some way, generating a catalog over the partial set of
> files that have already been downloaded, and comparing it with the
> source catalog will tell the replication agent where to puck up from.

I agree that regardless of whatever mechanism one uses to establish and carry out replication, this step, pretty much as you describe, is necessary. 

> The source catalog could be cached on the replication catalog and then
> purged after replication is done.  Or to be more up-to-date, can refetch
> a catalog from any in the list of already up-to-date replica holders.
> 
> The model is consistent.  Perhaps what needs to happen is for every part
> of this system to be able to parse and glean information from the catalog.

Which is a bigger job ... will take longer ... and is riskier.

> There are system tweaks and optimizations that can be made (Ex:
> subscribing to be notified for specific entities or doing a general
> subscription blast.  Refetching latest catalog from source or up-to-date
> replicas vs holding on to the source you already have - a question of
> freshness, etc...).  But the model of being catalog centric is
> consistent and complete.  I think this is the direction we should go in
> if we want this system to be scalable and provide clean interfacing of
> the different parts. 

If one wants scalable, and supportive of future evolution, one uses building blocks which are as dumb as possible. I think using the catalog view *all* the way is quite smart, and therefore, riskier ... and potentially less admitting of change.

Cheers
Bryan

-- 
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848; 
Web: home.badc.rl.ac.uk/lawrence