[Go-essp-tech] Replication call: agenda and document

Wed Nov 4 14:07:42 MST 2009

Hello Bryan,

Thanks for the feedback!! :-)  My response is interleaved below.

Bryan Lawrence wrote:
> Hi Gavin
> 
> Thanks for your email, it's good to get this stuff out in the open .... I've picked over it in some detail below.  I've chosen to use a socratic (confrontational) manner, but dont' take it personally, I find that a direct arguement tends to get the discussion focussed quickly on the issues (where I'm as likely to be wrong as right :-).
> 
>> The main idea I want to get across is for us to have a *catalog-centric*
>> view of they system.  It is the catalog that is the primary currency of
>> the system.
> 
> I think where we are going to disagree is about the definition of the system, and on the use cases of interaction with the system.  
> 
> But before I disagree, let me agree vociferously. From a *user* perspective, you are absoutely right. If you dig into any of my history, you'll see I've bagged on ad nauseum about the issue of getting metadata (aka catalogue for some cases) views right ... but I've also introduced a taxonomy of metadata  (e.g. doi:10.1098/rsta.2008.0237 ) to deal with the fact that there can *never* be "one ring to rule them all" because the way we interact with systems is different.
> 
> For example, even in the ESG perspective you have the TDS catalogs, and you have the gateway views ... (integrating rather more metadata etc). Clearly these are both useful catalogs, but they're not the same catalog.
> 

Indeed, I defer to you that there may indeed by no catalog / metadata
panacea for representing all the things that they system may deal with.
 But I think that it is critically important to deal with today's
problem while building a system with wisdom that won't paint us in a
corner. Right now ESG is dealing with catalogs, more specifically the
very amenable schema that defines the THREDDS catalog.

"Let's not make the perfect, the enemy of progress" B. Obama :-)

> I think we would both agree that the best way of building a complex system is to link together simple systems ... and  as you rightly note a catalog is an incredibly useful view on a system, and that you can build a more complex system out of the catalogs and interactions with them.
> 

nod

> The question not addressed however, mainly because it can't be by anyone but us, is how the ESG system will interact with my *ALREADY EXISTING SYSTEM*. Not to put to fine a point on it, but we already have a catalog, with hundreds of TB of data, ten thousand plus users etc ... and established operational procedures for managing data (and in particular, ingesting data), and the ESG system is simply not mature enough to replace it, so it's going to have to work with it.
>

In this case, we should be clear to separate the packaging of the data
from the semantics of the data.  The THREDDS catalog format (again very
flexible and simple) is simply one way of looking at the data, but if
the information that is being captured has the same semantic meaning
then this is "but a simple matter of code" ;-) - as my old adviser would
always say to me.

The ways of dealing with this would be...
- write translator code.
- write an uber wrapper encapsulating both formats (or any number of
formats), where the particular format is the payload.  (just like IP) -
there is nothing new under the sun.

The primary idea that I am trying to convey is that the representation
of the data should be recognized and used at each of the building
blocks.  What this does is puts interoperability on the data transport
format than at any particular API level call where you end up nickle and
diming the api for the right parameter lists, etc.

> Even on the smaller scale of the ESG data node, folks will be producing data in their production environments and moving it into the data node environment ... so if you like, the introduction of data *into* the ESG system is outside of scope of the ESG system ... (by definition).
> 

Well, we have to *do* something.  We have to pick a place to start.  And
the work at the data node to capture the glean the data out of netCDF
files I think is a pretty darn good place to start.  Ultimately we need
a system that provides a service to our end users, we are providing a
product not a framework.  First we build it, then we make it better...
right?

> So, a priori, the construction of a new version of data is out of scope for ESG ... the question on the table then is how best to deal with replicating a new (or first) version of data. (Maybe I didn't need to say all of the above, but it's important to put things in context ...)
> 

I certainly appreciate the context! :-)

> I used the phrase "inventory" on the telco, deliberately, because it could be a catalog, or it could be a more simple file list, along with checksums ... (does the catalog have all the file checksums in it? Which catalog?)
> 

The catalog is not so complex.  It is rather straight forward and makes
sense.  I think if you take a look at it you'll see it is quite
sensible.  And yes it does have checksums (though optional, to
accommodate legacy data that may want to be represented in the system...
as checksums are expensive, especially over large files, to so wholesale
a posteriori would be prohibitive). However I wholly expect new data
moving forward would all be checksummed as good data citizens. :-)

>> - The catalog gets generated and published from the data via the
>> data-node/publisher to the gateway.
> 
> Which catalog. The TDS catalog?

Yes (at the moment... we have to choose some representation THREDDS was
a good one)

> 
>> - The gateway is simply, in the context of this model, a searchable
>> index over a collection of catalogs.
> 
> We're not interested in search for this use case. So, as you say, we have a collection of TDS catalogs.
> 

Just giving a break down of the system.  End users care very much about
being able to find the data they are looking for.

>> - Changes to catalogs are what is versioned.
> 
> I don't agree with the perspective. I agree with the result :-) Atomic datasets are versioned, and that versioning is reflected in the catalog.
> 

Now you are just toying with me :-).

>> - Changes to catalogs are what trigger notifications
> 
> I have no problem with this, as long as we remember notifications are that the "replicating sites" are not in the same category as "users" ... not least because you don't want your metrics of downloads to include replication downloads ... also one is "desirable" the other is "essential".
> 

Indeed, though for varying levels of abstraction they can be thought of
in broad brush strokes as the same, in so far as, they are entities
being notified.  Remember that after we ship around these disks the
bootstrap the system notification and replication become updating
mechanisms for data churn vs wholesale tabulae rasae standup. Agreed?

Not including replica in metrics is a horse of a different color, we
deal with that when we deal with metrics. TBD

>> - Replication should be about replicating catalogs, where files
>> transfers are the necessary side-effect of proper catalog replication.
> 
> Sorry, again I think this perspective can lead to muddled thinking.  One could forget what the catalog is for. It is to enable the users to access data. Catalogs are a necessary mechanism for making it possible for users to access data. They might also be a necessary component of a replication system. (Catalogs such as the ESG catalog are NOT data management systems. I've learnt that the hard way with my team where I used to think they could b).
> 

Catalogs are not a data management system, correct.  They are a data
representation that contains all the salient information from which a
replication agent can get its necessary data.  What the catalog is for
is wholly dependent on who is using the catalog! To a data provider
(publisher) it captures the metadata that they wish to post. To a
replication agent it represents the list of files as a logical unit. To
another actor in the system it may mean something else.  This
underscores my thesis of having the catalog be the thought of as the
currency of the system.  As it has the full fidelity of the information
(data) such that inter-operating parts of the system have a rich lingua
fraca.  This facilitates interoperability.  Much like stringing together
unix programs through pipes, the catalogs are our stdin stdout format.

>> It is the catalog that is the central 'document' that we are interested
>> in.  It is the single entity that contains the necessary information
>> used in all levels of this system.
> 
> I think this way leads complex systems. I'm minded to think of a unix system being made up of lots of things that all work well on their own. Or a network layered model ... TCP doesn't know about ethernet ... the central thing of interest is the data!
> 

Indeed the central thing of interest is the data! I agree, this is my
whole point. :-)

> So, frankly, what I'm ultimately interested in is the data. I'm interested in the ESG catalog IFF it adds value to what I can already do (it does), but it wont supplant it, because it wont drive my operational data management system.   So clearly, the (ESG) catalog doesn't control everything in the ESG *federation* .... (if it were to, we would have had rather a lot of influence over it's functionality).
> 

We have to start somewhere.  First we solve the problem at hand, and
make our customers happy with the product, then we make everyone else
happy.  Focus must be the end-user, IMHO.  We can make the most
fantastically beautiful system but if no one uses it, then what's the
point?  So, we are making an omelet... eggs anyone?

>> The very good point that was brought up on the call was, what is the
>> interface between parts of the system?  It has become clear to me that
>> if each part of the system understood the catalog then they could
>> operate quite well, gleaning the information out of catalogs.
> 
> I don't doubt this point. As I said, my "inventory" *could* be a catalog document. But I'm picking the bones out of your email, to expose assumptions :-) :-)
> 

Ah that's the spirit! :-)

>> The topic today was replication:
>> So... In a catalog centric model, the question of replication becomes
>> simply, what datasets have changed?  
> 
> Yes.
> 

It's a love fest at this point.

>> This is equivalent to asking, what catalogs have changed? 
> 
> Sort of. More particularly, what entries in the catalog have changed. Which is also equivalent to saying "What files have changed?" I can do that with rather less lines of code than I can parsing the TDS catalog. (I'm not saying I should do the former, but one has to ask what is the value proposition for doing the latter, given we can *and must* rebuild the catalog on ingestion at the remote site).
> 
>> The replication agent is interested in
>> these notifications thus should be defacto subscribed to getting such
>> notification messages. 
> 
> This is possible.

nod

> 
>> When the replication agent is notified it would 
>> look on it's system and see if the notification is something in it's
>> list to have a replica of, it's "replication list".  If so it can pull
>> down the catalog or some subset (diff) of that catalog, or simply the
>> necessary tuple to find the location(s) of holders of the newest
>> catalog.  The catalog will always have in it its authoritative source
>> (dataset name and gateway).  This can be resolved to the actual data
>> node that has the new version of that catalog (and any other replicas
>> that are up-to-date).  Then it is the job of the replication agent that
>> wants to be updated to contact the authoritative data-node or any
>> up-to-date replica holder and basically sync catalogs.  Syncing catalogs
>> means grabbing the latest catalog, from the authoritative source or an
>> updated data-node replica, and diffing it with the stale catalog it
>> currently has... the result of the diff is the set files and such that
>> need to be transfered in order to make the state of the stale node
>> equivalent to the state of the latest catalog.  It is the catalog that
>> contains the 'inventory' and all other necessary information.  
> 
> This is possible. But it's not the simplest mechanism. Simple is good.
> 
> The simplest mechanism is to avoid the gateway, recognise that either in the db or the TDS view, ESG publisher has (or could have) the information to directly produce a "diff update" with respect to some previous time. That diff aka inventory could be a list of files and checksums. And the collection of such difs is what a replication agent is interested in exploiting.
> 
> In the abstract view both approaches are the same.  The question we need to answer is which is easist to implement, and is less likely to fail.
> 

<not saying a word>

>> Once 
>> files are transfered integrity checking can be accomplished at a few
>> levels.  First is to have the stale, node generate it's own catalog and
>> then check it against the reference (up-to-date) catalog it got from the
>> source. If replication has been done successfully they should be
>> identical!  The catalog should have a 'header' portion that contains the
>> checksum of the immutable portion 'body' of the catalog.  The first
>> level integrity check would be to see if what is generated and the
>> reference are the same, if not a second level check that required
>> walking the catalog's (xml) tree and compare the two trees.  it is in
>> the latter check where individual files entries are checked to detect
>> what files may need to be fetched again.  Also if the connection goes
>> down or fails in some way, generating a catalog over the partial set of
>> files that have already been downloaded, and comparing it with the
>> source catalog will tell the replication agent where to puck up from.
> 
> I agree that regardless of whatever mechanism one uses to establish and carry out replication, this step, pretty much as you describe, is necessary. 
> 
>> The source catalog could be cached on the replication catalog and then
>> purged after replication is done.  Or to be more up-to-date, can refetch
>> a catalog from any in the list of already up-to-date replica holders.
>>
>> The model is consistent.  Perhaps what needs to happen is for every part
>> of this system to be able to parse and glean information from the catalog.
> 
> Which is a bigger job ... will take longer ... and is riskier.
> 

If you properly circumscribe the parsing of the data format (catalog)
from the mechanisms that use the data gleaned, well then it is not that
risky.  Also changing data transport format becomes a matter of changing
that portion of the code, which can certainly be made pluggable and
arbitrarily automatic.  (See: JINI - a great system before it's time -
with regards to its publication and use of active code for protocol
handling.  Mac's Rendezvous protocol took huge inspiration from them)
Essentially other data formats could be supported with minimal system
impact if we want change the format or support additional data formats,
while keeping the semantics of the data.  If the semantics of the data
changes, well then "it is but a small matter of coding" :-).

>> There are system tweaks and optimizations that can be made (Ex:
>> subscribing to be notified for specific entities or doing a general
>> subscription blast.  Refetching latest catalog from source or up-to-date
>> replicas vs holding on to the source you already have - a question of
>> freshness, etc...).  But the model of being catalog centric is
>> consistent and complete.  I think this is the direction we should go in
>> if we want this system to be scalable and provide clean interfacing of
>> the different parts. 
> 
> If one wants scalable, and supportive of future evolution, one uses building blocks which are as dumb as possible. I think using the catalog view *all* the way is quite smart, and therefore, riskier ... and potentially less admitting of change.
> 

I think my last comment goes a long way to demonstrating how we can
mitigate the risks. :-)

> Cheers
> Bryan
> 

Thanks again for the input.  The more we exorcise / exercise these ideas
the better the product is for it.

May I remind you.  I am but a single voice in this community.  I just
think that we should all make the effort to look at the whole picture to
better inform our individual parts.

All the best.

(pardon any gaffs in typing, my fingers can't keep up with my head)
:-)

-- 
Gavin M. Bell
Lawrence Livermore National Labs
--

 "Never mistake a clear view for a short distance."
       	       -Paul Saffo