[Go-essp-tech] Proposed version directory structure document

Tue May 4 08:40:17 MDT 2010

Hi Gavin, Stephen,
please let me add a few comments:

o) As we start thinking about interoperating with existing data centers that do not necessarily have the full ESG stack installed, I think we should also consider the use case where the THREDDS catalogs are already existing, i.e. they are not generated by the publisher. So, I don't think we can necessarily assume that the publisher is the only editor of the files.

o) Along the same lines, I can also imagine the use case where THREDDS catalogs are used as an XML interchange format, but are not necessarily used by a TDS to serve the data. For example, you can imagine a center having an already existing FTP site, and wanting to expose its holdings for ESG harvesting by producing metadata-enriched THREDDS catalogs that reference the FTP URLs. These catalogs can be harvested by the Gateway and return search records that point back to the FTP data center.

I think it would be important to build this kind of flexibility in the system, even if it goes beyond the pure ESG/CIMP5 scope (which does remain our most important use case).

thanks, Luca
________________________________________
From: go-essp-tech-bounces at ucar.edu [go-essp-tech-bounces at ucar.edu] On Behalf Of Gavin M Bell [gavin at llnl.gov]
Sent: Tuesday, May 04, 2010 6:50 AM
To: Stehen Pascoe
Cc: go-essp-tech at ucar.edu
Subject: Re: [Go-essp-tech] Proposed version directory structure document

Hi Stephen,

The publisher is the only "editor" of files.  When the publisher
publishes it is essentially "saving" a new version of the catalog.  In
this context the catalog that is published has two important features;
it's version number and the origin of the publication.  With the version
number in the file we can easily inspect the version of the file
(visually or with a database - which the data node already has with this
information). As I mentioned we have a single editor (publisher) system
which simplifies our versioning task.  In the GIT case I suspect we
would additionally store the SHA1 hash of the catalog along with version
number in the data node's database.

When the publisher runs it generates actual thredds catalogs that are
then ingested into thredds.  The gateway fetches the catalogs, that have
been ingested by thredds, and indexes them for search.

Well, I have gone over this with Bob and indeed, quite rightly there are
two kinds of catalogs.  This is indeed the rub. The catalogs generated
in the scanning in process, local on a node, is a 'configuration'
catalog and the one that is presented to the external node are the
external catalogs that everyone knows and loves. The intention has
always been to version the external (served by thredds) catalogs only.

So to your second point: The version controlled catalogs would be the
external THREDDS catalogs that are created by THREDDS from the
*publisher*'s input to thredds (the 'configuration' catalogs).

Indeed the devil is in the details, so I think a prototype is in order.
 I will come up with something we can all poke at soon.

Fun stuff! :-)

Stehen Pascoe wrote:
> Hi Gavin,
>
> I have one concern, as I mentioned briefly before. As it stands the
> datanode produces THREDDS catalogues as output from either a directory
> scan or a map file: esgpublish does not take THREDDS catalogues as
> input. So where would the publisher fit in to your replication scenario?
>
> It is for this reason that I don't think it is obvious that our DVCS
> catalogues are the same as the THREDDS catalogues used to initialize
> TDS, although they are clearly prime candidates. For instance, even if
> we extended esgpublish to take THREDDS as input I doubt they would
> round-trip unchanged without care.
>
> A prototype sounds a very good idea.
>
> Cheers,
> Stephen.
>
> --
> Stephen Pascoe
>
>
> On 3 May 2010, at 20:48, Gavin M Bell <gavin at llnl.gov> wrote:
>
>> Hi,
>>
>> Because of the way our system is designed, as I mentioned before, single
>> editor single publisher. There is no merging and there is indeed ONE
>> person that has the ground truth of the *latest* version.
>>
>> So entertain the following scenario...
>>
>> From the system-side of things: (at a 15000 ft level)
>>
>> Phase 1: "The handshake: What new?"
>> Domain A wants to make sure it has the latest *catalogs* from Domain B,
>> for the *catalogs* it cares about.  Domain A asks Domain B for the
>> latest versions of these *catalogs*.  Domain B provides this list of
>> latest versions of said *catalogs* back to Domain A iff there is a
>> latest version to get. Domain A initiates a pull from Domain B of these
>> newest catalog files.
>>
>> Phase 2: "The 'realization' of these catalogs"
>> (this is where the catalog centric model's rubber hit the road)
>> This realization is basically the inspecting of the catalog and
>> reconciling the data files that the catalog has in its list with what is
>> on it's file system.  This is going basically back to the email I sent
>> at the beginning of this thread.  The system reconciles what it needs to
>> make the current 'latest' catalog "true".  There is a list of files that
>> fall out of this that can (will) be fed to the fetching mechanism (BDM
>> in our case) to pull down these files into the prescribed directory. I
>> suggested; a .esg_data_files directory at the same level as the catalog
>> where it's files are kept.  This concept of dealing with catalogs for
>> versioning is separate from the transfer of the datafiles that
>> constitute them. Getting the catalog to be "true" is out of band of any
>> version control system, though the catalogs themselves are version
>> controlled.
>>
>> This is also my proposal for how replication should be done.
>>
>> This scenario begs a bootstrapping question for what happens at "phase
>> 0" to support the initial contact between Domain A and Domain B... At
>> the moment it could be a potentially charged issue, so I am punting on
>> that.
>>
>> Anyway, so I hope my attempt to illustrate the separation between where
>> a DVCS would be applied (only with catalog xml files) and where data
>> file fetching would be done (via another out of band mechanism - Ex:
>> BDM).  I also hope I conveyed the elegance of this approach as it speaks
>> to both versioning and replication (and user copying/downloading).
>>
>> Thoughts?
>>
>> Perhaps I should prototype this and share it?
>>
>> martin.juckes at stfc.ac.uk wrote:
>>> Thanks, I think I understand a bit better now: the idea is to have a
>>> single GIT repository, and the files deposited in the repository are
>>> THREDDS catalogues? This does mean, I think, that the repository
>>> version is not much use to a user trying to track changes in a small
>>> subset of the THREDDS catalogues. There is a section in the GIT
>>> wikipedia page ( http://**en.wikipedia.org/wiki/Git_%28software%29 )
>>> which suggests that GIT is quite inefficient at getting the version
>>> history of deposited files (i.e. THREDDS catalogues). I'm not sure
>>> this matters, but we need to check the consequences of the different
>>> use cases: GIT is clearly designed for the situation where users will
>>> generally be interested in a complete version of the archived
>>> package, whereas we are dealing with a case in which users will
>>> generally only want a small part of it.
>>>
>>>
>>> cheers,
>>> Martin
>>>
>>>
>>> -----Original Message-----
>>> From: Pascoe, Stephen (STFC,RAL,SSTD)
>>> Sent: Sun 02/05/2010 16:22
>>> To: Juckes, Martin (STFC,RAL,SSTD); Gavin M Bell
>>> Cc: go-essp-tech at ucar.edu
>>> Subject: RE: [Go-essp-tech] Proposed version directory structure
>>> document
>>>
>>>
>>> Martin,
>>>
>>> I think Gavin uses the term catalogue to mean an aggregation of files
>>> similar to a THREDDS catalogue, not the entire archive.  Each
>>> catalogue would represent a realm-dataset.  As you say, that's what
>>> we need.
>>>
>>> What is probably confusing is that Gavin suggests a single GIT
>>> repository holding all catalogues (the repository can be replicated
>>> throughout the system -- that's what makes GIT a distributed version
>>> control system).  He also discusses how catalogues would be mapped
>>> onto versions of files, which would need their own internal
>>> identifiers to make the system work.
>>>
>>> As far as I can see the user would be aware of only one type of
>>> "version".
>>>
>>> Stephen.
>>>
>>>
>>> -----Original Message-----
>>> From: Juckes, Martin (STFC,RAL,SSTD)
>>> Sent: Sat 5/1/2010 6:43 PM
>>> To: Gavin M Bell; Pascoe, Stephen (STFC,RAL,SSTD)
>>> Cc: go-essp-tech at ucar.edu
>>> Subject: RE: [Go-essp-tech] Proposed version directory structure
>>> document
>>>
>>>
>>> Hello Gavin, Stephen,
>>>
>>> I haven't been following this discussion, so the following concern
>>> may well have been dealt with. It looks to me as though you are
>>> discussing a system with two levels of versioning: versions of
>>> individual files, and a version of the entire catalogue, which will
>>> increment every time any files are changed. This, I think, leaves too
>>> big a gap in which it is difficult for users to specify which set of
>>> files they have used. If someone uses a few thousand files, a change
>>> in the catalogue version doesn't tell him if these files have been
>>> changed, and listing all the file versions is not a useful option in
>>> publications and correspondence -- so we need versioning at
>>> intermediate levels such as published units and atomic datasets as
>>> well as at file and catalogue level,
>>>
>>> cheers,
>>> Martin
>>>
>>> -----Original Message-----
>>> From: go-essp-tech-bounces at ucar.edu on behalf of Gavin M Bell
>>> Sent: Fri 30/04/2010 19:03
>>> To: Pascoe, Stephen (STFC,RAL,SSTD)
>>> Cc: go-essp-tech at ucar.edu
>>> Subject: Re: [Go-essp-tech] Proposed version directory structure
>>> document
>>>
>>> Hi Stephen,
>>>
>>> I am glad that my scheme is making you warm and fuzzy, I hope the rest
>>> of the team is also on board.  I too an enamored with the simple
>>> elegance of it, if I do say so myself. :-)
>>>
>>> The thing is... my scheme is essentially that we *only* version control
>>> with DVCS (GIT) *catalogs*.  Catalogs are simple text xml files. Each of
>>> those files are not that big and the order of catalog files are well in
>>> the supported range for GIT.  With respect to catalogs... there are no
>>> BIG catalogs (files) in the context of anything that would be
>>> prohibitive for vanilla GIT.  The key bit of niceness is that the esg
>>> system is pushing off the storage and durability of the actual data
>>> files (the big files) to the institutions themselves.  Between local
>>> institution durability duties and replication we can be somewhat safe
>>> that we won't 'lose' data. The only thing the esg system will explicitly
>>> version are the catalogs themselves that intern point to the specific
>>> hard data files (netcdf) living on disk and replicated.
>>>
>>> In short; GIT with no additional bells and whistles should be able to
>>> handle all our ESG catalogs.  Note: There is one GIT repo per datanode.
>>>
>>> ...
>>>
>>> Some Things To Think About:
>>> There are some things that would need to be changed like - the catalog
>>> naming scheme.  If catalogs are version controlled then we no longer
>>> need to version files by explicitly naming them i.e.
>>> foo_catalog_v{1..n}. But, we *should* have that version value be put
>>> *in* the file itself.  Thus quick inspection of the file can give you
>>> the ESG version value, while the VCS sees a single filename entity to
>>> version control.  (I'll have to talk to Bob on that one.)  Also, thus
>>> far we are not using the "D" part of the VCS.  In order to do so we
>>> would have to a) flatten the file hierarchy (or at least settle on a
>>> consistent one) this would additionally facilitate the ability to
>>> divorce the catalog placement from the filesystem hierarchy - this is
>>> where a simpler version of your link idea would come to bear - or b)
>>> interrogate this ourselves (via code we write) as we do version
>>> negotiating among federated entities.
>>>
>>> I'd be happy to discuss this more.
>>>
>>>
>>>
>>> stephen.pascoe at stfc.ac.uk wrote:
>>>> Hi Gavin,
>>>>
>>>> If we go down the DVCS-catalogue route you might be interested to note
>>>> that Mercurial already has an extension that does something very
>>>> similar
>>>> to what you are proposing with GIT.
>>>>
>>>> http://***mercurial.selenic.com/wiki/BigfilesExtension
>>>>
>>>> Maybe we need something more bespoke, but it's a useful reference.  The
>>>> more I think about version-controlled catalogues the more it appears to
>>>> solve some of our problems (particularly replication).
>>>>
>>>> S.
>>>>
>>>>
>>>> ---
>>>> Stephen Pascoe  +44 (0)1235 445980
>>>> British Atmospheric Data Centre
>>>> Rutherford Appleton Laboratory
>>>>
>>>> -----Original Message-----
>>>> From: Gavin M Bell [mailto:gavin at llnl.gov]
>>>> Sent: 26 April 2010 18:33
>>>> To: Pascoe, Stephen (STFC,RAL,SSTD)
>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu
>>>> Subject: Re: [Go-essp-tech] Proposed version directory structure
>>>> document
>>>>
>>>> Hi Stephen,
>>>>
>>>> I hope you are well. :-)
>>>>
>>>> Disclaimer:
>>>> Okay, let me get straight at it... :-)  Please pardon the length of
>>>> this
>>>> response. I do tend to get a bit garrulous when I am trying to tackle
>>>> technical issues. :-\ sorry.  And pardon typos I get a bit james
>>>> joycian
>>>> at times too.
>>>>
>>>> (take a deep breath.... :-)
>>>>
>>>> So, the catalog centric model does not put any implications on the
>>>> files
>>>> system.  You can move the catalogs as you would any file on the file
>>>> system.  Indeed there will be a tool made available (I think I am just
>>>> going to write it and be done with it) so that the physical constituent
>>>> files move along with the logical catalog as well.  There is no file
>>>> system lock in at all, it's just a meta layer to essentially group
>>>> files.
>>>>
>>>> The source of catalog information is the catalog.  The database is a
>>>> nice tool to do data manipulations and queries but the ground truth is
>>>> the file that is the catalog IMHO.  As long as the catalogs are
>>>> identical (checksums match) then it doesn't matter where the catalog
>>>> comes from ipso facto. It is my proposed plan that data node manager
>>>> will provide the handshaking between/among data nodes, this is
>>>> something
>>>> that will be leveraged by the replication agent and this we can
>>>> somewhat
>>>> think of catalogs as being eventually consistent.  This can be upper
>>>> bounded (sort of) based on the type of even propagation protocols we
>>>> want to use.  For the moment I am looking at using a gossip protocol
>>>> among nodes for addressing this issue piggy backing GIT's distributed
>>>> syncing.  The fact that we have a single editor system simplifies
>>>> things
>>>> greatly!!!
>>>>
>>>> W.r.t. GIT, indeed there are limits. I guess what would be good is to
>>>> get a back of the envelope calculation on how many catalogs would there
>>>> be per data node, just by order of magnitude.  I suspect at that
>>>> catalog
>>>> (aggregation) level we will be well in the safe zone for GIT.
>>>>
>>>> As for the DRS, indeed you can layout your filesystem as prescribed by
>>>> the DRS exactly, or you can relax that requirement by having filters in
>>>> between the caller and storage that will translate to and fro the DRS
>>>> layout and the filesystem layout. I think it is a good idea to give
>>>> admins an 'out' from having it be compulsory that the filesystem looks
>>>> like DRS.  It has been my thought that ground truth catalogs at the
>>>> data
>>>> node will refer to files where they are physically on the machine.
>>>> When
>>>> the catalog leaves the datanode or is queried from a caller, these
>>>> exchanges use the canonical name for files as directed by the DRS.
>>>>
>>>> The basic idea in the catalog centric approach is to have a simple and
>>>> consistent model of "files" in our system.  The basic fact is that
>>>> semantically we aggregate files when we run are models and produce N
>>>> number of output files for a run.  We even go so far as having a
>>>> catalog
>>>> describing these collections.  So then why on the other side of the
>>>> system (those wanting to consume files) do we then make them have to
>>>> deal with files individually.  So I simply propose we stay consistent
>>>> all the way through. With regards to "fancy" tools (nice editorializing
>>>> Stephan :-) - just messing with you)... they are not so fancy as they
>>>> are simple automations that allow you to manipulate catalogs and not
>>>> worry about moving its constituent files.
>>>>
>>>> So they layout I described was something like... have a /foo/catalog1
>>>> which will always have a /foo/catalog1/.datafiles directory that
>>>> contain
>>>> the files stipulated in /foo/catalog1.  I would have a tool (I think I
>>>> am just going to write an esg-shell) that you would use to say mv
>>>> /foo/catalog1 to /bar/catalog1.  What the tool/shell would do is
>>>> move mv
>>>> /foo/catalog1 to /bar/catalog1 and then move all the files decribed in
>>>> /foo/catalog1 that reside in /foo/catalog1/.datafiles/* over as well to
>>>> /bar/catalog1/.datafiles/*.  If the "fancy" tool fails, one could just
>>>> read the catalog1 file and read the file entries and the checksums and
>>>> look for them in .datafiles and move them over by hand, or have a
>>>> script, which would be effectively equivalent to the "fancy" tool that
>>>> does just that.  The key is that where catalog1 lives is totally up to
>>>> the datanode admin.
>>>>
>>>> Oh and filters.  You can apply them in ingress and egress data in
>>>> tomcat.  For the intrepid data node admin they can install a filter to
>>>> rewrite the catalog such that what the outside world sees are DRS paths
>>>> to files.  This makes everything else 'just work' i.e. wget scripts,
>>>> etc.  The additional technical wrinkle is that I would suggest having a
>>>> specific esg-filter class that said intrepid data node admin would
>>>> subclass to make their filter.  The added bit of functionality would be
>>>> to write the filter name and version into the catalog's mutable
>>>> portion.
>>>> This way the catalog knows what filter was used to transform it and
>>>> thus
>>>> a filter factory can be setup on the fly to do the proper translations.
>>>> Yes, this means the data node admin would have to maintain their own
>>>> filters.  But that's fine... they are intrepid! For the less intrepid,
>>>> don't do any rewrite and make your filesystem match DRS.
>>>>
>>>> For non-tomcat filter amenable tools/protocols.  I would suggest
>>>> writing
>>>> the filter code such that it can be loaded up as a simple translation
>>>> service.  We have the source for GridFTP, right, and I believe it is
>>>> written in Java. So I think, if programmed wisely, there would only
>>>> be a
>>>> single filter that can be applied to every ingress/egress.
>>>>
>>>> I welcome discussing this further.  I will be the first to say that
>>>> I do
>>>> not now much about DRS and the details therein, but I think that
>>>> technically this is a surmountable issue.  Someone please educate me on
>>>> the semantics.  I don't yet know what I don't know. :-).
>>>>
>>>> stephen.pascoe at stfc.ac.uk wrote:
>>>>> Hi Gavin,
>>>>>
>>>>> Sorry it's taken me so long to respond to this.  It's a good point
>>>>> that we could version control catalogue information and then write
>>>>> tools to synchronise the catalogues with file versions.  I like the
>>>>> idea but I think it has far-reaching implications for the system as a
>>>> whole.
>>>>> There a couple of reasons why I haven't embraced the "catalogue
>>>> centric"
>>>>> approach so far.  First, the ESG datanode database already has all the
>>>>> information you'd put in catalogues.  The DRY principle suggests we
>>>>> should have only 1 source for catalogue information and I have assumed
>>>>> that is the database.  Now, the database has advantages and
>>>>> disadvantages: Bob's schema manages multiple versions but there is no
>>>>> mechanism for distributing version changes amongst datanodes, whereas
>>>>> tools like GIT would give us distributed version control of catalogues
>>>>> out of the box.  However, if we start version controlling catalogues
>>>>> we will end up with our catalogue information spread all over the
>>>>> place and we'll have to keep them all synchronised:
>>>>>
>>>>> 1. In the ESG database
>>>>> 2. In the archive
>>>>> 3. In the THREDDS catalogue tree
>>>>>
>>>>> Also, the reason I've stuck with symbolic links rather than tools to
>>>>> map to DRS paths is that there is an argument for keeping the on-disk
>>>>> layout as close to the DRS as possible so that there is a fallback to
>>>>> getting data if the fancy tools fail.  If you do this with symlinks
>>>>> you can always point an ftp server at the archive if all else fails.
>>>>> I'm on the fence about whether this argument is worth the reduced
>>>> flexibility.
>>>>> We should also bar in mind that GIT has performance problems for both
>>>>> size of files and number of files per repository
>>>>> (http://****stackoverflow.com/questions/984707/what-are-the-git-limits)
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Stephen.
>>>>>
>>>>> ---
>>>>> Stephen Pascoe  +44 (0)1235 445980
>>>>> British Atmospheric Data Centre
>>>>> Rutherford Appleton Laboratory
>>>>>
>>>>> -----Original Message-----
>>>>> From: Gavin M Bell [mailto:gavin at llnl.gov]
>>>>> Sent: 21 April 2010 00:15
>>>>> To: Pascoe, Stephen (STFC,RAL,SSTD)
>>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu
>>>>> Subject: Re: [Go-essp-tech] Proposed version directory structure
>>>>> document
>>>>>
>>>>> Hey Stephen,
>>>>>
>>>>> Great write up.  I read through it and it is really well thought out.
>>>>> As I mentioned on the call today it is pretty much exactly how GIT is
>>>>> designed, so you have in good company :-).
>>>>>
>>>>> http://****progit.org/book/ch1-3.html
>>>>>
>>>>> The main issue I have is that we are thinking too low level.  We are
>>>>> thinking about filesystem files.  Again, I think we should be thinking
>>>>> of things at the "ESG FILE" level, i.e. the catalog level.  Imagine
>>>>> the
>>>>> following:
>>>>>
>>>>> users download ESG FILES aka catalogs.  The catalogs have versions
>>>>> associated with them.  When the user downloads the catalog they get
>>>>> the physical catalog xml file as well as the files that come with it.
>>>>> If they download a new version of the catalog that has two files out
>>>>> of 100 different, they would only pull down the two new files.  How
>>>>> are files named to avoid collisions?  Easy, files are named in the
>>>>> scheme <filename>.<checksum> this information can be gleaned from
>>>>> inspecting the catalog that has both pieces of information.  Catalogs
>>>>> are version controlled in GIT (easy, they are just text xml files...
>>>>> perfect for version control).  Let's make this even more explicit that
>>>>> there is are no filesystem files in the context of ESG by putting all
>>>>> the files in a dot directory.
>>>>>
>>>>> Example:
>>>>>
>>>>> I am in directory foo/bar (DRS dir hierarchy perhaps) we pull down
>>>>> catalog_alpha_v1.  When we do so we now have the the following
>>>>> structure.
>>>>>
>>>>> pwd   -> /root/foo/bar
>>>>> ls    -> catalog_alpha_v1
>>>>>
>>>>> ls -a -> catalog_alpha_v1
>>>>>      -> .esg_files/file{1...n}.nc.<checksum>
>>>>>
>>>>> See what I mean?
>>>>>
>>>>> In GIT we tell GIT to ignore all .esg_files directories, thus only
>>>>> versioning the catalog files.
>>>>>
>>>>> The implication of this is that we would have to build tools to do our
>>>>> own interrogation of the file system to give us the file.nc
>>>>> translations.  This tool will use the catalog at that directory level
>>>>> and be able to get directly at the files users want.
>>>>>
>>>>> Furthermore, WGET will "just work" if we point wget to an HTTP URL
>>>>> that will have a filter applied to it that will do this interrogation
>>>>> and interpretation and fetch the files referenced to.  This is a
>>>>> tomcat filter (pretty straight forward to do).
>>>>>
>>>>> This means, no linking, no extra anything at the OS file system level.
>>>>> The important files are versioned i.e. the catalogs.  And we can still
>>>>> use WGET scripts as long as they point to our translation web service,
>>>>> which consists pretty much only of a filter! :-).
>>>>>
>>>>> As for the atomic data set thing... well they are represented already
>>>>> as aggregates in the catalog.  The only additional bit of information
>>>>> that we could add would be a version attribute.  The issues behind
>>>>> what the gateways read or don't read from the catalogs, I am confident
>>>>> will be surmounted, so that should not be a blocking issue to
>>>> implementing this.
>>>>> Things to do:
>>>>> -> Write this translation code.
>>>>>   -We know that .esg_files directory (a given)
>>>>>   -We know how to parse the catalog (use xml parser dejour)
>>>>>   -translate input file name as a wget script would use
>>>>>    to the actual physical filesystem filename.
>>>>>   -Put this in a filter for tomcat in front of a catoon service
>>>>>   -Create a shell for esg... simple read-eval-print loop that calls
>>>>> the translator when it is in git directories with catalog looking
>>>>> files and .esg_files directories to show you a filesystem looking "ls"
>>>>> but for esg-files.
>>>>>
>>>>>
>>>>> In my "spare" time I would love to write an ESG shell such that when
>>>>> you load the esg shell it will be able to do ls like traditional OS's
>>>>> ls using this translation code to show you the files that live there
>>>>> in the write version context.... I don't have a lot of spare time
>>>>> right about now. :-(
>>>>>
>>>>> This catalog centric modeling of the system has been a model I have
>>>>> pushed for months now.  I feel like Cassandra :-).
>>>>>
>>>>> Thanks for listening.
>>>>>
>>>>> stephen.pascoe at stfc.ac.uk wrote:
>>>>>> Hi Bob,
>>>>>>
>>>>>> Thanks for promptly commenting on the document.  Clarifying that the
>>>>>> publisher has these features is great news and I'm sorry that, in
>>>>>> trying to give everyone time to digest the document by Tuesday, I
>>>>>> didn't have time to confirm the facts with you.  I'm hoping this way
>>>>>> any errors will come out in the wash.
>>>>>>
>>>>>> The main thing I missed was the ability to create multiple THREDDS
>>>>>> catalogues for a dataset (or 1 catalogue per dataset version).
>>>>>> Omitting this feature felt like a funder mental difference in model
>>>>>> to
>>>>> the DRS.
>>>>>> I need to work out how to do this now and I'll revise the version
>>>>>> directory structure document too.  Phil Bentley has recommended a
>>>>>> different structure that has some advantages so the document will
>>>>>> probably look very different next time.
>>>>>>
>>>>>> Incidentally, I'm increasingly impressed with the ESG publisher and
>>>>>> I'm really enjoying working with it.  The stuff you've done with
>>>>>> project handler plugins in the latest release strengthens my
>>>>>> impression that it is a tool we will be using for a long time.
>>>>>>
>>>>>> Cheers,
>>>>>> Stephen.
>>>>>>
>>>>>> ---
>>>>>> Stephen Pascoe  +44 (0)1235 445980
>>>>>> British Atmospheric Data Centre
>>>>>> Rutherford Appleton Laboratory
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> -
>>>>>> --
>>>>>> *From:* Bob Drach [mailto:drach1 at llnl.gov]
>>>>>> *Sent:* 16 April 2010 00:14
>>>>>> *To:* Pascoe, Stephen (STFC,RAL,SSTD)
>>>>>> *Cc:* go-essp-tech at ucar.edu
>>>>>> *Subject:* Re: [Go-essp-tech] Proposed version directory structure
>>>>>> document
>>>>>>
>>>>>> Hi Stephen,
>>>>>>
>>>>>> Let me clarify a few points in the description of ESG Publisher:
>>>>>>
>>>>>> The document states: "ESG Publisher version system is built around
>>>>>> mutable datasets.  It does not attempt to maintain references to
>>>>>> previous data and the dataset version number is not part of the
>>>>>> dataset id unless the publisher is configured to include it from the
>>>>>> dataset metadata.  This means that it is not straight forward at this
>>>>>> time to publish multiple versions of an atomic dataset unless each
>>>>>> version is published as a separate dataset.  This approach would
>>>>>> effectively ignore ESG Publisher's version system and manage all
>>>>> versions independently."
>>>>>> - As of Version 2 the unit of publication is in fact a 'dataset
>>>>>> version', terminology that came out of the December meeting in
>>>>> Boulder.
>>>>>> A dataset version is an immutable object which can represent a 'DRS
>>>>>> dataset including version number'. The published 'dataset version'
>>>>>> itself has an identifier which typically consists of
>>>>>> dataset_id+version number; this appears in the THREDDS catalog. As
>>>>>> you
>>>>>> stated in the document, whether or not the published dataset
>>>>>> corresponds to a DRS dataset is a matter of publisher configuration,
>>>>>> not an inherent property of the publisher.
>>>>>>
>>>>>> - The node database does in fact maintain references to the
>>>>>> composition of previous dataset versions. It is possible to have
>>>>>> multiple versions published simultaneously, to list all published
>>>>>> versions of a dataset, and for any given dataset version the files
>>>>>> contained in that version can be listed.
>>>>>>
>>>>>> - The intention of the publisher design is to automate versioning as
>>>>>> much as possible. A 'dataset' is considered to be a collection of
>>>>>> dataset versions. Consequently, 'publishing a dataset' really means
>>>>>> 'publishing a dataset version where the version number is incremented
>>>>>> relative to the previous version.' Similarly, 'unpublishing' a
>>>>>> dataset
>>>>>> by default unpublishes all versions of a dataset. The terminology
>>>>>> dataset_id#n can be used to refer to a specific version.
>>>>>>
>>>>>>
>>>>>> In short, there is no fundamental mismatch between the DRS model and
>>>>>> the ESG publisher.
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>>
>>>>>> Bob
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Apr 15, 2010, at 3:24 AM, <stephen.pascoe at stfc.ac.uk
>>>>>> <mailto:stephen.pascoe at stfc.ac.uk>> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> Attached is my view on how we should structure the archive to
>>>>>>> support
>>>>>>> multiple versions.  It divides into 2 main sections, the first is a
>>>>>>> fairly lengthy summary of why this problem isn't solved yet in terms
>>>>>>> of the differences between the ESG datanode software and the DRS
>>>>>>> document.  The second section lays out the proposed structure and
>>>>>>> how
>>>>>>> we would manage symbolic links and moving from one version to
>>>>>>> another.  I restrict myself to directories below the atomic dataset
>>>>>>> level.
>>>>>>>
>>>>>>> Lots of issues are left to resolve, in particular how we ESG
>>>>>>> publisher can make use of this structure.  I'll try and draw
>>>>>>> attention to these points in the agenda for Tuesday's telco which
>>>>> will follow later today.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Stephen.
>>>>>>>
>>>>>>> ---
>>>>>>> Stephen Pascoe  +44 (0)1235 445980
>>>>>>> British Atmospheric Data Centre
>>>>>>> Rutherford Appleton Laboratory
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Scanned by iCritical.
>>>>>>>
>>>>>>>
>>>>>>> <ESGF_version_structure.odt>________________________________________
>>>>>>> _
>>>>>>> ______
>>>>>>> GO-ESSP-TECH mailing list
>>>>>>> GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>>>>>>> http://******mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>> --
>>>>>> Scanned by iCritical.
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> -
>>>>>> --
>>>>>>
>>>>>> _______________________________________________
>>>>>> GO-ESSP-TECH mailing list
>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>> http://*****mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>> --
>>>>> Gavin M. Bell
>>>>> Lawrence Livermore National Labs
>>>>> --
>>>>>
>>>>> "Never mistake a clear view for a short distance."
>>>>>                  -Paul Saffo
>>>>>
>>>>> (GPG Key - http://****rainbow.llnl.gov/dist/keys/gavin.asc)
>>>>>
>>>>> A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
>>>> --
>>>> Gavin M. Bell
>>>> Lawrence Livermore National Labs
>>>> --
>>>>
>>>> "Never mistake a clear view for a short distance."
>>>>                  -Paul Saffo
>>>>
>>>> (GPG Key - http://***rainbow.llnl.gov/dist/keys/gavin.asc)
>>>>
>>>> A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
>>>
>>
>> --
>> Gavin M. Bell
>> Lawrence Livermore National Labs
>> --
>>
>> "Never mistake a clear view for a short distance."
>>                  -Paul Saffo
>>
>> (GPG Key - http://*rainbow.llnl.gov/dist/keys/gavin.asc)
>>
>> A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
>
> Return-Path: <Stephen.Pascoe at stfc.ac.uk>
> Received: from mail-1.llnl.gov ([unix socket])
>      by mail-1.llnl.gov (Cyrus v2.2.12) with LMTPA;
>      Tue, 04 May 2010 04:37:10 -0700
> Received: from smtp.llnl.gov (nspiron-3.llnl.gov [128.115.41.83])
>     by mail-1.llnl.gov (8.13.1/8.12.3/LLNL evision: 1.7 $) with ESMTP id
> o44Bb9Kk021799
>     for <bell51 at mail.llnl.gov>; Tue, 4 May 2010 04:37:10 -0700
> X-Attachments: None
> Received: from nsziron-1.llnl.gov ([128.115.249.81])
>  by smtp.llnl.gov with ESMTP; 04 May 2010 04:37:18 -0700
> X-Attachments: None
> X-IronPort-Anti-Spam-Filtered: true
> X-IronPort-Anti-Spam-Result:
> AooAAPaj30uC9ofIkWdsb2JhbACdMRUBAQEBCQsKBxEFHbsthRME
> Received: from oin.rl.ac.uk ([130.246.135.200])
>  by nsziron-1.llnl.gov with ESMTP; 04 May 2010 04:37:16 -0700
> X-RAL-MFrom: <Stephen.Pascoe at stfc.ac.uk>
> X-RAL-Connect: <[217.19.45.2]>
> Received: from [10.0.31.193] ([217.19.45.2])
>     (authenticated bits=0)
>     by oin.rl.ac.uk (8.12.11.20060308/8.12.11) with ESMTP id o44Baiqj003354
>     (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO);
>     Tue, 4 May 2010 12:36:55 +0100
> References:
> <EB1E7CB92F5B35459E0B926D2A614DB60C14FA07 at EXCHANGE19.fed.cclrc.ac.uk>
> <893C5C3C-F4FF-474B-A58C-11643AB399D3 at llnl.gov>
> <EB1E7CB92F5B35459E0B926D2A614DB60C14FB49 at EXCHANGE19.fed.cclrc.ac.uk>
> <4BCE3557.2090609 at llnl.gov>
> <EB1E7CB92F5B35459E0B926D2A614DB60C1506A9 at EXCHANGE19.fed.cclrc.ac.uk>
> <4BD5CE5A.7030501 at llnl.gov>
> <EB1E7CB92F5B35459E0B926D2A614DB60C2D1144 at EXCHANGE19.fed.cclrc.ac.uk>
> <4BDB1B71.6050809 at llnl.gov>
> <EB1E7CB92F5B35459E0B926D2A614DB60A63C1BA at EXCHANGE19.fed.cclrc.ac.uk>
> <EB1E7CB92F5B35459E0B926D2A614DB60BD0536C at EXCHANGE19.fed.cclrc.ac.uk>
> <EB1E7CB92F5B35459E0B926D2A614DB60A63C1BD at EXCHANGE19.fed.cclrc.ac.uk>
> <4BDF287C.5080909 at llnl.gov>
> Message-Id: <C27EE8A1-3E67-4AA2-B4AC-E374713E5D60 at stfc.ac.uk>
> From: Stehen Pascoe <Stephen.Pascoe at stfc.ac.uk>
> To: Gavin M Bell <gavin at llnl.gov>
> In-Reply-To: <4BDF287C.5080909 at llnl.gov>
> Content-Type: text/plain;
>     charset=us-ascii;
>     format=flowed;
>     delsp=yes
> Content-Transfer-Encoding: 7bit
> X-Mailer: iPod Mail (7E18)
> Mime-Version: 1.0 (iPod Mail 7E18)
> Subject: Re: [Go-essp-tech] Proposed version directory structure document
> Date: Tue, 4 May 2010 12:37:28 +0100
> Cc: "martin.juckes at stfc.ac.uk" <martin.juckes at stfc.ac.uk>,
>        "go-essp-tech at ucar.edu" <go-essp-tech at ucar.edu>
> X-CCLRC-SPAM-report: 0.1 : BAYES_00,RCVD_IN_XBL
> X-Scanned-By: MIMEDefang 2.39
>
> Hi Gavin,
>
> I have one concern, as I mentioned briefly before. As it stands the
> datanode produces THREDDS catalogues as output from either a directory
> scan or a map file: esgpublish does not take THREDDS catalogues as
> input. So where would the publisher fit in to your replication scenario?
>
> It is for this reason that I don't think it is obvious that our DVCS
> catalogues are the same as the THREDDS catalogues used to initialize
> TDS, although they are clearly prime candidates. For instance, even if
> we extended esgpublish to take THREDDS as input I doubt they would
> round-trip unchanged without care.
>
> A prototype sounds a very good idea.
>
> Cheers,
> Stephen.
>
> --
> Stephen Pascoe
>
>
> On 3 May 2010, at 20:48, Gavin M Bell <gavin at llnl.gov> wrote:
>
>> Hi,
>>
>> Because of the way our system is designed, as I mentioned before, single
>> editor single publisher. There is no merging and there is indeed ONE
>> person that has the ground truth of the *latest* version.
>>
>> So entertain the following scenario...
>>
>> From the system-side of things: (at a 15000 ft level)
>>
>> Phase 1: "The handshake: What new?"
>> Domain A wants to make sure it has the latest *catalogs* from Domain B,
>> for the *catalogs* it cares about.  Domain A asks Domain B for the
>> latest versions of these *catalogs*.  Domain B provides this list of
>> latest versions of said *catalogs* back to Domain A iff there is a
>> latest version to get. Domain A initiates a pull from Domain B of these
>> newest catalog files.
>>
>> Phase 2: "The 'realization' of these catalogs"
>> (this is where the catalog centric model's rubber hit the road)
>> This realization is basically the inspecting of the catalog and
>> reconciling the data files that the catalog has in its list with what is
>> on it's file system.  This is going basically back to the email I sent
>> at the beginning of this thread.  The system reconciles what it needs to
>> make the current 'latest' catalog "true".  There is a list of files that
>> fall out of this that can (will) be fed to the fetching mechanism (BDM
>> in our case) to pull down these files into the prescribed directory. I
>> suggested; a .esg_data_files directory at the same level as the catalog
>> where it's files are kept.  This concept of dealing with catalogs for
>> versioning is separate from the transfer of the datafiles that
>> constitute them. Getting the catalog to be "true" is out of band of any
>> version control system, though the catalogs themselves are version
>> controlled.
>>
>> This is also my proposal for how replication should be done.
>>
>> This scenario begs a bootstrapping question for what happens at "phase
>> 0" to support the initial contact between Domain A and Domain B... At
>> the moment it could be a potentially charged issue, so I am punting on
>> that.
>>
>> Anyway, so I hope my attempt to illustrate the separation between where
>> a DVCS would be applied (only with catalog xml files) and where data
>> file fetching would be done (via another out of band mechanism - Ex:
>> BDM).  I also hope I conveyed the elegance of this approach as it speaks
>> to both versioning and replication (and user copying/downloading).
>>
>> Thoughts?
>>
>> Perhaps I should prototype this and share it?
>>
>> martin.juckes at stfc.ac.uk wrote:
>>> Thanks, I think I understand a bit better now: the idea is to have a
>>> single GIT repository, and the files deposited in the repository are
>>> THREDDS catalogues? This does mean, I think, that the repository
>>> version is not much use to a user trying to track changes in a small
>>> subset of the THREDDS catalogues. There is a section in the GIT
>>> wikipedia page ( http://**en.wikipedia.org/wiki/Git_%28software%29 )
>>> which suggests that GIT is quite inefficient at getting the version
>>> history of deposited files (i.e. THREDDS catalogues). I'm not sure
>>> this matters, but we need to check the consequences of the different
>>> use cases: GIT is clearly designed for the situation where users will
>>> generally be interested in a complete version of the archived
>>> package, whereas we are dealing with a case in which users will
>>> generally only want a small part of it.
>>>
>>>
>>> cheers,
>>> Martin
>>>
>>>
>>> -----Original Message-----
>>> From: Pascoe, Stephen (STFC,RAL,SSTD)
>>> Sent: Sun 02/05/2010 16:22
>>> To: Juckes, Martin (STFC,RAL,SSTD); Gavin M Bell
>>> Cc: go-essp-tech at ucar.edu
>>> Subject: RE: [Go-essp-tech] Proposed version directory structure
>>> document
>>>
>>>
>>> Martin,
>>>
>>> I think Gavin uses the term catalogue to mean an aggregation of files
>>> similar to a THREDDS catalogue, not the entire archive.  Each
>>> catalogue would represent a realm-dataset.  As you say, that's what
>>> we need.
>>>
>>> What is probably confusing is that Gavin suggests a single GIT
>>> repository holding all catalogues (the repository can be replicated
>>> throughout the system -- that's what makes GIT a distributed version
>>> control system).  He also discusses how catalogues would be mapped
>>> onto versions of files, which would need their own internal
>>> identifiers to make the system work.
>>>
>>> As far as I can see the user would be aware of only one type of
>>> "version".
>>>
>>> Stephen.
>>>
>>>
>>> -----Original Message-----
>>> From: Juckes, Martin (STFC,RAL,SSTD)
>>> Sent: Sat 5/1/2010 6:43 PM
>>> To: Gavin M Bell; Pascoe, Stephen (STFC,RAL,SSTD)
>>> Cc: go-essp-tech at ucar.edu
>>> Subject: RE: [Go-essp-tech] Proposed version directory structure
>>> document
>>>
>>>
>>> Hello Gavin, Stephen,
>>>
>>> I haven't been following this discussion, so the following concern
>>> may well have been dealt with. It looks to me as though you are
>>> discussing a system with two levels of versioning: versions of
>>> individual files, and a version of the entire catalogue, which will
>>> increment every time any files are changed. This, I think, leaves too
>>> big a gap in which it is difficult for users to specify which set of
>>> files they have used. If someone uses a few thousand files, a change
>>> in the catalogue version doesn't tell him if these files have been
>>> changed, and listing all the file versions is not a useful option in
>>> publications and correspondence -- so we need versioning at
>>> intermediate levels such as published units and atomic datasets as
>>> well as at file and catalogue level,
>>>
>>> cheers,
>>> Martin
>>>
>>> -----Original Message-----
>>> From: go-essp-tech-bounces at ucar.edu on behalf of Gavin M Bell
>>> Sent: Fri 30/04/2010 19:03
>>> To: Pascoe, Stephen (STFC,RAL,SSTD)
>>> Cc: go-essp-tech at ucar.edu
>>> Subject: Re: [Go-essp-tech] Proposed version directory structure
>>> document
>>>
>>> Hi Stephen,
>>>
>>> I am glad that my scheme is making you warm and fuzzy, I hope the rest
>>> of the team is also on board.  I too an enamored with the simple
>>> elegance of it, if I do say so myself. :-)
>>>
>>> The thing is... my scheme is essentially that we *only* version control
>>> with DVCS (GIT) *catalogs*.  Catalogs are simple text xml files. Each of
>>> those files are not that big and the order of catalog files are well in
>>> the supported range for GIT.  With respect to catalogs... there are no
>>> BIG catalogs (files) in the context of anything that would be
>>> prohibitive for vanilla GIT.  The key bit of niceness is that the esg
>>> system is pushing off the storage and durability of the actual data
>>> files (the big files) to the institutions themselves.  Between local
>>> institution durability duties and replication we can be somewhat safe
>>> that we won't 'lose' data. The only thing the esg system will explicitly
>>> version are the catalogs themselves that intern point to the specific
>>> hard data files (netcdf) living on disk and replicated.
>>>
>>> In short; GIT with no additional bells and whistles should be able to
>>> handle all our ESG catalogs.  Note: There is one GIT repo per datanode.
>>>
>>> ...
>>>
>>> Some Things To Think About:
>>> There are some things that would need to be changed like - the catalog
>>> naming scheme.  If catalogs are version controlled then we no longer
>>> need to version files by explicitly naming them i.e.
>>> foo_catalog_v{1..n}. But, we *should* have that version value be put
>>> *in* the file itself.  Thus quick inspection of the file can give you
>>> the ESG version value, while the VCS sees a single filename entity to
>>> version control.  (I'll have to talk to Bob on that one.)  Also, thus
>>> far we are not using the "D" part of the VCS.  In order to do so we
>>> would have to a) flatten the file hierarchy (or at least settle on a
>>> consistent one) this would additionally facilitate the ability to
>>> divorce the catalog placement from the filesystem hierarchy - this is
>>> where a simpler version of your link idea would come to bear - or b)
>>> interrogate this ourselves (via code we write) as we do version
>>> negotiating among federated entities.
>>>
>>> I'd be happy to discuss this more.
>>>
>>>
>>>
>>> stephen.pascoe at stfc.ac.uk wrote:
>>>> Hi Gavin,
>>>>
>>>> If we go down the DVCS-catalogue route you might be interested to note
>>>> that Mercurial already has an extension that does something very
>>>> similar
>>>> to what you are proposing with GIT.
>>>>
>>>> http://***mercurial.selenic.com/wiki/BigfilesExtension
>>>>
>>>> Maybe we need something more bespoke, but it's a useful reference.  The
>>>> more I think about version-controlled catalogues the more it appears to
>>>> solve some of our problems (particularly replication).
>>>>
>>>> S.
>>>>
>>>>
>>>> ---
>>>> Stephen Pascoe  +44 (0)1235 445980
>>>> British Atmospheric Data Centre
>>>> Rutherford Appleton Laboratory
>>>>
>>>> -----Original Message-----
>>>> From: Gavin M Bell [mailto:gavin at llnl.gov]
>>>> Sent: 26 April 2010 18:33
>>>> To: Pascoe, Stephen (STFC,RAL,SSTD)
>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu
>>>> Subject: Re: [Go-essp-tech] Proposed version directory structure
>>>> document
>>>>
>>>> Hi Stephen,
>>>>
>>>> I hope you are well. :-)
>>>>
>>>> Disclaimer:
>>>> Okay, let me get straight at it... :-)  Please pardon the length of
>>>> this
>>>> response. I do tend to get a bit garrulous when I am trying to tackle
>>>> technical issues. :-\ sorry.  And pardon typos I get a bit james
>>>> joycian
>>>> at times too.
>>>>
>>>> (take a deep breath.... :-)
>>>>
>>>> So, the catalog centric model does not put any implications on the
>>>> files
>>>> system.  You can move the catalogs as you would any file on the file
>>>> system.  Indeed there will be a tool made available (I think I am just
>>>> going to write it and be done with it) so that the physical constituent
>>>> files move along with the logical catalog as well.  There is no file
>>>> system lock in at all, it's just a meta layer to essentially group
>>>> files.
>>>>
>>>> The source of catalog information is the catalog.  The database is a
>>>> nice tool to do data manipulations and queries but the ground truth is
>>>> the file that is the catalog IMHO.  As long as the catalogs are
>>>> identical (checksums match) then it doesn't matter where the catalog
>>>> comes from ipso facto. It is my proposed plan that data node manager
>>>> will provide the handshaking between/among data nodes, this is
>>>> something
>>>> that will be leveraged by the replication agent and this we can
>>>> somewhat
>>>> think of catalogs as being eventually consistent.  This can be upper
>>>> bounded (sort of) based on the type of even propagation protocols we
>>>> want to use.  For the moment I am looking at using a gossip protocol
>>>> among nodes for addressing this issue piggy backing GIT's distributed
>>>> syncing.  The fact that we have a single editor system simplifies
>>>> things
>>>> greatly!!!
>>>>
>>>> W.r.t. GIT, indeed there are limits. I guess what would be good is to
>>>> get a back of the envelope calculation on how many catalogs would there
>>>> be per data node, just by order of magnitude.  I suspect at that
>>>> catalog
>>>> (aggregation) level we will be well in the safe zone for GIT.
>>>>
>>>> As for the DRS, indeed you can layout your filesystem as prescribed by
>>>> the DRS exactly, or you can relax that requirement by having filters in
>>>> between the caller and storage that will translate to and fro the DRS
>>>> layout and the filesystem layout. I think it is a good idea to give
>>>> admins an 'out' from having it be compulsory that the filesystem looks
>>>> like DRS.  It has been my thought that ground truth catalogs at the
>>>> data
>>>> node will refer to files where they are physically on the machine.
>>>> When
>>>> the catalog leaves the datanode or is queried from a caller, these
>>>> exchanges use the canonical name for files as directed by the DRS.
>>>>
>>>> The basic idea in the catalog centric approach is to have a simple and
>>>> consistent model of "files" in our system.  The basic fact is that
>>>> semantically we aggregate files when we run are models and produce N
>>>> number of output files for a run.  We even go so far as having a
>>>> catalog
>>>> describing these collections.  So then why on the other side of the
>>>> system (those wanting to consume files) do we then make them have to
>>>> deal with files individually.  So I simply propose we stay consistent
>>>> all the way through. With regards to "fancy" tools (nice editorializing
>>>> Stephan :-) - just messing with you)... they are not so fancy as they
>>>> are simple automations that allow you to manipulate catalogs and not
>>>> worry about moving its constituent files.
>>>>
>>>> So they layout I described was something like... have a /foo/catalog1
>>>> which will always have a /foo/catalog1/.datafiles directory that
>>>> contain
>>>> the files stipulated in /foo/catalog1.  I would have a tool (I think I
>>>> am just going to write an esg-shell) that you would use to say mv
>>>> /foo/catalog1 to /bar/catalog1.  What the tool/shell would do is
>>>> move mv
>>>> /foo/catalog1 to /bar/catalog1 and then move all the files decribed in
>>>> /foo/catalog1 that reside in /foo/catalog1/.datafiles/* over as well to
>>>> /bar/catalog1/.datafiles/*.  If the "fancy" tool fails, one could just
>>>> read the catalog1 file and read the file entries and the checksums and
>>>> look for them in .datafiles and move them over by hand, or have a
>>>> script, which would be effectively equivalent to the "fancy" tool that
>>>> does just that.  The key is that where catalog1 lives is totally up to
>>>> the datanode admin.
>>>>
>>>> Oh and filters.  You can apply them in ingress and egress data in
>>>> tomcat.  For the intrepid data node admin they can install a filter to
>>>> rewrite the catalog such that what the outside world sees are DRS paths
>>>> to files.  This makes everything else 'just work' i.e. wget scripts,
>>>> etc.  The additional technical wrinkle is that I would suggest having a
>>>> specific esg-filter class that said intrepid data node admin would
>>>> subclass to make their filter.  The added bit of functionality would be
>>>> to write the filter name and version into the catalog's mutable
>>>> portion.
>>>> This way the catalog knows what filter was used to transform it and
>>>> thus
>>>> a filter factory can be setup on the fly to do the proper translations.
>>>> Yes, this means the data node admin would have to maintain their own
>>>> filters.  But that's fine... they are intrepid! For the less intrepid,
>>>> don't do any rewrite and make your filesystem match DRS.
>>>>
>>>> For non-tomcat filter amenable tools/protocols.  I would suggest
>>>> writing
>>>> the filter code such that it can be loaded up as a simple translation
>>>> service.  We have the source for GridFTP, right, and I believe it is
>>>> written in Java. So I think, if programmed wisely, there would only
>>>> be a
>>>> single filter that can be applied to every ingress/egress.
>>>>
>>>> I welcome discussing this further.  I will be the first to say that
>>>> I do
>>>> not now much about DRS and the details therein, but I think that
>>>> technically this is a surmountable issue.  Someone please educate me on
>>>> the semantics.  I don't yet know what I don't know. :-).
>>>>
>>>> stephen.pascoe at stfc.ac.uk wrote:
>>>>> Hi Gavin,
>>>>>
>>>>> Sorry it's taken me so long to respond to this.  It's a good point
>>>>> that we could version control catalogue information and then write
>>>>> tools to synchronise the catalogues with file versions.  I like the
>>>>> idea but I think it has far-reaching implications for the system as a
>>>> whole.
>>>>> There a couple of reasons why I haven't embraced the "catalogue
>>>> centric"
>>>>> approach so far.  First, the ESG datanode database already has all the
>>>>> information you'd put in catalogues.  The DRY principle suggests we
>>>>> should have only 1 source for catalogue information and I have assumed
>>>>> that is the database.  Now, the database has advantages and
>>>>> disadvantages: Bob's schema manages multiple versions but there is no
>>>>> mechanism for distributing version changes amongst datanodes, whereas
>>>>> tools like GIT would give us distributed version control of catalogues
>>>>> out of the box.  However, if we start version controlling catalogues
>>>>> we will end up with our catalogue information spread all over the
>>>>> place and we'll have to keep them all synchronised:
>>>>>
>>>>> 1. In the ESG database
>>>>> 2. In the archive
>>>>> 3. In the THREDDS catalogue tree
>>>>>
>>>>> Also, the reason I've stuck with symbolic links rather than tools to
>>>>> map to DRS paths is that there is an argument for keeping the on-disk
>>>>> layout as close to the DRS as possible so that there is a fallback to
>>>>> getting data if the fancy tools fail.  If you do this with symlinks
>>>>> you can always point an ftp server at the archive if all else fails.
>>>>> I'm on the fence about whether this argument is worth the reduced
>>>> flexibility.
>>>>> We should also bar in mind that GIT has performance problems for both
>>>>> size of files and number of files per repository
>>>>> (http://****stackoverflow.com/questions/984707/what-are-the-git-limits)
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Stephen.
>>>>>
>>>>> ---
>>>>> Stephen Pascoe  +44 (0)1235 445980
>>>>> British Atmospheric Data Centre
>>>>> Rutherford Appleton Laboratory
>>>>>
>>>>> -----Original Message-----
>>>>> From: Gavin M Bell [mailto:gavin at llnl.gov]
>>>>> Sent: 21 April 2010 00:15
>>>>> To: Pascoe, Stephen (STFC,RAL,SSTD)
>>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu
>>>>> Subject: Re: [Go-essp-tech] Proposed version directory structure
>>>>> document
>>>>>
>>>>> Hey Stephen,
>>>>>
>>>>> Great write up.  I read through it and it is really well thought out.
>>>>> As I mentioned on the call today it is pretty much exactly how GIT is
>>>>> designed, so you have in good company :-).
>>>>>
>>>>> http://****progit.org/book/ch1-3.html
>>>>>
>>>>> The main issue I have is that we are thinking too low level.  We are
>>>>> thinking about filesystem files.  Again, I think we should be thinking
>>>>> of things at the "ESG FILE" level, i.e. the catalog level.  Imagine
>>>>> the
>>>>> following:
>>>>>
>>>>> users download ESG FILES aka catalogs.  The catalogs have versions
>>>>> associated with them.  When the user downloads the catalog they get
>>>>> the physical catalog xml file as well as the files that come with it.
>>>>> If they download a new version of the catalog that has two files out
>>>>> of 100 different, they would only pull down the two new files.  How
>>>>> are files named to avoid collisions?  Easy, files are named in the
>>>>> scheme <filename>.<checksum> this information can be gleaned from
>>>>> inspecting the catalog that has both pieces of information.  Catalogs
>>>>> are version controlled in GIT (easy, they are just text xml files...
>>>>> perfect for version control).  Let's make this even more explicit that
>>>>> there is are no filesystem files in the context of ESG by putting all
>>>>> the files in a dot directory.
>>>>>
>>>>> Example:
>>>>>
>>>>> I am in directory foo/bar (DRS dir hierarchy perhaps) we pull down
>>>>> catalog_alpha_v1.  When we do so we now have the the following
>>>>> structure.
>>>>>
>>>>> pwd   -> /root/foo/bar
>>>>> ls    -> catalog_alpha_v1
>>>>>
>>>>> ls -a -> catalog_alpha_v1
>>>>>      -> .esg_files/file{1...n}.nc.<checksum>
>>>>>
>>>>> See what I mean?
>>>>>
>>>>> In GIT we tell GIT to ignore all .esg_files directories, thus only
>>>>> versioning the catalog files.
>>>>>
>>>>> The implication of this is that we would have to build tools to do our
>>>>> own interrogation of the file system to give us the file.nc
>>>>> translations.  This tool will use the catalog at that directory level
>>>>> and be able to get directly at the files users want.
>>>>>
>>>>> Furthermore, WGET will "just work" if we point wget to an HTTP URL
>>>>> that will have a filter applied to it that will do this interrogation
>>>>> and interpretation and fetch the files referenced to.  This is a
>>>>> tomcat filter (pretty straight forward to do).
>>>>>
>>>>> This means, no linking, no extra anything at the OS file system level.
>>>>> The important files are versioned i.e. the catalogs.  And we can still
>>>>> use WGET scripts as long as they point to our translation web service,
>>>>> which consists pretty much only of a filter! :-).
>>>>>
>>>>> As for the atomic data set thing... well they are represented already
>>>>> as aggregates in the catalog.  The only additional bit of information
>>>>> that we could add would be a version attribute.  The issues behind
>>>>> what the gateways read or don't read from the catalogs, I am confident
>>>>> will be surmounted, so that should not be a blocking issue to
>>>> implementing this.
>>>>> Things to do:
>>>>> -> Write this translation code.
>>>>>   -We know that .esg_files directory (a given)
>>>>>   -We know how to parse the catalog (use xml parser dejour)
>>>>>   -translate input file name as a wget script would use
>>>>>    to the actual physical filesystem filename.
>>>>>   -Put this in a filter for tomcat in front of a catoon service
>>>>>   -Create a shell for esg... simple read-eval-print loop that calls
>>>>> the translator when it is in git directories with catalog looking
>>>>> files and .esg_files directories to show you a filesystem looking "ls"
>>>>> but for esg-files.
>>>>>
>>>>>
>>>>> In my "spare" time I would love to write an ESG shell such that when
>>>>> you load the esg shell it will be able to do ls like traditional OS's
>>>>> ls using this translation code to show you the files that live there
>>>>> in the write version context.... I don't have a lot of spare time
>>>>> right about now. :-(
>>>>>
>>>>> This catalog centric modeling of the system has been a model I have
>>>>> pushed for months now.  I feel like Cassandra :-).
>>>>>
>>>>> Thanks for listening.
>>>>>
>>>>> stephen.pascoe at stfc.ac.uk wrote:
>>>>>> Hi Bob,
>>>>>>
>>>>>> Thanks for promptly commenting on the document.  Clarifying that the
>>>>>> publisher has these features is great news and I'm sorry that, in
>>>>>> trying to give everyone time to digest the document by Tuesday, I
>>>>>> didn't have time to confirm the facts with you.  I'm hoping this way
>>>>>> any errors will come out in the wash.
>>>>>>
>>>>>> The main thing I missed was the ability to create multiple THREDDS
>>>>>> catalogues for a dataset (or 1 catalogue per dataset version).
>>>>>> Omitting this feature felt like a funder mental difference in model
>>>>>> to
>>>>> the DRS.
>>>>>> I need to work out how to do this now and I'll revise the version
>>>>>> directory structure document too.  Phil Bentley has recommended a
>>>>>> different structure that has some advantages so the document will
>>>>>> probably look very different next time.
>>>>>>
>>>>>> Incidentally, I'm increasingly impressed with the ESG publisher and
>>>>>> I'm really enjoying working with it.  The stuff you've done with
>>>>>> project handler plugins in the latest release strengthens my
>>>>>> impression that it is a tool we will be using for a long time.
>>>>>>
>>>>>> Cheers,
>>>>>> Stephen.
>>>>>>
>>>>>> ---
>>>>>> Stephen Pascoe  +44 (0)1235 445980
>>>>>> British Atmospheric Data Centre
>>>>>> Rutherford Appleton Laboratory
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> -
>>>>>> --
>>>>>> *From:* Bob Drach [mailto:drach1 at llnl.gov]
>>>>>> *Sent:* 16 April 2010 00:14
>>>>>> *To:* Pascoe, Stephen (STFC,RAL,SSTD)
>>>>>> *Cc:* go-essp-tech at ucar.edu
>>>>>> *Subject:* Re: [Go-essp-tech] Proposed version directory structure
>>>>>> document
>>>>>>
>>>>>> Hi Stephen,
>>>>>>
>>>>>> Let me clarify a few points in the description of ESG Publisher:
>>>>>>
>>>>>> The document states: "ESG Publisher version system is built around
>>>>>> mutable datasets.  It does not attempt to maintain references to
>>>>>> previous data and the dataset version number is not part of the
>>>>>> dataset id unless the publisher is configured to include it from the
>>>>>> dataset metadata.  This means that it is not straight forward at this
>>>>>> time to publish multiple versions of an atomic dataset unless each
>>>>>> version is published as a separate dataset.  This approach would
>>>>>> effectively ignore ESG Publisher's version system and manage all
>>>>> versions independently."
>>>>>> - As of Version 2 the unit of publication is in fact a 'dataset
>>>>>> version', terminology that came out of the December meeting in
>>>>> Boulder.
>>>>>> A dataset version is an immutable object which can represent a 'DRS
>>>>>> dataset including version number'. The published 'dataset version'
>>>>>> itself has an identifier which typically consists of
>>>>>> dataset_id+version number; this appears in the THREDDS catalog. As
>>>>>> you
>>>>>> stated in the document, whether or not the published dataset
>>>>>> corresponds to a DRS dataset is a matter of publisher configuration,
>>>>>> not an inherent property of the publisher.
>>>>>>
>>>>>> - The node database does in fact maintain references to the
>>>>>> composition of previous dataset versions. It is possible to have
>>>>>> multiple versions published simultaneously, to list all published
>>>>>> versions of a dataset, and for any given dataset version the files
>>>>>> contained in that version can be listed.
>>>>>>
>>>>>> - The intention of the publisher design is to automate versioning as
>>>>>> much as possible. A 'dataset' is considered to be a collection of
>>>>>> dataset versions. Consequently, 'publishing a dataset' really means
>>>>>> 'publishing a dataset version where the version number is incremented
>>>>>> relative to the previous version.' Similarly, 'unpublishing' a
>>>>>> dataset
>>>>>> by default unpublishes all versions of a dataset. The terminology
>>>>>> dataset_id#n can be used to refer to a specific version.
>>>>>>
>>>>>>
>>>>>> In short, there is no fundamental mismatch between the DRS model and
>>>>>> the ESG publisher.
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>>
>>>>>> Bob
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Apr 15, 2010, at 3:24 AM, <stephen.pascoe at stfc.ac.uk
>>>>>> <mailto:stephen.pascoe at stfc.ac.uk>> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> Attached is my view on how we should structure the archive to
>>>>>>> support
>>>>>>> multiple versions.  It divides into 2 main sections, the first is a
>>>>>>> fairly lengthy summary of why this problem isn't solved yet in terms
>>>>>>> of the differences between the ESG datanode software and the DRS
>>>>>>> document.  The second section lays out the proposed structure and
>>>>>>> how
>>>>>>> we would manage symbolic links and moving from one version to
>>>>>>> another.  I restrict myself to directories below the atomic dataset
>>>>>>> level.
>>>>>>>
>>>>>>> Lots of issues are left to resolve, in particular how we ESG
>>>>>>> publisher can make use of this structure.  I'll try and draw
>>>>>>> attention to these points in the agenda for Tuesday's telco which
>>>>> will follow later today.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Stephen.
>>>>>>>
>>>>>>> ---
>>>>>>> Stephen Pascoe  +44 (0)1235 445980
>>>>>>> British Atmospheric Data Centre
>>>>>>> Rutherford Appleton Laboratory
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Scanned by iCritical.
>>>>>>>
>>>>>>>
>>>>>>> <ESGF_version_structure.odt>________________________________________
>>>>>>> _
>>>>>>> ______
>>>>>>> GO-ESSP-TECH mailing list
>>>>>>> GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>>>>>>> http://******mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>>> --
>>>>>> Scanned by iCritical.
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> -
>>>>>> --
>>>>>>
>>>>>> _______________________________________________
>>>>>> GO-ESSP-TECH mailing list
>>>>>> GO-ESSP-TECH at ucar.edu
>>>>>> http://*****mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>> --
>>>>> Gavin M. Bell
>>>>> Lawrence Livermore National Labs
>>>>> --
>>>>>
>>>>> "Never mistake a clear view for a short distance."
>>>>>                  -Paul Saffo
>>>>>
>>>>> (GPG Key - http://****rainbow.llnl.gov/dist/keys/gavin.asc)
>>>>>
>>>>> A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
>>>> --
>>>> Gavin M. Bell
>>>> Lawrence Livermore National Labs
>>>> --
>>>>
>>>> "Never mistake a clear view for a short distance."
>>>>                  -Paul Saffo
>>>>
>>>> (GPG Key - http://***rainbow.llnl.gov/dist/keys/gavin.asc)
>>>>
>>>> A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
>>>
>>
>> --
>> Gavin M. Bell
>> Lawrence Livermore National Labs
>> --
>>
>> "Never mistake a clear view for a short distance."
>>                  -Paul Saffo
>>
>> (GPG Key - http://*rainbow.llnl.gov/dist/keys/gavin.asc)
>>
>> A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
>
>

--
Gavin M. Bell
Lawrence Livermore National Labs
--

 "Never mistake a clear view for a short distance."
               -Paul Saffo

(GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)

 A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
_______________________________________________
GO-ESSP-TECH mailing list
GO-ESSP-TECH at ucar.edu
http://mailman.ucar.edu/mailman/listinfo/go-essp-tech