[Go-essp-tech] Proposed version directory structure document

Mon Apr 26 07:28:53 MDT 2010

Hi Gavin,

Sorry it's taken me so long to respond to this.  It's a good point that
we could version control catalogue information and then write tools to
synchronise the catalogues with file versions.  I like the idea but I
think it has far-reaching implications for the system as a whole.

There a couple of reasons why I haven't embraced the "catalogue centric"
approach so far.  First, the ESG datanode database already has all the
information you'd put in catalogues.  The DRY principle suggests we
should have only 1 source for catalogue information and I have assumed
that is the database.  Now, the database has advantages and
disadvantages: Bob's schema manages multiple versions but there is no
mechanism for distributing version changes amongst datanodes, whereas
tools like GIT would give us distributed version control of catalogues
out of the box.  However, if we start version controlling catalogues we
will end up with our catalogue information spread all over the place and
we'll have to keep them all synchronised:

 1. In the ESG database
 2. In the archive
 3. In the THREDDS catalogue tree

Also, the reason I've stuck with symbolic links rather than tools to map
to DRS paths is that there is an argument for keeping the on-disk layout
as close to the DRS as possible so that there is a fallback to getting
data if the fancy tools fail.  If you do this with symlinks you can
always point an ftp server at the archive if all else fails.  I'm on the
fence about whether this argument is worth the reduced flexibility.

We should also bar in mind that GIT has performance problems for both
size of files and number of files per repository
(http://stackoverflow.com/questions/984707/what-are-the-git-limits)

Cheers,
Stephen.

---
Stephen Pascoe  +44 (0)1235 445980
British Atmospheric Data Centre
Rutherford Appleton Laboratory

-----Original Message-----
From: Gavin M Bell [mailto:gavin at llnl.gov] 
Sent: 21 April 2010 00:15
To: Pascoe, Stephen (STFC,RAL,SSTD)
Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu
Subject: Re: [Go-essp-tech] Proposed version directory structure
document

Hey Stephen,

Great write up.  I read through it and it is really well thought out.
As I mentioned on the call today it is pretty much exactly how GIT is
designed, so you have in good company :-).

http://progit.org/book/ch1-3.html

The main issue I have is that we are thinking too low level.  We are
thinking about filesystem files.  Again, I think we should be thinking
of things at the "ESG FILE" level, i.e. the catalog level.  Imagine the
following:

users download ESG FILES aka catalogs.  The catalogs have versions
associated with them.  When the user downloads the catalog they get the
physical catalog xml file as well as the files that come with it.  If
they download a new version of the catalog that has two files out of 100
different, they would only pull down the two new files.  How are files
named to avoid collisions?  Easy, files are named in the scheme
<filename>.<checksum> this information can be gleaned from inspecting
the catalog that has both pieces of information.  Catalogs are version
controlled in GIT (easy, they are just text xml files... perfect for
version control).  Let's make this even more explicit that there is are
no filesystem files in the context of ESG by putting all the files in a
dot directory.

Example:

I am in directory foo/bar (DRS dir hierarchy perhaps) we pull down
catalog_alpha_v1.  When we do so we now have the the following
structure.

pwd   -> /root/foo/bar
ls    -> catalog_alpha_v1

ls -a -> catalog_alpha_v1
      -> .esg_files/file{1...n}.nc.<checksum>

See what I mean?

In GIT we tell GIT to ignore all .esg_files directories, thus only
versioning the catalog files.

The implication of this is that we would have to build tools to do our
own interrogation of the file system to give us the file.nc
translations.  This tool will use the catalog at that directory level
and be able to get directly at the files users want.

Furthermore, WGET will "just work" if we point wget to an HTTP URL that
will have a filter applied to it that will do this interrogation and
interpretation and fetch the files referenced to.  This is a tomcat
filter (pretty straight forward to do).

This means, no linking, no extra anything at the OS file system level.
The important files are versioned i.e. the catalogs.  And we can still
use WGET scripts as long as they point to our translation web service,
which consists pretty much only of a filter! :-).

As for the atomic data set thing... well they are represented already as
aggregates in the catalog.  The only additional bit of information that
we could add would be a version attribute.  The issues behind what the
gateways read or don't read from the catalogs, I am confident will be
surmounted, so that should not be a blocking issue to implementing this.

Things to do:
-> Write this translation code.
   -We know that .esg_files directory (a given)
   -We know how to parse the catalog (use xml parser dejour)
   -translate input file name as a wget script would use
    to the actual physical filesystem filename.
   -Put this in a filter for tomcat in front of a catoon service
   -Create a shell for esg... simple read-eval-print loop that calls the
translator when it is in git directories with catalog looking files and
.esg_files directories to show you a filesystem looking "ls" but for
esg-files.

In my "spare" time I would love to write an ESG shell such that when you
load the esg shell it will be able to do ls like traditional OS's ls
using this translation code to show you the files that live there in the
write version context.... I don't have a lot of spare time right about
now. :-(

This catalog centric modeling of the system has been a model I have
pushed for months now.  I feel like Cassandra :-).

Thanks for listening.

stephen.pascoe at stfc.ac.uk wrote:
> Hi Bob,
>  
> Thanks for promptly commenting on the document.  Clarifying that the 
> publisher has these features is great news and I'm sorry that, in 
> trying to give everyone time to digest the document by Tuesday, I 
> didn't have time to confirm the facts with you.  I'm hoping this way 
> any errors will come out in the wash.
>  
> The main thing I missed was the ability to create multiple THREDDS 
> catalogues for a dataset (or 1 catalogue per dataset version).  
> Omitting this feature felt like a funder mental difference in model to
the DRS.
> I need to work out how to do this now and I'll revise the version 
> directory structure document too.  Phil Bentley has recommended a 
> different structure that has some advantages so the document will 
> probably look very different next time.
>  
> Incidentally, I'm increasingly impressed with the ESG publisher and 
> I'm really enjoying working with it.  The stuff you've done with 
> project handler plugins in the latest release strengthens my 
> impression that it is a tool we will be using for a long time.
>  
> Cheers,
> Stephen.
>  
> ---
> Stephen Pascoe  +44 (0)1235 445980
> British Atmospheric Data Centre
> Rutherford Appleton Laboratory
>  
> 
> ----------------------------------------------------------------------
> --
> *From:* Bob Drach [mailto:drach1 at llnl.gov]
> *Sent:* 16 April 2010 00:14
> *To:* Pascoe, Stephen (STFC,RAL,SSTD)
> *Cc:* go-essp-tech at ucar.edu
> *Subject:* Re: [Go-essp-tech] Proposed version directory structure 
> document
> 
> Hi Stephen,
> 
> Let me clarify a few points in the description of ESG Publisher:
> 
> The document states: "ESG Publisher version system is built around 
> mutable datasets.  It does not attempt to maintain references to 
> previous data and the dataset version number is not part of the 
> dataset id unless the publisher is configured to include it from the 
> dataset metadata.  This means that it is not straight forward at this 
> time to publish multiple versions of an atomic dataset unless each 
> version is published as a separate dataset.  This approach would 
> effectively ignore ESG Publisher's version system and manage all
versions independently."
> 
> - As of Version 2 the unit of publication is in fact a 'dataset 
> version', terminology that came out of the December meeting in
Boulder.
> A dataset version is an immutable object which can represent a 'DRS 
> dataset including version number'. The published 'dataset version'
> itself has an identifier which typically consists of 
> dataset_id+version number; this appears in the THREDDS catalog. As you

> stated in the document, whether or not the published dataset 
> corresponds to a DRS dataset is a matter of publisher configuration, 
> not an inherent property of the publisher.
> 
> - The node database does in fact maintain references to the 
> composition of previous dataset versions. It is possible to have 
> multiple versions published simultaneously, to list all published 
> versions of a dataset, and for any given dataset version the files 
> contained in that version can be listed.
> 
> - The intention of the publisher design is to automate versioning as 
> much as possible. A 'dataset' is considered to be a collection of 
> dataset versions. Consequently, 'publishing a dataset' really means 
> 'publishing a dataset version where the version number is incremented 
> relative to the previous version.' Similarly, 'unpublishing' a dataset

> by default unpublishes all versions of a dataset. The terminology 
> dataset_id#n can be used to refer to a specific version.
> 
> 
> In short, there is no fundamental mismatch between the DRS model and 
> the ESG publisher.
> 
> 
> Best regards,
> 
> 
> Bob
> 
> 
> 
> 
> On Apr 15, 2010, at 3:24 AM, <stephen.pascoe at stfc.ac.uk 
> <mailto:stephen.pascoe at stfc.ac.uk>> wrote:
> 
>> Hi everyone,
>>  
>> Attached is my view on how we should structure the archive to support

>> multiple versions.  It divides into 2 main sections, the first is a 
>> fairly lengthy summary of why this problem isn't solved yet in terms 
>> of the differences between the ESG datanode software and the DRS 
>> document.  The second section lays out the proposed structure and how

>> we would manage symbolic links and moving from one version to 
>> another.  I restrict myself to directories below the atomic dataset 
>> level.
>>  
>> Lots of issues are left to resolve, in particular how we ESG 
>> publisher can make use of this structure.  I'll try and draw 
>> attention to these points in the agenda for Tuesday's telco which
will follow later today.
>>  
>> Cheers,
>> Stephen.
>>  
>> ---
>> Stephen Pascoe  +44 (0)1235 445980
>> British Atmospheric Data Centre
>> Rutherford Appleton Laboratory
>>  
>>
>> --
>> Scanned by iCritical.
>>
>>
>> <ESGF_version_structure.odt>_________________________________________
>> ______
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu> 
>> http://**mailman.ucar.edu/mailman/listinfo/go-essp-tech
> 
> 
> --
> Scanned by iCritical.
> 
> 
> 
> ----------------------------------------------------------------------
> --
> 
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech

--
Gavin M. Bell
Lawrence Livermore National Labs
--

 "Never mistake a clear view for a short distance."
       	       -Paul Saffo

(GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)

 A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
-- 
Scanned by iCritical.