[Go-essp-tech] Proposed version directory structure document

Mon Apr 26 11:33:14 MDT 2010

Hi Stephen,

I hope you are well. :-)

Disclaimer:
Okay, let me get straight at it... :-)  Please pardon the length of this
response. I do tend to get a bit garrulous when I am trying to tackle
technical issues. :-\ sorry.  And pardon typos I get a bit james joycian
 at times too.

(take a deep breath.... :-)

So, the catalog centric model does not put any implications on the files
system.  You can move the catalogs as you would any file on the file
system.  Indeed there will be a tool made available (I think I am just
going to write it and be done with it) so that the physical constituent
files move along with the logical catalog as well.  There is no file
system lock in at all, it's just a meta layer to essentially group files.

The source of catalog information is the catalog.  The database is a
nice tool to do data manipulations and queries but the ground truth is
the file that is the catalog IMHO.  As long as the catalogs are
identical (checksums match) then it doesn't matter where the catalog
comes from ipso facto. It is my proposed plan that data node manager
will provide the handshaking between/among data nodes, this is something
that will be leveraged by the replication agent and this we can somewhat
think of catalogs as being eventually consistent.  This can be upper
bounded (sort of) based on the type of even propagation protocols we
want to use.  For the moment I am looking at using a gossip protocol
among nodes for addressing this issue piggy backing GIT's distributed
syncing.  The fact that we have a single editor system simplifies things
greatly!!!

W.r.t. GIT, indeed there are limits. I guess what would be good is to
get a back of the envelope calculation on how many catalogs would there
be per data node, just by order of magnitude.  I suspect at that catalog
(aggregation) level we will be well in the safe zone for GIT.

As for the DRS, indeed you can layout your filesystem as prescribed by
the DRS exactly, or you can relax that requirement by having filters in
between the caller and storage that will translate to and fro the DRS
layout and the filesystem layout. I think it is a good idea to give
admins an 'out' from having it be compulsory that the filesystem looks
like DRS.  It has been my thought that ground truth catalogs at the data
node will refer to files where they are physically on the machine.  When
the catalog leaves the datanode or is queried from a caller, these
exchanges use the canonical name for files as directed by the DRS.

The basic idea in the catalog centric approach is to have a simple and
consistent model of "files" in our system.  The basic fact is that
semantically we aggregate files when we run are models and produce N
number of output files for a run.  We even go so far as having a catalog
describing these collections.  So then why on the other side of the
system (those wanting to consume files) do we then make them have to
deal with files individually.  So I simply propose we stay consistent
all the way through. With regards to "fancy" tools (nice editorializing
Stephan :-) - just messing with you)... they are not so fancy as they
are simple automations that allow you to manipulate catalogs and not
worry about moving its constituent files.

So they layout I described was something like... have a /foo/catalog1
which will always have a /foo/catalog1/.datafiles directory that contain
the files stipulated in /foo/catalog1.  I would have a tool (I think I
am just going to write an esg-shell) that you would use to say mv
/foo/catalog1 to /bar/catalog1.  What the tool/shell would do is move mv
 /foo/catalog1 to /bar/catalog1 and then move all the files decribed in
/foo/catalog1 that reside in /foo/catalog1/.datafiles/* over as well to
/bar/catalog1/.datafiles/*.  If the "fancy" tool fails, one could just
read the catalog1 file and read the file entries and the checksums and
look for them in .datafiles and move them over by hand, or have a
script, which would be effectively equivalent to the "fancy" tool that
does just that.  The key is that where catalog1 lives is totally up to
the datanode admin.

Oh and filters.  You can apply them in ingress and egress data in
tomcat.  For the intrepid data node admin they can install a filter to
rewrite the catalog such that what the outside world sees are DRS paths
to files.  This makes everything else 'just work' i.e. wget scripts,
etc.  The additional technical wrinkle is that I would suggest having a
specific esg-filter class that said intrepid data node admin would
subclass to make their filter.  The added bit of functionality would be
to write the filter name and version into the catalog's mutable portion.
This way the catalog knows what filter was used to transform it and thus
a filter factory can be setup on the fly to do the proper translations.
 Yes, this means the data node admin would have to maintain their own
filters.  But that's fine... they are intrepid! For the less intrepid,
don't do any rewrite and make your filesystem match DRS.

For non-tomcat filter amenable tools/protocols.  I would suggest writing
the filter code such that it can be loaded up as a simple translation
service.  We have the source for GridFTP, right, and I believe it is
written in Java. So I think, if programmed wisely, there would only be a
single filter that can be applied to every ingress/egress.

I welcome discussing this further.  I will be the first to say that I do
not now much about DRS and the details therein, but I think that
technically this is a surmountable issue.  Someone please educate me on
the semantics.  I don't yet know what I don't know. :-).

stephen.pascoe at stfc.ac.uk wrote:
> Hi Gavin,
> 
> Sorry it's taken me so long to respond to this.  It's a good point that
> we could version control catalogue information and then write tools to
> synchronise the catalogues with file versions.  I like the idea but I
> think it has far-reaching implications for the system as a whole.
> 
> There a couple of reasons why I haven't embraced the "catalogue centric"
> approach so far.  First, the ESG datanode database already has all the
> information you'd put in catalogues.  The DRY principle suggests we
> should have only 1 source for catalogue information and I have assumed
> that is the database.  Now, the database has advantages and
> disadvantages: Bob's schema manages multiple versions but there is no
> mechanism for distributing version changes amongst datanodes, whereas
> tools like GIT would give us distributed version control of catalogues
> out of the box.  However, if we start version controlling catalogues we
> will end up with our catalogue information spread all over the place and
> we'll have to keep them all synchronised:
> 
>  1. In the ESG database
>  2. In the archive
>  3. In the THREDDS catalogue tree
> 
> Also, the reason I've stuck with symbolic links rather than tools to map
> to DRS paths is that there is an argument for keeping the on-disk layout
> as close to the DRS as possible so that there is a fallback to getting
> data if the fancy tools fail.  If you do this with symlinks you can
> always point an ftp server at the archive if all else fails.  I'm on the
> fence about whether this argument is worth the reduced flexibility.
> 
> We should also bar in mind that GIT has performance problems for both
> size of files and number of files per repository
> (http://*stackoverflow.com/questions/984707/what-are-the-git-limits)
> 
> Cheers,
> Stephen.
> 
> ---
> Stephen Pascoe  +44 (0)1235 445980
> British Atmospheric Data Centre
> Rutherford Appleton Laboratory
> 
> -----Original Message-----
> From: Gavin M Bell [mailto:gavin at llnl.gov] 
> Sent: 21 April 2010 00:15
> To: Pascoe, Stephen (STFC,RAL,SSTD)
> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu
> Subject: Re: [Go-essp-tech] Proposed version directory structure
> document
> 
> Hey Stephen,
> 
> Great write up.  I read through it and it is really well thought out.
> As I mentioned on the call today it is pretty much exactly how GIT is
> designed, so you have in good company :-).
> 
> http://*progit.org/book/ch1-3.html
> 
> The main issue I have is that we are thinking too low level.  We are
> thinking about filesystem files.  Again, I think we should be thinking
> of things at the "ESG FILE" level, i.e. the catalog level.  Imagine the
> following:
> 
> users download ESG FILES aka catalogs.  The catalogs have versions
> associated with them.  When the user downloads the catalog they get the
> physical catalog xml file as well as the files that come with it.  If
> they download a new version of the catalog that has two files out of 100
> different, they would only pull down the two new files.  How are files
> named to avoid collisions?  Easy, files are named in the scheme
> <filename>.<checksum> this information can be gleaned from inspecting
> the catalog that has both pieces of information.  Catalogs are version
> controlled in GIT (easy, they are just text xml files... perfect for
> version control).  Let's make this even more explicit that there is are
> no filesystem files in the context of ESG by putting all the files in a
> dot directory.
> 
> Example:
> 
> I am in directory foo/bar (DRS dir hierarchy perhaps) we pull down
> catalog_alpha_v1.  When we do so we now have the the following
> structure.
> 
> pwd   -> /root/foo/bar
> ls    -> catalog_alpha_v1
> 
> ls -a -> catalog_alpha_v1
>       -> .esg_files/file{1...n}.nc.<checksum>
> 
> See what I mean?
> 
> In GIT we tell GIT to ignore all .esg_files directories, thus only
> versioning the catalog files.
> 
> The implication of this is that we would have to build tools to do our
> own interrogation of the file system to give us the file.nc
> translations.  This tool will use the catalog at that directory level
> and be able to get directly at the files users want.
> 
> Furthermore, WGET will "just work" if we point wget to an HTTP URL that
> will have a filter applied to it that will do this interrogation and
> interpretation and fetch the files referenced to.  This is a tomcat
> filter (pretty straight forward to do).
> 
> This means, no linking, no extra anything at the OS file system level.
> The important files are versioned i.e. the catalogs.  And we can still
> use WGET scripts as long as they point to our translation web service,
> which consists pretty much only of a filter! :-).
> 
> As for the atomic data set thing... well they are represented already as
> aggregates in the catalog.  The only additional bit of information that
> we could add would be a version attribute.  The issues behind what the
> gateways read or don't read from the catalogs, I am confident will be
> surmounted, so that should not be a blocking issue to implementing this.
> 
> Things to do:
> -> Write this translation code.
>    -We know that .esg_files directory (a given)
>    -We know how to parse the catalog (use xml parser dejour)
>    -translate input file name as a wget script would use
>     to the actual physical filesystem filename.
>    -Put this in a filter for tomcat in front of a catoon service
>    -Create a shell for esg... simple read-eval-print loop that calls the
> translator when it is in git directories with catalog looking files and
> .esg_files directories to show you a filesystem looking "ls" but for
> esg-files.
> 
> 
> In my "spare" time I would love to write an ESG shell such that when you
> load the esg shell it will be able to do ls like traditional OS's ls
> using this translation code to show you the files that live there in the
> write version context.... I don't have a lot of spare time right about
> now. :-(
> 
> This catalog centric modeling of the system has been a model I have
> pushed for months now.  I feel like Cassandra :-).
> 
> Thanks for listening.
> 
> stephen.pascoe at stfc.ac.uk wrote:
>> Hi Bob,
>>  
>> Thanks for promptly commenting on the document.  Clarifying that the 
>> publisher has these features is great news and I'm sorry that, in 
>> trying to give everyone time to digest the document by Tuesday, I 
>> didn't have time to confirm the facts with you.  I'm hoping this way 
>> any errors will come out in the wash.
>>  
>> The main thing I missed was the ability to create multiple THREDDS 
>> catalogues for a dataset (or 1 catalogue per dataset version).  
>> Omitting this feature felt like a funder mental difference in model to
> the DRS.
>> I need to work out how to do this now and I'll revise the version 
>> directory structure document too.  Phil Bentley has recommended a 
>> different structure that has some advantages so the document will 
>> probably look very different next time.
>>  
>> Incidentally, I'm increasingly impressed with the ESG publisher and 
>> I'm really enjoying working with it.  The stuff you've done with 
>> project handler plugins in the latest release strengthens my 
>> impression that it is a tool we will be using for a long time.
>>  
>> Cheers,
>> Stephen.
>>  
>> ---
>> Stephen Pascoe  +44 (0)1235 445980
>> British Atmospheric Data Centre
>> Rutherford Appleton Laboratory
>>  
>>
>> ----------------------------------------------------------------------
>> --
>> *From:* Bob Drach [mailto:drach1 at llnl.gov]
>> *Sent:* 16 April 2010 00:14
>> *To:* Pascoe, Stephen (STFC,RAL,SSTD)
>> *Cc:* go-essp-tech at ucar.edu
>> *Subject:* Re: [Go-essp-tech] Proposed version directory structure 
>> document
>>
>> Hi Stephen,
>>
>> Let me clarify a few points in the description of ESG Publisher:
>>
>> The document states: "ESG Publisher version system is built around 
>> mutable datasets.  It does not attempt to maintain references to 
>> previous data and the dataset version number is not part of the 
>> dataset id unless the publisher is configured to include it from the 
>> dataset metadata.  This means that it is not straight forward at this 
>> time to publish multiple versions of an atomic dataset unless each 
>> version is published as a separate dataset.  This approach would 
>> effectively ignore ESG Publisher's version system and manage all
> versions independently."
>> - As of Version 2 the unit of publication is in fact a 'dataset 
>> version', terminology that came out of the December meeting in
> Boulder.
>> A dataset version is an immutable object which can represent a 'DRS 
>> dataset including version number'. The published 'dataset version'
>> itself has an identifier which typically consists of 
>> dataset_id+version number; this appears in the THREDDS catalog. As you
> 
>> stated in the document, whether or not the published dataset 
>> corresponds to a DRS dataset is a matter of publisher configuration, 
>> not an inherent property of the publisher.
>>
>> - The node database does in fact maintain references to the 
>> composition of previous dataset versions. It is possible to have 
>> multiple versions published simultaneously, to list all published 
>> versions of a dataset, and for any given dataset version the files 
>> contained in that version can be listed.
>>
>> - The intention of the publisher design is to automate versioning as 
>> much as possible. A 'dataset' is considered to be a collection of 
>> dataset versions. Consequently, 'publishing a dataset' really means 
>> 'publishing a dataset version where the version number is incremented 
>> relative to the previous version.' Similarly, 'unpublishing' a dataset
> 
>> by default unpublishes all versions of a dataset. The terminology 
>> dataset_id#n can be used to refer to a specific version.
>>
>>
>> In short, there is no fundamental mismatch between the DRS model and 
>> the ESG publisher.
>>
>>
>> Best regards,
>>
>>
>> Bob
>>
>>
>>
>>
>> On Apr 15, 2010, at 3:24 AM, <stephen.pascoe at stfc.ac.uk 
>> <mailto:stephen.pascoe at stfc.ac.uk>> wrote:
>>
>>> Hi everyone,
>>>  
>>> Attached is my view on how we should structure the archive to support
> 
>>> multiple versions.  It divides into 2 main sections, the first is a 
>>> fairly lengthy summary of why this problem isn't solved yet in terms 
>>> of the differences between the ESG datanode software and the DRS 
>>> document.  The second section lays out the proposed structure and how
> 
>>> we would manage symbolic links and moving from one version to 
>>> another.  I restrict myself to directories below the atomic dataset 
>>> level.
>>>  
>>> Lots of issues are left to resolve, in particular how we ESG 
>>> publisher can make use of this structure.  I'll try and draw 
>>> attention to these points in the agenda for Tuesday's telco which
> will follow later today.
>>>  
>>> Cheers,
>>> Stephen.
>>>  
>>> ---
>>> Stephen Pascoe  +44 (0)1235 445980
>>> British Atmospheric Data Centre
>>> Rutherford Appleton Laboratory
>>>  
>>>
>>> --
>>> Scanned by iCritical.
>>>
>>>
>>> <ESGF_version_structure.odt>_________________________________________
>>> ______
>>> GO-ESSP-TECH mailing list
>>> GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu> 
>>> http://***mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>
>> --
>> Scanned by iCritical.
>>
>>
>>
>> ----------------------------------------------------------------------
>> --
>>
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu
>> http://**mailman.ucar.edu/mailman/listinfo/go-essp-tech
> 
> --
> Gavin M. Bell
> Lawrence Livermore National Labs
> --
> 
>  "Never mistake a clear view for a short distance."
>        	       -Paul Saffo
> 
> (GPG Key - http://*rainbow.llnl.gov/dist/keys/gavin.asc)
> 
>  A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E

-- 
Gavin M. Bell
Lawrence Livermore National Labs
--

 "Never mistake a clear view for a short distance."
       	       -Paul Saffo

(GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)

 A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E