[Go-essp-tech] Proposed version directory structure document

martin.juckes at stfc.ac.uk martin.juckes at stfc.ac.uk
Tue May 4 09:25:37 MDT 2010


Hi Gavin,

OK -- but I think we do need to look at the differences between the design goals of a DVCS and what we want to achieve. I don't like the idea of having to batch up files for a commit -- we will have enough operational problems as it is. 

cheers,
Martin


-----Original Message-----
From: Gavin M Bell [mailto:gavin at llnl.gov]
Sent: Tue 04/05/2010 15:53
To: Juckes, Martin (STFC,RAL,SSTD)
Cc: Pascoe, Stephen (STFC,RAL,SSTD); go-essp-tech at ucar.edu
Subject: Re: [Go-essp-tech] Proposed version directory structure document
 
Hi...

:-)

"GIT is not very efficient at providing versions for individual
deposited files"

I cannot speak directly to efficiency as that is contextually relevant.
 What may be deemed inefficient maybe plenty efficient for our needs.
Also talk of efficiency is more of an optimization issue.  Also though I
really really love GIT, perhaps there is another DVCS that better fits
the bill (though I can't think of one) ;-).  If this is coded properly,
we should be able to be DVCS agnostic and use them specifically for
their version storage / tracking capabilities - per node.

"We may end up with hundreds of archive versions because of updates
occurring at all the independent sites"

In the context of a single catalog - the only issue we face is efficient
dissemination of the singular new catalog posted by the publisher.  As
for many many catalogs being updated throughout the federation, this
again is an optimization problem that may be solved with batching up
files to do single commits of many files, etc.

I think it is worth a shot. :-) If nothing else, it would be educational
and give us concrete things to say about this kind of a system. Lemme go
fire up Emacs ;-).

Be well...


martin.juckes at stfc.ac.uk wrote:
> Hi Gavin,
> 
> thanks -- I'm reasonably convinced that this is a good way of dealing with synchronisation problems, as you describe. This gives us a well defined version system for the archive as a whole, and we already have versions on individual files -- so this covers the archive management versioning requirements.  
> 
> The problem I'm trying to raise is more to do with how scientific users keep track of changes to the data they have extracted from the archive, which brings some additional versioning requirements. In particular, we need to be able to provide a version for each published dataset (which I think means for each THREDDS catalogue, which is in turn means for each file deposited in GIT). I'm sure this can be done, but I noticed a comment GIT wikipedia page that GIT is not very efficient at providing versions for individual deposited files (i.e. THREDDS catalogues). We may end up with hundreds of archive versions because of updates occurring at all the independent sites, and if GIT has to trawl through all these to identify changes to a particular THREDDS catalogue we may find we are struggling to provide answers to what might be  common query,
> 
> cheers,
> Martin 
> 
> 
> -----Original Message-----
> From: Gavin M Bell [mailto:gavin at llnl.gov]
> Sent: Mon 03/05/2010 20:48
> To: Juckes, Martin (STFC,RAL,SSTD)
> Cc: Pascoe, Stephen (STFC,RAL,SSTD); go-essp-tech at ucar.edu
> Subject: Re: [Go-essp-tech] Proposed version directory structure document
>  
> Hi,
> 
> Because of the way our system is designed, as I mentioned before, single
> editor single publisher. There is no merging and there is indeed ONE
> person that has the ground truth of the *latest* version.
> 
> So entertain the following scenario...
> 
>>From the system-side of things: (at a 15000 ft level)
> 
> Phase 1: "The handshake: What new?"
> Domain A wants to make sure it has the latest *catalogs* from Domain B,
> for the *catalogs* it cares about.  Domain A asks Domain B for the
> latest versions of these *catalogs*.  Domain B provides this list of
> latest versions of said *catalogs* back to Domain A iff there is a
> latest version to get. Domain A initiates a pull from Domain B of these
> newest catalog files.
> 
> Phase 2: "The 'realization' of these catalogs"
> (this is where the catalog centric model's rubber hit the road)
> This realization is basically the inspecting of the catalog and
> reconciling the data files that the catalog has in its list with what is
> on it's file system.  This is going basically back to the email I sent
> at the beginning of this thread.  The system reconciles what it needs to
> make the current 'latest' catalog "true".  There is a list of files that
> fall out of this that can (will) be fed to the fetching mechanism (BDM
> in our case) to pull down these files into the prescribed directory. I
> suggested; a .esg_data_files directory at the same level as the catalog
> where it's files are kept.  This concept of dealing with catalogs for
> versioning is separate from the transfer of the datafiles that
> constitute them. Getting the catalog to be "true" is out of band of any
> version control system, though the catalogs themselves are version
> controlled.
> 
> This is also my proposal for how replication should be done.
> 
> This scenario begs a bootstrapping question for what happens at "phase
> 0" to support the initial contact between Domain A and Domain B... At
> the moment it could be a potentially charged issue, so I am punting on that.
> 
> Anyway, so I hope my attempt to illustrate the separation between where
> a DVCS would be applied (only with catalog xml files) and where data
> file fetching would be done (via another out of band mechanism - Ex:
> BDM).  I also hope I conveyed the elegance of this approach as it speaks
> to both versioning and replication (and user copying/downloading).
> 
> Thoughts?
> 
> Perhaps I should prototype this and share it?
> 
> martin.juckes at stfc.ac.uk wrote:
>> Thanks, I think I understand a bit better now: the idea is to have a single GIT repository, and the files deposited in the repository are THREDDS catalogues? This does mean, I think, that the repository version is not much use to a user trying to track changes in a small subset of the THREDDS catalogues. There is a section in the GIT wikipedia page ( http://**en.wikipedia.org/wiki/Git_%28software%29 ) which suggests that GIT is quite inefficient at getting the version history of deposited files (i.e. THREDDS catalogues). I'm not sure this matters, but we need to check the consequences of the different use cases: GIT is clearly designed for the situation where users will generally be interested in a complete version of the archived package, whereas we are dealing with a case in which users will generally only want a small part of it.
>>
>>  
>> cheers,
>> Martin 
>>
>>
>> -----Original Message-----
>> From: Pascoe, Stephen (STFC,RAL,SSTD)
>> Sent: Sun 02/05/2010 16:22
>> To: Juckes, Martin (STFC,RAL,SSTD); Gavin M Bell
>> Cc: go-essp-tech at ucar.edu
>> Subject: RE: [Go-essp-tech] Proposed version directory structure document
>>  
>>
>> Martin,
>>
>> I think Gavin uses the term catalogue to mean an aggregation of files similar to a THREDDS catalogue, not the entire archive.  Each catalogue would represent a realm-dataset.  As you say, that's what we need.
>>
>> What is probably confusing is that Gavin suggests a single GIT repository holding all catalogues (the repository can be replicated throughout the system -- that's what makes GIT a distributed version control system).  He also discusses how catalogues would be mapped onto versions of files, which would need their own internal identifiers to make the system work.
>>
>> As far as I can see the user would be aware of only one type of "version".
>>
>> Stephen.
>>
>>
>> -----Original Message-----
>> From: Juckes, Martin (STFC,RAL,SSTD)
>> Sent: Sat 5/1/2010 6:43 PM
>> To: Gavin M Bell; Pascoe, Stephen (STFC,RAL,SSTD)
>> Cc: go-essp-tech at ucar.edu
>> Subject: RE: [Go-essp-tech] Proposed version directory structure document
>>  
>>
>> Hello Gavin, Stephen,
>>
>> I haven't been following this discussion, so the following concern may well have been dealt with. It looks to me as though you are discussing a system with two levels of versioning: versions of individual files, and a version of the entire catalogue, which will increment every time any files are changed. This, I think, leaves too big a gap in which it is difficult for users to specify which set of files they have used. If someone uses a few thousand files, a change in the catalogue version doesn't tell him if these files have been changed, and listing all the file versions is not a useful option in publications and correspondence -- so we need versioning at intermediate levels such as published units and atomic datasets as well as at file and catalogue level,
>>
>> cheers,
>> Martin
>>
>> -----Original Message-----
>> From: go-essp-tech-bounces at ucar.edu on behalf of Gavin M Bell
>> Sent: Fri 30/04/2010 19:03
>> To: Pascoe, Stephen (STFC,RAL,SSTD)
>> Cc: go-essp-tech at ucar.edu
>> Subject: Re: [Go-essp-tech] Proposed version directory structure document
>>  
>> Hi Stephen,
>>
>> I am glad that my scheme is making you warm and fuzzy, I hope the rest
>> of the team is also on board.  I too an enamored with the simple
>> elegance of it, if I do say so myself. :-)
>>
>> The thing is... my scheme is essentially that we *only* version control
>> with DVCS (GIT) *catalogs*.  Catalogs are simple text xml files. Each of
>> those files are not that big and the order of catalog files are well in
>> the supported range for GIT.  With respect to catalogs... there are no
>> BIG catalogs (files) in the context of anything that would be
>> prohibitive for vanilla GIT.  The key bit of niceness is that the esg
>> system is pushing off the storage and durability of the actual data
>> files (the big files) to the institutions themselves.  Between local
>> institution durability duties and replication we can be somewhat safe
>> that we won't 'lose' data. The only thing the esg system will explicitly
>> version are the catalogs themselves that intern point to the specific
>> hard data files (netcdf) living on disk and replicated.
>>
>> In short; GIT with no additional bells and whistles should be able to
>> handle all our ESG catalogs.  Note: There is one GIT repo per datanode.
>>
>> ...
>>
>> Some Things To Think About:
>> There are some things that would need to be changed like - the catalog
>> naming scheme.  If catalogs are version controlled then we no longer
>> need to version files by explicitly naming them i.e.
>> foo_catalog_v{1..n}. But, we *should* have that version value be put
>> *in* the file itself.  Thus quick inspection of the file can give you
>> the ESG version value, while the VCS sees a single filename entity to
>> version control.  (I'll have to talk to Bob on that one.)  Also, thus
>> far we are not using the "D" part of the VCS.  In order to do so we
>> would have to a) flatten the file hierarchy (or at least settle on a
>> consistent one) this would additionally facilitate the ability to
>> divorce the catalog placement from the filesystem hierarchy - this is
>> where a simpler version of your link idea would come to bear - or b)
>> interrogate this ourselves (via code we write) as we do version
>> negotiating among federated entities.
>>
>> I'd be happy to discuss this more.
>>
>>
>>
>> stephen.pascoe at stfc.ac.uk wrote:
>>> Hi Gavin,
>>>
>>> If we go down the DVCS-catalogue route you might be interested to note
>>> that Mercurial already has an extension that does something very similar
>>> to what you are proposing with GIT.
>>>
>>> http://***mercurial.selenic.com/wiki/BigfilesExtension
>>>
>>> Maybe we need something more bespoke, but it's a useful reference.  The
>>> more I think about version-controlled catalogues the more it appears to
>>> solve some of our problems (particularly replication).
>>>
>>> S.
>>>
>>>
>>> ---
>>> Stephen Pascoe  +44 (0)1235 445980
>>> British Atmospheric Data Centre
>>> Rutherford Appleton Laboratory
>>>
>>> -----Original Message-----
>>> From: Gavin M Bell [mailto:gavin at llnl.gov] 
>>> Sent: 26 April 2010 18:33
>>> To: Pascoe, Stephen (STFC,RAL,SSTD)
>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu
>>> Subject: Re: [Go-essp-tech] Proposed version directory structure
>>> document
>>>
>>> Hi Stephen,
>>>
>>> I hope you are well. :-)
>>>
>>> Disclaimer:
>>> Okay, let me get straight at it... :-)  Please pardon the length of this
>>> response. I do tend to get a bit garrulous when I am trying to tackle
>>> technical issues. :-\ sorry.  And pardon typos I get a bit james joycian
>>> at times too.
>>>
>>> (take a deep breath.... :-)
>>>
>>> So, the catalog centric model does not put any implications on the files
>>> system.  You can move the catalogs as you would any file on the file
>>> system.  Indeed there will be a tool made available (I think I am just
>>> going to write it and be done with it) so that the physical constituent
>>> files move along with the logical catalog as well.  There is no file
>>> system lock in at all, it's just a meta layer to essentially group
>>> files.
>>>
>>> The source of catalog information is the catalog.  The database is a
>>> nice tool to do data manipulations and queries but the ground truth is
>>> the file that is the catalog IMHO.  As long as the catalogs are
>>> identical (checksums match) then it doesn't matter where the catalog
>>> comes from ipso facto. It is my proposed plan that data node manager
>>> will provide the handshaking between/among data nodes, this is something
>>> that will be leveraged by the replication agent and this we can somewhat
>>> think of catalogs as being eventually consistent.  This can be upper
>>> bounded (sort of) based on the type of even propagation protocols we
>>> want to use.  For the moment I am looking at using a gossip protocol
>>> among nodes for addressing this issue piggy backing GIT's distributed
>>> syncing.  The fact that we have a single editor system simplifies things
>>> greatly!!!
>>>
>>> W.r.t. GIT, indeed there are limits. I guess what would be good is to
>>> get a back of the envelope calculation on how many catalogs would there
>>> be per data node, just by order of magnitude.  I suspect at that catalog
>>> (aggregation) level we will be well in the safe zone for GIT.
>>>
>>> As for the DRS, indeed you can layout your filesystem as prescribed by
>>> the DRS exactly, or you can relax that requirement by having filters in
>>> between the caller and storage that will translate to and fro the DRS
>>> layout and the filesystem layout. I think it is a good idea to give
>>> admins an 'out' from having it be compulsory that the filesystem looks
>>> like DRS.  It has been my thought that ground truth catalogs at the data
>>> node will refer to files where they are physically on the machine.  When
>>> the catalog leaves the datanode or is queried from a caller, these
>>> exchanges use the canonical name for files as directed by the DRS.
>>>
>>> The basic idea in the catalog centric approach is to have a simple and
>>> consistent model of "files" in our system.  The basic fact is that
>>> semantically we aggregate files when we run are models and produce N
>>> number of output files for a run.  We even go so far as having a catalog
>>> describing these collections.  So then why on the other side of the
>>> system (those wanting to consume files) do we then make them have to
>>> deal with files individually.  So I simply propose we stay consistent
>>> all the way through. With regards to "fancy" tools (nice editorializing
>>> Stephan :-) - just messing with you)... they are not so fancy as they
>>> are simple automations that allow you to manipulate catalogs and not
>>> worry about moving its constituent files.
>>>
>>> So they layout I described was something like... have a /foo/catalog1
>>> which will always have a /foo/catalog1/.datafiles directory that contain
>>> the files stipulated in /foo/catalog1.  I would have a tool (I think I
>>> am just going to write an esg-shell) that you would use to say mv
>>> /foo/catalog1 to /bar/catalog1.  What the tool/shell would do is move mv
>>>  /foo/catalog1 to /bar/catalog1 and then move all the files decribed in
>>> /foo/catalog1 that reside in /foo/catalog1/.datafiles/* over as well to
>>> /bar/catalog1/.datafiles/*.  If the "fancy" tool fails, one could just
>>> read the catalog1 file and read the file entries and the checksums and
>>> look for them in .datafiles and move them over by hand, or have a
>>> script, which would be effectively equivalent to the "fancy" tool that
>>> does just that.  The key is that where catalog1 lives is totally up to
>>> the datanode admin.
>>>
>>> Oh and filters.  You can apply them in ingress and egress data in
>>> tomcat.  For the intrepid data node admin they can install a filter to
>>> rewrite the catalog such that what the outside world sees are DRS paths
>>> to files.  This makes everything else 'just work' i.e. wget scripts,
>>> etc.  The additional technical wrinkle is that I would suggest having a
>>> specific esg-filter class that said intrepid data node admin would
>>> subclass to make their filter.  The added bit of functionality would be
>>> to write the filter name and version into the catalog's mutable portion.
>>> This way the catalog knows what filter was used to transform it and thus
>>> a filter factory can be setup on the fly to do the proper translations.
>>>  Yes, this means the data node admin would have to maintain their own
>>> filters.  But that's fine... they are intrepid! For the less intrepid,
>>> don't do any rewrite and make your filesystem match DRS.
>>>
>>> For non-tomcat filter amenable tools/protocols.  I would suggest writing
>>> the filter code such that it can be loaded up as a simple translation
>>> service.  We have the source for GridFTP, right, and I believe it is
>>> written in Java. So I think, if programmed wisely, there would only be a
>>> single filter that can be applied to every ingress/egress.
>>>
>>> I welcome discussing this further.  I will be the first to say that I do
>>> not now much about DRS and the details therein, but I think that
>>> technically this is a surmountable issue.  Someone please educate me on
>>> the semantics.  I don't yet know what I don't know. :-).
>>>
>>> stephen.pascoe at stfc.ac.uk wrote:
>>>> Hi Gavin,
>>>>
>>>> Sorry it's taken me so long to respond to this.  It's a good point 
>>>> that we could version control catalogue information and then write 
>>>> tools to synchronise the catalogues with file versions.  I like the 
>>>> idea but I think it has far-reaching implications for the system as a
>>> whole.
>>>> There a couple of reasons why I haven't embraced the "catalogue
>>> centric"
>>>> approach so far.  First, the ESG datanode database already has all the
>>>> information you'd put in catalogues.  The DRY principle suggests we 
>>>> should have only 1 source for catalogue information and I have assumed
>>>> that is the database.  Now, the database has advantages and
>>>> disadvantages: Bob's schema manages multiple versions but there is no 
>>>> mechanism for distributing version changes amongst datanodes, whereas 
>>>> tools like GIT would give us distributed version control of catalogues
>>>> out of the box.  However, if we start version controlling catalogues 
>>>> we will end up with our catalogue information spread all over the 
>>>> place and we'll have to keep them all synchronised:
>>>>
>>>>  1. In the ESG database
>>>>  2. In the archive
>>>>  3. In the THREDDS catalogue tree
>>>>
>>>> Also, the reason I've stuck with symbolic links rather than tools to 
>>>> map to DRS paths is that there is an argument for keeping the on-disk 
>>>> layout as close to the DRS as possible so that there is a fallback to 
>>>> getting data if the fancy tools fail.  If you do this with symlinks 
>>>> you can always point an ftp server at the archive if all else fails.  
>>>> I'm on the fence about whether this argument is worth the reduced
>>> flexibility.
>>>> We should also bar in mind that GIT has performance problems for both 
>>>> size of files and number of files per repository
>>>> (http://****stackoverflow.com/questions/984707/what-are-the-git-limits)
>>>>
>>>> Cheers,
>>>> Stephen.
>>>>
>>>> ---
>>>> Stephen Pascoe  +44 (0)1235 445980
>>>> British Atmospheric Data Centre
>>>> Rutherford Appleton Laboratory
>>>>
>>>> -----Original Message-----
>>>> From: Gavin M Bell [mailto:gavin at llnl.gov]
>>>> Sent: 21 April 2010 00:15
>>>> To: Pascoe, Stephen (STFC,RAL,SSTD)
>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu
>>>> Subject: Re: [Go-essp-tech] Proposed version directory structure 
>>>> document
>>>>
>>>> Hey Stephen,
>>>>
>>>> Great write up.  I read through it and it is really well thought out.
>>>> As I mentioned on the call today it is pretty much exactly how GIT is 
>>>> designed, so you have in good company :-).
>>>>
>>>> http://****progit.org/book/ch1-3.html
>>>>
>>>> The main issue I have is that we are thinking too low level.  We are 
>>>> thinking about filesystem files.  Again, I think we should be thinking
>>>> of things at the "ESG FILE" level, i.e. the catalog level.  Imagine 
>>>> the
>>>> following:
>>>>
>>>> users download ESG FILES aka catalogs.  The catalogs have versions 
>>>> associated with them.  When the user downloads the catalog they get 
>>>> the physical catalog xml file as well as the files that come with it.
>>>> If they download a new version of the catalog that has two files out 
>>>> of 100 different, they would only pull down the two new files.  How 
>>>> are files named to avoid collisions?  Easy, files are named in the 
>>>> scheme <filename>.<checksum> this information can be gleaned from 
>>>> inspecting the catalog that has both pieces of information.  Catalogs 
>>>> are version controlled in GIT (easy, they are just text xml files... 
>>>> perfect for version control).  Let's make this even more explicit that
>>>> there is are no filesystem files in the context of ESG by putting all 
>>>> the files in a dot directory.
>>>>
>>>> Example:
>>>>
>>>> I am in directory foo/bar (DRS dir hierarchy perhaps) we pull down 
>>>> catalog_alpha_v1.  When we do so we now have the the following 
>>>> structure.
>>>>
>>>> pwd   -> /root/foo/bar
>>>> ls    -> catalog_alpha_v1
>>>>
>>>> ls -a -> catalog_alpha_v1
>>>>       -> .esg_files/file{1...n}.nc.<checksum>
>>>>
>>>> See what I mean?
>>>>
>>>> In GIT we tell GIT to ignore all .esg_files directories, thus only 
>>>> versioning the catalog files.
>>>>
>>>> The implication of this is that we would have to build tools to do our
>>>> own interrogation of the file system to give us the file.nc 
>>>> translations.  This tool will use the catalog at that directory level 
>>>> and be able to get directly at the files users want.
>>>>
>>>> Furthermore, WGET will "just work" if we point wget to an HTTP URL 
>>>> that will have a filter applied to it that will do this interrogation 
>>>> and interpretation and fetch the files referenced to.  This is a 
>>>> tomcat filter (pretty straight forward to do).
>>>>
>>>> This means, no linking, no extra anything at the OS file system level.
>>>> The important files are versioned i.e. the catalogs.  And we can still
>>>> use WGET scripts as long as they point to our translation web service,
>>>> which consists pretty much only of a filter! :-).
>>>>
>>>> As for the atomic data set thing... well they are represented already 
>>>> as aggregates in the catalog.  The only additional bit of information 
>>>> that we could add would be a version attribute.  The issues behind 
>>>> what the gateways read or don't read from the catalogs, I am confident
>>>> will be surmounted, so that should not be a blocking issue to
>>> implementing this.
>>>> Things to do:
>>>> -> Write this translation code.
>>>>    -We know that .esg_files directory (a given)
>>>>    -We know how to parse the catalog (use xml parser dejour)
>>>>    -translate input file name as a wget script would use
>>>>     to the actual physical filesystem filename.
>>>>    -Put this in a filter for tomcat in front of a catoon service
>>>>    -Create a shell for esg... simple read-eval-print loop that calls 
>>>> the translator when it is in git directories with catalog looking 
>>>> files and .esg_files directories to show you a filesystem looking "ls"
>>>> but for esg-files.
>>>>
>>>>
>>>> In my "spare" time I would love to write an ESG shell such that when 
>>>> you load the esg shell it will be able to do ls like traditional OS's 
>>>> ls using this translation code to show you the files that live there 
>>>> in the write version context.... I don't have a lot of spare time 
>>>> right about now. :-(
>>>>
>>>> This catalog centric modeling of the system has been a model I have 
>>>> pushed for months now.  I feel like Cassandra :-).
>>>>
>>>> Thanks for listening.
>>>>
>>>> stephen.pascoe at stfc.ac.uk wrote:
>>>>> Hi Bob,
>>>>>  
>>>>> Thanks for promptly commenting on the document.  Clarifying that the 
>>>>> publisher has these features is great news and I'm sorry that, in 
>>>>> trying to give everyone time to digest the document by Tuesday, I 
>>>>> didn't have time to confirm the facts with you.  I'm hoping this way 
>>>>> any errors will come out in the wash.
>>>>>  
>>>>> The main thing I missed was the ability to create multiple THREDDS 
>>>>> catalogues for a dataset (or 1 catalogue per dataset version).
>>>>> Omitting this feature felt like a funder mental difference in model 
>>>>> to
>>>> the DRS.
>>>>> I need to work out how to do this now and I'll revise the version 
>>>>> directory structure document too.  Phil Bentley has recommended a 
>>>>> different structure that has some advantages so the document will 
>>>>> probably look very different next time.
>>>>>  
>>>>> Incidentally, I'm increasingly impressed with the ESG publisher and 
>>>>> I'm really enjoying working with it.  The stuff you've done with 
>>>>> project handler plugins in the latest release strengthens my 
>>>>> impression that it is a tool we will be using for a long time.
>>>>>  
>>>>> Cheers,
>>>>> Stephen.
>>>>>  
>>>>> ---
>>>>> Stephen Pascoe  +44 (0)1235 445980
>>>>> British Atmospheric Data Centre
>>>>> Rutherford Appleton Laboratory
>>>>>  
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> -
>>>>> --
>>>>> *From:* Bob Drach [mailto:drach1 at llnl.gov]
>>>>> *Sent:* 16 April 2010 00:14
>>>>> *To:* Pascoe, Stephen (STFC,RAL,SSTD)
>>>>> *Cc:* go-essp-tech at ucar.edu
>>>>> *Subject:* Re: [Go-essp-tech] Proposed version directory structure 
>>>>> document
>>>>>
>>>>> Hi Stephen,
>>>>>
>>>>> Let me clarify a few points in the description of ESG Publisher:
>>>>>
>>>>> The document states: "ESG Publisher version system is built around 
>>>>> mutable datasets.  It does not attempt to maintain references to 
>>>>> previous data and the dataset version number is not part of the 
>>>>> dataset id unless the publisher is configured to include it from the 
>>>>> dataset metadata.  This means that it is not straight forward at this
>>>>> time to publish multiple versions of an atomic dataset unless each 
>>>>> version is published as a separate dataset.  This approach would 
>>>>> effectively ignore ESG Publisher's version system and manage all
>>>> versions independently."
>>>>> - As of Version 2 the unit of publication is in fact a 'dataset 
>>>>> version', terminology that came out of the December meeting in
>>>> Boulder.
>>>>> A dataset version is an immutable object which can represent a 'DRS 
>>>>> dataset including version number'. The published 'dataset version'
>>>>> itself has an identifier which typically consists of 
>>>>> dataset_id+version number; this appears in the THREDDS catalog. As 
>>>>> you
>>>>> stated in the document, whether or not the published dataset 
>>>>> corresponds to a DRS dataset is a matter of publisher configuration, 
>>>>> not an inherent property of the publisher.
>>>>>
>>>>> - The node database does in fact maintain references to the 
>>>>> composition of previous dataset versions. It is possible to have 
>>>>> multiple versions published simultaneously, to list all published 
>>>>> versions of a dataset, and for any given dataset version the files 
>>>>> contained in that version can be listed.
>>>>>
>>>>> - The intention of the publisher design is to automate versioning as 
>>>>> much as possible. A 'dataset' is considered to be a collection of 
>>>>> dataset versions. Consequently, 'publishing a dataset' really means 
>>>>> 'publishing a dataset version where the version number is incremented
>>>>> relative to the previous version.' Similarly, 'unpublishing' a 
>>>>> dataset
>>>>> by default unpublishes all versions of a dataset. The terminology 
>>>>> dataset_id#n can be used to refer to a specific version.
>>>>>
>>>>>
>>>>> In short, there is no fundamental mismatch between the DRS model and 
>>>>> the ESG publisher.
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>>
>>>>> Bob
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Apr 15, 2010, at 3:24 AM, <stephen.pascoe at stfc.ac.uk 
>>>>> <mailto:stephen.pascoe at stfc.ac.uk>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>  
>>>>>> Attached is my view on how we should structure the archive to 
>>>>>> support
>>>>>> multiple versions.  It divides into 2 main sections, the first is a 
>>>>>> fairly lengthy summary of why this problem isn't solved yet in terms
>>>>>> of the differences between the ESG datanode software and the DRS 
>>>>>> document.  The second section lays out the proposed structure and 
>>>>>> how
>>>>>> we would manage symbolic links and moving from one version to 
>>>>>> another.  I restrict myself to directories below the atomic dataset 
>>>>>> level.
>>>>>>  
>>>>>> Lots of issues are left to resolve, in particular how we ESG 
>>>>>> publisher can make use of this structure.  I'll try and draw 
>>>>>> attention to these points in the agenda for Tuesday's telco which
>>>> will follow later today.
>>>>>>  
>>>>>> Cheers,
>>>>>> Stephen.
>>>>>>  
>>>>>> ---
>>>>>> Stephen Pascoe  +44 (0)1235 445980
>>>>>> British Atmospheric Data Centre
>>>>>> Rutherford Appleton Laboratory
>>>>>>  
>>>>>>
>>>>>> --
>>>>>> Scanned by iCritical.
>>>>>>
>>>>>>
>>>>>> <ESGF_version_structure.odt>________________________________________
>>>>>> _
>>>>>> ______
>>>>>> GO-ESSP-TECH mailing list
>>>>>> GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu> 
>>>>>> http://******mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>> --
>>>>> Scanned by iCritical.
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> -
>>>>> --
>>>>>
>>>>> _______________________________________________
>>>>> GO-ESSP-TECH mailing list
>>>>> GO-ESSP-TECH at ucar.edu
>>>>> http://*****mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>> --
>>>> Gavin M. Bell
>>>> Lawrence Livermore National Labs
>>>> --
>>>>
>>>>  "Never mistake a clear view for a short distance."
>>>>        	       -Paul Saffo
>>>>
>>>> (GPG Key - http://****rainbow.llnl.gov/dist/keys/gavin.asc)
>>>>
>>>>  A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
>>> --
>>> Gavin M. Bell
>>> Lawrence Livermore National Labs
>>> --
>>>
>>>  "Never mistake a clear view for a short distance."
>>>        	       -Paul Saffo
>>>
>>> (GPG Key - http://***rainbow.llnl.gov/dist/keys/gavin.asc)
>>>
>>>  A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
> 

-- 
Gavin M. Bell
Lawrence Livermore National Labs
--

 "Never mistake a clear view for a short distance."
       	       -Paul Saffo

(GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)

 A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E

-- 
Scanned by iCritical.


More information about the GO-ESSP-TECH mailing list