[Go-essp-tech] Proposed version directory structure document

Stehen Pascoe Stephen.Pascoe at stfc.ac.uk
Tue May 4 05:37:28 MDT 2010


Hi Gavin,

I have one concern, as I mentioned briefly before. As it stands the  
datanode produces THREDDS catalogues as output from either a directory  
scan or a map file: esgpublish does not take THREDDS catalogues as  
input. So where would the publisher fit in to your replication scenario?

It is for this reason that I don't think it is obvious that our DVCS  
catalogues are the same as the THREDDS catalogues used to initialize  
TDS, although they are clearly prime candidates. For instance, even if  
we extended esgpublish to take THREDDS as input I doubt they would  
round-trip unchanged without care.

A prototype sounds a very good idea.

Cheers,
Stephen.

--
Stephen Pascoe


On 3 May 2010, at 20:48, Gavin M Bell <gavin at llnl.gov> wrote:

> Hi,
>
> Because of the way our system is designed, as I mentioned before,  
> single
> editor single publisher. There is no merging and there is indeed ONE
> person that has the ground truth of the *latest* version.
>
> So entertain the following scenario...
>
> From the system-side of things: (at a 15000 ft level)
>
> Phase 1: "The handshake: What new?"
> Domain A wants to make sure it has the latest *catalogs* from Domain  
> B,
> for the *catalogs* it cares about.  Domain A asks Domain B for the
> latest versions of these *catalogs*.  Domain B provides this list of
> latest versions of said *catalogs* back to Domain A iff there is a
> latest version to get. Domain A initiates a pull from Domain B of  
> these
> newest catalog files.
>
> Phase 2: "The 'realization' of these catalogs"
> (this is where the catalog centric model's rubber hit the road)
> This realization is basically the inspecting of the catalog and
> reconciling the data files that the catalog has in its list with  
> what is
> on it's file system.  This is going basically back to the email I sent
> at the beginning of this thread.  The system reconciles what it  
> needs to
> make the current 'latest' catalog "true".  There is a list of files  
> that
> fall out of this that can (will) be fed to the fetching mechanism (BDM
> in our case) to pull down these files into the prescribed directory. I
> suggested; a .esg_data_files directory at the same level as the  
> catalog
> where it's files are kept.  This concept of dealing with catalogs for
> versioning is separate from the transfer of the datafiles that
> constitute them. Getting the catalog to be "true" is out of band of  
> any
> version control system, though the catalogs themselves are version
> controlled.
>
> This is also my proposal for how replication should be done.
>
> This scenario begs a bootstrapping question for what happens at "phase
> 0" to support the initial contact between Domain A and Domain B... At
> the moment it could be a potentially charged issue, so I am punting  
> on that.
>
> Anyway, so I hope my attempt to illustrate the separation between  
> where
> a DVCS would be applied (only with catalog xml files) and where data
> file fetching would be done (via another out of band mechanism - Ex:
> BDM).  I also hope I conveyed the elegance of this approach as it  
> speaks
> to both versioning and replication (and user copying/downloading).
>
> Thoughts?
>
> Perhaps I should prototype this and share it?
>
> martin.juckes at stfc.ac.uk wrote:
>> Thanks, I think I understand a bit better now: the idea is to have  
>> a single GIT repository, and the files deposited in the repository  
>> are THREDDS catalogues? This does mean, I think, that the  
>> repository version is not much use to a user trying to track  
>> changes in a small subset of the THREDDS catalogues. There is a  
>> section in the GIT wikipedia page ( http://*en.wikipedia.org/wiki/Git_%28software%29 
>>  ) which suggests that GIT is quite inefficient at getting the  
>> version history of deposited files (i.e. THREDDS catalogues). I'm  
>> not sure this matters, but we need to check the consequences of the  
>> different use cases: GIT is clearly designed for the situation  
>> where users will generally be interested in a complete version of  
>> the archived package, whereas we are dealing with a case in which  
>> users will generally only want a small part of it.
>>
>>
>> cheers,
>> Martin
>>
>>
>> -----Original Message-----
>> From: Pascoe, Stephen (STFC,RAL,SSTD)
>> Sent: Sun 02/05/2010 16:22
>> To: Juckes, Martin (STFC,RAL,SSTD); Gavin M Bell
>> Cc: go-essp-tech at ucar.edu
>> Subject: RE: [Go-essp-tech] Proposed version directory structure  
>> document
>>
>>
>> Martin,
>>
>> I think Gavin uses the term catalogue to mean an aggregation of  
>> files similar to a THREDDS catalogue, not the entire archive.  Each  
>> catalogue would represent a realm-dataset.  As you say, that's what  
>> we need.
>>
>> What is probably confusing is that Gavin suggests a single GIT  
>> repository holding all catalogues (the repository can be replicated  
>> throughout the system -- that's what makes GIT a distributed  
>> version control system).  He also discusses how catalogues would be  
>> mapped onto versions of files, which would need their own internal  
>> identifiers to make the system work.
>>
>> As far as I can see the user would be aware of only one type of  
>> "version".
>>
>> Stephen.
>>
>>
>> -----Original Message-----
>> From: Juckes, Martin (STFC,RAL,SSTD)
>> Sent: Sat 5/1/2010 6:43 PM
>> To: Gavin M Bell; Pascoe, Stephen (STFC,RAL,SSTD)
>> Cc: go-essp-tech at ucar.edu
>> Subject: RE: [Go-essp-tech] Proposed version directory structure  
>> document
>>
>>
>> Hello Gavin, Stephen,
>>
>> I haven't been following this discussion, so the following concern  
>> may well have been dealt with. It looks to me as though you are  
>> discussing a system with two levels of versioning: versions of  
>> individual files, and a version of the entire catalogue, which will  
>> increment every time any files are changed. This, I think, leaves  
>> too big a gap in which it is difficult for users to specify which  
>> set of files they have used. If someone uses a few thousand files,  
>> a change in the catalogue version doesn't tell him if these files  
>> have been changed, and listing all the file versions is not a  
>> useful option in publications and correspondence -- so we need  
>> versioning at intermediate levels such as published units and  
>> atomic datasets as well as at file and catalogue level,
>>
>> cheers,
>> Martin
>>
>> -----Original Message-----
>> From: go-essp-tech-bounces at ucar.edu on behalf of Gavin M Bell
>> Sent: Fri 30/04/2010 19:03
>> To: Pascoe, Stephen (STFC,RAL,SSTD)
>> Cc: go-essp-tech at ucar.edu
>> Subject: Re: [Go-essp-tech] Proposed version directory structure  
>> document
>>
>> Hi Stephen,
>>
>> I am glad that my scheme is making you warm and fuzzy, I hope the  
>> rest
>> of the team is also on board.  I too an enamored with the simple
>> elegance of it, if I do say so myself. :-)
>>
>> The thing is... my scheme is essentially that we *only* version  
>> control
>> with DVCS (GIT) *catalogs*.  Catalogs are simple text xml files.  
>> Each of
>> those files are not that big and the order of catalog files are  
>> well in
>> the supported range for GIT.  With respect to catalogs... there are  
>> no
>> BIG catalogs (files) in the context of anything that would be
>> prohibitive for vanilla GIT.  The key bit of niceness is that the esg
>> system is pushing off the storage and durability of the actual data
>> files (the big files) to the institutions themselves.  Between local
>> institution durability duties and replication we can be somewhat safe
>> that we won't 'lose' data. The only thing the esg system will  
>> explicitly
>> version are the catalogs themselves that intern point to the specific
>> hard data files (netcdf) living on disk and replicated.
>>
>> In short; GIT with no additional bells and whistles should be able to
>> handle all our ESG catalogs.  Note: There is one GIT repo per  
>> datanode.
>>
>> ...
>>
>> Some Things To Think About:
>> There are some things that would need to be changed like - the  
>> catalog
>> naming scheme.  If catalogs are version controlled then we no longer
>> need to version files by explicitly naming them i.e.
>> foo_catalog_v{1..n}. But, we *should* have that version value be put
>> *in* the file itself.  Thus quick inspection of the file can give you
>> the ESG version value, while the VCS sees a single filename entity to
>> version control.  (I'll have to talk to Bob on that one.)  Also, thus
>> far we are not using the "D" part of the VCS.  In order to do so we
>> would have to a) flatten the file hierarchy (or at least settle on a
>> consistent one) this would additionally facilitate the ability to
>> divorce the catalog placement from the filesystem hierarchy - this is
>> where a simpler version of your link idea would come to bear - or b)
>> interrogate this ourselves (via code we write) as we do version
>> negotiating among federated entities.
>>
>> I'd be happy to discuss this more.
>>
>>
>>
>> stephen.pascoe at stfc.ac.uk wrote:
>>> Hi Gavin,
>>>
>>> If we go down the DVCS-catalogue route you might be interested to  
>>> note
>>> that Mercurial already has an extension that does something very  
>>> similar
>>> to what you are proposing with GIT.
>>>
>>> http://**mercurial.selenic.com/wiki/BigfilesExtension
>>>
>>> Maybe we need something more bespoke, but it's a useful  
>>> reference.  The
>>> more I think about version-controlled catalogues the more it  
>>> appears to
>>> solve some of our problems (particularly replication).
>>>
>>> S.
>>>
>>>
>>> ---
>>> Stephen Pascoe  +44 (0)1235 445980
>>> British Atmospheric Data Centre
>>> Rutherford Appleton Laboratory
>>>
>>> -----Original Message-----
>>> From: Gavin M Bell [mailto:gavin at llnl.gov]
>>> Sent: 26 April 2010 18:33
>>> To: Pascoe, Stephen (STFC,RAL,SSTD)
>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu
>>> Subject: Re: [Go-essp-tech] Proposed version directory structure
>>> document
>>>
>>> Hi Stephen,
>>>
>>> I hope you are well. :-)
>>>
>>> Disclaimer:
>>> Okay, let me get straight at it... :-)  Please pardon the length  
>>> of this
>>> response. I do tend to get a bit garrulous when I am trying to  
>>> tackle
>>> technical issues. :-\ sorry.  And pardon typos I get a bit james  
>>> joycian
>>> at times too.
>>>
>>> (take a deep breath.... :-)
>>>
>>> So, the catalog centric model does not put any implications on the  
>>> files
>>> system.  You can move the catalogs as you would any file on the file
>>> system.  Indeed there will be a tool made available (I think I am  
>>> just
>>> going to write it and be done with it) so that the physical  
>>> constituent
>>> files move along with the logical catalog as well.  There is no file
>>> system lock in at all, it's just a meta layer to essentially group
>>> files.
>>>
>>> The source of catalog information is the catalog.  The database is a
>>> nice tool to do data manipulations and queries but the ground  
>>> truth is
>>> the file that is the catalog IMHO.  As long as the catalogs are
>>> identical (checksums match) then it doesn't matter where the catalog
>>> comes from ipso facto. It is my proposed plan that data node manager
>>> will provide the handshaking between/among data nodes, this is  
>>> something
>>> that will be leveraged by the replication agent and this we can  
>>> somewhat
>>> think of catalogs as being eventually consistent.  This can be upper
>>> bounded (sort of) based on the type of even propagation protocols we
>>> want to use.  For the moment I am looking at using a gossip protocol
>>> among nodes for addressing this issue piggy backing GIT's  
>>> distributed
>>> syncing.  The fact that we have a single editor system simplifies  
>>> things
>>> greatly!!!
>>>
>>> W.r.t. GIT, indeed there are limits. I guess what would be good is  
>>> to
>>> get a back of the envelope calculation on how many catalogs would  
>>> there
>>> be per data node, just by order of magnitude.  I suspect at that  
>>> catalog
>>> (aggregation) level we will be well in the safe zone for GIT.
>>>
>>> As for the DRS, indeed you can layout your filesystem as  
>>> prescribed by
>>> the DRS exactly, or you can relax that requirement by having  
>>> filters in
>>> between the caller and storage that will translate to and fro the  
>>> DRS
>>> layout and the filesystem layout. I think it is a good idea to give
>>> admins an 'out' from having it be compulsory that the filesystem  
>>> looks
>>> like DRS.  It has been my thought that ground truth catalogs at  
>>> the data
>>> node will refer to files where they are physically on the  
>>> machine.  When
>>> the catalog leaves the datanode or is queried from a caller, these
>>> exchanges use the canonical name for files as directed by the DRS.
>>>
>>> The basic idea in the catalog centric approach is to have a simple  
>>> and
>>> consistent model of "files" in our system.  The basic fact is that
>>> semantically we aggregate files when we run are models and produce N
>>> number of output files for a run.  We even go so far as having a  
>>> catalog
>>> describing these collections.  So then why on the other side of the
>>> system (those wanting to consume files) do we then make them have to
>>> deal with files individually.  So I simply propose we stay  
>>> consistent
>>> all the way through. With regards to "fancy" tools (nice  
>>> editorializing
>>> Stephan :-) - just messing with you)... they are not so fancy as  
>>> they
>>> are simple automations that allow you to manipulate catalogs and not
>>> worry about moving its constituent files.
>>>
>>> So they layout I described was something like... have a /foo/ 
>>> catalog1
>>> which will always have a /foo/catalog1/.datafiles directory that  
>>> contain
>>> the files stipulated in /foo/catalog1.  I would have a tool (I  
>>> think I
>>> am just going to write an esg-shell) that you would use to say mv
>>> /foo/catalog1 to /bar/catalog1.  What the tool/shell would do is  
>>> move mv
>>> /foo/catalog1 to /bar/catalog1 and then move all the files  
>>> decribed in
>>> /foo/catalog1 that reside in /foo/catalog1/.datafiles/* over as  
>>> well to
>>> /bar/catalog1/.datafiles/*.  If the "fancy" tool fails, one could  
>>> just
>>> read the catalog1 file and read the file entries and the checksums  
>>> and
>>> look for them in .datafiles and move them over by hand, or have a
>>> script, which would be effectively equivalent to the "fancy" tool  
>>> that
>>> does just that.  The key is that where catalog1 lives is totally  
>>> up to
>>> the datanode admin.
>>>
>>> Oh and filters.  You can apply them in ingress and egress data in
>>> tomcat.  For the intrepid data node admin they can install a  
>>> filter to
>>> rewrite the catalog such that what the outside world sees are DRS  
>>> paths
>>> to files.  This makes everything else 'just work' i.e. wget scripts,
>>> etc.  The additional technical wrinkle is that I would suggest  
>>> having a
>>> specific esg-filter class that said intrepid data node admin would
>>> subclass to make their filter.  The added bit of functionality  
>>> would be
>>> to write the filter name and version into the catalog's mutable  
>>> portion.
>>> This way the catalog knows what filter was used to transform it  
>>> and thus
>>> a filter factory can be setup on the fly to do the proper  
>>> translations.
>>> Yes, this means the data node admin would have to maintain their own
>>> filters.  But that's fine... they are intrepid! For the less  
>>> intrepid,
>>> don't do any rewrite and make your filesystem match DRS.
>>>
>>> For non-tomcat filter amenable tools/protocols.  I would suggest  
>>> writing
>>> the filter code such that it can be loaded up as a simple  
>>> translation
>>> service.  We have the source for GridFTP, right, and I believe it is
>>> written in Java. So I think, if programmed wisely, there would  
>>> only be a
>>> single filter that can be applied to every ingress/egress.
>>>
>>> I welcome discussing this further.  I will be the first to say  
>>> that I do
>>> not now much about DRS and the details therein, but I think that
>>> technically this is a surmountable issue.  Someone please educate  
>>> me on
>>> the semantics.  I don't yet know what I don't know. :-).
>>>
>>> stephen.pascoe at stfc.ac.uk wrote:
>>>> Hi Gavin,
>>>>
>>>> Sorry it's taken me so long to respond to this.  It's a good point
>>>> that we could version control catalogue information and then write
>>>> tools to synchronise the catalogues with file versions.  I like the
>>>> idea but I think it has far-reaching implications for the system  
>>>> as a
>>> whole.
>>>> There a couple of reasons why I haven't embraced the "catalogue
>>> centric"
>>>> approach so far.  First, the ESG datanode database already has  
>>>> all the
>>>> information you'd put in catalogues.  The DRY principle suggests we
>>>> should have only 1 source for catalogue information and I have  
>>>> assumed
>>>> that is the database.  Now, the database has advantages and
>>>> disadvantages: Bob's schema manages multiple versions but there  
>>>> is no
>>>> mechanism for distributing version changes amongst datanodes,  
>>>> whereas
>>>> tools like GIT would give us distributed version control of  
>>>> catalogues
>>>> out of the box.  However, if we start version controlling  
>>>> catalogues
>>>> we will end up with our catalogue information spread all over the
>>>> place and we'll have to keep them all synchronised:
>>>>
>>>> 1. In the ESG database
>>>> 2. In the archive
>>>> 3. In the THREDDS catalogue tree
>>>>
>>>> Also, the reason I've stuck with symbolic links rather than tools  
>>>> to
>>>> map to DRS paths is that there is an argument for keeping the on- 
>>>> disk
>>>> layout as close to the DRS as possible so that there is a  
>>>> fallback to
>>>> getting data if the fancy tools fail.  If you do this with symlinks
>>>> you can always point an ftp server at the archive if all else  
>>>> fails.
>>>> I'm on the fence about whether this argument is worth the reduced
>>> flexibility.
>>>> We should also bar in mind that GIT has performance problems for  
>>>> both
>>>> size of files and number of files per repository
>>>> (http://***stackoverflow.com/questions/984707/what-are-the-git-limits 
>>>> )
>>>>
>>>> Cheers,
>>>> Stephen.
>>>>
>>>> ---
>>>> Stephen Pascoe  +44 (0)1235 445980
>>>> British Atmospheric Data Centre
>>>> Rutherford Appleton Laboratory
>>>>
>>>> -----Original Message-----
>>>> From: Gavin M Bell [mailto:gavin at llnl.gov]
>>>> Sent: 21 April 2010 00:15
>>>> To: Pascoe, Stephen (STFC,RAL,SSTD)
>>>> Cc: drach1 at llnl.gov; go-essp-tech at ucar.edu
>>>> Subject: Re: [Go-essp-tech] Proposed version directory structure
>>>> document
>>>>
>>>> Hey Stephen,
>>>>
>>>> Great write up.  I read through it and it is really well thought  
>>>> out.
>>>> As I mentioned on the call today it is pretty much exactly how  
>>>> GIT is
>>>> designed, so you have in good company :-).
>>>>
>>>> http://***progit.org/book/ch1-3.html
>>>>
>>>> The main issue I have is that we are thinking too low level.  We  
>>>> are
>>>> thinking about filesystem files.  Again, I think we should be  
>>>> thinking
>>>> of things at the "ESG FILE" level, i.e. the catalog level.  Imagine
>>>> the
>>>> following:
>>>>
>>>> users download ESG FILES aka catalogs.  The catalogs have versions
>>>> associated with them.  When the user downloads the catalog they get
>>>> the physical catalog xml file as well as the files that come with  
>>>> it.
>>>> If they download a new version of the catalog that has two files  
>>>> out
>>>> of 100 different, they would only pull down the two new files.  How
>>>> are files named to avoid collisions?  Easy, files are named in the
>>>> scheme <filename>.<checksum> this information can be gleaned from
>>>> inspecting the catalog that has both pieces of information.   
>>>> Catalogs
>>>> are version controlled in GIT (easy, they are just text xml  
>>>> files...
>>>> perfect for version control).  Let's make this even more explicit  
>>>> that
>>>> there is are no filesystem files in the context of ESG by putting  
>>>> all
>>>> the files in a dot directory.
>>>>
>>>> Example:
>>>>
>>>> I am in directory foo/bar (DRS dir hierarchy perhaps) we pull down
>>>> catalog_alpha_v1.  When we do so we now have the the following
>>>> structure.
>>>>
>>>> pwd   -> /root/foo/bar
>>>> ls    -> catalog_alpha_v1
>>>>
>>>> ls -a -> catalog_alpha_v1
>>>>      -> .esg_files/file{1...n}.nc.<checksum>
>>>>
>>>> See what I mean?
>>>>
>>>> In GIT we tell GIT to ignore all .esg_files directories, thus only
>>>> versioning the catalog files.
>>>>
>>>> The implication of this is that we would have to build tools to  
>>>> do our
>>>> own interrogation of the file system to give us the file.nc
>>>> translations.  This tool will use the catalog at that directory  
>>>> level
>>>> and be able to get directly at the files users want.
>>>>
>>>> Furthermore, WGET will "just work" if we point wget to an HTTP URL
>>>> that will have a filter applied to it that will do this  
>>>> interrogation
>>>> and interpretation and fetch the files referenced to.  This is a
>>>> tomcat filter (pretty straight forward to do).
>>>>
>>>> This means, no linking, no extra anything at the OS file system  
>>>> level.
>>>> The important files are versioned i.e. the catalogs.  And we can  
>>>> still
>>>> use WGET scripts as long as they point to our translation web  
>>>> service,
>>>> which consists pretty much only of a filter! :-).
>>>>
>>>> As for the atomic data set thing... well they are represented  
>>>> already
>>>> as aggregates in the catalog.  The only additional bit of  
>>>> information
>>>> that we could add would be a version attribute.  The issues behind
>>>> what the gateways read or don't read from the catalogs, I am  
>>>> confident
>>>> will be surmounted, so that should not be a blocking issue to
>>> implementing this.
>>>> Things to do:
>>>> -> Write this translation code.
>>>>   -We know that .esg_files directory (a given)
>>>>   -We know how to parse the catalog (use xml parser dejour)
>>>>   -translate input file name as a wget script would use
>>>>    to the actual physical filesystem filename.
>>>>   -Put this in a filter for tomcat in front of a catoon service
>>>>   -Create a shell for esg... simple read-eval-print loop that calls
>>>> the translator when it is in git directories with catalog looking
>>>> files and .esg_files directories to show you a filesystem looking  
>>>> "ls"
>>>> but for esg-files.
>>>>
>>>>
>>>> In my "spare" time I would love to write an ESG shell such that  
>>>> when
>>>> you load the esg shell it will be able to do ls like traditional  
>>>> OS's
>>>> ls using this translation code to show you the files that live  
>>>> there
>>>> in the write version context.... I don't have a lot of spare time
>>>> right about now. :-(
>>>>
>>>> This catalog centric modeling of the system has been a model I have
>>>> pushed for months now.  I feel like Cassandra :-).
>>>>
>>>> Thanks for listening.
>>>>
>>>> stephen.pascoe at stfc.ac.uk wrote:
>>>>> Hi Bob,
>>>>>
>>>>> Thanks for promptly commenting on the document.  Clarifying that  
>>>>> the
>>>>> publisher has these features is great news and I'm sorry that, in
>>>>> trying to give everyone time to digest the document by Tuesday, I
>>>>> didn't have time to confirm the facts with you.  I'm hoping this  
>>>>> way
>>>>> any errors will come out in the wash.
>>>>>
>>>>> The main thing I missed was the ability to create multiple THREDDS
>>>>> catalogues for a dataset (or 1 catalogue per dataset version).
>>>>> Omitting this feature felt like a funder mental difference in  
>>>>> model
>>>>> to
>>>> the DRS.
>>>>> I need to work out how to do this now and I'll revise the version
>>>>> directory structure document too.  Phil Bentley has recommended a
>>>>> different structure that has some advantages so the document will
>>>>> probably look very different next time.
>>>>>
>>>>> Incidentally, I'm increasingly impressed with the ESG publisher  
>>>>> and
>>>>> I'm really enjoying working with it.  The stuff you've done with
>>>>> project handler plugins in the latest release strengthens my
>>>>> impression that it is a tool we will be using for a long time.
>>>>>
>>>>> Cheers,
>>>>> Stephen.
>>>>>
>>>>> ---
>>>>> Stephen Pascoe  +44 (0)1235 445980
>>>>> British Atmospheric Data Centre
>>>>> Rutherford Appleton Laboratory
>>>>>
>>>>>
>>>>> --- 
>>>>> ------------------------------------------------------------------
>>>>> -
>>>>> --
>>>>> *From:* Bob Drach [mailto:drach1 at llnl.gov]
>>>>> *Sent:* 16 April 2010 00:14
>>>>> *To:* Pascoe, Stephen (STFC,RAL,SSTD)
>>>>> *Cc:* go-essp-tech at ucar.edu
>>>>> *Subject:* Re: [Go-essp-tech] Proposed version directory structure
>>>>> document
>>>>>
>>>>> Hi Stephen,
>>>>>
>>>>> Let me clarify a few points in the description of ESG Publisher:
>>>>>
>>>>> The document states: "ESG Publisher version system is built around
>>>>> mutable datasets.  It does not attempt to maintain references to
>>>>> previous data and the dataset version number is not part of the
>>>>> dataset id unless the publisher is configured to include it from  
>>>>> the
>>>>> dataset metadata.  This means that it is not straight forward at  
>>>>> this
>>>>> time to publish multiple versions of an atomic dataset unless each
>>>>> version is published as a separate dataset.  This approach would
>>>>> effectively ignore ESG Publisher's version system and manage all
>>>> versions independently."
>>>>> - As of Version 2 the unit of publication is in fact a 'dataset
>>>>> version', terminology that came out of the December meeting in
>>>> Boulder.
>>>>> A dataset version is an immutable object which can represent a  
>>>>> 'DRS
>>>>> dataset including version number'. The published 'dataset version'
>>>>> itself has an identifier which typically consists of
>>>>> dataset_id+version number; this appears in the THREDDS catalog. As
>>>>> you
>>>>> stated in the document, whether or not the published dataset
>>>>> corresponds to a DRS dataset is a matter of publisher  
>>>>> configuration,
>>>>> not an inherent property of the publisher.
>>>>>
>>>>> - The node database does in fact maintain references to the
>>>>> composition of previous dataset versions. It is possible to have
>>>>> multiple versions published simultaneously, to list all published
>>>>> versions of a dataset, and for any given dataset version the files
>>>>> contained in that version can be listed.
>>>>>
>>>>> - The intention of the publisher design is to automate  
>>>>> versioning as
>>>>> much as possible. A 'dataset' is considered to be a collection of
>>>>> dataset versions. Consequently, 'publishing a dataset' really  
>>>>> means
>>>>> 'publishing a dataset version where the version number is  
>>>>> incremented
>>>>> relative to the previous version.' Similarly, 'unpublishing' a
>>>>> dataset
>>>>> by default unpublishes all versions of a dataset. The terminology
>>>>> dataset_id#n can be used to refer to a specific version.
>>>>>
>>>>>
>>>>> In short, there is no fundamental mismatch between the DRS model  
>>>>> and
>>>>> the ESG publisher.
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>>
>>>>> Bob
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Apr 15, 2010, at 3:24 AM, <stephen.pascoe at stfc.ac.uk
>>>>> <mailto:stephen.pascoe at stfc.ac.uk>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> Attached is my view on how we should structure the archive to
>>>>>> support
>>>>>> multiple versions.  It divides into 2 main sections, the first  
>>>>>> is a
>>>>>> fairly lengthy summary of why this problem isn't solved yet in  
>>>>>> terms
>>>>>> of the differences between the ESG datanode software and the DRS
>>>>>> document.  The second section lays out the proposed structure and
>>>>>> how
>>>>>> we would manage symbolic links and moving from one version to
>>>>>> another.  I restrict myself to directories below the atomic  
>>>>>> dataset
>>>>>> level.
>>>>>>
>>>>>> Lots of issues are left to resolve, in particular how we ESG
>>>>>> publisher can make use of this structure.  I'll try and draw
>>>>>> attention to these points in the agenda for Tuesday's telco which
>>>> will follow later today.
>>>>>>
>>>>>> Cheers,
>>>>>> Stephen.
>>>>>>
>>>>>> ---
>>>>>> Stephen Pascoe  +44 (0)1235 445980
>>>>>> British Atmospheric Data Centre
>>>>>> Rutherford Appleton Laboratory
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Scanned by iCritical.
>>>>>>
>>>>>>
>>>>>> <ESGF_version_structure.odt> 
>>>>>> ________________________________________
>>>>>> _
>>>>>> ______
>>>>>> GO-ESSP-TECH mailing list
>>>>>> GO-ESSP-TECH at ucar.edu <mailto:GO-ESSP-TECH at ucar.edu>
>>>>>> http://*****mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>>> --
>>>>> Scanned by iCritical.
>>>>>
>>>>>
>>>>>
>>>>> --- 
>>>>> ------------------------------------------------------------------
>>>>> -
>>>>> --
>>>>>
>>>>> _______________________________________________
>>>>> GO-ESSP-TECH mailing list
>>>>> GO-ESSP-TECH at ucar.edu
>>>>> http://****mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>>> --
>>>> Gavin M. Bell
>>>> Lawrence Livermore National Labs
>>>> --
>>>>
>>>> "Never mistake a clear view for a short distance."
>>>>                  -Paul Saffo
>>>>
>>>> (GPG Key - http://***rainbow.llnl.gov/dist/keys/gavin.asc)
>>>>
>>>> A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
>>> --
>>> Gavin M. Bell
>>> Lawrence Livermore National Labs
>>> --
>>>
>>> "Never mistake a clear view for a short distance."
>>>                  -Paul Saffo
>>>
>>> (GPG Key - http://**rainbow.llnl.gov/dist/keys/gavin.asc)
>>>
>>> A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E
>>
>
> -- 
> Gavin M. Bell
> Lawrence Livermore National Labs
> --
>
> "Never mistake a clear view for a short distance."
>                  -Paul Saffo
>
> (GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)
>
> A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E


More information about the GO-ESSP-TECH mailing list