[Go-essp-tech] How will it all work?

Tue Mar 23 12:38:22 MDT 2010

merkle trees


martin.juckes at stfc.ac.uk wrote:
> I would like to strengthen Stephen’s last point – we need to have a
> system for unambiguously determining whether two files or two datasets
> are the same or not. I.e. we need to have checksums on the files and
> some means of aggregating these to a checksum at the ESG published
> dataset level. Ideally, the ESG gateway would not advertise datasets
> held at two different sites as being the same without having verified
> that this is the case by looking at the dataset checksums.
> 
> Cheers,
> 
> Martin
> 
>  
> 
> *From:* go-essp-tech-bounces at ucar.edu
> [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of
> *stephen.pascoe at stfc.ac.uk
> *Sent:* 23 March 2010 14:55
> *To:* taylor13 at llnl.gov; go-essp-tech at ucar.edu
> *Subject:* Re: [Go-essp-tech] How will it all work?
> 
>  
> 
> Hi Karl,
> 
>  
> 
> There is lots to consider in this email.  I'll leave the policy stuff to
> Bryan but I want to focus on your description of how versions will
> work.  I think the definition of versions below is confused.  Just to
> enumerate what I understand you are saying:
> 
>  
> 
>  1. ESG datanode assigns versions to files and datasets
>  2. A dataset is a collection of variables from a particular experiment
> (& realisation) and model (presumably these collections are realms)
>  3. A version subdirectory is inserted after <ensemble-member> in the
> DRS hierarchy
>  4. variables may be stored in more than one file.
> 
>  
> 
> There are 3 different version concepts here: file versions, dataset
> versions and the version subdirectory (aka DRS-version) and they all
> apply to different levels of granularity.  This has arisen because
> "dataset" is no longer at the same level as DRS-version if we follow
> Bob's advice and publish at the realm level.  The DRS-version no longer
> matches either types of version managed by esg publisher.  I think I
> suggested in a previous email that this could be solved by moving the
> version subdirectory to be directly below realm. 
> 
>  
> 
> However, if we stick with the DRS structure as-is there is a non-trivial
> relationship between these 3 version concepts.  To try and understand
> your proposal I've sketched it out at
> http://*proj.badc.rl.ac.uk/go-essp/wiki/CMIP5/VersionStructure.  I've
> given 2 scenarios there on how files will be moved into version
> folders.  How these relate to "dataset version" is TBD.
> 
>  
> 
> You say version consistency is not essential across datanodes (#7) and
> that the version directory should contain a "latest" directory(#14).  In
> this case how do we know whether 2 datanodes have the same "latest"
> data?  It would seem obvious that datanodes need consistent versioning. 
> Even if version numbers are consistent I'm not convinced "latest" is a
> good idea.  One one datanode "latest" could mean v2 and on another it
> could mean v3.
> 
>  
> 
> If a dataset's version can't be kept consistent across all datanodes we
> will need another means of determining whether 2 datasets are the same. 
> One possibility is a combined hash of all the files' tracking_ids, or
> even combined checksum.
> 
>  
> 
> Cheers,
> 
> Stephen.
> 
>  
> 
> ---
> 
> Stephen Pascoe  +44 (0)1235 445980
> 
> British Atmospheric Data Centre
> 
> Rutherford Appleton Laboratory
> 
>  
> 
>  
> 
> ------------------------------------------------------------------------
> 
> *From:* go-essp-tech-bounces at ucar.edu
> [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Karl Taylor
> *Sent:* 22 March 2010 09:30
> *To:* GO-ESSP
> *Subject:* [Go-essp-tech] How will it all work?
> 
> Dear all,
> 
> Here is an attempt to write down how CMIP5 data might be served by ESG. 
> Perhaps someone can find a better way to do this, but if not, perhaps
> this will be acceptable. {I apologize if my limited understanding of ESG
> means that either this is impractical or stupid.  It is meant to inspire
> others to come up with a better approach, but I would like to see a very
> explicit written description of any proposed alternative.)  Perhaps some
> of you will have a chance to study this before our next teleconference.
> 
> Procedure for putting in place the CMIP5 archive:
> 1.  A modeling group generates model output in native format and file
> structure.
> 2.  The modeling group rewrites data consistent with CMIP5 requirements
> (see attached document) using either CMOR2  or an equivalent
> post-processing coding.  Data is placed in a directory structure
> specified by the "CMIP5 and AR5 Data Reference Syntax (DRS)" (see
> http://*cmip-pcmdi.llnl.gov/cmip5/docs/cmip5_data_reference_syntax.pdf).. 
> [This is automatically assured by CMOR2, but otherwise must be enforced
> by the user;]
> 3. Certain quality control (QC) criteria are guaranteed to be satisfied
> by processing output through CMOR2 (available from PCMDI), but,
> alternatively,  to ensure that the same QC criteria are met by output
> that has *not* been processed through CMOR2, this output should be
> required to successfully pass the tests imposed by the "CMOR2 checker"
> code (also available from PCMDI).
> 4.  For modeling groups hosting an ESG node, CMIP5-compliant model
> output is "published" to the ESG federation (i.e., it is registered in
> an ESG catalog and becomes visible to the ESG federation). .  Other
> groups unable to host an ESG node  may send output to an archival center
> (e.g., PCMDI, BADC, DKRZ) which will become the surrogate "owner" of the
> output. The owner will publish the data to the ESG federation. As a
> first step in the publication job stream, the files will be moved from a
> directory at the "realization" level to a subdirectoy at the "version"
> level. [CMOR2 writes data to the following directory:
> <activity>/<product>/<institute>/<model>/<experiment>/<frequency>/<modeling
> realm>/<variable name>/<ensemble member>/, and in the ESG publisher
> procedure, the files will be moved to a directory under <ensemble
> member, which will be named v<i> where i is the version assigned to this
> by ESG. Note that ESG assigns version numbers to individual files and
> also to "datasets". The "i" refers to the file version number.  An ESG
> "dataset" comprises several variables produced from a single run (and
> realization) from a single model.  Output from a single variable may be
> stored in several files.  Thus, a dataset will include files from a
> number of variables and in general for each variable the data will be
> stored in multiple files.  ESG will assign a version number to each file
> (and the directory name will be consistent with this) and ESG will also
> assign a version number to each dataset.  If the version of any file
> within a dataset is incremented, then the version of the dataset must be
> incremented.(by 1). 
> 5.  As a step in the ESG publication procedure, a subdirectory under
> <ensemble member> will be created named "latest"  and a link will be
> crated in this subdirectory pointing to the latest version of all files
> that together contribute to the latest version of the dataset.  This
> so-called "latest" subdirectory can be accessed to retrieve the most
> recent (and, presumably, trustworthy) model output available.
> 6.  The owner (or surrogate owner) of the model output will send the
> so-called "CMIP5 requested model output" (as defined in a document
> available from
> http://*cmip-pcmdi.llnl.gov/cmip5/output_req.html?submenuheader=3#req_list) 
> via 2-Tbyte disks to PCMDI (and subsequently it will be passed on to
> other archival centers).  Each of the archival centers will decide
> whether to store all of the requested output or some subset of the
> requested output, or none of the output.  There is no requirement that
> all of the archival centers host exactly the same portion of model output.
> 7. Each archival center will store the output in a directory structure
> consistent with the "CMIP5 and AR5 Data Reference Syntax (DRS)", as
> described above.  The "version number" assigned to each file (and as
> automatically guaranteed by the ESG publication procedure also assigned
> to the directory name?) would ideally be the same as that found at the
> data owner's node, but I don't think this is essential.  Note that each
> of the archival centers will publish to the ESG federation the subset
> (or complete) model output it chooses to archive.  
> 8.  If users find errors in the model output that has been published (or
> if additional quality assurance procedures applied by the ESG federation
> uncover any flaws), it is reported to the data "owner" who may withdraw
> the output and possibly replace it with corrected output.  If the data
> is withdrawn and not replaced, the data owner informs the federation
> that data has been withdrawn, and the archival centers withdraw all the
> affected files.  At all sites the dataset version is incemented, and the
> withdrawn files are not included in this new version of the dataset.  If
> the data is replaced, the data owner publishes the new data (placed in
> an incremented "version" subdirectory) and  informs the federation that
> the data has been replaced.  The archival centers update their archives
> with the latest files (placed in incremented "version" subdirectories). 
> . At all sites the dataset "version" is also incremented and this new
> dataset version now includes the replacement files.
> 9.  At a time when the dataset has "matured" and it is deemed
> appropriate, a (substantial) subset of the "CMIP5 requested output"  for
> a given model and experiment will be submitted for assignment of a
> DOI's.  (DOI's will be assigned with a granularity following the ESG
> "dataset" granularity -- i.e., DOI's will be assigned to each subset of
> a single model's output defined by a single experiment, a single
> realization, a single realm, and a single frequency.  The dataset will
> include many variables.)  The procedure for assigning a DOI to model
> output is described elsewhere, but a requirement is that the data must
> be archived at one, some, or all of the following locations: PCMDI,
> BADC, and DKRZ.  The expected persistence of these groups and their
> ability to support data archives makes it likely that the output will
> remain accessible far into the future.
> 10.  As part of the submission procedure for DOI status, the model
> output "owner" will publish a set of new "ESG datasets" that will
> typically include only a subset of the original model output.  Each of
> these new datasets is a candidate for DOI assignment.  Because these new
> ESG datasets constitute a subset of the originally published "model
> output", they may not be of much interest to users who come to ESG in
> search of data (since the users will presumably be keen to examine *all*
> the model output).  Nevertheless, if DOI status is granted, the subset
> of output included will presumably be perceived as somewhat more
> permanent and reliable (since we expect additional quality assurance
> procedures will be invoked in the procedure to gain DOI status).  The
> DOI's will also serve future researchers who might want to reproduce
> research results that cite certain DOI-labeled datasets.  The modeling
> groups will also be able to substantiate claims that their data has
> actually contributed to the research results that cite their DOI's. 
> This capability requires that the DOI-designated datasets be given
> special status by ESG.  With the current ESG design it may be necessary
> (for this purpose defining DOI datasets) to create a parallel directory
> structure to the original directory structure where the model output is
> stored.  This parallel directory would contain links to only the the
> subset of model output files that are included in the DOI-designated
> (and ESG federation-replicated) subset.  A user with access to the
> actual DOI archive directory would only see files included in the
> DOI-designated data.  The user could go to the *original* directory to
> see *all* the data available at the site, which would include the
> DOI-designated data. 
> 11. Once the output submitted for DOI candidacy has been published,
> archival centers that have copies of this data will publish to the
> federation the same (subset of) model output and these copies will be
> identified by ESG as "replicated" datasets.  These replicated datasets
> will likely be subsets of the already published corresponding model
> output datasets, in which case there will be two distinct datasets
> registered with the ESG federation, one containing the entire available
> output at the site and the other containing only the replicated subset.
> 12. At this point the ESG federation will be aware of a number of
> different datasets that are similar but differ in the fraction of output
> included from the total output available (within the granularity defined
> by the ESG "dataset" definition).  For example, the total output might
> include all time samples simulated. PCMDI might archive only a subset of
> this output.  And the DOI-candidate output (which might be "replicated"
> at BADC and DKRZ) might include only a subset of the variables (of most
> interest).  Thus, at least 3 different ESG datasets would be defined,
> with only one of these being replicated across certain archival centers.
> 13.  The user who comes to an ESG portal should be able to search the
> distributed, federated ESG database and find out whether data of
> interest is available.  Initially it will likely be unimportant (from
> the user's perspective) to learn where exactly the data is stored (and I
> think the user should initially not see all the different ESG datasets
> that include the data of interest).  But before the user actually
> attempts to retrieve the data, he/she should be given the opportunity to
> select a preferred site from which to obtain it.  ESG should then
> provide the wget script (or equivalent) that the user can subsequently
> use to download the data.  This wget script would access the data from
> the preferred site, unless it were unavailable there in which case it
> would direct the user to an archive where is was available. 
> 14.  Note that the directory structure described above includes a
> "latest" subdirectory containing links that point to the most recent
> versions of files available.  The wget script should probably point to
> the links in this "latest" subdirectory because this will make it
> possible for the user to edit the script to obtain files for a different
> variable.  If the wget script points to the actual file location for a
> particular variable, the user will in general be unable to easily edit
> the wget script to get a different variable because the "version
> subdirectory" where the latest version of each file is located may
> differ from one variable to another.
> 
> I have left out the details of what specific QC procedures are required
> at various points in the procedure.  I have also omitted lots of details
> that will have to be worked out.  Note also that I do not think
> "replication" is of major interest or concern.  My view is that whether
> a given dataset is replicated or not is not so important.  ESG will say
> what files are available (and it will know where copies of individual
> files can be found).  My guess is that most of the major "archival
> centers" will want to have copies of the files that are DOI-anointed,
> and ESG should be able to keep track of these "replicated" datasets.  If
> this is not practical, I'm not sure how making a "bigger deal" about
> replication remedies the any difficulty posed by the above.
> 
> I look forward to your reactions/comments/alternative suggestions.
> 
> Best regards,
> Karl
> 
> P.S. It's rather late in the evening, so please allow for that in
> reading the above.
> 
> 
> 
> -- 
> Scanned by iCritical.
> 
>  
> 
> 
> -- 
> Scanned by iCritical.
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech

-- 
Gavin M. Bell
Lawrence Livermore National Labs
--

 "Never mistake a clear view for a short distance."
       	       -Paul Saffo

(GPG Key - http://rainbow.llnl.gov/dist/keys/gavin.asc)

 A796 CE39 9C31 68A4 52A7  1F6B 66B7 B250 21D5 6D3E