[Go-essp-tech] How will it all work?

Tue Mar 23 17:56:52 MDT 2010

Hi Stephen and all,

Last night I edited the document I sent earlier, but with some minor 
changes (you can see because "track changes" is on).  Perhaps this 
should replace the document already posted on line even though we know 
that Stephen will hopefully propose a far superior approach.

Mostly for Stephen's benefit, I have made some comments below.

------------------------------------------------------------------------
On 23-Mar-10 7:54 AM, stephen.pascoe at stfc.ac.uk wrote:
> Hi Karl,
> There is lots to consider in this email.  I'll leave the policy stuff 
> to Bryan but I want to focus on your description of how versions will 
> work.  I think the definition of versions below is confused.  Just to 
> enumerate what I understand you are saying:
>  1. ESG datanode assigns versions to files and datasets
>  2. A dataset is a collection of variables from a particular 
> experiment (& realisation) and model (presumably these collections are 
> realms)
Note that the dataset is also limited to a single sampling "frequency"
>  3. A version subdirectory is inserted after <ensemble-member> in the 
> DRS hierarchy
>  4. variables may be stored in more than one file.
> There are 3 different version concepts here: file versions, dataset 
> versions and the version subdirectory (aka DRS-version) and they all 
> apply to different levels of granularity.
We should note that the ESG-defined dataset (which, presumably, will 
remain at the "realm" level) also differs from the DOI dataset, which I 
think will include all realms (and frequencies?? I need to check the DOI 
document about this) and may informally also have a "version" associated 
with it (correct?  by this I mean some data associated with a particular 
DOI may be corrected and a new DOI could get assigned to the corrected 
DOI dataset).  Also, we may need to distinguish between DRS "atomic" 
dataset versions (but maybe not?), which will in general differ from all 
the other versions.

The "version subdirectory" should, I think, be assigned a number 
consistent with one of the other 3 (or 4?) version numbers (I nominate 
it be consistent with the ESG-defined dataset version, but perhaps there 
are good arguments against this.)

> This has arisen because "dataset" is no longer at the same level as 
> DRS-version if we follow Bob's advice and publish at the realm level.  
> The DRS-version no longer matches either types of version managed by 
> esg publisher.  I think I suggested in a previous email that this 
> could be solved by moving the version subdirectory to be directly 
> below realm.
If I understood our audio conference this morning, you've withdrawn this 
suggestion.
> However, if we stick with the DRS structure as-is there is a 
> non-trivial relationship between these 3 version concepts.  To try and 
> understand your proposal I've sketched it out at 
> http://*proj.badc.rl.ac.uk/go-essp/wiki/CMIP5/VersionStructure.  I've 
> given 2 scenarios there on how files will be moved into version 
> folders.  How these relate to "dataset version" is TBD.
> You say version consistency is not essential across datanodes (#7) and 
> that the version directory should contain a "latest" directory(#14).  
> In this case how do we know whether 2 datanodes have the same "latest" 
> data?  It would seem obvious that datanodes need consistent 
> versioning.  Even if version numbers are consistent I'm not convinced 
> "latest" is a good idea.  One one datanode "latest" could mean v2 and 
> on another it could mean v3.
> If a dataset's version can't be kept consistent across all datanodes 
> we will need another means of determining whether 2 datasets are the 
> same.  One possibility is a combined hash of all the files' 
> tracking_ids, or even combined checksum.
I think this deserves some consideration.  I think even if we design 
things to make sense, "operator error" will likely lead to 
inconsistencies between labeled versions from one data location to 
another across the federation.

After looking at your "diagrams" describing the structure, I agree that 
your suggestion will allow one to more easily recover an old version and 
I think it is probably a better approach.

One idea for being able to locate the files associated with a particular 
DOI would be to include a parallel directory structure with the links at 
the "r" level in your diagram not pointing to "latest", but instead 
pointing to the files included in that DOI.  On the other hand, perhaps 
you don't want to create a complete parallel structure and instead could 
find a place within the current structure (shown in your diagram) to 
host the DOI links.  Note that you must allow for newer versions of the 
DOI dataset (with a new DOI number), although we hope this won't be a 
common occurrence.

On the other hand, perhaps it is envisioned that the files associated 
with the DOI are cataloged somewhere externally and can't be retrieved 
directly by just using the directory names and structure as a guide.

I look forward to the next iteration.

Best regards,
Karl
> Cheers,
> Stephen.
> ---
> Stephen Pascoe  +44 (0)1235 445980
> British Atmospheric Data Centre
> Rutherford Appleton Laboratory
>
> ------------------------------------------------------------------------
> *From:* go-essp-tech-bounces at ucar.edu 
> [mailto:go-essp-tech-bounces at ucar.edu] *On Behalf Of *Karl Taylor
> *Sent:* 22 March 2010 09:30
> *To:* GO-ESSP
> *Subject:* [Go-essp-tech] How will it all work?
>
> Dear all,
>
> Here is an attempt to write down how CMIP5 data might be served by 
> ESG.  Perhaps someone can find a better way to do this, but if not, 
> perhaps this will be acceptable. {I apologize if my limited 
> understanding of ESG means that either this is impractical or stupid.  
> It is meant to inspire others to come up with a better approach, but I 
> would like to see a very explicit written description of any proposed 
> alternative.)  Perhaps some of you will have a chance to study this 
> before our next teleconference.
>
> Procedure for putting in place the CMIP5 archive:
> 1.  A modeling group generates model output in native format and file 
> structure.
> 2.  The modeling group rewrites data consistent with CMIP5 
> requirements (see attached document) using either CMOR2  or an 
> equivalent post-processing coding.  Data is placed in a directory 
> structure specified by the "CMIP5 and AR5 Data Reference Syntax (DRS)" 
> (see 
> http://*cmip-pcmdi.llnl.gov/cmip5/docs/cmip5_data_reference_syntax.pdf)..  
> [This is automatically assured by CMOR2, but otherwise must be 
> enforced by the user;]
> 3. Certain quality control (QC) criteria are guaranteed to be 
> satisfied by processing output through CMOR2 (available from PCMDI), 
> but, alternatively,  to ensure that the same QC criteria are met by 
> output that has *not* been processed through CMOR2, this output should 
> be required to successfully pass the tests imposed by the "CMOR2 
> checker" code (also available from PCMDI).
> 4.  For modeling groups hosting an ESG node, CMIP5-compliant model 
> output is "published" to the ESG federation (i.e., it is registered in 
> an ESG catalog and becomes visible to the ESG federation). .  Other 
> groups unable to host an ESG node  may send output to an archival 
> center (e.g., PCMDI, BADC, DKRZ) which will become the surrogate 
> "owner" of the output. The owner will publish the data to the ESG 
> federation. As a first step in the publication job stream, the files 
> will be moved from a directory at the "realization" level to a 
> subdirectoy at the "version" level. [CMOR2 writes data to the 
> following directory: 
> <activity>/<product>/<institute>/<model>/<experiment>/<frequency>/<modeling 
> realm>/<variable name>/<ensemble member>/, and in the ESG publisher 
> procedure, the files will be moved to a directory under <ensemble 
> member, which will be named v<i> where i is the version assigned to 
> this by ESG. Note that ESG assigns version numbers to individual files 
> and also to "datasets". The "i" refers to the file version number.  An 
> ESG "dataset" comprises several variables produced from a single run 
> (and realization) from a single model.  Output from a single variable 
> may be stored in several files.  Thus, a dataset will include files 
> from a number of variables and in general for each variable the data 
> will be stored in multiple files.  ESG will assign a version number to 
> each file (and the directory name will be consistent with this) and 
> ESG will also assign a version number to each dataset.  If the version 
> of any file within a dataset is incremented, then the version of the 
> dataset must be incremented.(by 1).
> 5.  As a step in the ESG publication procedure, a subdirectory under 
> <ensemble member> will be created named "latest"  and a link will be 
> crated in this subdirectory pointing to the latest version of all 
> files that together contribute to the latest version of the dataset.  
> This so-called "latest" subdirectory can be accessed to retrieve the 
> most recent (and, presumably, trustworthy) model output available.
> 6.  The owner (or surrogate owner) of the model output will send the 
> so-called "CMIP5 requested model output" (as defined in a document 
> available from 
> http://*cmip-pcmdi.llnl.gov/cmip5/output_req.html?submenuheader=3#req_list)  
> via 2-Tbyte disks to PCMDI (and subsequently it will be passed on to 
> other archival centers).  Each of the archival centers will decide 
> whether to store all of the requested output or some subset of the 
> requested output, or none of the output.  There is no requirement that 
> all of the archival centers host exactly the same portion of model output.
> 7. Each archival center will store the output in a directory structure 
> consistent with the "CMIP5 and AR5 Data Reference Syntax (DRS)", as 
> described above.  The "version number" assigned to each file (and as 
> automatically guaranteed by the ESG publication procedure also 
> assigned to the directory name?) would ideally be the same as that 
> found at the data owner's node, but I don't think this is essential.  
> Note that each of the archival centers will publish to the ESG 
> federation the subset (or complete) model output it chooses to archive.
> 8.  If users find errors in the model output that has been published 
> (or if additional quality assurance procedures applied by the ESG 
> federation uncover any flaws), it is reported to the data "owner" who 
> may withdraw the output and possibly replace it with corrected 
> output.  If the data is withdrawn and not replaced, the data owner 
> informs the federation that data has been withdrawn, and the archival 
> centers withdraw all the affected files.  At all sites the dataset 
> version is incemented, and the withdrawn files are not included in 
> this new version of the dataset.  If the data is replaced, the data 
> owner publishes the new data (placed in an incremented "version" 
> subdirectory) and  informs the federation that the data has been 
> replaced.  The archival centers update their archives with the latest 
> files (placed in incremented "version" subdirectories).  . At all 
> sites the dataset "version" is also incremented and this new dataset 
> version now includes the replacement files.
> 9.  At a time when the dataset has "matured" and it is deemed 
> appropriate, a (substantial) subset of the "CMIP5 requested output"  
> for a given model and experiment will be submitted for assignment of a 
> DOI's.  (DOI's will be assigned with a granularity following the ESG 
> "dataset" granularity -- i.e., DOI's will be assigned to each subset 
> of a single model's output defined by a single experiment, a single 
> realization, a single realm, and a single frequency.  The dataset will 
> include many variables.)  The procedure for assigning a DOI to model 
> output is described elsewhere, but a requirement is that the data must 
> be archived at one, some, or all of the following locations: PCMDI, 
> BADC, and DKRZ.  The expected persistence of these groups and their 
> ability to support data archives makes it likely that the output will 
> remain accessible far into the future.
> 10.  As part of the submission procedure for DOI status, the model 
> output "owner" will publish a set of new "ESG datasets" that will 
> typically include only a subset of the original model output.  Each of 
> these new datasets is a candidate for DOI assignment.  Because these 
> new ESG datasets constitute a subset of the originally published 
> "model output", they may not be of much interest to users who come to 
> ESG in search of data (since the users will presumably be keen to 
> examine *all* the model output).  Nevertheless, if DOI status is 
> granted, the subset of output included will presumably be perceived as 
> somewhat more permanent and reliable (since we expect additional 
> quality assurance procedures will be invoked in the procedure to gain 
> DOI status).  The DOI's will also serve future researchers who might 
> want to reproduce research results that cite certain DOI-labeled 
> datasets.  The modeling groups will also be able to substantiate 
> claims that their data has actually contributed to the research 
> results that cite their DOI's.  This capability requires that the 
> DOI-designated datasets be given special status by ESG.  With the 
> current ESG design it may be necessary (for this purpose defining DOI 
> datasets) to create a parallel directory structure to the original 
> directory structure where the model output is stored.  This parallel 
> directory would contain links to only the the subset of model output 
> files that are included in the DOI-designated (and ESG 
> federation-replicated) subset.  A user with access to the actual DOI 
> archive directory would only see files included in the DOI-designated 
> data.  The user could go to the *original* directory to see *all* the 
> data available at the site, which would include the DOI-designated data.
> 11. Once the output submitted for DOI candidacy has been published, 
> archival centers that have copies of this data will publish to the 
> federation the same (subset of) model output and these copies will be 
> identified by ESG as "replicated" datasets.  These replicated datasets 
> will likely be subsets of the already published corresponding model 
> output datasets, in which case there will be two distinct datasets 
> registered with the ESG federation, one containing the entire 
> available output at the site and the other containing only the 
> replicated subset.
> 12. At this point the ESG federation will be aware of a number of 
> different datasets that are similar but differ in the fraction of 
> output included from the total output available (within the 
> granularity defined by the ESG "dataset" definition).  For example, 
> the total output might include all time samples simulated. PCMDI might 
> archive only a subset of this output.  And the DOI-candidate output 
> (which might be "replicated" at BADC and DKRZ) might include only a 
> subset of the variables (of most interest).  Thus, at least 3 
> different ESG datasets would be defined, with only one of these being 
> replicated across certain archival centers.
> 13.  The user who comes to an ESG portal should be able to search the 
> distributed, federated ESG database and find out whether data of 
> interest is available.  Initially it will likely be unimportant (from 
> the user's perspective) to learn where exactly the data is stored (and 
> I think the user should initially not see all the different ESG 
> datasets that include the data of interest).  But before the user 
> actually attempts to retrieve the data, he/she should be given the 
> opportunity to select a preferred site from which to obtain it.  ESG 
> should then provide the wget script (or equivalent) that the user can 
> subsequently use to download the data.  This wget script would access 
> the data from the preferred site, unless it were unavailable there in 
> which case it would direct the user to an archive where is was available.
> 14.  Note that the directory structure described above includes a 
> "latest" subdirectory containing links that point to the most recent 
> versions of files available.  The wget script should probably point to 
> the links in this "latest" subdirectory because this will make it 
> possible for the user to edit the script to obtain files for a 
> different variable.  If the wget script points to the actual file 
> location for a particular variable, the user will in general be unable 
> to easily edit the wget script to get a different variable because the 
> "version subdirectory" where the latest version of each file is 
> located may differ from one variable to another.
>
> I have left out the details of what specific QC procedures are 
> required at various points in the procedure.  I have also omitted lots 
> of details that will have to be worked out.  Note also that I do not 
> think "replication" is of major interest or concern.  My view is that 
> whether a given dataset is replicated or not is not so important.  ESG 
> will say what files are available (and it will know where copies of 
> individual files can be found).  My guess is that most of the major 
> "archival centers" will want to have copies of the files that are 
> DOI-anointed, and ESG should be able to keep track of these 
> "replicated" datasets.  If this is not practical, I'm not sure how 
> making a "bigger deal" about replication remedies the any difficulty 
> posed by the above.
>
> I look forward to your reactions/comments/alternative suggestions.
>
> Best regards,
> Karl
>
> P.S. It's rather late in the evening, so please allow for that in 
> reading the above.
>
>
>
>
> -- 
> Scanned by iCritical.
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20100323/6a0280ed/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ESG_archive_procedures_23032010.docx
Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Size: 19482 bytes
Desc: not available
Url : http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20100323/6a0280ed/attachment-0001.bin