[Go-essp-tech] How will it all work?

Mon Mar 22 03:29:40 MDT 2010

Dear all,

Here is an attempt to write down how CMIP5 data might be served by ESG.  
Perhaps someone can find a better way to do this, but if not, perhaps 
this will be acceptable. {I apologize if my limited understanding of ESG 
means that either this is impractical or stupid.  It is meant to inspire 
others to come up with a better approach, but I would like to see a very 
explicit written description of any proposed alternative.)  Perhaps some 
of you will have a chance to study this before our next teleconference.

Procedure for putting in place the CMIP5 archive:
1.  A modeling group generates model output in native format and file 
structure.
2.  The modeling group rewrites data consistent with CMIP5 requirements 
(see attached document) using either CMOR2  or an equivalent 
post-processing coding.  Data is placed in a directory structure 
specified by the "CMIP5 and AR5 Data Reference Syntax (DRS)" (see 
http://cmip-pcmdi.llnl.gov/cmip5/docs/cmip5_data_reference_syntax.pdf)..  [This 
is automatically assured by CMOR2, but otherwise must be enforced by the 
user;]
3. Certain quality control (QC) criteria are guaranteed to be satisfied 
by processing output through CMOR2 (available from PCMDI), but, 
alternatively,  to ensure that the same QC criteria are met by output 
that has *not* been processed through CMOR2, this output should be 
required to successfully pass the tests imposed by the "CMOR2 checker" 
code (also available from PCMDI).
4.  For modeling groups hosting an ESG node, CMIP5-compliant model 
output is "published" to the ESG federation (i.e., it is registered in 
an ESG catalog and becomes visible to the ESG federation). .  Other 
groups unable to host an ESG node  may send output to an archival center 
(e.g., PCMDI, BADC, DKRZ) which will become the surrogate "owner" of the 
output. The owner will publish the data to the ESG federation. As a 
first step in the publication job stream, the files will be moved from a 
directory at the "realization" level to a subdirectoy at the "version" 
level. [CMOR2 writes data to the following directory: 
<activity>/<product>/<institute>/<model>/<experiment>/<frequency>/<modeling 
realm>/<variable name>/<ensemble member>/, and in the ESG publisher 
procedure, the files will be moved to a directory under <ensemble 
member, which will be named v<i> where i is the version assigned to this 
by ESG. Note that ESG assigns version numbers to individual files and 
also to "datasets". The "i" refers to the file version number.  An ESG 
"dataset" comprises several variables produced from a single run (and 
realization) from a single model.  Output from a single variable may be 
stored in several files.  Thus, a dataset will include files from a 
number of variables and in general for each variable the data will be 
stored in multiple files.  ESG will assign a version number to each file 
(and the directory name will be consistent with this) and ESG will also 
assign a version number to each dataset.  If the version of any file 
within a dataset is incremented, then the version of the dataset must be 
incremented.(by 1).
5.  As a step in the ESG publication procedure, a subdirectory under 
<ensemble member> will be created named "latest"  and a link will be 
crated in this subdirectory pointing to the latest version of all files 
that together contribute to the latest version of the dataset.  This 
so-called "latest" subdirectory can be accessed to retrieve the most 
recent (and, presumably, trustworthy) model output available.
6.  The owner (or surrogate owner) of the model output will send the 
so-called "CMIP5 requested model output" (as defined in a document 
available from 
http://cmip-pcmdi.llnl.gov/cmip5/output_req.html?submenuheader=3#req_list)  
via 2-Tbyte disks to PCMDI (and subsequently it will be passed on to 
other archival centers).  Each of the archival centers will decide 
whether to store all of the requested output or some subset of the 
requested output, or none of the output.  There is no requirement that 
all of the archival centers host exactly the same portion of model output.
7. Each archival center will store the output in a directory structure 
consistent with the "CMIP5 and AR5 Data Reference Syntax (DRS)", as 
described above.  The "version number" assigned to each file (and as 
automatically guaranteed by the ESG publication procedure also assigned 
to the directory name?) would ideally be the same as that found at the 
data owner's node, but I don't think this is essential.  Note that each 
of the archival centers will publish to the ESG federation the subset 
(or complete) model output it chooses to archive.
8.  If users find errors in the model output that has been published (or 
if additional quality assurance procedures applied by the ESG federation 
uncover any flaws), it is reported to the data "owner" who may withdraw 
the output and possibly replace it with corrected output.  If the data 
is withdrawn and not replaced, the data owner informs the federation 
that data has been withdrawn, and the archival centers withdraw all the 
affected files.  At all sites the dataset version is incemented, and the 
withdrawn files are not included in this new version of the dataset.  If 
the data is replaced, the data owner publishes the new data (placed in 
an incremented "version" subdirectory) and  informs the federation that 
the data has been replaced.  The archival centers update their archives 
with the latest files (placed in incremented "version" subdirectories).  
. At all sites the dataset "version" is also incremented and this new 
dataset version now includes the replacement files.
9.  At a time when the dataset has "matured" and it is deemed 
appropriate, a (substantial) subset of the "CMIP5 requested output"  for 
a given model and experiment will be submitted for assignment of a 
DOI's.  (DOI's will be assigned with a granularity following the ESG 
"dataset" granularity -- i.e., DOI's will be assigned to each subset of 
a single model's output defined by a single experiment, a single 
realization, a single realm, and a single frequency.  The dataset will 
include many variables.)  The procedure for assigning a DOI to model 
output is described elsewhere, but a requirement is that the data must 
be archived at one, some, or all of the following locations: PCMDI, 
BADC, and DKRZ.  The expected persistence of these groups and their 
ability to support data archives makes it likely that the output will 
remain accessible far into the future.
10.  As part of the submission procedure for DOI status, the model 
output "owner" will publish a set of new "ESG datasets" that will 
typically include only a subset of the original model output.  Each of 
these new datasets is a candidate for DOI assignment.  Because these new 
ESG datasets constitute a subset of the originally published "model 
output", they may not be of much interest to users who come to ESG in 
search of data (since the users will presumably be keen to examine *all* 
the model output).  Nevertheless, if DOI status is granted, the subset 
of output included will presumably be perceived as somewhat more 
permanent and reliable (since we expect additional quality assurance 
procedures will be invoked in the procedure to gain DOI status).  The 
DOI's will also serve future researchers who might want to reproduce 
research results that cite certain DOI-labeled datasets.  The modeling 
groups will also be able to substantiate claims that their data has 
actually contributed to the research results that cite their DOI's.  
This capability requires that the DOI-designated datasets be given 
special status by ESG.  With the current ESG design it may be necessary 
(for this purpose defining DOI datasets) to create a parallel directory 
structure to the original directory structure where the model output is 
stored.  This parallel directory would contain links to only the the 
subset of model output files that are included in the DOI-designated 
(and ESG federation-replicated) subset.  A user with access to the 
actual DOI archive directory would only see files included in the 
DOI-designated data.  The user could go to the *original* directory to 
see *all* the data available at the site, which would include the 
DOI-designated data.
11. Once the output submitted for DOI candidacy has been published, 
archival centers that have copies of this data will publish to the 
federation the same (subset of) model output and these copies will be 
identified by ESG as "replicated" datasets.  These replicated datasets 
will likely be subsets of the already published corresponding model 
output datasets, in which case there will be two distinct datasets 
registered with the ESG federation, one containing the entire available 
output at the site and the other containing only the replicated subset.
12. At this point the ESG federation will be aware of a number of 
different datasets that are similar but differ in the fraction of output 
included from the total output available (within the granularity defined 
by the ESG "dataset" definition).  For example, the total output might 
include all time samples simulated. PCMDI might archive only a subset of 
this output.  And the DOI-candidate output (which might be "replicated" 
at BADC and DKRZ) might include only a subset of the variables (of most 
interest).  Thus, at least 3 different ESG datasets would be defined, 
with only one of these being replicated across certain archival centers.
13.  The user who comes to an ESG portal should be able to search the 
distributed, federated ESG database and find out whether data of 
interest is available.  Initially it will likely be unimportant (from 
the user's perspective) to learn where exactly the data is stored (and I 
think the user should initially not see all the different ESG datasets 
that include the data of interest).  But before the user actually 
attempts to retrieve the data, he/she should be given the opportunity to 
select a preferred site from which to obtain it.  ESG should then 
provide the wget script (or equivalent) that the user can subsequently 
use to download the data.  This wget script would access the data from 
the preferred site, unless it were unavailable there in which case it 
would direct the user to an archive where is was available.
14.  Note that the directory structure described above includes a 
"latest" subdirectory containing links that point to the most recent 
versions of files available.  The wget script should probably point to 
the links in this "latest" subdirectory because this will make it 
possible for the user to edit the script to obtain files for a different 
variable.  If the wget script points to the actual file location for a 
particular variable, the user will in general be unable to easily edit 
the wget script to get a different variable because the "version 
subdirectory" where the latest version of each file is located may 
differ from one variable to another.

I have left out the details of what specific QC procedures are required 
at various points in the procedure.  I have also omitted lots of details 
that will have to be worked out.  Note also that I do not think 
"replication" is of major interest or concern.  My view is that whether 
a given dataset is replicated or not is not so important.  ESG will say 
what files are available (and it will know where copies of individual 
files can be found).  My guess is that most of the major "archival 
centers" will want to have copies of the files that are DOI-anointed, 
and ESG should be able to keep track of these "replicated" datasets.  If 
this is not practical, I'm not sure how making a "bigger deal" about 
replication remedies the any difficulty posed by the above.

I look forward to your reactions/comments/alternative suggestions.

Best regards,
Karl

P.S. It's rather late in the evening, so please allow for that in 
reading the above.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20100322/95780118/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: CMIP5_output_requirements19Mar10.pdf
Type: application/pdf
Size: 244288 bytes
Desc: not available
Url : http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20100322/95780118/attachment-0001.pdf