[Go-essp-tech] Proposal for adjusting our definition of an atomic dataset

Mon Dec 7 16:33:25 MST 2009

Dear Stephen and all,

Before commenting on the substance of your email, let me suggest that we 
not talk about "standard" and "non-standard" output.  Rather, I think it 
will be less confusing to talk about:
1. CMIP5 "requested" output
2. output not requested by CMIP5. 

As an aside, I think it is best to avoid the term "core" output, and 
instead refer to the subset of the output that will be replicated at 
several gateways (e.g., PCMDI, BADC, DKRZ, ...) as "centralized CMIP5 
output".  Dean and I agree this will avoid confusion.

Now to suggestions in your email:

I'm not sure I understand option 1, but I'm definitely opposed to option 
2.  We are not talking about two different experiments, we are talking 
about different subsets of output from a single experiment.  Option 2 
would, I'm sure, confuse at least 99% of the users (well, maybe I 
exaggerate).

As for option 1,

1. What would the allowable "values" be for the additional DRS attribute?
2. What is meant by "Atomic datasets that currently span standard and 
non-standard output would be split into 2 atomic datasets"?  I don't 
think there are any current atomic datasets (except in our imagination), 
so there is no need to split them.
3.  Rather than saying "Other atomic datasets would exist in one 
category or the other," couldn't we simply say, an atomic dataset can 
either refer to all time-samples output from the run, or a subset of 
contiguous time-samples defined by the project.   [I'm not sure that 
it's absolutely necessary that they be contiguous, but I would think 
this would be less confusing.  For example, suppose the CMIP5 requested 
output was for the years 1950-1980, but the full expt. ran from 1850 to 
2005.  I would think that having the atomic dataset defined by the CMIP5 
requested output falling inside the atomic dataset for the non-requested 
output would seem to "split" the non-requested atomic dataset, which 
seems contradictory (can you split an atomic dataset?).]
4.  Note that there are some cases in which the CMIP5 *requested* output 
is non-contiguous.  For example, in the case of aerosol data, some of 
the 3-D fields are collected in 1-year samples as follows: 1850 to 1950 
every 20 years, 1960 to 2020 every 10 years, 2040 to 2100 every 20 
years.  If we require the time-samples in an atomic dataset be 
contiguous, this would require 17 different atomic datasets would 
comprise the CMIP5 requested output for these variables.  Perhaps that's 
unattractive and argues against requiring that the data be contiguous.

I'll try to join tomorrow at the beginning, at least.

Best regards,
Karl

stephen.pascoe at stfc.ac.uk wrote:
>
> A bunch of the ESG developers are in NCAR this week talking in detail
> about versioning and representing replicas in the datanode and gateway.  
> We have come to the conclusion that in order to implement replication 
> we need to confine ourselves to replicating entire atomic datasets.  
> We would like to work with the following principles:
>
> 1. CMIP5 archive is a set of atomic datasets
> 2. The CMIP5 standard output is a subset of the CMIP5 archive
> 3. We only replicate entire atomic datasets.
>
> >From previous emails it is apparent that the standard output does not
> correspond to a set of atomic datasets because in some cases standard
> output is a temporal subset of an atomic dataset.  This implies that a
> replica of an atomic dataset would be a temporal subset of that atomic
> dataset.
>
> Therefore we propose adjusting the definition of an atomic dataset to
> allow us to only replicate entire atomic datasets.  We suggest 2 ways
> of achieving this:
>
>  1. Add an extra attribute to the DRS syntax to represent the
>  difference between standard and non-standard output.  Atomic datasets
>  that currently span standard and non-standard output would be split
>  into 2 atomic datasets.  Other atomic datasets would exist in one
>  category or the other.
>
>  2. Split all experiments (as definied in the DRS) that contain atomic
>  datasets that span standard and non-standard output into 2
>  experiments e.g. "<expt>_standard", "<expt>_optional".
>
> We'd like to discuss this proposal at the telco tomorrow.  Comments welcome.
>
> Thanks,
> Stephen.
>
>