[Go-essp-tech] Are atomic datasets mutable?

Fri Nov 20 03:26:56 MST 2009

Karl,

First on a specific point:

> At a lower priority some of these runs will be extended to the end of
the 23rd century.  
> ... The component time-periods are part of the same experiment and
CMOR2 will write output into the same directory. 

Does this mean that a single atomic dataset could contain data from 2
different tiers (core and tier 1)?  If so I think this contradicts
Bryan's assertion:

> There are two classes of this example:
>  - some of these examples are going to come from tier1 experiments
which extend core 
> experiments.
> - some are going to be because of the way folk have done their runs
>
> The first instance is covered by the fact that technically the second
tranch of data is a 
> different atomic dataset, and analysis should exploit a concatenation
of two atomic datasets 
> ...

On the more general I point I completely agree with Bryan.  The DOI
point is the most concise statement of why we need immutable atomic
datasets.  What we need to agree on is what is a "version".  We probably
have different ideas what a version is so I'll share my perspective,
which comes from software engineering and version control systems.

VCS systems have a concept of an atomic unit: a file.  Any change to a
file is considered a new version (or revision, the terminology varies).
It doesn't differentiate between additions and changes -- a change could
be as trivial as a newline at the end of the file.  The point is that
the system's knowledge of the internal structure of objects needs to
stop somewhere and that's the atomic unit.  

Therefore I think extension should imply a new version.  As Bryan says
the job of explaining the relationship between 2 versions is the job of
metadata.  Also the problem of how we efficiently store 2 versions one
of which is an extension of the first can be solved separately.

S.

---
Stephen Pascoe  +44 (0)1235 445980
British Atmospheric Data Centre
Rutherford Appleton Laboratory

-----Original Message-----
From: Karl Taylor [mailto:taylor13 at llnl.gov] 
Sent: 19 November 2009 20:34
To: Pascoe, Stephen (STFC,RAL,SSTD)
Cc: go-essp-tech at ucar.edu
Subject: Re: [Go-essp-tech] Are atomic datasets mutable?

Hi all,

Another common case that we'll have to deal with in CMIP5 and which
should be considered in defining what a "version" is:  For the future
(so-called RCP) runs, CMIP5 calls for runs initiated from the end of the
historical run and as part of the core set of expts., running to the end
of the 21st century.  At a lower priority some of these runs will be
extended to the end of the 23rd century.  Groups will likely carry out
these simulations in stages, sending us the 21st century output long 
before the 22nd and 23rd century output becomes available.   The 
component time-periods are part of the same experiment and CMOR2 will
write output into the same directory. 

I think the users will be confused if a new "version" of model output
(that has been modified in some way) is indistinguishable from model
output that has been simply extended.  Both of the following options
will be confusing:

Option 1) 21st century data for the RCP4.5 future run is received and
identified as version 1.  Then the continuation of that run to the end
of the 23rd century is received and stored as version 2.  New users will
have to download both version 1 and version 2 to get the complete run.

Option 2) 21st century data for the RCP4.5 future run is received and 
identified as version 1.   Then the continuation of that run to the end 
of the 23rd century is received and stored as version 2 along with a
copy of the data already stored as version 1.  In this case a new user
will get all the data by downloading version 2, but an old user who
already downloaded version 1, won't know if what's in version 2 is a
duplicate of the data he already has, or is replacement data which has
corrected some problems in the earlier version.

I would suggest therefore that for a single experiment, it would be best
from a user's perspective to not assign a new version to model output
that simply extends a previous run.  We will have to find a method by
which to advise old users who already downloaded data that the runs have
now been extended.

Best regards,
Karl

stephen.pascoe at stfc.ac.uk wrote:
> Hi all,
>  
> The UKMO has flagged up a use case where an atomic dataset might 
> change over time without being a new version.  The example is the 1000

> year piControl run where UKMO is likely to deliver it in several time 
> chunks and would want it to be published before the full run is 
> complete.  Since atomic datasets represent the whole time period these

> datasets will grow over time.
>  
> I am tempted to say each addition to the dataset triggers a new 
> version that deprecates the previous one but UKMO wasn't too keen on 
> that.  Any ideas?
>  
> S.
>  
> ---
> Stephen Pascoe  +44 (0)1235 445980
> British Atmospheric Data Centre
> Rutherford Appleton Laboratory
>  
>
> --
> Scanned by iCritical.
>
>
> ----------------------------------------------------------------------
> --
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech
>   

-- 
Scanned by iCritical.