[Go-essp-tech] Are atomic datasets mutable?

Mon Nov 23 02:43:11 MST 2009

Hi Folks

This is getting quite hard to follow even for me, and I care about every little detail in this instance. We'll write up a summary of all the threads shortly, but meanwihle ...

1) The definition of experiment ...

The way we have set up the CMIP5 descriptoin with "continuation" experiments in the tiers, is obvoiulsy already causing confusion. Metafor, which is riddled with climate modellers, had interpreted the case of a tierN extension as being a "new" experiment, and the reason was pretty obvious: most places will do the core experiments, less will do the tierN, so an ensemble average on the core timescale will be rather different than an ensemble average on the tierN timescale. Further, we expect someone to be able to cite a core description including which models contributed *without* having to document them individually in the paper itself (by default, see the discussion in 3) below). Clearly that becomes impossible if you allow the tierN and core experiments to be essentially the same.

This becomes impossible if it's done the way you describe now ...  and would cause a major rethink in the work we have done.

The only issue for a consumer of the archive in keeping them split is that you have to think about concatenating two atomic datasets if you need to a full period analysis. It becomes much harder to document and handle if you don't ...

That said, the consquence of the metafor decisoin is a lot of spurious experiment conforrmance documentation, in so far as the two experiments will conform in the same way ... (all the other documentation comes for free via copying).

The metafor decisions can be undone ... and relatively quickly, but I personally think the discrimination is a helpful one for archive consumers. I've copied this to metafor to see if there is a will to do this.

2) Downloading

As Stephen says, is a different problem. It seems that we are treating versioning+ DRS like a hammer and all our problems like nails.  If I have two different versions, then a) the metadata should tell me why they are different, and b) the catalog system should help me with tools to download version differences.

Now it may be that ESG publisher+ ESG gateway can't do b) on day one, but that's no reason to make a hostage to fortune ... 

3) Analysis Subsets

I think it's fairly obivous that modellers *can* (not *will*) be fairly discriminating about which models they analyse in specific contexts. History and intuition tell us that the wider community of archive users probalby will not, they'll use entire sets of models available in various contexts ... 

How versioning matters in this context is that it is simply a marker that something has changed, we can't handle all the reasons why things have changed in the DRS URL or DOI.

All that said, we can use it to give hints, and Karl's made a proposal in a separate email, and I'll reply to that separately.

Cheers
Bryan

On Monday 23 November 2009 05:34:13 Karl Taylor wrote:
> Dear Stephen,
> 
> Here are some responses to your email:
> 
> stephen.pascoe at stfc.ac.uk wrote:
> > CMOR2 will write output into the same directory. 
> >
> > Does this mean that a single atomic dataset could contain data from 2
> > different tiers (core and tier 1)? 
> Yes.
> >
> > On the more general I point I completely agree with Bryan.  The DOI
> > point is the most concise statement of why we need immutable atomic
> > datasets.  What we need to agree on is what is a "version".  We probably
> > have different ideas what a version is so I'll share my perspective,
> > which comes from software engineering and version control systems.
> >
> > VCS systems have a concept of an atomic unit: a file.  Any change to a
> > file is considered a new version (or revision, the terminology varies).
> > It doesn't differentiate between additions and changes -- a change could
> > be as trivial as a newline at the end of the file.  The point is that
> > the system's knowledge of the internal structure of objects needs to
> > stop somewhere and that's the atomic unit.  
> >
> > Therefore I think extension should imply a new version.  As Bryan says
> > the job of explaining the relationship between 2 versions is the job of
> > metadata.  Also the problem of how we efficiently store 2 versions one
> > of which is an extension of the first can be solved separately.
> >
> >   
>  From a user's perspective, I think the common understanding of 
> "version" will be that it differs in some substantive way from other 
> versions, not that it simply contains data not previously contributed to 
> the archive.  Anyone publishing a scientific article will have to say 
> which time-period he analyzed, and this will not be evident simply by 
> specifying the version, since many papers will be based on some subset 
> of the total output available under a single version number.
> 
> best regards,
> Karl
> > S.
> >
> > ---
> > Stephen Pascoe  +44 (0)1235 445980
> > British Atmospheric Data Centre
> > Rutherford Appleton Laboratory
> >
> > -----Original Message-----
> > From: Karl Taylor [mailto:taylor13 at llnl.gov] 
> > Sent: 19 November 2009 20:34
> > To: Pascoe, Stephen (STFC,RAL,SSTD)
> > Cc: go-essp-tech at ucar.edu
> > Subject: Re: [Go-essp-tech] Are atomic datasets mutable?
> >
> > Hi all,
> >
> > Another common case that we'll have to deal with in CMIP5 and which
> > should be considered in defining what a "version" is:  For the future
> > (so-called RCP) runs, CMIP5 calls for runs initiated from the end of the
> > historical run and as part of the core set of expts., running to the end
> > of the 21st century.  At a lower priority some of these runs will be
> > extended to the end of the 23rd century.  Groups will likely carry out
> > these simulations in stages, sending us the 21st century output long 
> > before the 22nd and 23rd century output becomes available.   The 
> > component time-periods are part of the same experiment and CMOR2 will
> > write output into the same directory. 
> >
> > I think the users will be confused if a new "version" of model output
> > (that has been modified in some way) is indistinguishable from model
> > output that has been simply extended.  Both of the following options
> > will be confusing:
> >
> > Option 1) 21st century data for the RCP4.5 future run is received and
> > identified as version 1.  Then the continuation of that run to the end
> > of the 23rd century is received and stored as version 2.  New users will
> > have to download both version 1 and version 2 to get the complete run.
> >
> > Option 2) 21st century data for the RCP4.5 future run is received and 
> > identified as version 1.   Then the continuation of that run to the end 
> > of the 23rd century is received and stored as version 2 along with a
> > copy of the data already stored as version 1.  In this case a new user
> > will get all the data by downloading version 2, but an old user who
> > already downloaded version 1, won't know if what's in version 2 is a
> > duplicate of the data he already has, or is replacement data which has
> > corrected some problems in the earlier version.
> >
> > I would suggest therefore that for a single experiment, it would be best
> > from a user's perspective to not assign a new version to model output
> > that simply extends a previous run.  We will have to find a method by
> > which to advise old users who already downloaded data that the runs have
> > now been extended.
> >
> > Best regards,
> > Karl
> >
> >
> > stephen.pascoe at stfc.ac.uk wrote:
> >   
> >> Hi all,
> >>  
> >> The UKMO has flagged up a use case where an atomic dataset might 
> >> change over time without being a new version.  The example is the 1000
> >>     
> >
> >   
> >> year piControl run where UKMO is likely to deliver it in several time 
> >> chunks and would want it to be published before the full run is 
> >> complete.  Since atomic datasets represent the whole time period these
> >>     
> >
> >   
> >> datasets will grow over time.
> >>  
> >> I am tempted to say each addition to the dataset triggers a new 
> >> version that deprecates the previous one but UKMO wasn't too keen on 
> >> that.  Any ideas?
> >>  
> >> S.
> >>  
> >> ---
> >> Stephen Pascoe  +44 (0)1235 445980
> >> British Atmospheric Data Centre
> >> Rutherford Appleton Laboratory
> >>  
> >>
> >> --
> >> Scanned by iCritical.
> >>
> >>
> >> ----------------------------------------------------------------------
> >> --
> >>
> >> _______________________________________________
> >> GO-ESSP-TECH mailing list
> >> GO-ESSP-TECH at ucar.edu
> >> http://**mailman.ucar.edu/mailman/listinfo/go-essp-tech
> >>   
> >>     
> >
> >   
> 
> 

-- 
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848; 
Web: home.badc.rl.ac.uk/lawrence