[Go-essp-tech] Are atomic datasets mutable?

Mon Nov 23 04:00:10 MST 2009

Hi Karl,

> > Does this mean that a single atomic dataset could contain data from 
> > 2 different tiers (core and tier 1)?
> Yes.

I can see a few serious problems with this.

It has implications for our data management system, particularly how we
replicate the core.  It means that our atomic datasets aren't atomic!
Consider an atomic dataset that has both core and tier 1 data.  The core
portion will be replicated at PCMDI, BADC, etc.  These replicas will
contain less data than at the originating node.  This will extremely
confusing to users who go download replicas of the core.

It also makes our job of working out what needs replicating more
difficult.  We had assumed that we just needed to decide which atomic
datasets needed replicating.

My experience with scientific user communities is quite limited but from
my knowledge of the impacts community many people will create aggregate
statistics from a basket of models and experiments to genenarate 1 or 2
numbers.  In this case I can see lots of confusion if a "dataset"
doesn't have a clearly defined time period.

I quite like your versioning algorithm and directory layout but if we
can't sort these issues out we have a big problem.

Cheers,
Stephen.

---
Stephen Pascoe  +44 (0)1235 445980
British Atmospheric Data Centre
Rutherford Appleton Laboratory

-----Original Message-----
From: Karl Taylor [mailto:taylor13 at llnl.gov] 
Sent: 23 November 2009 05:34
To: Pascoe, Stephen (STFC,RAL,SSTD)
Cc: go-essp-tech at ucar.edu; Lawrence, Bryan (STFC,RAL,SSTD); Juckes,
Martin (STFC,RAL,SSTD); Bob Drach; Charles Doutriaux
Subject: Re: [Go-essp-tech] Are atomic datasets mutable?

Dear Stephen,

Here are some responses to your email:

stephen.pascoe at stfc.ac.uk wrote:
> CMOR2 will write output into the same directory. 
>
> Does this mean that a single atomic dataset could contain data from 2 
> different tiers (core and tier 1)?
Yes.
>
> On the more general I point I completely agree with Bryan.  The DOI 
> point is the most concise statement of why we need immutable atomic 
> datasets.  What we need to agree on is what is a "version".  We 
> probably have different ideas what a version is so I'll share my 
> perspective, which comes from software engineering and version control
systems.
>
> VCS systems have a concept of an atomic unit: a file.  Any change to a

> file is considered a new version (or revision, the terminology
varies).
> It doesn't differentiate between additions and changes -- a change 
> could be as trivial as a newline at the end of the file.  The point is

> that the system's knowledge of the internal structure of objects needs

> to stop somewhere and that's the atomic unit.
>
> Therefore I think extension should imply a new version.  As Bryan says

> the job of explaining the relationship between 2 versions is the job 
> of metadata.  Also the problem of how we efficiently store 2 versions 
> one of which is an extension of the first can be solved separately.
>
>   
 From a user's perspective, I think the common understanding of
"version" will be that it differs in some substantive way from other
versions, not that it simply contains data not previously contributed to
the archive.  Anyone publishing a scientific article will have to say
which time-period he analyzed, and this will not be evident simply by
specifying the version, since many papers will be based on some subset
of the total output available under a single version number.

best regards,
Karl
> S.
>
> ---
> Stephen Pascoe  +44 (0)1235 445980
> British Atmospheric Data Centre
> Rutherford Appleton Laboratory
>
> -----Original Message-----
> From: Karl Taylor [mailto:taylor13 at llnl.gov]
> Sent: 19 November 2009 20:34
> To: Pascoe, Stephen (STFC,RAL,SSTD)
> Cc: go-essp-tech at ucar.edu
> Subject: Re: [Go-essp-tech] Are atomic datasets mutable?
>
> Hi all,
>
> Another common case that we'll have to deal with in CMIP5 and which 
> should be considered in defining what a "version" is:  For the future 
> (so-called RCP) runs, CMIP5 calls for runs initiated from the end of 
> the historical run and as part of the core set of expts., running to 
> the end of the 21st century.  At a lower priority some of these runs 
> will be extended to the end of the 23rd century.  Groups will likely 
> carry out these simulations in stages, sending us the 21st century
output long
> before the 22nd and 23rd century output becomes available.   The 
> component time-periods are part of the same experiment and CMOR2 will 
> write output into the same directory.
>
> I think the users will be confused if a new "version" of model output 
> (that has been modified in some way) is indistinguishable from model 
> output that has been simply extended.  Both of the following options 
> will be confusing:
>
> Option 1) 21st century data for the RCP4.5 future run is received and 
> identified as version 1.  Then the continuation of that run to the end

> of the 23rd century is received and stored as version 2.  New users 
> will have to download both version 1 and version 2 to get the complete
run.
>
> Option 2) 21st century data for the RCP4.5 future run is received and 
> identified as version 1.   Then the continuation of that run to the
end 
> of the 23rd century is received and stored as version 2 along with a 
> copy of the data already stored as version 1.  In this case a new user

> will get all the data by downloading version 2, but an old user who 
> already downloaded version 1, won't know if what's in version 2 is a 
> duplicate of the data he already has, or is replacement data which has

> corrected some problems in the earlier version.
>
> I would suggest therefore that for a single experiment, it would be 
> best from a user's perspective to not assign a new version to model 
> output that simply extends a previous run.  We will have to find a 
> method by which to advise old users who already downloaded data that 
> the runs have now been extended.
>
> Best regards,
> Karl
>
>
> stephen.pascoe at stfc.ac.uk wrote:
>   
>> Hi all,
>>  
>> The UKMO has flagged up a use case where an atomic dataset might 
>> change over time without being a new version.  The example is the 
>> 1000
>>     
>
>   
>> year piControl run where UKMO is likely to deliver it in several time

>> chunks and would want it to be published before the full run is 
>> complete.  Since atomic datasets represent the whole time period 
>> these
>>     
>
>   
>> datasets will grow over time.
>>  
>> I am tempted to say each addition to the dataset triggers a new 
>> version that deprecates the previous one but UKMO wasn't too keen on 
>> that.  Any ideas?
>>  
>> S.
>>  
>> ---
>> Stephen Pascoe  +44 (0)1235 445980
>> British Atmospheric Data Centre
>> Rutherford Appleton Laboratory
>>  
>>
>> --
>> Scanned by iCritical.
>>
>>
>> ---------------------------------------------------------------------
>> -
>> --
>>
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu
>> http://**mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>   
>>     
>
>   

-- 
Scanned by iCritical.