[Go-essp-tech] Are atomic datasets mutable?

Sun Nov 22 22:32:46 MST 2009

Hi Bryan, Martin, and Stephen, (I get a message that go-essp-tech is not 
passing this on?  Does anyone know if I do or don't have permission to 
distribute to the go-essp mail list?)

First, I have no strong feelings about what constitutes a new "version" 
of a dataset or what the definition of an atomic dataset should be.  I 
only care about how easy it is for users to download the the latest 
version of a data set, and to know when data they have already 
downloaded has been replaced (or extended in time).   [I don't think it 
has to be particularly convenient to get earlier versions of a dataset, 
but I understand this is a requirement that some think is important.]    
Is my view of what' important too narrow?

I also would not like to see several duplicate copies of files 
populating the data archive.  Not only does this seem wasteful of disk 
space, but why should it be necessary?

I'm mostly concerned at this time with the directory structure.  If it 
is as clean as possible, I should think it would be easy to create a 
catalog documenting what's there and making it easy to find what you need.

With these criteria in mind, I propose the following:  Prior to (or as 
part of) executing the ESG "publication" procedure

1) a version identifier would be assigned such as 1a, 1b, 2a, 3a, etc.  
If a run had simply been extended with no data from the earlier version 
withdrawn, then the letter would be incremented (e.g., from "a" to 
"b").  If any data were withdrawn and/or replaced, then the integer 
would be incremented (e.g, from "2" to "3"). 

2) A new subdirectory would be created (e.g., "v2b") where any new files 
would be placed.

3) Any files from the previous version that had not been replaced, would 
then be moved into this new directory, and a link to them would be 
created in the previous version's subdirectory. (For example, if we 
simply add output from an extension of a run currently stored in "v2a", 
the new files would be placed in "v2b" and the files from "v2a" would be 
moved into "v2b" and links would be created in "v2a" that point to the 
files that were moved to "v2b".  Similarly, if a single file was found 
to be corrupted within the current version, a new version would be 
established with all the uncorrupted old files moved to the new 
subdirectory along with the single replacement file. Links would be 
created in the old version subdirectory pointing to the moved files, 
but, of course, not the replacement file.)

Will this work and be acceptable?

Below I respond to some of Bryan's specific questions/comments.  I'll 
respond to Martin's and Stephen's emails separately.

Bryan Lawrence wrote:
>  you can't have a DOI pointing at an object, and allow it to change (even by extension).
>   
O.K.
>   
>> I think the users will be confused if a new "version" of model output 
>> (that has been modified in some way) is indistinguishable from model 
>> output that has been simply extended.  Both of the following options 
>> will be confusing:
>>     
>
> The key word is "indistinguishable". How? You do need to start relying on metadata (internal AND external to files) or this data management job is just impossible.
>   
Not sure I get it, but I think I agree.
>   
>> Option 1) 21st century data for the RCP4.5 future run is received and 
>> identified as version 1.  Then the continuation of that run to the end 
>> of the 23rd century is received and stored as version 2.  New users will 
>> have to download both version 1 and version 2 to get the complete run.
>>     
>
> There are two classes of this example:
>  - some of these examples are going to come from tier1 experiments which extend core experiments.
> - some are going to be because of the way folk have done their runs
>
> The first instance is covered by the fact that technically the second tranch of data is a different atomic dataset, and analysis should exploit a concatenation of two atomic datasets ... so the issue I think is about the second case, and I think we covered that in discussion yesterday.
>
>   
I don't think the designation of tier1 vs. core should have anything to 
do with the DRS or version discussion.  If the first part of an 
experiment is higher priority and in the "core" set, while the extension 
is lower priority, we shouldn't care.  The database need only record 
what years are available from the runs.
>> I would suggest therefore that for a single experiment, it would be best 
>> from a user's perspective to not assign a new version to model output 
>> that simply extends a previous run.  We will have to find a method by 
>> which to advise old users who already downloaded data that the runs have 
>> now been extended.
>>     
>
> I don't think this is the right solution, it makes for even more confusion in the long run, because your two users have used *different* data in their analyses and yet they have assigned it the same "reference" (name including version with or without a DOI). 
>
>   
I understand this, but even those using the same "reference" name won't 
necessarily use the same data.  For example, if all the data from an 
experiment have been made available at the same time it will be assigned 
a single reference name.  One user might choose only to use the first 
100 years of data stored, while another user might use only the last 100 
years stored.  They would both reference the data with the same 
reference "name", but they would be using completely different data.  
The bottom line: a reference name is insufficient to know what data the 
user has used in his study; you also need to know what time-period he 
has considered.  I think anyone publishing an analysis would be expected 
to indicate what years of model output his analysis was based on, so 
attempting to use the reference "name" to indicate this information is 
both impossible in general and unnecessary. 

Best regards,
Karl