[Go-essp-tech] Are atomic datasets mutable?

Thu Nov 19 23:08:46 MST 2009

Hi Karl

I've promised Don to write something up about the use cases for versioning as we understand them, and that would include the use cases you describe ... but to cut to the chase the bottom line is , you can't have a DOI pointing at an object, and allow it to change (even by extension).

> I think the users will be confused if a new "version" of model output 
> (that has been modified in some way) is indistinguishable from model 
> output that has been simply extended.  Both of the following options 
> will be confusing:

The key word is "indistinguishable". How? You do need to start relying on metadata (internal AND external to files) or this data management job is just impossible.

> Option 1) 21st century data for the RCP4.5 future run is received and 
> identified as version 1.  Then the continuation of that run to the end 
> of the 23rd century is received and stored as version 2.  New users will 
> have to download both version 1 and version 2 to get the complete run.

There are two classes of this example:
 - some of these examples are going to come from tier1 experiments which extend core experiments.
 - some are going to be because of the way folk have done their runs

The first instance is covered by the fact that technically the second tranch of data is a different atomic dataset, and analysis should exploit a concatenation of two atomic datasets ... so the issue I think is about the second case, and I think we covered that in discussion yesterday.

> Option 2) 21st century data for the RCP4.5 future run is received and 
> identified as version 1.   Then the continuation of that run to the end 
> of the 23rd century is received and stored as version 2 along with a 
> copy of the data already stored as version 1.  In this case a new user 
> will get all the data by downloading version 2, but an old user who 
> already downloaded version 1, won't know if what's in version 2 is a 
> duplicate of the data he already has, or is replacement data which has 
> corrected some problems in the earlier version.

Metadata metadata metadata.

> I would suggest therefore that for a single experiment, it would be best 
> from a user's perspective to not assign a new version to model output 
> that simply extends a previous run.  We will have to find a method by 
> which to advise old users who already downloaded data that the runs have 
> now been extended.

I don't think this is the right solution, it makes for even more confusion in the long run, because your two users have used *different* data in their analyses and yet they have assigned it the same "reference" (name including version with or without a DOI). 

Speaking as a potential DOI authority, there can be no chance of allowing something to change underfoot otherwise we'd have no credibility. Extension is change.

As I say, we'll try and write something coherent on this over the next fortnight.

Bryan

-- 
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848; 
Web: home.badc.rl.ac.uk/lawrence