[Go-essp-tech] Are atomic datasets mutable?

Mon Nov 23 03:07:37 MST 2009

Hi Karl

> I also would not like to see several duplicate copies of files 
> populating the data archive.  Not only does this seem wasteful of disk 
> space, but why should it be necessary?

We already discussed this at hamburg and decided it made other problems go away ... provided we (ESG+archives) decide it's none of our business how the version problems are handled locally (see below). That said, there is clearly no point in having datasets that are updated often resulting in many copies, so having some "best practice" guidance is a good tihng, and having a nomenclature that helps users is obviously good too ...

We need to handle that properly, but how much effort do we need to put into it? From your experience with CMIP3, did you find that many centres provided many updates which were simply extensions?

> I'm mostly concerned at this time with the directory structure.  If it 
> is as clean as possible, I should think it would be easy to create a 
> catalog documenting what's there and making it easy to find what you need.

As I said in my previous email, we can try and do too much with the directory structure ... but I agree that we should at least handle extension a bit more cleanly.

I'll get to your proposal below in a moment, but Martin and I considered a rather simpler idea. How about we simply say to folks that if they are submitting in pieces, they submit as version0, and tell us when it's complete at which case we call it v1 ... and we don't give dois to version0 ever ... and we tell folks not to submit papers based on version0 data ... IN THE LICENSE AGREEMENT. Thus allowing modelling centres to control "when their data is ready".

However, I admit that idea wasn't fully thought through yet, and so the following deserves consideration, and I rather like it.

> With these criteria in mind, I propose the following:  Prior to (or as 
> part of) executing the ESG "publication" procedure
> 
> 1) a version identifier would be assigned such as 1a, 1b, 2a, 3a, etc.  
> If a run had simply been extended with no data from the earlier version 
> withdrawn, then the letter would be incremented (e.g., from "a" to 
> "b").  If any data were withdrawn and/or replaced, then the integer 
> would be incremented (e.g, from "2" to "3"). 

This implies a maximum of 26 increments. Fine by me.

> 2) A new subdirectory would be created (e.g., "v2b"where any new files 
> would be placed.

> 3) Any files from the previous version that had not been replaced, would 
> then be moved into this new directory, and a link to them would be 
> created in the previous version's subdirectory.

I'm kind of ok with this. In practice I think it'll be ok, but the reason I'm against *relying* on links is that when we get to petascale, we may well find that those who are using NFS can't reliably have a signle filesystem (for NFS performance reasons) (and that's ok, the DRS allows for splitting it), and then what happens if we have most of a dataset in one place, and the updated version somewhere else, we can't do links between different hosts. I suspect that's such an edge case that we can just forget it, but just so you know why I wasn't so keen on links previously. (I think, we think, at BADC, that we can live with this ... even in an NFS environment, but only by being clever and paying for someone to move data as necessary).

> (For example, if we  
> simply add output from an extension of a run currently stored in "v2a", 
> the new files would be placed in "v2b" and the files from "v2a" would be 
> moved into "v2b" and links would be created in "v2a" that point to the 
> files that were moved to "v2b".  Similarly, if a single file was found 
> to be corrupted within the current version, a new version would be 
> established with all the uncorrupted old files moved to the new 
> subdirectory along with the single replacement file. Links would be 
> created in the old version subdirectory pointing to the moved files, 
> but, of course, not the replacement file.) 
> Will this work and be acceptable?

As above. I think it'll work with the NFS petascale caveat .... 

However, I don't think it's something you need to mandate. You simply mandate that the files appear according to the version scheme above  whether they're links or not is irrelevant to you (the governor of the DRS). It's up to me, the archive maintainer, whether I use links or not ... 

> > The key word is "indistinguishable". How? You do need to start relying on metadata (internal AND external to files) or this data management job is just impossible.
> >   
> Not sure I get it, but I think I agree.

Once you get the files home, the only thing that you can guarantee is that hte internal file metadata has a tracking id, and we need to be able to query the catalog to find out how that file was made, whether it's an updated verison or whatever .... the filename may have changed etc ... certainly the organisatoin will have (e.g. see Jonathan's email to the CF list on the 22nd). So relying on the DRS is ok for the maintainer, may in the end, not help the users.

> >> Option 1) 21st century data for the RCP4.5 future run is received and 
> >> identified as version 1.  Then the continuation of that run to the end 
> >> of the 23rd century is received and stored as version 2.  New users will 
> >> have to download both version 1 and version 2 to get the complete run.

Now I think that's much more confusing than handling as two different experiments as outlined in my previous email. I'm obvoiusly willing to be outvoted on this ... but consider that for good reasons, the communitiies interested in the two different runs may be quite different ... (even if that wasn't the original intention).

> > There are two classes of this example:
> >  - some of these examples are going to come from tier1 experiments which extend core experiments.
> > - some are going to be because of the way folk have done their runs
> >
> > The first instance is covered by the fact that technically the second tranch of data is a different atomic dataset, and analysis should exploit a concatenation of two atomic datasets ... so the issue I think is about the second case, and I think we covered that in discussion yesterday.
> >
> >   
> I don't think the designation of tier1 vs. core should have anything to 
> do with the DRS or version discussion.  If the first part of an 
> experiment is higher priority and in the "core" set, while the extension 
> is lower priority, we shouldn't care.  The database need only record 
> what years are available from the runs.

That's not how you guys wrote the CMIP5  paper, and not how a lot of folks have interpreted it. However, like I say, I can happily lose this argument ...

> >> I would suggest therefore that for a single experiment, it would be best 
> >> from a user's perspective to not assign a new version to model output 
> >> that simply extends a previous run.  We will have to find a method by 
> >> which to advise old users who already downloaded data that the runs have 
> >> now been extended.

Different issue. Give that to the gateway :-)

> I understand this, but even those using the same "reference" name won't 
> necessarily use the same data.  
> For example, if all the data from an  
> experiment have been made available at the same time it will be assigned 
> a single reference name.  One user might choose only to use the first 
> 100 years of data stored, while another user might use only the last 100 
> years stored.  They would both reference the data with the same 
> reference "name", but they would be using completely different data.  
> The bottom line: a reference name is insufficient to know what data the 
> user has used in his study; you also need to know what time-period he 
> has considered.

However, what if I said: I have used all the 500 mb height data from
all the 4.1 (RCP4.5) experiment, and used the metafor experiment descriptor
to get all the simulation descriptions (for citatoins). Completely unambiguous
as to time, and more importantly, as to simulation content. In fact, if as I 
expect, experiment descpriptions (based on metafor documetns) are published, 
then one could cite that unambiguously.

> I think anyone publishing an analysis would be expected  
> to indicate what years of model output his analysis was based on, so 
> attempting to use the reference "name" to indicate this information is 
> both impossible in general and unnecessary. 

Both possible in detail, and in a new world of data citation, desirable! Now I don't
for a moment think that folks will jump onto this new bandwagon on day one,
but don't assume it can't happen :-) :-) Nor will it  be appropriate for all analyses,
particularly those done by sophisticated climate modellers ...

Cheers
Bryan

-- 
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848; 
Web: home.badc.rl.ac.uk/lawrence