[Go-essp-tech] [versioning] some issues (related to atomic dataset concept)

Fri Mar 6 10:23:14 MST 2009

Hi Folks

My apologies to all that I've not been engaged thus far! 

Charlotte and I have just reviewed the wiki material.

1)  We really like the Versoning Operations section!

2)  We're not really sure of the definition of "dataset hierarchy", isn't the whole reason it appears that we have
the notion of wanting to tag groups of datasets ... (and potentially the whole group that exists at a given time)?
(Bryan has a major problem with the word "hierarchy" in this context).

3) The state of information which has been replaced but retracted is not clear, and it needs to be, given that we state at the end that older version data can still be obtained.  This means that the earlier statement that files will not be directly versioned needs to be married to the ability to retain older versions, and if we we stick with atomic datasets we
have to replicate atomic datasets even if only a few files within them have changed.  We ought to be handle that more cleverly (i.e. older pieces of replaced data are kept as new "old sub atomic datasets" or whatever - and the new "parent" atomic dataset will of course include the entire set of current data).

4) We cannot see how changes in data would not be reflected in metadata, however, since all the metadata does not have a one-to-one relationship to data, it's probably cleaner to say: that where appropriate, external metadata will be versioned and/or updated to reflect the data versioning. So, we're not at all comfortable with the last bullet point 
under data versioning breakdown. In particular, activities that give rise to data changes are pretty important!

5) The audit log ought to be integral to the external metadata.

6) If somone holds retracted data, we think it should show up in the (a?) catalogue. The whole reason for having it would be for evidential reasons. We can certainly make it non-trivial to accidently use it, but it should be discoverable somehow!

It seems that it might be helpful to have some definitions of a) what information artifacts exist, and b) where they exist.
There are (at least) the following entities floating round:

1) a catalogue, which links atomic fileset names to services which allow one to download/manipulate the data.

2) original data files

3) original file-level metadata (extracted from the files)

4) extra metadata created by humans

5) all the information (as opposed to the gridded data) lives in at least: 
  a - the ESG/Curator type catalogue
  b - the simple catalogue concept listed above 
  c - in other metadata catalogues.

(It is not obvious to us that 4a and 4b are the same - maybe they are, but we hear mixed messages).

I'm always banging on for the need for sequence diagrams/flow charts. What information is created when, and what's the consequence of it changing? What flows from those changes? Maybe we need one of those here? Clearly we could create such a thing, but frankly, we don't have enough visibility of how the ESG guys think this is going to happen ... so the ball is in your court as to whether it would help, and if so, creating it :-) :-)

Cheers
Bryan

-- 
Bryan Lawrence
Director of Environmental Archival and Associated Research
(NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
STFC, Rutherford Appleton Laboratory
Phone +44 1235 445012; Fax ... 5848; 
Web: home.badc.rl.ac.uk/lawrence