[Go-essp-tech] [versioning] some issues (related to atomic dataset concept)

Wed Mar 25 14:40:29 MDT 2009

Hi Bryan - Sorry for the delay in following up. I will try to address 
you questions as best possible. The responses are inline.

Thanks!

-Nate

Bryan Lawrence wrote:
> Hi Folks
>
> My apologies to all that I've not been engaged thus far! 
>
> Charlotte and I have just reviewed the wiki material.
>
> 1)  We really like the Versoning Operations section!
>
> 2)  We're not really sure of the definition of "dataset hierarchy", isn't the whole reason it appears that we have
> the notion of wanting to tag groups of datasets ... (and potentially the whole group that exists at a given time)?
> (Bryan has a major problem with the word "hierarchy" in this context).
>   
'Tags' in versioning are like the SVN tags. The just mark the state of 
the archive a point in time.
'Tags' on the user interface side are the web cloud tag style tag. 
Community defined key words.

The hierarchy referred to is the nested structure of the datasets, like 
a file system hierarchy.

> 3) The state of information which has been replaced but retracted is not clear, and it needs to be, given that we state at the end that older version data can still be obtained.  This means that the earlier statement that files will not be directly versioned needs to be married to the ability to retain older versions, and if we we stick with atomic datasets we
> have to replicate atomic datasets even if only a few files within them have changed.  We ought to be handle that more cleverly (i.e. older pieces of replaced data are kept as new "old sub atomic datasets" or whatever - and the new "parent" atomic dataset will of course include the entire set of current data).
>   
It wasn't intention to completely replicate entire datasets if 1 file 
changes. If a file is changed it would be a new LogicalFile in the 
system. We version the relationship between a Dataset and LogicalFiles. 
Thus only the new LogicalFiles would need to be replicated. If the data 
provider decides to leave the old data available, it can be retrieved. 
If the provider removes it we would still hold the metadata record and 
they would have to contact provider to see if still can be obtained.
> 4) We cannot see how changes in data would not be reflected in metadata, however, since all the metadata does not have a one-to-one relationship to data, it's probably cleaner to say: that where appropriate, external metadata will be versioned and/or updated to reflect the data versioning. So, we're not at all comfortable with the last bullet point 
> under data versioning breakdown. In particular, activities that give rise to data changes are pretty important!
>   
Given time constraints this wasn't deemed to be needed. This can always 
be re-evaluated later.
> 5) The audit log ought to be integral to the external metadata.
>   
This is the intent.
> 6) If somone holds retracted data, we think it should show up in the (a?) catalogue. The whole reason for having it would be for evidential reasons. We can certainly make it non-trivial to accidently use it, but it should be discoverable somehow!
>   

For a retracted datasets citation links etc will all still work. If you 
know the ID/URL you can still get to it directly. It just won't show up 
in normal searching or browsing. I believe it was Gary Strand who argued 
that it should be difficult to find retracted datasets.

> It seems that it might be helpful to have some definitions of a) what information artifacts exist, and b) where they exist.
> There are (at least) the following entities floating round
>   
> 1) a catalogue, which links atomic fileset names to services which allow one to download/manipulate the data.
>
> 2) original data files
>
> 3) original file-level metadata (extracted from the files)
>
> 4) extra metadata created by humans
>
> 5) all the information (as opposed to the gridded data) lives in at least: 
>   a - the ESG/Curator type catalogue
>   b - the simple catalogue concept listed above 
>   c - in other metadata catalogues.
>
> (It is not obvious to us that 4a and 4b are the same - maybe they are, but we hear mixed messages).
>
> I'm always banging on for the need for sequence diagrams/flow charts. What information is created when, and what's the consequence of it changing? What flows from those changes? Maybe we need one of those here? Clearly we could create such a thing, but frankly, we don't have enough visibility of how the ESG guys think this is going to happen ... so the ball is in your court as to whether it would help, and if so, creating it :-) :-)
>
> Cheers
> Bryan
>
>
>