[Go-essp-tech] Are atomic datasets mutable?

Mon Nov 23 12:37:36 MST 2009

All,

The thread *is* starting to get hard to follow - I hope someone  
(Bryan?) can summarize.

The discussion is pertinent to the upcoming revision of esgpublish,  
which has support for versioning. In particular this revision will  
have support for the standard create/replace/update/delete  
operations. If new files are added to an existing dataset, the  
default action is to increment the version number, i.e., treat the  
dataset as a new version. The same is true if files are modified or  
deleted from an existing dataset. Although this is the default  
action, it can be overridden, and ultimate responsibility is given to  
the data publisher (person) to choose a version number.

I think Karl is correct that there are cases where users would view  
addition of files to a dataset as *not* constituting a new version.  
In particular this will be the case when datasets are built up  
incrementally. Because of the high volume of data I expect this to be  
a common occurrence. But as Bryan notes, this is an issue for the  
gateway to deal with in the notification service. For example, the  
default behavior could be to notify a user only if files that the  
user has downloaded are modified or deleted, not if new files are  
added to the dataset. And of course this could be refined with more  
options at the gateway.

For the internals of ESG the main concern is to properly support  
replication, and this drives the decision to treat *any* change as a  
new dataset version. This simplifies the various decisions on how to  
replicate and publish datasets. But in my mind the discussion has  
raised some issues and concerns that still need resolution:

- Who generates the version number of a dataset - the data producer  
or data publisher? I strongly feel that decision must be made by the  
data publisher, who has more complete information to make the choice.  
This argues that CMOR should not include version numbers in its  
output, and suggests that the DRS directory syntax is relevant only  
to the archival directory structure, but not to directory structures  
as used at the point of data production.

- At the moment the data publisher produces version numbers that are  
positive integers. The advantage is that they can be generated  
automatically, and publishers can  publish multiple datasets at  
possibly different version numbers, without worrying about setting  
versions for each dataset individually. This supports publication of  
large numbers of files/datasets in a single operation - a key design  
goal. I see the appeal of more refined versioning schemes but  
question whether they buy much for the extra complication.

- From CMIP3 experience I know the value of carefully defining the  
archival directory structure as DRS does. It's easy to understand and  
supports ftp services nicely. But I've come to the realization that  
for CMIP5 there needs to be an abstraction of the (actual) directory  
structure that sits between the user and the physical directories.  
This is for a number of reasons: security requirements, the fact that  
the data is truly distributed, the eventual need to store data under  
multiple file systems, the need to support legacy data and data that  
can't be easily moved around, and so on. This is not to criticize DRS  
- on the contrary the value of DRS is that it defines nicely what  
that abstraction layer might look like. Viewed this way a role of the  
publisher is to associate the DRS fields with each dataset, and one  
role of the gateway (which implements the abstraction layer) is to  
map the fields to specific datasets. The physical directory structure  
is really irrelevant. This also resolves the issue of updating  
files / datasets: just pick something reasonable, don't duplicate  
files, and publish the new version as you would any other dataset.

Bob

On Nov 23, 2009, at 8:12 AM, Karl Taylor wrote:

> Dear Stephen, Bryan et al.,
>
> stephen.pascoe at stfc.ac.uk wrote:
>>>> Does this mean that a single atomic dataset could contain data  
>>>> from 2 different tiers (core and tier 1)?
>>>>
>>> Yes.
>>>
>>
>> I can see a few serious problems with this.
>>
>> It has implications for our data management system, particularly  
>> how we
>> replicate the core.  It means that our atomic datasets aren't atomic!
>> Consider an atomic dataset that has both core and tier 1 data.   
>> The core
>> portion will be replicated at PCMDI, BADC, etc.  These replicas will
>> contain less data than at the originating node.  This will extremely
>> confusing to users who go download replicas of the core.
>>
>> It also makes our job of working out what needs replicating more
>> difficult.  We had assumed that we just needed to decide which atomic
>> datasets needed replicating.
>>
> To echo Bryan, there are two uses of "core" that get confused. The  
> original use of the term was to indicate which experiments were  
> seen to be essential for all models to do, whereas, depending on a   
> group's particular scientific interests and available resources,  
> only a subset of the experiments in tier1 and tier2 might be  
> performed.
>
> For the federated data archive, on the other hand, we have for a  
> number of reasons decided it would be good to collect a subset of  
> all the model output and replicate it at several ESG gateways.   
> This subset of model output has been referred to as "core", but  
> since that term has already been used to describe the essential  
> CMIP5 experiments, perhaps we should find another word; perhaps we  
> could call it the "replicated subset of model output" (too long?)  
> or "high-demand data".  We could refer to the "core data centers"  
> as "replicating sites of high-demand data" (or "mirror sites"?)
> Changing the term would avoid future confusion, but perhaps someone  
> can improve on the above suggestions or find ways to make sure  
> others don't get confused by the dual use of the term "core".
>> My experience with scientific user communities is quite limited  
>> but from
>> my knowledge of the impacts community many people will create  
>> aggregate
>> statistics from a basket of models and experiments to genenarate 1  
>> or 2
>> numbers.  In this case I can see lots of confusion if a "dataset"
>> doesn't have a clearly defined time period.
> In at least one case (the pre-industrial control experiment),  
> different models will be run for different durations and so the  
> "dataset" for one model will clearly differ from the "dataset" for  
> another model.  I don't see why this should be a problem for the  
> user in citing the data.  They will simply say "I looked at the  
> years 1979-2005 from the historical runs, which is archived in the  
> following "dataset name(s)".  They might also specify which region  
> of the those runs they analyzed.  I think they will rarely be able  
> to refer to the entire dataset without indicating which years and  
> which region they focused on.
>> I quite like your versioning algorithm and directory layout but if we
>> can't sort these issues out we have a big problem.
>>
> your suggestion of version numbers 1.01, 1.02, ... was the key for  
> coming up with this.
>> Cheers,
>> Stephen.
>>
>> ---
>> Stephen Pascoe  +44 (0)1235 445980
>> British Atmospheric Data Centre
>> Rutherford Appleton Laboratory
>>