[Go-essp-tech] Are atomic datasets mutable?

Mon Nov 23 09:12:45 MST 2009

Dear Stephen, Bryan et al.,

stephen.pascoe at stfc.ac.uk wrote:
>>> Does this mean that a single atomic dataset could contain data from 
>>> 2 different tiers (core and tier 1)?
>>>       
>> Yes.
>>     
>
> I can see a few serious problems with this.
>
> It has implications for our data management system, particularly how we
> replicate the core.  It means that our atomic datasets aren't atomic!
> Consider an atomic dataset that has both core and tier 1 data.  The core
> portion will be replicated at PCMDI, BADC, etc.  These replicas will
> contain less data than at the originating node.  This will extremely
> confusing to users who go download replicas of the core.
>
> It also makes our job of working out what needs replicating more
> difficult.  We had assumed that we just needed to decide which atomic
> datasets needed replicating.
>   
To echo Bryan, there are two uses of "core" that get confused. The 
original use of the term was to indicate which experiments were seen to 
be essential for all models to do, whereas, depending on a  group's 
particular scientific interests and available resources, only a subset 
of the experiments in tier1 and tier2 might be performed.

For the federated data archive, on the other hand, we have for a number 
of reasons decided it would be good to collect a subset of all the model 
output and replicate it at several ESG gateways.  This subset of model 
output has been referred to as "core", but since that term has already 
been used to describe the essential CMIP5 experiments, perhaps we should 
find another word; perhaps we could call it the "replicated subset of 
model output" (too long?) or "high-demand data".  We could refer to the 
"core data centers" as "replicating sites of high-demand data" (or 
"mirror sites"?) 

Changing the term would avoid future confusion, but perhaps someone can 
improve on the above suggestions or find ways to make sure others don't 
get confused by the dual use of the term "core".
> My experience with scientific user communities is quite limited but from
> my knowledge of the impacts community many people will create aggregate
> statistics from a basket of models and experiments to genenarate 1 or 2
> numbers.  In this case I can see lots of confusion if a "dataset"
> doesn't have a clearly defined time period.
In at least one case (the pre-industrial control experiment), different 
models will be run for different durations and so the "dataset" for one 
model will clearly differ from the "dataset" for another model.  I don't 
see why this should be a problem for the user in citing the data.  They 
will simply say "I looked at the years 1979-2005 from the historical 
runs, which is archived in the following "dataset name(s)".  They might 
also specify which region of the those runs they analyzed.  I think they 
will rarely be able to refer to the entire dataset without indicating 
which years and which region they focused on.
> I quite like your versioning algorithm and directory layout but if we
> can't sort these issues out we have a big problem.
>   
your suggestion of version numbers 1.01, 1.02, ... was the key for 
coming up with this.
> Cheers,
> Stephen.
>
> ---
> Stephen Pascoe  +44 (0)1235 445980
> British Atmospheric Data Centre
> Rutherford Appleton Laboratory
>