[Go-essp-tech] Are atomic datasets mutable?

Mon Nov 23 09:32:59 MST 2009

Hi Karl,

I knew the term core was developing multiple meanings but I thought the
data we were replicating was a subset of data from the "essential CMIP5
experiments".  What you say now makes me doubt this.

So, are you saying the data we are replicating will be from the core and
tier 1/2 experiments?  
If so is it still the case that we will be replicating entire "atomic
datasets" and not parts thereof?

In Hamburg I brought up the question of when and how we will identify
which data is to be replicated.  We didn't get to the bottom of it then,
maybe we need to now.

S.

---
Stephen Pascoe  +44 (0)1235 445980
British Atmospheric Data Centre
Rutherford Appleton Laboratory

-----Original Message-----
From: Karl Taylor [mailto:taylor13 at llnl.gov] 
Sent: 23 November 2009 16:13
To: Pascoe, Stephen (STFC,RAL,SSTD)
Cc: go-essp-tech at ucar.edu; Lawrence, Bryan (STFC,RAL,SSTD); Juckes,
Martin (STFC,RAL,SSTD); drach at llnl.gov; doutriaux1 at llnl.gov
Subject: Re: [Go-essp-tech] Are atomic datasets mutable?

Dear Stephen, Bryan et al.,

stephen.pascoe at stfc.ac.uk wrote:
>>> Does this mean that a single atomic dataset could contain data from
>>> 2 different tiers (core and tier 1)?
>>>       
>> Yes.
>>     
>
> I can see a few serious problems with this.
>
> It has implications for our data management system, particularly how 
> we replicate the core.  It means that our atomic datasets aren't
atomic!
> Consider an atomic dataset that has both core and tier 1 data.  The 
> core portion will be replicated at PCMDI, BADC, etc.  These replicas 
> will contain less data than at the originating node.  This will 
> extremely confusing to users who go download replicas of the core.
>
> It also makes our job of working out what needs replicating more 
> difficult.  We had assumed that we just needed to decide which atomic 
> datasets needed replicating.
>   
To echo Bryan, there are two uses of "core" that get confused. The
original use of the term was to indicate which experiments were seen to
be essential for all models to do, whereas, depending on a  group's
particular scientific interests and available resources, only a subset
of the experiments in tier1 and tier2 might be performed.

For the federated data archive, on the other hand, we have for a number
of reasons decided it would be good to collect a subset of all the model
output and replicate it at several ESG gateways.  This subset of model
output has been referred to as "core", but since that term has already
been used to describe the essential CMIP5 experiments, perhaps we should
find another word; perhaps we could call it the "replicated subset of
model output" (too long?) or "high-demand data".  We could refer to the
"core data centers" as "replicating sites of high-demand data" (or
"mirror sites"?) 

Changing the term would avoid future confusion, but perhaps someone can
improve on the above suggestions or find ways to make sure others don't
get confused by the dual use of the term "core".
> My experience with scientific user communities is quite limited but 
> from my knowledge of the impacts community many people will create 
> aggregate statistics from a basket of models and experiments to 
> genenarate 1 or 2 numbers.  In this case I can see lots of confusion
if a "dataset"
> doesn't have a clearly defined time period.
In at least one case (the pre-industrial control experiment), different
models will be run for different durations and so the "dataset" for one
model will clearly differ from the "dataset" for another model.  I don't
see why this should be a problem for the user in citing the data.  They
will simply say "I looked at the years 1979-2005 from the historical
runs, which is archived in the following "dataset name(s)".  They might
also specify which region of the those runs they analyzed.  I think they
will rarely be able to refer to the entire dataset without indicating
which years and which region they focused on.
> I quite like your versioning algorithm and directory layout but if we 
> can't sort these issues out we have a big problem.
>   
your suggestion of version numbers 1.01, 1.02, ... was the key for
coming up with this.
> Cheers,
> Stephen.
>
> ---
> Stephen Pascoe  +44 (0)1235 445980
> British Atmospheric Data Centre
> Rutherford Appleton Laboratory
>   
-- 
Scanned by iCritical.