[Go-essp-tech] Are atomic datasets mutable?

Mon Nov 23 09:46:05 MST 2009

Hello,

Arising from Bryan's points, can I check a couple of specific questions:

(1) Will there be any way for a user to trace where a specific file has
come from (e.g. through a "archive_node" attribute) or will replicated
files be identical? I'd expected the latter, but want to check.

(2) If we have 100 years of daily rcp45 data from IPSL on their node,
and only replicate 30 years to PCMDI, BADC and DKRZ, what happens when a
user searches for IPSL daily rcp45 data? Do they get a result showing
how the data is spread around the distributed archive, or do they just
get the result that there is a 100 years of data and a link to a script
which will deliver the data?

If the distributed nature of the archive is hidden from the users, then
an atomic dataset from the user perspective should constitute everything
that is available across the entire archive.

Cheers,
Martin

> -----Original Message-----
> From: Karl Taylor [mailto:taylor13 at llnl.gov]
> Sent: 23 November 2009 16:13
> To: Pascoe, Stephen (STFC,RAL,SSTD)
> Cc: go-essp-tech at ucar.edu; Lawrence, Bryan (STFC,RAL,SSTD); Juckes,
> Martin (STFC,RAL,SSTD); drach at llnl.gov; doutriaux1 at llnl.gov
> Subject: Re: [Go-essp-tech] Are atomic datasets mutable?
> 
> Dear Stephen, Bryan et al.,
> 
> stephen.pascoe at stfc.ac.uk wrote:
> >>> Does this mean that a single atomic dataset could contain data
from
> >>> 2 different tiers (core and tier 1)?
> >>>
> >> Yes.
> >>
> >
> > I can see a few serious problems with this.
> >
> > It has implications for our data management system, particularly how
> we
> > replicate the core.  It means that our atomic datasets aren't
atomic!
> > Consider an atomic dataset that has both core and tier 1 data.  The
> core
> > portion will be replicated at PCMDI, BADC, etc.  These replicas will
> > contain less data than at the originating node.  This will extremely
> > confusing to users who go download replicas of the core.
> >
> > It also makes our job of working out what needs replicating more
> > difficult.  We had assumed that we just needed to decide which
atomic
> > datasets needed replicating.
> >
> To echo Bryan, there are two uses of "core" that get confused. The
> original use of the term was to indicate which experiments were seen
to
> be essential for all models to do, whereas, depending on a  group's
> particular scientific interests and available resources, only a subset
> of the experiments in tier1 and tier2 might be performed.
> 
> For the federated data archive, on the other hand, we have for a
number
> of reasons decided it would be good to collect a subset of all the
> model
> output and replicate it at several ESG gateways.  This subset of model
> output has been referred to as "core", but since that term has already
> been used to describe the essential CMIP5 experiments, perhaps we
> should
> find another word; perhaps we could call it the "replicated subset of
> model output" (too long?) or "high-demand data".  We could refer to
the
> "core data centers" as "replicating sites of high-demand data" (or
> "mirror sites"?)
> 
> Changing the term would avoid future confusion, but perhaps someone
can
> improve on the above suggestions or find ways to make sure others
don't
> get confused by the dual use of the term "core".
> > My experience with scientific user communities is quite limited but
> from
> > my knowledge of the impacts community many people will create
> aggregate
> > statistics from a basket of models and experiments to genenarate 1
or
> 2
> > numbers.  In this case I can see lots of confusion if a "dataset"
> > doesn't have a clearly defined time period.
> In at least one case (the pre-industrial control experiment),
different
> models will be run for different durations and so the "dataset" for
one
> model will clearly differ from the "dataset" for another model.  I
> don't
> see why this should be a problem for the user in citing the data.
They
> will simply say "I looked at the years 1979-2005 from the historical
> runs, which is archived in the following "dataset name(s)".  They
might
> also specify which region of the those runs they analyzed.  I think
> they
> will rarely be able to refer to the entire dataset without indicating
> which years and which region they focused on.
> > I quite like your versioning algorithm and directory layout but if
we
> > can't sort these issues out we have a big problem.
> >
> your suggestion of version numbers 1.01, 1.02, ... was the key for
> coming up with this.
> > Cheers,
> > Stephen.
> >
> > ---
> > Stephen Pascoe  +44 (0)1235 445980
> > British Atmospheric Data Centre
> > Rutherford Appleton Laboratory
> >
-- 
Scanned by iCritical.