[Go-essp-tech] Proposal for adjusting our definition of an atomic dataset

martin.juckes at stfc.ac.uk martin.juckes at stfc.ac.uk
Tue Dec 8 02:24:02 MST 2009


Hello All,

 

I agree with Karl about option 2, but before discussing option 1 I'd
like to clarify the third of the "starting point" statements:

> > 3. We only replicate entire atomic datasets

This should say, I think: 3. We only replicate entire atomic dataset
versions.

 

I can't see any grounds for requiring that all versions be replicated. 

 

Making this change introduces another option:

3. When an atomic dataset on a node contains data beyond that which is
to be replicated (centralized CMIP5 output), a version containing only
the portion to be replicated will be maintained.

 

This would require a modification to the versioning system currently
proposed. E.g. 

When a subset of the data in an atomic dataset is to replicated, a
version with an id of the form "vr<version number><version letter>" will
be created, which contains (copies of or links to) a subset of the files
in a corresponding "v<version number><version letter>".

 

This avoids the complication of having to split the larger atomic
dataset on the source node. It does increase the number of versions and
links that need to be managed within an atomic dataset, but avoids
multiplying the number of atomic datasets. It would also mean that,
within the DRS, we would have a clear indication in the version id as to
whether an atomic dataset was a complete ("v..") or partial ("vr..")
replication.

 

The main difference between expanding the use of the version attribute
as I'm suggesting and Stephen's option 2 is that the latter would
require breaking up the data on the source node into two atomic
datasets. By making use of the fact that different versions of an atomic
dataset can share files we can avoid this fragmentation.

 

Cheers,

Martin

  

> -----Original Message-----

> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-

> bounces at ucar.edu] On Behalf Of Karl Taylor

> Sent: 07 December 2009 23:33

> To: Pascoe, Stephen (STFC,RAL,SSTD)

> Cc: go-essp-tech at ucar.edu

> Subject: Re: [Go-essp-tech] Proposal for adjusting our definition of
an

> atomic dataset

> 

> Dear Stephen and all,

> 

> Before commenting on the substance of your email, let me suggest that

> we

> not talk about "standard" and "non-standard" output.  Rather, I think

> it

> will be less confusing to talk about:

> 1. CMIP5 "requested" output

> 2. output not requested by CMIP5.

> 

> As an aside, I think it is best to avoid the term "core" output, and

> instead refer to the subset of the output that will be replicated at

> several gateways (e.g., PCMDI, BADC, DKRZ, ...) as "centralized CMIP5

> output".  Dean and I agree this will avoid confusion.

> 

> Now to suggestions in your email:

> 

> I'm not sure I understand option 1, but I'm definitely opposed to

> option

> 2.  We are not talking about two different experiments, we are talking

> about different subsets of output from a single experiment.  Option 2

> would, I'm sure, confuse at least 99% of the users (well, maybe I

> exaggerate).

> 

> As for option 1,

> 

> 1. What would the allowable "values" be for the additional DRS

> attribute?

> 2. What is meant by "Atomic datasets that currently span standard and

> non-standard output would be split into 2 atomic datasets"?  I don't

> think there are any current atomic datasets (except in our

> imagination),

> so there is no need to split them.

> 3.  Rather than saying "Other atomic datasets would exist in one

> category or the other," couldn't we simply say, an atomic dataset can

> either refer to all time-samples output from the run, or a subset of

> contiguous time-samples defined by the project.   [I'm not sure that

> it's absolutely necessary that they be contiguous, but I would think

> this would be less confusing.  For example, suppose the CMIP5
requested

> output was for the years 1950-1980, but the full expt. ran from 1850
to

> 2005.  I would think that having the atomic dataset defined by the

> CMIP5

> requested output falling inside the atomic dataset for the non-

> requested

> output would seem to "split" the non-requested atomic dataset, which

> seems contradictory (can you split an atomic dataset?).]

> 4.  Note that there are some cases in which the CMIP5 *requested*

> output

> is non-contiguous.  For example, in the case of aerosol data, some of

> the 3-D fields are collected in 1-year samples as follows: 1850 to
1950

> every 20 years, 1960 to 2020 every 10 years, 2040 to 2100 every 20

> years.  If we require the time-samples in an atomic dataset be

> contiguous, this would require 17 different atomic datasets would

> comprise the CMIP5 requested output for these variables.  Perhaps

> that's

> unattractive and argues against requiring that the data be contiguous.

> 

> I'll try to join tomorrow at the beginning, at least.

> 

> Best regards,

> Karl

> 

> 

> stephen.pascoe at stfc.ac.uk wrote:

> >

> > A bunch of the ESG developers are in NCAR this week talking in
detail

> > about versioning and representing replicas in the datanode and

> gateway.

> > We have come to the conclusion that in order to implement
replication

> > we need to confine ourselves to replicating entire atomic datasets.

> > We would like to work with the following principles:

> >

> > 1. CMIP5 archive is a set of atomic datasets

> > 2. The CMIP5 standard output is a subset of the CMIP5 archive

> > 3. We only replicate entire atomic datasets.

> >

> > >From previous emails it is apparent that the standard output does

> not

> > correspond to a set of atomic datasets because in some cases
standard

> > output is a temporal subset of an atomic dataset.  This implies that

> a

> > replica of an atomic dataset would be a temporal subset of that

> atomic

> > dataset.

> >

> > Therefore we propose adjusting the definition of an atomic dataset
to

> > allow us to only replicate entire atomic datasets.  We suggest 2
ways

> > of achieving this:

> >

> >  1. Add an extra attribute to the DRS syntax to represent the

> >  difference between standard and non-standard output.  Atomic

> datasets

> >  that currently span standard and non-standard output would be split

> >  into 2 atomic datasets.  Other atomic datasets would exist in one

> >  category or the other.

> >

> >  2. Split all experiments (as definied in the DRS) that contain

> atomic

> >  datasets that span standard and non-standard output into 2

> >  experiments e.g. "<expt>_standard", "<expt>_optional".

> >

> > We'd like to discuss this proposal at the telco tomorrow.  Comments

> welcome.

> >

> > Thanks,

> > Stephen.

> >

> >

> 

> _______________________________________________

> GO-ESSP-TECH mailing list

> GO-ESSP-TECH at ucar.edu

> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech


-- 
Scanned by iCritical.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20091208/75d36f08/attachment-0001.html 


More information about the GO-ESSP-TECH mailing list