[Go-essp-tech] Proposal for adjusting our definition of an atomic dataset

Tue Dec 8 08:23:04 MST 2009

Stephen, thanks for posting that summary. The group's proposal here  
was aimed at simplifying things and reducing additional development  
requirements.  Otherwise, I think we delve into the realm of managing  
sub-atomic particles... Talk to you soon.

don

On Dec 8, 2009, at 7:05 AM, Dean N. Williams wrote:

> Hi Stephen,
>
> 	This sounds all good, but we must also keep in mind that we are under
> time constraints and need to weigh the user experience(s) with getting
> some out in a practical/reasonable amount of time. I look forward to
> the telecon later this morning and hearing from all of you.
>
> Telecon Number: (925) 424-8105  access code 305757#
>
> Best regards,
> 	Dean
>
> On Dec 8, 2009, at 5:52 AM, <stephen.pascoe at stfc.ac.uk> wrote:
>
>>
>> Hello All,
>>
>> Terminology is getting so confused here that I'd like to hold off on
>> the details until telco.
>>
>> However :-) ...
>>
>> We really aren't just talking about replication and versioning any
>> more.
>> We think we also need subsetting and so we are now trying to
>> shoe-horn subsetting into our current concepts of "atomic dataset",
>> "dataset version" and "replica".  A subset of something isn't a
>> replica and it isn't a new version.  Replica implies a complete copy
>> of something.  Version implies superseding an previous version of
>> something (except the initial version, obviously).  Code has already
>> been written with these implied semantics in mind.
>>
>> So unless we can reconcile the atomic dataset with what actually
>> gets replicated and versioned we need a extra concept.  Another
>> concept means more developer effort: in particular Bob's
>> implementation of versioning would need significant change.
>>
>> S.
>>
>>
>> -----Original Message-----
>> From: Juckes, Martin (STFC,RAL,SSTD)
>> Sent: Tue 12/8/2009 9:24 AM
>> To: 'Karl Taylor'; Pascoe, Stephen (STFC,RAL,SSTD); go-essp-tech at ucar.edu
>> Subject: RE: [Go-essp-tech] Proposal for adjusting our definition of
>> an atomic dataset
>>
>> Hello All,
>>
>>
>>
>> I agree with Karl about option 2, but before discussing option 1 I'd
>> like to clarify the third of the "starting point" statements:
>>
>>>> 3. We only replicate entire atomic datasets
>>
>> This should say, I think: 3. We only replicate entire atomic dataset
>> versions.
>>
>>
>>
>> I can't see any grounds for requiring that all versions be  
>> replicated.
>>
>>
>>
>> Making this change introduces another option:
>>
>> 3. When an atomic dataset on a node contains data beyond that which
>> is to be replicated (centralized CMIP5 output), a version containing
>> only the portion to be replicated will be maintained.
>>
>>
>>
>> This would require a modification to the versioning system currently
>> proposed. E.g.
>>
>> When a subset of the data in an atomic dataset is to replicated, a
>> version with an id of the form "vr<version number><version letter>"
>> will be created, which contains (copies of or links to) a subset of
>> the files in a corresponding "v<version number><version letter>".
>>
>>
>>
>> This avoids the complication of having to split the larger atomic
>> dataset on the source node. It does increase the number of versions
>> and links that need to be managed within an atomic dataset, but
>> avoids multiplying the number of atomic datasets. It would also mean
>> that, within the DRS, we would have a clear indication in the
>> version id as to whether an atomic dataset was a complete ("v..") or
>> partial ("vr..") replication.
>>
>>
>>
>> The main difference between expanding the use of the version
>> attribute as I'm suggesting and Stephen's option 2 is that the
>> latter would require breaking up the data on the source node into
>> two atomic datasets. By making use of the fact that different
>> versions of an atomic dataset can share files we can avoid this
>> fragmentation.
>>
>>
>>
>> Cheers,
>>
>> Martin
>>
>>
>>
>>> -----Original Message-----
>>
>>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
>>
>>> bounces at ucar.edu] On Behalf Of Karl Taylor
>>
>>> Sent: 07 December 2009 23:33
>>
>>> To: Pascoe, Stephen (STFC,RAL,SSTD)
>>
>>> Cc: go-essp-tech at ucar.edu
>>
>>> Subject: Re: [Go-essp-tech] Proposal for adjusting our definition
>>> of an
>>
>>> atomic dataset
>>
>>>
>>
>>> Dear Stephen and all,
>>
>>>
>>
>>> Before commenting on the substance of your email, let me suggest  
>>> that
>>
>>> we
>>
>>> not talk about "standard" and "non-standard" output.  Rather, I  
>>> think
>>
>>> it
>>
>>> will be less confusing to talk about:
>>
>>> 1. CMIP5 "requested" output
>>
>>> 2. output not requested by CMIP5.
>>
>>>
>>
>>> As an aside, I think it is best to avoid the term "core" output, and
>>
>>> instead refer to the subset of the output that will be replicated at
>>
>>> several gateways (e.g., PCMDI, BADC, DKRZ, ...) as "centralized  
>>> CMIP5
>>
>>> output".  Dean and I agree this will avoid confusion.
>>
>>>
>>
>>> Now to suggestions in your email:
>>
>>>
>>
>>> I'm not sure I understand option 1, but I'm definitely opposed to
>>
>>> option
>>
>>> 2.  We are not talking about two different experiments, we are
>>> talking
>>
>>> about different subsets of output from a single experiment.   
>>> Option 2
>>
>>> would, I'm sure, confuse at least 99% of the users (well, maybe I
>>
>>> exaggerate).
>>
>>>
>>
>>> As for option 1,
>>
>>>
>>
>>> 1. What would the allowable "values" be for the additional DRS
>>
>>> attribute?
>>
>>> 2. What is meant by "Atomic datasets that currently span standard  
>>> and
>>
>>> non-standard output would be split into 2 atomic datasets"?  I don't
>>
>>> think there are any current atomic datasets (except in our
>>
>>> imagination),
>>
>>> so there is no need to split them.
>>
>>> 3.  Rather than saying "Other atomic datasets would exist in one
>>
>>> category or the other," couldn't we simply say, an atomic dataset  
>>> can
>>
>>> either refer to all time-samples output from the run, or a subset of
>>
>>> contiguous time-samples defined by the project.   [I'm not sure that
>>
>>> it's absolutely necessary that they be contiguous, but I would think
>>
>>> this would be less confusing.  For example, suppose the CMIP5
>>> requested
>>
>>> output was for the years 1950-1980, but the full expt. ran from
>>> 1850 to
>>
>>> 2005.  I would think that having the atomic dataset defined by the
>>
>>> CMIP5
>>
>>> requested output falling inside the atomic dataset for the non-
>>
>>> requested
>>
>>> output would seem to "split" the non-requested atomic dataset, which
>>
>>> seems contradictory (can you split an atomic dataset?).]
>>
>>> 4.  Note that there are some cases in which the CMIP5 *requested*
>>
>>> output
>>
>>> is non-contiguous.  For example, in the case of aerosol data, some  
>>> of
>>
>>> the 3-D fields are collected in 1-year samples as follows: 1850 to
>>> 1950
>>
>>> every 20 years, 1960 to 2020 every 10 years, 2040 to 2100 every 20
>>
>>> years.  If we require the time-samples in an atomic dataset be
>>
>>> contiguous, this would require 17 different atomic datasets would
>>
>>> comprise the CMIP5 requested output for these variables.  Perhaps
>>
>>> that's
>>
>>> unattractive and argues against requiring that the data be
>>> contiguous.
>>
>>>
>>
>>> I'll try to join tomorrow at the beginning, at least.
>>
>>>
>>
>>> Best regards,
>>
>>> Karl
>>
>>>
>>
>>>
>>
>>> stephen.pascoe at stfc.ac.uk wrote:
>>
>>>>
>>
>>>> A bunch of the ESG developers are in NCAR this week talking in
>>>> detail
>>
>>>> about versioning and representing replicas in the datanode and
>>
>>> gateway.
>>
>>>> We have come to the conclusion that in order to implement
>>>> replication
>>
>>>> we need to confine ourselves to replicating entire atomic datasets.
>>
>>>> We would like to work with the following principles:
>>
>>>>
>>
>>>> 1. CMIP5 archive is a set of atomic datasets
>>
>>>> 2. The CMIP5 standard output is a subset of the CMIP5 archive
>>
>>>> 3. We only replicate entire atomic datasets.
>>
>>>>
>>
>>>>> From previous emails it is apparent that the standard output does
>>
>>> not
>>
>>>> correspond to a set of atomic datasets because in some cases
>>>> standard
>>
>>>> output is a temporal subset of an atomic dataset.  This implies  
>>>> that
>>
>>> a
>>
>>>> replica of an atomic dataset would be a temporal subset of that
>>
>>> atomic
>>
>>>> dataset.
>>
>>>>
>>
>>>> Therefore we propose adjusting the definition of an atomic dataset
>>>> to
>>
>>>> allow us to only replicate entire atomic datasets.  We suggest 2
>>>> ways
>>
>>>> of achieving this:
>>
>>>>
>>
>>>> 1. Add an extra attribute to the DRS syntax to represent the
>>
>>>> difference between standard and non-standard output.  Atomic
>>
>>> datasets
>>
>>>> that currently span standard and non-standard output would be split
>>
>>>> into 2 atomic datasets.  Other atomic datasets would exist in one
>>
>>>> category or the other.
>>
>>>>
>>
>>>> 2. Split all experiments (as definied in the DRS) that contain
>>
>>> atomic
>>
>>>> datasets that span standard and non-standard output into 2
>>
>>>> experiments e.g. "<expt>_standard", "<expt>_optional".
>>
>>>>
>>
>>>> We'd like to discuss this proposal at the telco tomorrow.  Comments
>>
>>> welcome.
>>
>>>>
>>
>>>> Thanks,
>>
>>>> Stephen.
>>
>>>>
>>
>>>>
>>
>>>
>>
>>> _______________________________________________
>>
>>> GO-ESSP-TECH mailing list
>>
>>> GO-ESSP-TECH at ucar.edu
>>
>>> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>
>>
>> -- 
>> Scanned by iCritical.
>> _______________________________________________
>> GO-ESSP-TECH mailing list
>> GO-ESSP-TECH at ucar.edu
>> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech
>>
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech