[Go-essp-tech] Proposal for adjusting our definition of an atomic dataset

Tue Dec 8 07:05:48 MST 2009

Hi Stephen,

	This sounds all good, but we must also keep in mind that we are under  
time constraints and need to weigh the user experience(s) with getting  
some out in a practical/reasonable amount of time. I look forward to  
the telecon later this morning and hearing from all of you.

Telecon Number: (925) 424-8105  access code 305757#

Best regards,
	Dean

On Dec 8, 2009, at 5:52 AM, <stephen.pascoe at stfc.ac.uk> wrote:

>
> Hello All,
>
> Terminology is getting so confused here that I'd like to hold off on  
> the details until telco.
>
> However :-) ...
>
> We really aren't just talking about replication and versioning any  
> more.
> We think we also need subsetting and so we are now trying to
> shoe-horn subsetting into our current concepts of "atomic dataset",
> "dataset version" and "replica".  A subset of something isn't a
> replica and it isn't a new version.  Replica implies a complete copy  
> of something.  Version implies superseding an previous version of  
> something (except the initial version, obviously).  Code has already  
> been written with these implied semantics in mind.
>
> So unless we can reconcile the atomic dataset with what actually  
> gets replicated and versioned we need a extra concept.  Another  
> concept means more developer effort: in particular Bob's  
> implementation of versioning would need significant change.
>
> S.
>
>
> -----Original Message-----
> From: Juckes, Martin (STFC,RAL,SSTD)
> Sent: Tue 12/8/2009 9:24 AM
> To: 'Karl Taylor'; Pascoe, Stephen (STFC,RAL,SSTD); go-essp-tech at ucar.edu
> Subject: RE: [Go-essp-tech] Proposal for adjusting our definition of  
> an atomic dataset
>
> Hello All,
>
>
>
> I agree with Karl about option 2, but before discussing option 1 I'd  
> like to clarify the third of the "starting point" statements:
>
>>> 3. We only replicate entire atomic datasets
>
> This should say, I think: 3. We only replicate entire atomic dataset  
> versions.
>
>
>
> I can't see any grounds for requiring that all versions be replicated.
>
>
>
> Making this change introduces another option:
>
> 3. When an atomic dataset on a node contains data beyond that which  
> is to be replicated (centralized CMIP5 output), a version containing  
> only the portion to be replicated will be maintained.
>
>
>
> This would require a modification to the versioning system currently  
> proposed. E.g.
>
> When a subset of the data in an atomic dataset is to replicated, a  
> version with an id of the form "vr<version number><version letter>"  
> will be created, which contains (copies of or links to) a subset of  
> the files in a corresponding "v<version number><version letter>".
>
>
>
> This avoids the complication of having to split the larger atomic  
> dataset on the source node. It does increase the number of versions  
> and links that need to be managed within an atomic dataset, but  
> avoids multiplying the number of atomic datasets. It would also mean  
> that, within the DRS, we would have a clear indication in the  
> version id as to whether an atomic dataset was a complete ("v..") or  
> partial ("vr..") replication.
>
>
>
> The main difference between expanding the use of the version  
> attribute as I'm suggesting and Stephen's option 2 is that the  
> latter would require breaking up the data on the source node into  
> two atomic datasets. By making use of the fact that different  
> versions of an atomic dataset can share files we can avoid this  
> fragmentation.
>
>
>
> Cheers,
>
> Martin
>
>
>
>> -----Original Message-----
>
>> From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-
>
>> bounces at ucar.edu] On Behalf Of Karl Taylor
>
>> Sent: 07 December 2009 23:33
>
>> To: Pascoe, Stephen (STFC,RAL,SSTD)
>
>> Cc: go-essp-tech at ucar.edu
>
>> Subject: Re: [Go-essp-tech] Proposal for adjusting our definition  
>> of an
>
>> atomic dataset
>
>>
>
>> Dear Stephen and all,
>
>>
>
>> Before commenting on the substance of your email, let me suggest that
>
>> we
>
>> not talk about "standard" and "non-standard" output.  Rather, I think
>
>> it
>
>> will be less confusing to talk about:
>
>> 1. CMIP5 "requested" output
>
>> 2. output not requested by CMIP5.
>
>>
>
>> As an aside, I think it is best to avoid the term "core" output, and
>
>> instead refer to the subset of the output that will be replicated at
>
>> several gateways (e.g., PCMDI, BADC, DKRZ, ...) as "centralized CMIP5
>
>> output".  Dean and I agree this will avoid confusion.
>
>>
>
>> Now to suggestions in your email:
>
>>
>
>> I'm not sure I understand option 1, but I'm definitely opposed to
>
>> option
>
>> 2.  We are not talking about two different experiments, we are  
>> talking
>
>> about different subsets of output from a single experiment.  Option 2
>
>> would, I'm sure, confuse at least 99% of the users (well, maybe I
>
>> exaggerate).
>
>>
>
>> As for option 1,
>
>>
>
>> 1. What would the allowable "values" be for the additional DRS
>
>> attribute?
>
>> 2. What is meant by "Atomic datasets that currently span standard and
>
>> non-standard output would be split into 2 atomic datasets"?  I don't
>
>> think there are any current atomic datasets (except in our
>
>> imagination),
>
>> so there is no need to split them.
>
>> 3.  Rather than saying "Other atomic datasets would exist in one
>
>> category or the other," couldn't we simply say, an atomic dataset can
>
>> either refer to all time-samples output from the run, or a subset of
>
>> contiguous time-samples defined by the project.   [I'm not sure that
>
>> it's absolutely necessary that they be contiguous, but I would think
>
>> this would be less confusing.  For example, suppose the CMIP5  
>> requested
>
>> output was for the years 1950-1980, but the full expt. ran from  
>> 1850 to
>
>> 2005.  I would think that having the atomic dataset defined by the
>
>> CMIP5
>
>> requested output falling inside the atomic dataset for the non-
>
>> requested
>
>> output would seem to "split" the non-requested atomic dataset, which
>
>> seems contradictory (can you split an atomic dataset?).]
>
>> 4.  Note that there are some cases in which the CMIP5 *requested*
>
>> output
>
>> is non-contiguous.  For example, in the case of aerosol data, some of
>
>> the 3-D fields are collected in 1-year samples as follows: 1850 to  
>> 1950
>
>> every 20 years, 1960 to 2020 every 10 years, 2040 to 2100 every 20
>
>> years.  If we require the time-samples in an atomic dataset be
>
>> contiguous, this would require 17 different atomic datasets would
>
>> comprise the CMIP5 requested output for these variables.  Perhaps
>
>> that's
>
>> unattractive and argues against requiring that the data be  
>> contiguous.
>
>>
>
>> I'll try to join tomorrow at the beginning, at least.
>
>>
>
>> Best regards,
>
>> Karl
>
>>
>
>>
>
>> stephen.pascoe at stfc.ac.uk wrote:
>
>>>
>
>>> A bunch of the ESG developers are in NCAR this week talking in  
>>> detail
>
>>> about versioning and representing replicas in the datanode and
>
>> gateway.
>
>>> We have come to the conclusion that in order to implement  
>>> replication
>
>>> we need to confine ourselves to replicating entire atomic datasets.
>
>>> We would like to work with the following principles:
>
>>>
>
>>> 1. CMIP5 archive is a set of atomic datasets
>
>>> 2. The CMIP5 standard output is a subset of the CMIP5 archive
>
>>> 3. We only replicate entire atomic datasets.
>
>>>
>
>>>> From previous emails it is apparent that the standard output does
>
>> not
>
>>> correspond to a set of atomic datasets because in some cases  
>>> standard
>
>>> output is a temporal subset of an atomic dataset.  This implies that
>
>> a
>
>>> replica of an atomic dataset would be a temporal subset of that
>
>> atomic
>
>>> dataset.
>
>>>
>
>>> Therefore we propose adjusting the definition of an atomic dataset  
>>> to
>
>>> allow us to only replicate entire atomic datasets.  We suggest 2  
>>> ways
>
>>> of achieving this:
>
>>>
>
>>> 1. Add an extra attribute to the DRS syntax to represent the
>
>>> difference between standard and non-standard output.  Atomic
>
>> datasets
>
>>> that currently span standard and non-standard output would be split
>
>>> into 2 atomic datasets.  Other atomic datasets would exist in one
>
>>> category or the other.
>
>>>
>
>>> 2. Split all experiments (as definied in the DRS) that contain
>
>> atomic
>
>>> datasets that span standard and non-standard output into 2
>
>>> experiments e.g. "<expt>_standard", "<expt>_optional".
>
>>>
>
>>> We'd like to discuss this proposal at the telco tomorrow.  Comments
>
>> welcome.
>
>>>
>
>>> Thanks,
>
>>> Stephen.
>
>>>
>
>>>
>
>>
>
>> _______________________________________________
>
>> GO-ESSP-TECH mailing list
>
>> GO-ESSP-TECH at ucar.edu
>
>> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
>
> -- 
> Scanned by iCritical.
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://*mailman.ucar.edu/mailman/listinfo/go-essp-tech
>