[Go-essp-tech] publishing by realm -- required DRS modifications.

Fri Feb 26 11:22:41 MST 2010

Hi Luca,

The problem is that we don't want to replicate all the "ocean/requested" data -- because we don't have enough storage to keep 2Pb of data.
When we were expecting to publish at the atomic dataset level, the idea was that we would replicate some, but not all, of the datasets within the "ocean/requested" realm. If all the "ocean/requested" data goes into a single dataset this is not going to work. Karl's selection of what is to be replicated is based on priorities assigned to different variables: "ocean/requested" contains a mixture of high and low priority variables. So we need to create a published unit which only contains the high priority variables.

cheers,
Martin

-----Original Message-----
From: Luca Cinquini [mailto:luca at ucar.edu]
Sent: Fri 26/02/2010 16:51
To: Juckes, Martin (STFC,RAL,SSTD)
Cc: drach at llnl.gov; Lawrence, Bryan (STFC,RAL,SSTD); taylor13 at llnl.gov; go-essp-tech at ucar.edu
Subject: Re: [Go-essp-tech] publishing by realm -- required DRS modifications.

Hi Martin,
	the way I understood is that there would be two dataset for each  
realm, for example "atmos/output" and "atmos/requested", the latter  
being a sub-set of the former, and each of those could be replicated  
as a whole.
Luca

On Feb 26, 2010, at 8:43 AM, <martin.juckes at stfc.ac.uk> wrote:

>
> Hello,
>
> I have eventually got round to checking this idea against Karl's  
> specification of the replication subset. The latter would not be  
> complete realm level datasets (e.g. ocean, monthly data is not all  
> to be replicated). This means that we would have to revise the  
> replication plan, because we had been counting on replicating  
> complete published units of the "requested" product. An alternative  
> approach might be to replace the "requested" product with a  
> "replicated" product. The ESG data node would then have "output" and  
> "replicated" products, the latter being a subset of the former both  
> in terms of the temporal coverage and the number of variable  
> included. The entire "replicated" product would then, as the name  
> suggests, be replicated.
>
> A second DRS modification which would be required by the realm level  
> publishing is the scrapping of the atomic dataset versioning and  
> replacing this with versioning at the realm level.
>
> cheers,
> Martin
>
> -----Original Message-----
> From: go-essp-tech-bounces at ucar.edu on behalf of Bob Drach
> Sent: Thu 25/02/2010 22:48
> To: Lawrence, Bryan (STFC,RAL,SSTD)
> Cc: go-essp-tech at ucar.edu
> Subject: Re: [Go-essp-tech] publishing by realm
>
> Hi Bryan,
>
> Are you assuming that CMOR will assign the version numbers (either
> atomic_dataset | realm_dataset | file)? That's not the case, and I'm
> not sure that CMOR has sufficient information to do so.
>
> It's worth recapping how the ESG publisher currently deals with
> versioning:
>
> - The publisher is given a dataset id and a list of files to be
> published. Let's assume that dataset == realm_dataset here.
> - If this is a new dataset, the dataset is assigned dataset_version=1
> by default. Each file is assigned file_version=1. Dataset_version and
> file_version are completely independent.
> - If the dataset exists, each file is compared with it's existing
> counterpart in the dataset (if present), based on a set of metadata:
> checksum, file length, modification date, etc. If a file has changed,
> it's file_version is incremented and that value is recorded in the
> THREDDS catalog. Similarly, if the dataset has any files that have
> been added, deleted, or modified, its dataset_version is incremented
> and this is also recorded in THREDDS.
>
> So suppose that we publish at the realm_dataset granularity and one
> of the files in that dataset is updated. Then the file has a new
> file_version, the dataset has a new dataset_version, both are
> recorded in the TDS catalog. It should be possible for the replica
> manager to compare old and new dataset versions - by comparing old
> and new catalogs - to determine which files have changed, and only
> transfer those files to the replica site.
>
> Bob
>
>
> On Feb 25, 2010, at 12:08 PM, Bryan Lawrence wrote:
>
>> Hi Bob
>>
>> On Thursday 25 February 2010 19:27:15 Bob Drach wrote:
>>> Where would 'atomic dataset version' be stored? In ESG there would
>>> only be realm-dataset versions and individual file versions.
>>
>> The DRS is writing a version associated with the atomic dataset as
>> defined within it. We expect modelling groups would conform to
>> that, and update versions according to it. We could rewrite the
>> DRS ... (and hence CMOR presumably ... but it's a bit late for
>> that ... or maybe I'm missing something).
>>
>> That means, if we leave things the way there are: there is a
>> logical disconnect, and the risk of either vastly more data
>> movement than is necessary, or a complex resolution problem (is my
>> replicated "realm" level dataset the same as yours, if we've done
>> replication at the file level).
>>
>> cheers
>> Bryan
>>
>> -- 
>> Bryan Lawrence
>> Director of Environmental Archival and Associated Research
>> (NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC)
>> STFC, Rutherford Appleton Laboratory
>> Phone +44 1235 445012; Fax ... 5848;
>> Web: home.badc.rl.ac.uk/lawrence
>
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech
>
> -- 
> Scanned by iCritical.
> _______________________________________________
> GO-ESSP-TECH mailing list
> GO-ESSP-TECH at ucar.edu
> http://mailman.ucar.edu/mailman/listinfo/go-essp-tech

-- 
Scanned by iCritical.