[Go-essp-tech] publishing by realm

Fri Feb 26 03:26:49 MST 2010

Hi All,

The confusion here stems from the fact we (BADC and maybe others) have
been focussing on the DRS document as a design for the system.  We knew
implementations were in flux but the DRS document appeared to indicate
what direction the software was heading.

The DRS defines an atomic dataset as:

  The collection of data constituting a "product" from a single model
run is characterized by 
  sharing a single activity, institute, model, experiment/scenario, data
frequency, modelling-
  realm, variable name, local ensemble member, and version.

This implies to me that each atomic dataset can have an independent
version.  Also the DRS URL syntax strongly suggests that individual
variables can be versioned as it puts <version> as the last component in
the path.  We already have code for constructing DRS paths that make
this assumption.

This fits with the publisher if we publish by atomic-dataset.  However,
if we publish by realm-dataset there is a mismatch.  This is nothing to
do with how much data needs shipping about to do replication: as Gavin
and Bob point out there are algorithms for ensuring only updated files
need copying.  However, the DRS document needs to accurately describe
what the system is versioning -- files, atomic-datasets or
realm-datasets.

I'm not suggesting the DRS document is cast in stone just that it should
be consistent with the implementation.  It strikes me that the DRS needs
to align atomic-dataset with realm-dataset.

Just to sketch some changes that may need to be made to the DRS:

 1. New definition of atomic dataset:

  The collection of data constituting a "product" from a single model
run characterized by 
  sharing a single activity, institute, model, experiment/scenario, data
frequency, modelling-
  realm and version.

  The data within each dataset is classified according to variable,
local ensemble member and temporal subset.

 2. Rearrange the DRS URL/path structure as:

http://<hostname>/<activity>/<product>/<institute>/<model>/<experiment>/
<frequency>/<modeling-realm>/<version>/<variable>/<ensemble-member>[<end
point>],

The wording needs improving but I think that would do it.

Cheers,
Stephen.

---
Stephen Pascoe  +44 (0)1235 445980
British Atmospheric Data Centre
Rutherford Appleton Laboratory

-----Original Message-----
From: Bob Drach [mailto:drach at llnl.gov] 
Sent: 25 February 2010 22:48
To: Lawrence, Bryan (STFC,RAL,SSTD)
Cc: go-essp-tech at ucar.edu; Pascoe, Stephen (STFC,RAL,SSTD)
Subject: Re: [Go-essp-tech] publishing by realm

Hi Bryan,

Are you assuming that CMOR will assign the version numbers (either
atomic_dataset | realm_dataset | file)? That's not the case, and I'm not
sure that CMOR has sufficient information to do so.

It's worth recapping how the ESG publisher currently deals with
versioning:

- The publisher is given a dataset id and a list of files to be
published. Let's assume that dataset == realm_dataset here.
- If this is a new dataset, the dataset is assigned dataset_version=1 by
default. Each file is assigned file_version=1. Dataset_version and
file_version are completely independent.
- If the dataset exists, each file is compared with it's existing
counterpart in the dataset (if present), based on a set of metadata:  
checksum, file length, modification date, etc. If a file has changed,
it's file_version is incremented and that value is recorded in the
THREDDS catalog. Similarly, if the dataset has any files that have been
added, deleted, or modified, its dataset_version is incremented and this
is also recorded in THREDDS.

So suppose that we publish at the realm_dataset granularity and one of
the files in that dataset is updated. Then the file has a new
file_version, the dataset has a new dataset_version, both are recorded
in the TDS catalog. It should be possible for the replica manager to
compare old and new dataset versions - by comparing old and new catalogs
- to determine which files have changed, and only transfer those files
to the replica site.

Bob

On Feb 25, 2010, at 12:08 PM, Bryan Lawrence wrote:

> Hi Bob
>
> On Thursday 25 February 2010 19:27:15 Bob Drach wrote:
>> Where would 'atomic dataset version' be stored? In ESG there would 
>> only be realm-dataset versions and individual file versions.
>
> The DRS is writing a version associated with the atomic dataset as 
> defined within it. We expect modelling groups would conform to that, 
> and update versions according to it. We could rewrite the DRS ... (and

> hence CMOR presumably ... but it's a bit late for that ... or maybe 
> I'm missing something).
>
> That means, if we leave things the way there are: there is a logical 
> disconnect, and the risk of either vastly more data movement than is 
> necessary, or a complex resolution problem (is my replicated "realm" 
> level dataset the same as yours, if we've done replication at the file

> level).
>
> cheers
> Bryan
>
> --
> Bryan Lawrence
> Director of Environmental Archival and Associated Research 
> (NCAS/British Atmospheric Data Centre and NCEO/NERC NEODC) STFC, 
> Rutherford Appleton Laboratory Phone +44 1235 445012; Fax ... 5848;
> Web: home.badc.rl.ac.uk/lawrence

-- 
Scanned by iCritical.